]]>
DtSearchQuery
library call
DtSearchQuery
Perform a DtSearch database search for a specified query
#include <Dt/Search.h>
int DtSearchQuery
void *qry
char *dbname
int search_type
char *date1
char *date2
DtSrResult **results
long *resultscount
char *stems
int *stemcount
DESCRIPTION
DtSearchQuery is the DtSearch API search function.
DtSearchQuery is passed a query string and some
search options, performs the requested search, and if successful returns a
linked list of DtSrResult structures representing
the documents satisfying the search.
The results list contains information about the documents that can be
used for subsequent retrievals, as well as information suitable for
display to an end user.
Search Types
DtSearchQuery supports three types of searches:
P, W, and S.
Type P Search Query Strings
Query strings for search type P have the simplest syntax, namely a
sequence of words separated by ASCII whitespace. Punctuation and invalid words
are silently discarded by the search engine. The only possible syntax error
is that all query words happen to be invalid in the language of the database.
Search type P is often used to implement a limited
Query-by-Example (QBE) search paradigm. In this scenario, users
typically paste document text from whatever source into a query string
text field. Their expectation is that the search engine will return the
documents in the database that are "most similar" to the text of the
query string, and the statistical sort of the results list usually
satisfies that expectation.
Note that although search type P does not use boolean
syntax, it is actually implemented as a stemmed search (type
S search) with implied boolean ORs between words.
Types S and W Boolean Query Strings
Query strings for search types S (stemmed boolean)
and W (exact word boolean) must be syntactically
valid boolean expressions as described below. Any string that does not
match a valid expression rule is invalid and will fail with an error
message.
Query words for all search types may be entered in any codeset for a
supported DtSearch language, including multibyte languages. Words may be
identified as invalid by the language module of the database for a
number of reasons including any words that would not have been indexed
because they are too short, too long, on the stop list, etc. With one
exception, linguistically invalid words result in a syntax error. The
exception is in the case of an "all ANDs" query, where invalid words and
valid words that happen not to be in the database are silently erased
from the query string.
The boolean query operators are the ASCII metacharacters: '&' for
AND, '|' for OR, '~' for NOT, '(' and ')' for open and close parentheses
respectively, and '@ nnn' for collocation expressions.
All expression tokens are separated by ASCII whitespace. Typically this
i 1 or more space or tab characters. Omitting whitespace separators is
legal if it can be done unambiguously. For example "word1&word2" is
a legal expression but "word1word2" would be interpreted as a single
word token.
The ASCII "at" sign (@) marks a special boolean collocation
operator. The collocation operator has the syntax "@n...",
the ASCII "at" sign followed by one or more ASCII numeric digits,
representing an integer with value greater than zero. Collocation is a
variation of the AND search where a user can specify the maximum
distance in bytes between any two words. In most languages a byte is
equivalent to a character position. For example to find "ice" and
"cream" separated by no more than five characters, the search query "ice
@5 cream" may be used. Unlike other boolean operators, the collocation
operator can apply only to naked word tokens, not other expressions.
Searches including collocation operators are slower than searches
without them, and can be much slower for common words.
There are a maximum of 8 distinct word tokens. Collocation operators
count as part of the 8. There is no limit to the number of operators, as
long as they match the syntax rules.
Collocation operators are only supported for "Austext flavor" databases.
The default flavor of database created by dtsrcreate is
"Dtinfo flavor," which does not support collocation.
Boolean Query Syntax Rules
There are only 6 syntax rules and the rules are recursive. Ambiguity is
resolved by precedence and associativity rules.
valid_expression := word_token
A valid expression can be just a valid naked word token. Semantically,
the expression returns all documents containing the specified word. The
word_token must be a valid word in the language of
the database being searched.
valid_expression := valid_expression '&' valid_expression
The ASCII ampersand character is the AND character. Semantically, it
returns all documents satisfying both the first and second expressions
(boolean intersection). AND is also the "implied" boolean operator in
the following sense: the query parser will insert an ampersand between
words or expressions that otherwise would be separated only by
whitespace. For example "word1 word2" becomes "word1 & word2".
valid_expression := valid_expression '|' valid_expression
The ASCII virgule (vertical slash) character is the OR character. It
means return all documents satisfying either the first or the second
expression (boolean union).
valid_expression := '(' valid_expression ')'
Valid expressions may be recursively nested in ASCII open and close
parentheses characters. The query parser "forgives" two common human errors.
It will automatically discard excessive close parentheses characters, and
it will automatically generate close parentheses characters if necessary at
the end of a query. For example, "aaa | (bbb & ccc)))))) ddd" becomes
"aaa | ( bbb & ccc) & ddd", and "aaa ((bbbb" becomes "aaa ( ( bbb
) )".
valid_expression := '~' valid_expression
The ASCII tilde character is the unary NOT operator. It returns every
document in the database that is not in the set satisfying the expression.
valid_expression := word_token
collocation_operator word_token
Collocation operators are permitted only between words, not expressions.
Each of the word tokens and the collocation operator itself occupy slots
in the table of 8 maximum word tokens.
Boolean Associativity and Precedence Table
In order from highest precedence to lowest:
Associativity
OperatorExample
(none)
COLLOC
right
NOT
"aaa~bbb" resolved as "aaa & (˜(bbb)"
left
AND
"aaa bbb ccc" resolved
as "(aaa & bbb) & ccc"
left
OR
"aaa|bbb|ccc"
resolved as "(aaa | bbb) | ccc"
(none)
naked word
Example Boolean Queries
aaa bbb ccc
Returns all records that contain at least one occurrence of all three words.
aaa | (bbb ~ccc)
Retrieves all records containing "aaa"
and also all records containing "bbb", but not
"ccc".
aaa ~(aaa @1 bbb)
Returns all records containing "aaa" but omits those
where "aaa" is one character away from "bbb".
It is possible to formulate a query that requires retrieving all records
in the database that contain none of the query words (for example,
~aaa. Users should be warned that in
a large database such a search can take a very long time.
Using the implied associativity and precedence rules, the ambiguous
query string aaa ~bbb | ccc ~ddd @10 eee
is disambiguated as (aaa & (~bbb))
| (ccc & (~(ddd @10 eee))).
ARGUMENTS
search_type
Specifies the type of search to perform. Valid values are
P, W, and S.
Search type P indicates that the query string is a
sequence of words separated by ASCII whitespace.
It requests that the words be stemmed prior to searching, that all
documents containing any of the words be returned, that the results list
be statistically sorted, and that no more than the top
MaxResults list items be returned where
MaxResults is the current value
returned from DtSearchGetMaxResults. Note that a
type P search is identical to a type
S boolean search with an implied boolean OR between
words.
Search types W and S are boolean
query searches. They indicate that the query string is a sequence of
words and boolean operators matching the syntax described under "Types
S and W Boolean Query Strings"
above.
Type S requests that words be stemmed prior to
searching. Type 'W' requests that words be left unstemmed. Both types
request that all documents containing the combinations of query words
specified by the boolean operations be returned, that the results list
be statistically sorted if possible, and that no more than the top
MaxResults list items be returned
whereMaxResults is the current value
returned from DtSearchGetMaxResults.
dbname
Specifies which database is to be searched. It is any one of the
database name strings returned from DtSearchInit or
DtSearchReinit. If
dbname is NULL, the first database name string
is used.
Within the specified database, searches will be restricted to those
documents whose DtSrKeytype.is_selected
field is nonzero.
date1 and
date2
Specify a range of document dates to use for the search. Only documents
within the specified range will be returned on the results list.
date1 is the older end of the range and
if not NULL, requests DtSearch to return only those records younger than
(that is, after) the specified date.
date2 is the younger end of the range
and if not NULL, requests DtSearch to return only those records older
than (that is before) the specified date.
It is valid to specify just one of the arguments.
Undated documents always qualify for a results list regardless of search
date strings. The format of a valid date string is described in
&cdeman.DtSearchValidDateString;.
stems and
stemscount
Specify a character buffer to hold parsed and stemmed words and a
variable to receive the number of stored words.
stems and stemscount are optional; they can be NULL. However, if either
is specified, they must both be specified.
If specified stemsmust point to a
character buffer large enough to hold
DtSrMAX_STEMCOUNT by
DtSrMAXWIDTH_HWORD bytes. An array of parsed
and stemmed query words will be stored here by the API for use by a
later call to DtSearchHighlight.
The size of the array will be stored in
stemscount.
results and
resultscount
Specify where a pointer to the results list will be stored and a
variable to receive the number of items on the list.
Results lists can be manipulated with several utility functions.
In DtSearch, frequency of occurrence information is
maintained for words across the whole database and within documents. For
most queries, results lists are sorted by this statistical information
and presented to the user as a "proximity" number for each document on
the list. Proximity is meant to appear to a user as a distance, or a
measure of the nearness of the query to the document. Conceptually, the
smaller the proximity the "closer" the document is to the query and the
more likely it will be valuable to the user
DtSearch searches only one database at a time and returns only results
lists for that single database. However, browsers often provide the
illusion of simultaneous searches in multiple databases, merging the
results lists by proximity when completed. Since the domain of knowledge
and density of words and records may vary from database to database, the
value of proximity numbers may similarly vary, and some databases may be
underrepresented on merged results lists.
RETURN VALUE
This function has three common return codes.
DtSrOK is returned, as well
as a results list and stems array, when the search was completely successful.
DtSrNOTAVAIL is returned when
the query was valid but the search was unsuccessful (that is, no set of
documents matched the query). There are usually no messages with
DtSrNOTAVAIL.
DtSrFAIL is returned when the
search was unsuccessful, usually because of an invalid query, and user
messages on the MessageList explain why.
Any API function can also return DtSrREINIT and the return codes for fatal engine errors at any time.
SEE ALSO
&cdeman.DtSrAPI;,
&cdeman.DtSearchReinit;,
&cdeman.DtSearchGetMaxResults;,
&cdeman.DtSearchSetMaxResults;,
&cdeman.DtSearchGetKeytypes;,
&cdeman.DtSearchValidDateString;,
&cdeman.DtSearchSortResults;,
&cdeman.DtSearchFreeResults;,
&cdeman.DtSearchHighlight;