]]>
dtsrindexuser cmd
dtsrindexLoad
inverted index for document objects
dtsrindex
−ddbname
−tetxstr
−h0
−hhashsz
−rrecdots
−bbatchsz
−ccachesz
−iinbufsz
file
DESCRIPTION
dtsrindex is the second of a pair of programs that
load a database with documents data from an input fzk file.
dtsrload loads document header information and
optionally the documents themselves. dtsrindex parses
words from document text and loads them into the inverted index files.
Word parsing is performed in the specified language and linguistic
codeset of the database. The inverted index contains the search terms
used for subsequent online queries.
An fzk file can be generated by dtsrhan manually with
a text editor, or by a special application program created for the
purpose. Typically the same fzk file is used for
dtsrload and dtsrindex. However,
it is not required and there are situations where it may not be
desirable. If the same fzk file is not used by both programs, the one
used for dtsrindex must represent the same objects in
the same order. Only the unique key line and the text portions of the
file are used by this program. (See &cdeman.dtsrfzkfiles; for
information about DtSearch fzk files).
A document's unique key in the fzk file must already preexist in the
database (that is, dtsrload must be executed before
dtsrindex). If any words are already indexed for the
unique document key, indicating dtsrload "updated"
the document, then the newly parsed words from the current fzk file will
totally replace the previously indexed words.
When duplicate record ids are encountered in a single fzk file, only the
first occurrence of the document is indexed into the database; the
second one is discarded. Sinxe this is exactly the same discard order as
dtsrload, the same fzk file can be used for both
programs. Duplicate record ids are maintained during execution with a
hash table.
dtsrindex performs two passes. In the first pass,
dtsrindex constructs an inverted index in memory of
all the words it parses from the fzk file. Since the index is built in
memory, it is possible to run out of memory for very large fzk files.
For this reason very large fzk files are processed in batches. Execution
time in the first pass depends on the size of the fzk file.
In the second pass, dtsrindex merges the information
in the memory index into the database's disk inverted index. Execution
time in the second pass depends on both the size of the incoming fzk
file and the overall size of the database.
If dtsrindex is interrupted in the first pass, it
can be reexecuted without database damage. However if it is interrupted
in the second pass, the database will be corrupted. Database backups
are always recommended.
To prevent database corruption, execute dtsrindex
only after all users of a preexisting database have exited their search
programs. For a single fzk file, dtsrload must be
executed immediately before dtsrindex so that
dtsrindex can map the words it indexes to the correct
internal database addresses. Only after both programs successfully
complete execution may users again be allowed to perform online searches
of the database.
OPTIONS
The following options are available:
If an option takes a value, the value must be directly appended to
the option name without white space.
−ddbname
Specifies the 1 to 8 ASCII character name of the database to be
updated.
If an optional directory path is not prepended to the database
name, dtsrindex will attempt to open the database from
the current working directory. File name extensions for database
files are automatically appended.
−tetxstr
Specifies the end of document text delimiter string. The default
document separator in an fzk file is an ASCII form feed character
followed by an ASCII line feed ('\f\n'). For certain multibyte languages
it may be more convenient to specify a nonASCII string as the document
delimiter.
−h0
Instructs dtsrindex to not check for duplicate
record ids. This option should not be specified unless it
is certain that there are no duplicate ids in the fzk file.
−hhashsz
Sets the duplicate record id hash table size to hashsz. The default is 3000.
dtsrindex will execute more efficiently if the
specified table size is larger than the number of documents in the fzk
file.
−rrecdots
Instructs dtsrindex to print a progress character to
stdout for every recdots documents
processed during the first pass. The default is 20.
−bbatchsz
Sets the batch size to batchsz. The
default is 10000. The batch size is the maximum number of records
processed in Pass 1 before copying the in memory index to disk in Pass
2. Larger batch sizes significantly improve execution time in Pass 2,
but require exponentially larger amounts of memory. The default batch
size has been optimized for moderately fast machines with large amounts
of memory.
−ccachesz
Sets the number of 1024 byte cache pages used by the DtSearch Database
Management System to cachesz. The
default is 64. The cache size affects memory paging performance for word
b-trees. cacheszshould be greater than
or equal to 16, in even powers of 2. The default is usually sufficient.
−iinbufsz
Sets the size of the input line buffer to inbufsz. The default is 1024 bytes. This
buffer is used only for reading the four ASCII header lines for each
document in an fzk file. (The text portion of each document is parsed on
the fly a word at a time.) Increasing inbufsz may be appropriate for very large
abstracts, but the default is sufficient in most cases.
OPERANDS
The required input file name (file)
identifies the file to be processed by dtsrindex. It
can optionally include a path prefix, either from root or relative to
the current working directory. If a file name extension is not
specified, dtsrindex assumes a default extension of
.fzk.
ENVIRONMENT VARIABLES
None.
RESOURCES
None.
ACTIONS/MESSAGES
None.
RETURN VALUES
The return values are as follows:
0
dtsrindex completed successfully.
1
dtsrindex successfully
recovered from an error. This occurs when one or more
documents were discarded because of a partially invalid
fzk file format, duplicate record ids, or empty record text.
>1
dtsrindex encountered a fatal error.
FILES
dtsrindex reads the specified fzk file and opens
all the database and related language files for the specified
database name.
dtsrindex updates the following database files:
dbname.d21
dbname.d22
dbname.d23
dbname.k21
dbname.k22
dbname.k23
dbname.d99
EXAMPLES
Index all words in the fzk file named batch1.fzk in
the current working directory into database mydb.
dtsrindex -dmydb batch1
Load database mydb with the documents specified in
the fzk file /u/dtsearch/jpndocs.1. Three ASCII
plus signs at the bottom of each document signals the end of document
text and the beginning of the next fzk file record.
dtsrindex -dmydb -t+++ /u/dtsearch/jpndocs.1
SEE ALSO
&cdeman.dtsrload;,
&cdeman.dtsrhan;,
&cdeman.dtsrfzkfiles;,
&cdeman.DtSearch;