cdesktop/cde/doc/C/guides/man/man1_dt/srindex.sgm

<!-- $XConsortium: srindex.sgm /main/7 1996/09/08 19:56:47 rws $ -->
<!-- (c) Copyright 1996 Digital Equipment Corporation. -->
<!-- (c) Copyright 1996 Hewlett-Packard Company. -->
<!-- (c) Copyright 1996 International Business Machines Corp. -->
<!-- (c) Copyright 1996 Sun Microsystems, Inc. -->
<!-- (c) Copyright 1996 Novell, Inc. -->
<!-- (c) Copyright 1996 FUJITSU LIMITED. -->
<!-- (c) Copyright 1996 Hitachi. -->
<![%CDE.C.CDE; [<refentry id="CDE.SEARCH.dtsrindex">]]>
<refmeta><refentrytitle>dtsrindex</refentrytitle><manvolnum>user cmd</manvolnum>
</refmeta>
<refnamediv><refname><command>dtsrindex</command></refname><refpurpose>Load
 inverted index for document objects</refpurpose></refnamediv>
<refsynopsisdiv>
<cmdsynopsis>
<command>dtsrindex</command>
<arg choice="plain">&minus;d<replaceable>dbname</replaceable></arg>
<arg choice="opt">&minus;t<replaceable>etxstr</replaceable></arg>
<arg choice="opt"><group choice="plain"><arg choice="plain">&minus;h0</arg>
<arg choice="plain">&minus;h<replaceable>hashsz</replaceable></arg>
</group></arg>
<arg choice="opt">&minus;r<replaceable>recdots</replaceable></arg>
<arg choice="opt">&minus;b<replaceable>batchsz</replaceable></arg>
<arg choice="opt">&minus;c<replaceable>cachesz</replaceable></arg>
<arg choice="opt">&minus;i<replaceable>inbufsz</replaceable></arg>
<arg choice="plain"><replaceable>file</replaceable></arg>
</cmdsynopsis>
</refsynopsisdiv>
<refsect1>
<title>DESCRIPTION</title>
<para><command>dtsrindex</command> is the second of a pair of programs that
load a database with documents data from an input fzk file.
<command>dtsrload</command> loads document header information and
optionally the documents themselves. <command>dtsrindex</command> parses
words from document text and loads them into the inverted index files.
Word parsing is performed in the specified language and linguistic
codeset of the database. The inverted index contains the search terms
used for subsequent online queries.
</para>
<para>An fzk file can be generated by <command>dtsrhan</command> manually with
a text editor, or by a special application program created for the
purpose. Typically the same fzk file is used for
<command>dtsrload</command> and <command>dtsrindex</command>. However,
it is not required and there are situations where it may not be
desirable. If the same fzk file is not used by both programs, the one
used for <command>dtsrindex</command> must represent the same objects in
the same order. Only the unique key line and the text portions of the
file are used by this program. (See &cdeman.dtsrfzkfiles; for
information about DtSearch fzk files).
</para>
<para>A document's unique key in the fzk file must already preexist in the
database (that is, <command>dtsrload</command> must be executed before
<command>dtsrindex</command>). If any words are already indexed for the
unique document key, indicating <command>dtsrload</command> "updated"
the document, then the newly parsed words from the current fzk file will
totally replace the previously indexed words.
</para>
<para>When duplicate record ids are encountered in a single fzk file, only the
first occurrence of the document is indexed into the database; the
second one is discarded. Sinxe this is exactly the same discard order as
<command>dtsrload</command>, the same fzk file can be used for both
programs. Duplicate record ids are maintained during execution with a
hash table.
</para>
<para><command>dtsrindex</command> performs two passes. In the first pass,
<command>dtsrindex</command> constructs an inverted index in memory of
all the words it parses from the fzk file. Since the index is built in
memory, it is possible to run out of memory for very large fzk files.
For this reason very large fzk files are processed in batches. Execution
time in the first pass depends on the size of the fzk file.
</para>
<para>In the second pass, <command>dtsrindex</command> merges the information
in the memory index into the database's disk inverted index. Execution
time in the second pass depends on both the size of the incoming fzk
file and the overall size of the database.
</para>
<para>If <command>dtsrindex</command> is interrupted in the first pass, it
can be reexecuted without database damage. However if it is interrupted
in the second pass, the database will be corrupted. Database backups
are always recommended.
</para>
<caution>
<para>To prevent database corruption, execute <command>dtsrindex</command>
only after all users of a preexisting database have exited their search
programs. For a single fzk file, <command>dtsrload</command> must be
executed immediately before <command>dtsrindex</command> so that
<command>dtsrindex</command> can map the words it indexes to the correct
internal database addresses. Only after both programs successfully
complete execution may users again be allowed to perform online searches
of the database.
</para>
</caution>
</refsect1>
<refsect1>
<title>OPTIONS</title>
<para>The following options are available:</para>
<note>
<para>If an option takes a value, the value must be directly appended to
the option name without white space.</para>
</note>
<variablelist>
<varlistentry><term><literal>&minus;d</literal><Symbol Role="Variable">dbname</Symbol></term>
<listitem>
<para>Specifies the 1 to 8 ASCII character name of the database to be
updated.
If an optional directory path is not prepended to the database
name, <command>dtsrindex</command> will attempt to open the database from
the current working directory. File name extensions for database
files are automatically appended.
</para>
</listitem>
</varlistentry>
<varlistentry><term><literal>&minus;t</literal><Symbol Role="Variable">etxstr</Symbol></term>
<listitem>
<para>Specifies the end of document text delimiter string. The default
document separator in an fzk file is an ASCII form feed character
followed by an ASCII line feed ('\f\n'). For certain multibyte languages
it may be more convenient to specify a nonASCII string as the document
delimiter.
</para>
</listitem>
</varlistentry>
<varlistentry><term><literal>&minus;h0</literal></term>
<listitem>
<para>Instructs <command>dtsrindex</command> to not check for duplicate
record ids. This option should not be specified unless it
is certain that there are no duplicate ids in the fzk file.
</para>
</listitem>
</varlistentry>
<varlistentry><term><literal>&minus;h</literal><Symbol Role="Variable">hashsz</Symbol></term>
<listitem>
<para>Sets the duplicate record id hash table size to <Symbol Role="Variable">hashsz</Symbol>. The default is 3000.
<command>dtsrindex</command> will execute more efficiently if the
specified table size is larger than the number of documents in the fzk
file.
</para>
</listitem>
</varlistentry>
<varlistentry><term><literal>&minus;r</literal><Symbol Role="Variable">recdots</Symbol></term>
<listitem>
<para>Instructs <command>dtsrindex</command> to print a progress character to
stdout for every <Symbol Role="Variable">recdots</Symbol> documents
processed during the first pass. The default is 20.
</para>
</listitem>
</varlistentry>
<varlistentry><term><literal>&minus;b</literal><Symbol Role="Variable">batchsz</Symbol></term>
<listitem>
<para>Sets the batch size to <Symbol Role="Variable">batchsz</Symbol>. The
default is 10000. The batch size is the maximum number of records
processed in Pass 1 before copying the in memory index to disk in Pass
2. Larger batch sizes significantly improve execution time in Pass 2,
but require exponentially larger amounts of memory. The default batch
size has been optimized for moderately fast machines with large amounts
of memory.
</para>
</listitem>
</varlistentry>
<varlistentry><term><literal>&minus;c</literal><Symbol Role="Variable">cachesz</Symbol></term>
<listitem>
<para>Sets the number of 1024 byte cache pages used by the DtSearch Database
Management System to <Symbol Role="Variable">cachesz</Symbol>. The
default is 64. The cache size affects memory paging performance for word
b-trees. <Symbol Role="Variable">cachesz</Symbol>should be greater than
or equal to 16, in even powers of 2. The default is usually sufficient.
</para>
</listitem>
</varlistentry>
<varlistentry><term><literal>&minus;i</literal><Symbol Role="Variable">inbufsz</Symbol></term>
<listitem>
<para>Sets the size of the input line buffer to <Symbol Role="Variable">inbufsz</Symbol>. The default is 1024 bytes. This
buffer is used only for reading the four ASCII header lines for each
document in an fzk file. (The text portion of each document is parsed on
the fly a word at a time.) Increasing <Symbol Role="Variable">inbufsz</Symbol> may be appropriate for very large
abstracts, but the default is sufficient in most cases.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1>
<title>OPERANDS</title>
<para>The required input file name (<Symbol Role="Variable">file</Symbol>)
identifies the file to be processed by <command>dtsrindex</command>. It
can optionally include a path prefix, either from root or relative to
the current working directory. If a file name extension is not
specified, <command>dtsrindex</command> assumes a default extension of
<Filename>.fzk</Filename>.
</para>
</refsect1>
<refsect1>
<title>ENVIRONMENT VARIABLES</title>
<para>None.</para>
</refsect1>
<refsect1>
<title>RESOURCES</title>
<para>None.</para>
</refsect1>
<refsect1>
<title>ACTIONS/MESSAGES</title>
<para>None.</para>
</refsect1>
<refsect1>
<title>RETURN VALUES</title>
<para>The return values are as follows:</para>
<variablelist>
<varlistentry><term>0</term>
<listitem>
<para><command>dtsrindex</command> completed successfully.</para>
</listitem>
</varlistentry>
<varlistentry><term>1</term>
<listitem>
<para><command>dtsrindex</command> successfully
recovered from an error. This occurs when one or more
documents were discarded because of a partially invalid
fzk file format, duplicate record ids, or empty record text.
</para>
</listitem>
</varlistentry>
<varlistentry><term>>1</term>
<listitem>
<para><command>dtsrindex</command> encountered a fatal error.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1>
<title>FILES</title>
<para><command>dtsrindex</command> reads the specified fzk file and opens
all the database and related language files for the specified
database name.
</para>
<para><command>dtsrindex</command> updates the following database files:
</para>
<simplelist>
<member><symbol role="Variable">dbname</symbol>.d21</member>
<member><symbol role="Variable">dbname</symbol>.d22</member>
<member><symbol role="Variable">dbname</symbol>.d23</member>
<member><symbol role="Variable">dbname</symbol>.k21</member>
<member><symbol role="Variable">dbname</symbol>.k22</member>
<member><symbol role="Variable">dbname</symbol>.k23</member>
<member><symbol role="Variable">dbname</symbol>.d99</member>
</simplelist>
</refsect1>
<refsect1>
<title>EXAMPLES</title>
<para>Index all words in the fzk file named <filename>batch1.fzk</filename> in
the current working directory into database <filename>mydb</filename>.
</para>
<programlisting>
dtsrindex -dmydb batch1
</programlisting>
<para>Load database <filename>mydb</filename> with the documents specified in
the fzk file <filename>/u/dtsearch/jpndocs.1</filename>. Three ASCII
plus signs at the bottom of each document signals the end of document
text and the beginning of the next fzk file record.
</para>
<programlisting>
dtsrindex -dmydb -t+++ /u/dtsearch/jpndocs.1
</programlisting>
</refsect1>
<refsect1>
<title>SEE ALSO</title>
<para>&cdeman.dtsrload;,
&cdeman.dtsrhan;,
&cdeman.dtsrfzkfiles;,
&cdeman.DtSearch;
</para>
</refsect1>
</refentry>