275 lines
11 KiB
Plaintext
275 lines
11 KiB
Plaintext
<!-- $XConsortium: srindex.sgm /main/7 1996/09/08 19:56:47 rws $ -->
|
|
<!-- (c) Copyright 1996 Digital Equipment Corporation. -->
|
|
<!-- (c) Copyright 1996 Hewlett-Packard Company. -->
|
|
<!-- (c) Copyright 1996 International Business Machines Corp. -->
|
|
<!-- (c) Copyright 1996 Sun Microsystems, Inc. -->
|
|
<!-- (c) Copyright 1996 Novell, Inc. -->
|
|
<!-- (c) Copyright 1996 FUJITSU LIMITED. -->
|
|
<!-- (c) Copyright 1996 Hitachi. -->
|
|
<![%CDE.C.CDE; [<refentry id="CDE.SEARCH.dtsrindex">]]>
|
|
<refmeta><refentrytitle>dtsrindex</refentrytitle><manvolnum>user cmd</manvolnum>
|
|
</refmeta>
|
|
<refnamediv><refname><command>dtsrindex</command></refname><refpurpose>Load
|
|
inverted index for document objects</refpurpose></refnamediv>
|
|
<refsynopsisdiv>
|
|
<cmdsynopsis>
|
|
<command>dtsrindex</command>
|
|
<arg choice="plain">−d<replaceable>dbname</replaceable></arg>
|
|
<arg choice="opt">−t<replaceable>etxstr</replaceable></arg>
|
|
<arg choice="opt"><group choice="plain"><arg choice="plain">−h0</arg>
|
|
<arg choice="plain">−h<replaceable>hashsz</replaceable></arg>
|
|
</group></arg>
|
|
<arg choice="opt">−r<replaceable>recdots</replaceable></arg>
|
|
<arg choice="opt">−b<replaceable>batchsz</replaceable></arg>
|
|
<arg choice="opt">−c<replaceable>cachesz</replaceable></arg>
|
|
<arg choice="opt">−i<replaceable>inbufsz</replaceable></arg>
|
|
<arg choice="plain"><replaceable>file</replaceable></arg>
|
|
</cmdsynopsis>
|
|
</refsynopsisdiv>
|
|
<refsect1>
|
|
<title>DESCRIPTION</title>
|
|
<para><command>dtsrindex</command> is the second of a pair of programs that
|
|
load a database with documents data from an input fzk file.
|
|
<command>dtsrload</command> loads document header information and
|
|
optionally the documents themselves. <command>dtsrindex</command> parses
|
|
words from document text and loads them into the inverted index files.
|
|
Word parsing is performed in the specified language and linguistic
|
|
codeset of the database. The inverted index contains the search terms
|
|
used for subsequent online queries.
|
|
</para>
|
|
<para>An fzk file can be generated by <command>dtsrhan</command> manually with
|
|
a text editor, or by a special application program created for the
|
|
purpose. Typically the same fzk file is used for
|
|
<command>dtsrload</command> and <command>dtsrindex</command>. However,
|
|
it is not required and there are situations where it may not be
|
|
desirable. If the same fzk file is not used by both programs, the one
|
|
used for <command>dtsrindex</command> must represent the same objects in
|
|
the same order. Only the unique key line and the text portions of the
|
|
file are used by this program. (See &cdeman.dtsrfzkfiles; for
|
|
information about DtSearch fzk files).
|
|
</para>
|
|
<para>A document's unique key in the fzk file must already preexist in the
|
|
database (that is, <command>dtsrload</command> must be executed before
|
|
<command>dtsrindex</command>). If any words are already indexed for the
|
|
unique document key, indicating <command>dtsrload</command> "updated"
|
|
the document, then the newly parsed words from the current fzk file will
|
|
totally replace the previously indexed words.
|
|
</para>
|
|
<para>When duplicate record ids are encountered in a single fzk file, only the
|
|
first occurrence of the document is indexed into the database; the
|
|
second one is discarded. Sinxe this is exactly the same discard order as
|
|
<command>dtsrload</command>, the same fzk file can be used for both
|
|
programs. Duplicate record ids are maintained during execution with a
|
|
hash table.
|
|
</para>
|
|
<para><command>dtsrindex</command> performs two passes. In the first pass,
|
|
<command>dtsrindex</command> constructs an inverted index in memory of
|
|
all the words it parses from the fzk file. Since the index is built in
|
|
memory, it is possible to run out of memory for very large fzk files.
|
|
For this reason very large fzk files are processed in batches. Execution
|
|
time in the first pass depends on the size of the fzk file.
|
|
</para>
|
|
<para>In the second pass, <command>dtsrindex</command> merges the information
|
|
in the memory index into the database's disk inverted index. Execution
|
|
time in the second pass depends on both the size of the incoming fzk
|
|
file and the overall size of the database.
|
|
</para>
|
|
<para>If <command>dtsrindex</command> is interrupted in the first pass, it
|
|
can be reexecuted without database damage. However if it is interrupted
|
|
in the second pass, the database will be corrupted. Database backups
|
|
are always recommended.
|
|
</para>
|
|
<caution>
|
|
<para>To prevent database corruption, execute <command>dtsrindex</command>
|
|
only after all users of a preexisting database have exited their search
|
|
programs. For a single fzk file, <command>dtsrload</command> must be
|
|
executed immediately before <command>dtsrindex</command> so that
|
|
<command>dtsrindex</command> can map the words it indexes to the correct
|
|
internal database addresses. Only after both programs successfully
|
|
complete execution may users again be allowed to perform online searches
|
|
of the database.
|
|
</para>
|
|
</caution>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>OPTIONS</title>
|
|
<para>The following options are available:</para>
|
|
<note>
|
|
<para>If an option takes a value, the value must be directly appended to
|
|
the option name without white space.</para>
|
|
</note>
|
|
<variablelist>
|
|
<varlistentry><term><literal>−d</literal><Symbol Role="Variable">dbname</Symbol></term>
|
|
<listitem>
|
|
<para>Specifies the 1 to 8 ASCII character name of the database to be
|
|
updated.
|
|
If an optional directory path is not prepended to the database
|
|
name, <command>dtsrindex</command> will attempt to open the database from
|
|
the current working directory. File name extensions for database
|
|
files are automatically appended.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><literal>−t</literal><Symbol Role="Variable">etxstr</Symbol></term>
|
|
<listitem>
|
|
<para>Specifies the end of document text delimiter string. The default
|
|
document separator in an fzk file is an ASCII form feed character
|
|
followed by an ASCII line feed ('\f\n'). For certain multibyte languages
|
|
it may be more convenient to specify a nonASCII string as the document
|
|
delimiter.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><literal>−h0</literal></term>
|
|
<listitem>
|
|
<para>Instructs <command>dtsrindex</command> to not check for duplicate
|
|
record ids. This option should not be specified unless it
|
|
is certain that there are no duplicate ids in the fzk file.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><literal>−h</literal><Symbol Role="Variable">hashsz</Symbol></term>
|
|
<listitem>
|
|
<para>Sets the duplicate record id hash table size to <Symbol Role="Variable">hashsz</Symbol>. The default is 3000.
|
|
<command>dtsrindex</command> will execute more efficiently if the
|
|
specified table size is larger than the number of documents in the fzk
|
|
file.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><literal>−r</literal><Symbol Role="Variable">recdots</Symbol></term>
|
|
<listitem>
|
|
<para>Instructs <command>dtsrindex</command> to print a progress character to
|
|
stdout for every <Symbol Role="Variable">recdots</Symbol> documents
|
|
processed during the first pass. The default is 20.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><literal>−b</literal><Symbol Role="Variable">batchsz</Symbol></term>
|
|
<listitem>
|
|
<para>Sets the batch size to <Symbol Role="Variable">batchsz</Symbol>. The
|
|
default is 10000. The batch size is the maximum number of records
|
|
processed in Pass 1 before copying the in memory index to disk in Pass
|
|
2. Larger batch sizes significantly improve execution time in Pass 2,
|
|
but require exponentially larger amounts of memory. The default batch
|
|
size has been optimized for moderately fast machines with large amounts
|
|
of memory.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><literal>−c</literal><Symbol Role="Variable">cachesz</Symbol></term>
|
|
<listitem>
|
|
<para>Sets the number of 1024 byte cache pages used by the DtSearch Database
|
|
Management System to <Symbol Role="Variable">cachesz</Symbol>. The
|
|
default is 64. The cache size affects memory paging performance for word
|
|
b-trees. <Symbol Role="Variable">cachesz</Symbol>should be greater than
|
|
or equal to 16, in even powers of 2. The default is usually sufficient.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><literal>−i</literal><Symbol Role="Variable">inbufsz</Symbol></term>
|
|
<listitem>
|
|
<para>Sets the size of the input line buffer to <Symbol Role="Variable">inbufsz</Symbol>. The default is 1024 bytes. This
|
|
buffer is used only for reading the four ASCII header lines for each
|
|
document in an fzk file. (The text portion of each document is parsed on
|
|
the fly a word at a time.) Increasing <Symbol Role="Variable">inbufsz</Symbol> may be appropriate for very large
|
|
abstracts, but the default is sufficient in most cases.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>OPERANDS</title>
|
|
<para>The required input file name (<Symbol Role="Variable">file</Symbol>)
|
|
identifies the file to be processed by <command>dtsrindex</command>. It
|
|
can optionally include a path prefix, either from root or relative to
|
|
the current working directory. If a file name extension is not
|
|
specified, <command>dtsrindex</command> assumes a default extension of
|
|
<Filename>.fzk</Filename>.
|
|
</para>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>ENVIRONMENT VARIABLES</title>
|
|
<para>None.</para>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>RESOURCES</title>
|
|
<para>None.</para>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>ACTIONS/MESSAGES</title>
|
|
<para>None.</para>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>RETURN VALUES</title>
|
|
<para>The return values are as follows:</para>
|
|
<variablelist>
|
|
<varlistentry><term>0</term>
|
|
<listitem>
|
|
<para><command>dtsrindex</command> completed successfully.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term>1</term>
|
|
<listitem>
|
|
<para><command>dtsrindex</command> successfully
|
|
recovered from an error. This occurs when one or more
|
|
documents were discarded because of a partially invalid
|
|
fzk file format, duplicate record ids, or empty record text.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term>>1</term>
|
|
<listitem>
|
|
<para><command>dtsrindex</command> encountered a fatal error.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>FILES</title>
|
|
<para><command>dtsrindex</command> reads the specified fzk file and opens
|
|
all the database and related language files for the specified
|
|
database name.
|
|
</para>
|
|
<para><command>dtsrindex</command> updates the following database files:
|
|
</para>
|
|
<simplelist>
|
|
<member><symbol role="Variable">dbname</symbol>.d21</member>
|
|
<member><symbol role="Variable">dbname</symbol>.d22</member>
|
|
<member><symbol role="Variable">dbname</symbol>.d23</member>
|
|
<member><symbol role="Variable">dbname</symbol>.k21</member>
|
|
<member><symbol role="Variable">dbname</symbol>.k22</member>
|
|
<member><symbol role="Variable">dbname</symbol>.k23</member>
|
|
<member><symbol role="Variable">dbname</symbol>.d99</member>
|
|
</simplelist>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>EXAMPLES</title>
|
|
<para>Index all words in the fzk file named <filename>batch1.fzk</filename> in
|
|
the current working directory into database <filename>mydb</filename>.
|
|
</para>
|
|
<programlisting>
|
|
dtsrindex -dmydb batch1
|
|
</programlisting>
|
|
<para>Load database <filename>mydb</filename> with the documents specified in
|
|
the fzk file <filename>/u/dtsearch/jpndocs.1</filename>. Three ASCII
|
|
plus signs at the bottom of each document signals the end of document
|
|
text and the beginning of the next fzk file record.
|
|
</para>
|
|
<programlisting>
|
|
dtsrindex -dmydb -t+++ /u/dtsearch/jpndocs.1
|
|
</programlisting>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>SEE ALSO</title>
|
|
<para>&cdeman.dtsrload;,
|
|
&cdeman.dtsrhan;,
|
|
&cdeman.dtsrfzkfiles;,
|
|
&cdeman.DtSearch;
|
|
</para>
|
|
</refsect1>
|
|
</refentry>
|
|
|