267 lines
11 KiB
Plaintext
267 lines
11 KiB
Plaintext
<!-- $XConsortium: huffcode.sgm /main/7 1996/09/08 19:53:22 rws $ -->
|
|
<!-- (c) Copyright 1996 Digital Equipment Corporation. -->
|
|
<!-- (c) Copyright 1996 Hewlett-Packard Company. -->
|
|
<!-- (c) Copyright 1996 International Business Machines Corp. -->
|
|
<!-- (c) Copyright 1996 Sun Microsystems, Inc. -->
|
|
<!-- (c) Copyright 1996 Novell, Inc. -->
|
|
<!-- (c) Copyright 1996 FUJITSU LIMITED. -->
|
|
<!-- (c) Copyright 1996 Hitachi. -->
|
|
<![%CDE.C.CDE; [<refentry id="CDE.SEARCH.huffcode">]]>
|
|
<refmeta><refentrytitle>huffcode</refentrytitle><manvolnum>user cmd</manvolnum>
|
|
</refmeta>
|
|
<refnamediv><refname><command>huffcode</command></refname><refpurpose>
|
|
Create optimized DtSearch compression/decompression tables
|
|
</refpurpose></refnamediv>
|
|
<refsynopsisdiv>
|
|
<cmdsynopsis>
|
|
<command>huffcode</command>
|
|
<arg choice="opt"><group choice="plain">
|
|
<arg choice="plain">−l<replaceable>lit_thresh</replaceable></arg>
|
|
<arg choice="plain">−l−</arg>
|
|
</group></arg>
|
|
<arg choice="opt">−o</arg>
|
|
<arg choice="plain"><replaceable>huffile</replaceable></arg>
|
|
<arg choice="opt"><replaceable>textfile</replaceable></arg>
|
|
</cmdsynopsis>
|
|
</refsynopsisdiv>
|
|
<refsect1>
|
|
<title>DESCRIPTION</title>
|
|
<para><command>huffcode</command> creates optimized DtSearch
|
|
compression/decompression tables.
|
|
</para>
|
|
<para>Documents stored in a DtSearch database text repository can be first
|
|
compressed using a Huffman text compression algorithm. The algorithm
|
|
provides optimal compression only with preanalysis of the statistical
|
|
distribution of bytes in the database corpus.
|
|
<command>huffcode</command> analyses a text corpus and generates
|
|
DtSearch compression and decompression tables. It is provided as a
|
|
convenience utility for database developers who want to optimize offline
|
|
storage requirements. Compression is not used in databases created
|
|
without the ability to store text in a DtSearch repository.
|
|
</para>
|
|
<para><command>huffcode</command> reads a text file as input and writes out
|
|
<filename>ophuf.huf</filename> (compression or "encode" table) and
|
|
<filename>ophuf.c</filename> (decompression or "decode" table).
|
|
<filename>ophuf.huf</filename> is an external ascii file that also
|
|
retains the statistical information on how it was generated.
|
|
<command>huffcode</command> can be executed repeatedly against different
|
|
text samples, continually accumulating results. In the case of a small
|
|
or static text corpus, the entire corpus can be fed into
|
|
<command>huffcode</command> for optimal huffman compression. In large or
|
|
dynamic databases the typical practice is to feed
|
|
dynamic f representative text samples.
|
|
</para>
|
|
<para>The huffman code tables are created once for each API instance (not once
|
|
per database) before any documents are loaded. The only program to read
|
|
the encode table, an external file, is <command>dtsrload</command>. The
|
|
<filename>ophuf.huf</filename> file generated by
|
|
<command>huffcode</command> should be used instead of the provided
|
|
default file prior to the first run of <command>dtsrload</command> for
|
|
any databases to be accessed by a particular API instance. The decode
|
|
table, a C module, should be compiled and linked into the application
|
|
code ahead of the API library to override the default decode module in
|
|
the library. Huf files and decode modules are not user editable.
|
|
</para>
|
|
<refsect2>
|
|
<title>HCTREE_ID</title>
|
|
<para>It is imperative that the encode and decode tables reflect identical
|
|
byte statistics to prevent decode errors. The first line of
|
|
<filename>ophuf.huf</filename> includes a long integer value named
|
|
<Symbol>HCTREE_ID</Symbol>. Each execution of
|
|
<command>huffcode</command> generates a new, unique
|
|
<literal>hctree_id</literal> integer. <command>dtsrload</command> loads
|
|
this integer into the database configuration and status record when it
|
|
loads the first document into a new database. Thereafter, each execution
|
|
of <command>dtsrload</command> for that database confirms that the same
|
|
<literal>hctree_id</literal> is used for each document compression. It
|
|
will abort if the <filename>ophuf.huf</filename>
|
|
<literal>hctree_id</literal> does not match the value for a database
|
|
from previous executions.
|
|
</para>
|
|
<para><literal>hctree_id</literal> is also stored as a variable in the decode
|
|
module <filename>ophuf.c</filename>. <function>DtSearchInit</function>
|
|
will not open any database listed in the ocf file whose
|
|
<literal>hctree_id</literal>, as stored in its configuration and status
|
|
record, does not match the value in the decode module. The
|
|
<command>dtsrdbrec</command> utility will print the
|
|
<literal>hctree_id</literal> value for any database.
|
|
</para>
|
|
</refsect2>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>OPTIONS</title>
|
|
<para>The following options are available:</para>
|
|
<note>
|
|
<para>If an option takes a value, the value must be directly appended to
|
|
the option name without white space.</para>
|
|
</note>
|
|
<variablelist>
|
|
<varlistentry><term><literal>−l</literal><Symbol Role="Variable">lit_thresh</Symbol></term>
|
|
<listitem>
|
|
<para>Sets the literal character's minimum threshold to the integer specified
|
|
by <Symbol Role="Variable">lit_thresh</Symbol>.
|
|
</para>
|
|
<para>This Huffman algorithm implements a pseudo-character called the literal
|
|
character. It represents all characters whose frequency is so low that
|
|
no huffman translation will be attempted. This reduces the maximum
|
|
length of the coded bit string when there are lots of zero- or
|
|
low-frequency bytes in the text corpus. For example, pure ASCII text
|
|
files only occasionally have byte values less than 32 (control
|
|
characters) and rarely greater than 127 (high order bit turned on). The
|
|
<Symbol Role="Variable">lit_thresh</Symbol> value specifies the literal
|
|
character's threshold count. After counting is completed, any character
|
|
in the encode table occurring with frequency less than or equal to
|
|
<Symbol Role="Variable">lit_thresh</Symbol> will be coded with the
|
|
literal character.
|
|
</para>
|
|
<para>If this option and the <literal>−l−</literal> option are
|
|
omitted, the default is <literal>−l0</literal>, meaning that
|
|
literal coding is provided only for bytes that never occur (counts of
|
|
zero).
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><literal>−l−</literal></term>
|
|
<listitem>
|
|
<para>Disables literal character encoding. Disabling literal character
|
|
encoding in corpa with unbalanced byte frequency distributions will lead
|
|
to extremely long bit string codes. Most natural language text corpa
|
|
are represented by highly unbalanced frequency distributions so this
|
|
option is not recommended for most DtSearch applications.
|
|
</para>
|
|
<para>If this option and the <literal>−l</literal><Symbol Role="Variable">lit_thresh</Symbol> option are omitted, the default is
|
|
<literal>−l0</literal>, meaning that literal coding is provided
|
|
only for bytes that never occur (counts of zero).
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><literal>−o</literal></term>
|
|
<listitem>
|
|
<para>Suppresses the overwrite prompt. It preauthorizes erasure and
|
|
reinitialization of the decode module.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><Symbol Role="Variable">textfile</Symbol></term>
|
|
<listitem>
|
|
<para>Specifies an optional input file of text that is representative of the
|
|
entire text corpus of the databases. It should contain bytes in the same
|
|
relative abundances as occur in documents in the entire corpus. Since
|
|
<command>huffcode</command> can be executed repeatedly with different
|
|
document <Symbol Role="Variable">textfile</Symbol>s, it is possible to
|
|
analyze the entire actual corpus if it is small enough or static.
|
|
</para>
|
|
<para>If <Symbol Role="Variable">textfile</Symbol> is not specified, the byte
|
|
frequencies in the currently loaded tables are not changed, and the
|
|
huffman codes are recomputed with the existing frequencies. This is
|
|
useful for examining the relative merits of using different literal
|
|
character thresholds.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>OPERANDS</title>
|
|
<para>The required input file name (<Symbol Role="Variable">huffile</Symbol>)
|
|
is the base file name of the encode table, excluding the
|
|
<Filename>.huf</Filename> extension. <command>dtsrload</command> expects
|
|
<Symbol Role="Variable">huffile</Symbol> to be
|
|
<filename>ophuf</filename>. Similarly, the decode module will be named
|
|
<Symbol Role="Variable">huffile</Symbol>.c.
|
|
</para>
|
|
<para>At the beginning of each new execution, <command>huffcode</command>
|
|
tries to open the encode table file and continue byte frequency counting
|
|
from the last run. If the huf file represented by
|
|
<Symbol Role="Variable">huffile</Symbol> does not exist, the table's counts are
|
|
initialized to zeroes. The decode module is recomputed fresh each run,
|
|
whether it existed before or not.
|
|
</para>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>ENVIRONMENT VARIABLES</title>
|
|
<para>None.</para>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>RESOURCES</title>
|
|
<para>None.</para>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>ACTIONS/MESSAGES</title>
|
|
<para>None.</para>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>RETURN VALUES</title>
|
|
<para>The return values are as follows:</para>
|
|
<variablelist>
|
|
<varlistentry><term>0</term>
|
|
<listitem>
|
|
<para><command>huffcode</command> completed successfully.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term>nonzero</term>
|
|
<listitem>
|
|
<para><command>huffcode</command> encountered an error.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>FILES</title>
|
|
<para><command>huffcode</command> reads the specified
|
|
<Symbol Role="Variable">huffile</Symbol>. It also reads
|
|
<Symbol Role="Variable">textfile</Symbol> if it is
|
|
specified.
|
|
It writes to
|
|
<Symbol Role="Variable">huffile</Symbol>.huf and
|
|
<Symbol Role="Variable">huffile</Symbol>.c.
|
|
</para>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>EXAMPLES</title>
|
|
<para>Read <filename>ophuf.huf</filename> if it exists and initialize the
|
|
internal byte count table with its byte frequency counts. If
|
|
<filename>ophuf.huf</filename> does not exist, the internal byte counts
|
|
will be initialized to zeros. The encoding table in the original huf
|
|
file will be discarded. The text file <filename>foo.txt</filename> will
|
|
be read and its individual byte frequencies added to the internal byte
|
|
count table. Then, <filename>ophuf.huf</filename> will be written out,
|
|
with an encoding scheme based on the current byte counts, and with a
|
|
literal character encoding all bytes that have zero frequency. Finally,
|
|
if the decode module <filename>ophuf.c</filename> already exists, a
|
|
prompt requesting permission to overwrite it will be output to stdout
|
|
and, if an affirmative response is read from stdin, a new version
|
|
corresponding to the new <filename>ophuf.huf</filename> will be written
|
|
out.
|
|
</para>
|
|
<programlisting>
|
|
huffcode ophuf foo.txt
|
|
</programlisting>
|
|
<para>Read <filename>myappl.huf</filename> and initialize the internal byte
|
|
count table with its byte frequency counts. Since no
|
|
<filename>textfile</filename> argument is specified, the only possible
|
|
action is to build different coding tables using existing frequency
|
|
counts in <filename>myappl.huf</filename>. The new tables will be based
|
|
on a literal character implementation where only bytes that occur more
|
|
than 200 times will be given an encoding; all other bytes will be
|
|
encoded with the literal character. After new encoding tables are
|
|
generated <filename>myappl.huf</filename> will be written out. The
|
|
decode module <filename>myappl.c</filename> will also be written out
|
|
without prompting whether it preexists or not.
|
|
</para>
|
|
<programlisting>
|
|
huffcode -l200 -o myappl
|
|
</programlisting>
|
|
</refsect1>
|
|
<refsect1>
|
|
<title>SEE ALSO</title>
|
|
<para>&cdeman.dtsrcreate;,
|
|
&cdeman.dtsrdbrec;,
|
|
&cdeman.dtsrload;,
|
|
&cdeman.DtSrAPI;,
|
|
&cdeman.DtSearch;
|
|
</para>
|
|
</refsect1>
|
|
</refentry>
|