cdesktop/cde/doc/C/guides/man/man1_dt/huffcode.sgm

<!-- $XConsortium: huffcode.sgm /main/7 1996/09/08 19:53:22 rws $ -->
<!-- (c) Copyright 1996 Digital Equipment Corporation. -->
<!-- (c) Copyright 1996 Hewlett-Packard Company. -->
<!-- (c) Copyright 1996 International Business Machines Corp. -->
<!-- (c) Copyright 1996 Sun Microsystems, Inc. -->
<!-- (c) Copyright 1996 Novell, Inc. -->
<!-- (c) Copyright 1996 FUJITSU LIMITED. -->
<!-- (c) Copyright 1996 Hitachi. -->
<![%CDE.C.CDE; [<refentry id="CDE.SEARCH.huffcode">]]>
<refmeta><refentrytitle>huffcode</refentrytitle><manvolnum>user cmd</manvolnum>
</refmeta>
<refnamediv><refname><command>huffcode</command></refname><refpurpose>
Create optimized DtSearch compression/decompression tables
</refpurpose></refnamediv>
<refsynopsisdiv>
<cmdsynopsis>
<command>huffcode</command>
<arg choice="opt"><group choice="plain">
<arg choice="plain">&minus;l<replaceable>lit_thresh</replaceable></arg>
<arg choice="plain">&minus;l&minus;</arg>
</group></arg>
<arg choice="opt">&minus;o</arg>
<arg choice="plain"><replaceable>huffile</replaceable></arg>
<arg choice="opt"><replaceable>textfile</replaceable></arg>
</cmdsynopsis>
</refsynopsisdiv>
<refsect1>
<title>DESCRIPTION</title>
<para><command>huffcode</command> creates optimized DtSearch
compression/decompression tables.
</para>
<para>Documents stored in a DtSearch database text repository can be first
compressed using a Huffman text compression algorithm. The algorithm
provides optimal compression only with preanalysis of the statistical
distribution of bytes in the database corpus.
<command>huffcode</command> analyses a text corpus and generates
DtSearch compression and decompression tables. It is provided as a
convenience utility for database developers who want to optimize offline
storage requirements. Compression is not used in databases created
without the ability to store text in a DtSearch repository.
</para>
<para><command>huffcode</command> reads a text file as input and writes out
<filename>ophuf.huf</filename> (compression or "encode" table) and
<filename>ophuf.c</filename> (decompression or "decode" table).
<filename>ophuf.huf</filename> is an external ascii file that also
retains the statistical information on how it was generated.
<command>huffcode</command> can be executed repeatedly against different
text samples, continually accumulating results. In the case of a small
or static text corpus, the entire corpus can be fed into
<command>huffcode</command> for optimal huffman compression. In large or
dynamic databases the typical practice is to feed
dynamic f representative text samples.
</para>
<para>The huffman code tables are created once for each API instance (not once
per database) before any documents are loaded. The only program to read
the encode table, an external file, is <command>dtsrload</command>. The
<filename>ophuf.huf</filename> file generated by
<command>huffcode</command> should be used instead of the provided
default file prior to the first run of <command>dtsrload</command> for
any databases to be accessed by a particular API instance. The decode
table, a C module, should be compiled and linked into the application
code ahead of the API library to override the default decode module in
the library. Huf files and decode modules are not user editable.
</para>
<refsect2>
<title>HCTREE_ID</title>
<para>It is imperative that the encode and decode tables reflect identical
byte statistics to prevent decode errors. The first line of
<filename>ophuf.huf</filename> includes a long integer value named
<Symbol>HCTREE_ID</Symbol>. Each execution of
<command>huffcode</command> generates a new, unique
<literal>hctree_id</literal> integer. <command>dtsrload</command> loads
this integer into the database configuration and status record when it
loads the first document into a new database. Thereafter, each execution
of <command>dtsrload</command> for that database confirms that the same
<literal>hctree_id</literal> is used for each document compression. It
will abort if the <filename>ophuf.huf</filename>
<literal>hctree_id</literal> does not match the value for a database
from previous executions.
</para>
<para><literal>hctree_id</literal> is also stored as a variable in the decode
module <filename>ophuf.c</filename>. <function>DtSearchInit</function>
will not open any database listed in the ocf file whose
<literal>hctree_id</literal>, as stored in its configuration and status
record, does not match the value in the decode module. The
<command>dtsrdbrec</command> utility will print the
<literal>hctree_id</literal> value for any database.
</para>
</refsect2>
</refsect1>
<refsect1>
<title>OPTIONS</title>
<para>The following options are available:</para>
<note>
<para>If an option takes a value, the value must be directly appended to
the option name without white space.</para>
</note>
<variablelist>
<varlistentry><term><literal>&minus;l</literal><Symbol Role="Variable">lit_thresh</Symbol></term>
<listitem>
<para>Sets the literal character's minimum threshold to the integer specified
by <Symbol Role="Variable">lit_thresh</Symbol>.
</para>
<para>This Huffman algorithm implements a pseudo-character called the literal
character. It represents all characters whose frequency is so low that
no huffman translation will be attempted. This reduces the maximum
length of the coded bit string when there are lots of zero- or
low-frequency bytes in the text corpus. For example, pure ASCII text
files only occasionally have byte values less than 32 (control
characters) and rarely greater than 127 (high order bit turned on). The
<Symbol Role="Variable">lit_thresh</Symbol> value specifies the literal
character's threshold count. After counting is completed, any character
in the encode table occurring with frequency less than or equal to
<Symbol Role="Variable">lit_thresh</Symbol> will be coded with the
literal character.
</para>
<para>If this option and the <literal>&minus;l&minus;</literal> option are
omitted, the default is <literal>&minus;l0</literal>, meaning that
literal coding is provided only for bytes that never occur (counts of
zero).
</para>
</listitem>
</varlistentry>
<varlistentry><term><literal>&minus;l&minus;</literal></term>
<listitem>
<para>Disables literal character encoding. Disabling literal character
encoding in corpa with unbalanced byte frequency distributions will lead
to extremely long bit string codes. Most natural language text corpa
are represented by highly unbalanced frequency distributions so this
option is not recommended for most DtSearch applications.
</para>
<para>If this option and the <literal>&minus;l</literal><Symbol Role="Variable">lit_thresh</Symbol> option are omitted, the default is
<literal>&minus;l0</literal>, meaning that literal coding is provided
only for bytes that never occur (counts of zero).
</para>
</listitem>
</varlistentry>
<varlistentry><term><literal>&minus;o</literal></term>
<listitem>
<para>Suppresses the overwrite prompt. It preauthorizes erasure and
reinitialization of the decode module.
</para>
</listitem>
</varlistentry>
<varlistentry><term><Symbol Role="Variable">textfile</Symbol></term>
<listitem>
<para>Specifies an optional input file of text that is representative of the
entire text corpus of the databases. It should contain bytes in the same
relative abundances as occur in documents in the entire corpus. Since
<command>huffcode</command> can be executed repeatedly with different
document <Symbol Role="Variable">textfile</Symbol>s, it is possible to
analyze the entire actual corpus if it is small enough or static.
</para>
<para>If <Symbol Role="Variable">textfile</Symbol> is not specified, the byte
frequencies in the currently loaded tables are not changed, and the
huffman codes are recomputed with the existing frequencies. This is
useful for examining the relative merits of using different literal
character thresholds.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1>
<title>OPERANDS</title>
<para>The required input file name (<Symbol Role="Variable">huffile</Symbol>)
is the base file name of the encode table, excluding the
<Filename>.huf</Filename> extension. <command>dtsrload</command> expects
<Symbol Role="Variable">huffile</Symbol> to be
<filename>ophuf</filename>. Similarly, the decode module will be named
<Symbol Role="Variable">huffile</Symbol>.c.
</para>
<para>At the beginning of each new execution, <command>huffcode</command>
tries to open the encode table file and continue byte frequency counting
from the last run. If the huf file represented by
<Symbol Role="Variable">huffile</Symbol> does not exist, the table's counts are
initialized to zeroes. The decode module is recomputed fresh each run,
whether it existed before or not.
</para>
</refsect1>
<refsect1>
<title>ENVIRONMENT VARIABLES</title>
<para>None.</para>
</refsect1>
<refsect1>
<title>RESOURCES</title>
<para>None.</para>
</refsect1>
<refsect1>
<title>ACTIONS/MESSAGES</title>
<para>None.</para>
</refsect1>
<refsect1>
<title>RETURN VALUES</title>
<para>The return values are as follows:</para>
<variablelist>
<varlistentry><term>0</term>
<listitem>
<para><command>huffcode</command> completed successfully.</para>
</listitem>
</varlistentry>
<varlistentry><term>nonzero</term>
<listitem>
<para><command>huffcode</command> encountered an error.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1>
<title>FILES</title>
<para><command>huffcode</command> reads the specified
<Symbol Role="Variable">huffile</Symbol>. It also reads
<Symbol Role="Variable">textfile</Symbol> if it is
specified.
It writes to
<Symbol Role="Variable">huffile</Symbol>.huf and
<Symbol Role="Variable">huffile</Symbol>.c.
</para>
</refsect1>
<refsect1>
<title>EXAMPLES</title>
<para>Read <filename>ophuf.huf</filename> if it exists and initialize the
internal byte count table with its byte frequency counts. If
<filename>ophuf.huf</filename> does not exist, the internal byte counts
will be initialized to zeros. The encoding table in the original huf
file will be discarded. The text file <filename>foo.txt</filename> will
be read and its individual byte frequencies added to the internal byte
count table. Then, <filename>ophuf.huf</filename> will be written out,
with an encoding scheme based on the current byte counts, and with a
literal character encoding all bytes that have zero frequency. Finally,
if the decode module <filename>ophuf.c</filename> already exists, a
prompt requesting permission to overwrite it will be output to stdout
and, if an affirmative response is read from stdin, a new version
corresponding to the new <filename>ophuf.huf</filename> will be written
out.
</para>
<programlisting>
huffcode ophuf foo.txt
</programlisting>
<para>Read <filename>myappl.huf</filename> and initialize the internal byte
count table with its byte frequency counts. Since no
<filename>textfile</filename> argument is specified, the only possible
action is to build different coding tables using existing frequency
counts in <filename>myappl.huf</filename>. The new tables will be based
on a literal character implementation where only bytes that occur more
than 200 times will be given an encoding; all other bytes will be
encoded with the literal character. After new encoding tables are
generated <filename>myappl.huf</filename> will be written out. The
decode module <filename>myappl.c</filename> will also be written out
without prompting whether it preexists or not.
</para>
<programlisting>
huffcode -l200 -o myappl
</programlisting>
</refsect1>
<refsect1>
<title>SEE ALSO</title>
<para>&cdeman.dtsrcreate;,
&cdeman.dtsrdbrec;,
&cdeman.dtsrload;,
&cdeman.DtSrAPI;,
&cdeman.DtSearch;
</para>
</refsect1>
</refentry>