]]>
huffcodeuser cmd
huffcode
Create optimized DtSearch compression/decompression tables
huffcode
−llit_thresh
−l−
−o
huffile
textfile
DESCRIPTION
huffcode creates optimized DtSearch
compression/decompression tables.
Documents stored in a DtSearch database text repository can be first
compressed using a Huffman text compression algorithm. The algorithm
provides optimal compression only with preanalysis of the statistical
distribution of bytes in the database corpus.
huffcode analyses a text corpus and generates
DtSearch compression and decompression tables. It is provided as a
convenience utility for database developers who want to optimize offline
storage requirements. Compression is not used in databases created
without the ability to store text in a DtSearch repository.
huffcode reads a text file as input and writes out
ophuf.huf (compression or "encode" table) and
ophuf.c (decompression or "decode" table).
ophuf.huf is an external ascii file that also
retains the statistical information on how it was generated.
huffcode can be executed repeatedly against different
text samples, continually accumulating results. In the case of a small
or static text corpus, the entire corpus can be fed into
huffcode for optimal huffman compression. In large or
dynamic databases the typical practice is to feed
dynamic f representative text samples.
The huffman code tables are created once for each API instance (not once
per database) before any documents are loaded. The only program to read
the encode table, an external file, is dtsrload. The
ophuf.huf file generated by
huffcode should be used instead of the provided
default file prior to the first run of dtsrload for
any databases to be accessed by a particular API instance. The decode
table, a C module, should be compiled and linked into the application
code ahead of the API library to override the default decode module in
the library. Huf files and decode modules are not user editable.
HCTREE_ID
It is imperative that the encode and decode tables reflect identical
byte statistics to prevent decode errors. The first line of
ophuf.huf includes a long integer value named
HCTREE_ID. Each execution of
huffcode generates a new, unique
hctree_id integer. dtsrload loads
this integer into the database configuration and status record when it
loads the first document into a new database. Thereafter, each execution
of dtsrload for that database confirms that the same
hctree_id is used for each document compression. It
will abort if the ophuf.huf
hctree_id does not match the value for a database
from previous executions.
hctree_id is also stored as a variable in the decode
module ophuf.c. DtSearchInit
will not open any database listed in the ocf file whose
hctree_id, as stored in its configuration and status
record, does not match the value in the decode module. The
dtsrdbrec utility will print the
hctree_id value for any database.
OPTIONS
The following options are available:
If an option takes a value, the value must be directly appended to
the option name without white space.
−llit_thresh
Sets the literal character's minimum threshold to the integer specified
by lit_thresh.
This Huffman algorithm implements a pseudo-character called the literal
character. It represents all characters whose frequency is so low that
no huffman translation will be attempted. This reduces the maximum
length of the coded bit string when there are lots of zero- or
low-frequency bytes in the text corpus. For example, pure ASCII text
files only occasionally have byte values less than 32 (control
characters) and rarely greater than 127 (high order bit turned on). The
lit_thresh value specifies the literal
character's threshold count. After counting is completed, any character
in the encode table occurring with frequency less than or equal to
lit_thresh will be coded with the
literal character.
If this option and the −l− option are
omitted, the default is −l0, meaning that
literal coding is provided only for bytes that never occur (counts of
zero).
−l−
Disables literal character encoding. Disabling literal character
encoding in corpa with unbalanced byte frequency distributions will lead
to extremely long bit string codes. Most natural language text corpa
are represented by highly unbalanced frequency distributions so this
option is not recommended for most DtSearch applications.
If this option and the −llit_thresh option are omitted, the default is
−l0, meaning that literal coding is provided
only for bytes that never occur (counts of zero).
−o
Suppresses the overwrite prompt. It preauthorizes erasure and
reinitialization of the decode module.
textfile
Specifies an optional input file of text that is representative of the
entire text corpus of the databases. It should contain bytes in the same
relative abundances as occur in documents in the entire corpus. Since
huffcode can be executed repeatedly with different
document textfiles, it is possible to
analyze the entire actual corpus if it is small enough or static.
If textfile is not specified, the byte
frequencies in the currently loaded tables are not changed, and the
huffman codes are recomputed with the existing frequencies. This is
useful for examining the relative merits of using different literal
character thresholds.
OPERANDS
The required input file name (huffile)
is the base file name of the encode table, excluding the
.huf extension. dtsrload expects
huffile to be
ophuf. Similarly, the decode module will be named
huffile.c.
At the beginning of each new execution, huffcode
tries to open the encode table file and continue byte frequency counting
from the last run. If the huf file represented by
huffile does not exist, the table's counts are
initialized to zeroes. The decode module is recomputed fresh each run,
whether it existed before or not.
ENVIRONMENT VARIABLES
None.
RESOURCES
None.
ACTIONS/MESSAGES
None.
RETURN VALUES
The return values are as follows:
0
huffcode completed successfully.
nonzero
huffcode encountered an error.
FILES
huffcode reads the specified
huffile. It also reads
textfile if it is
specified.
It writes to
huffile.huf and
huffile.c.
EXAMPLES
Read ophuf.huf if it exists and initialize the
internal byte count table with its byte frequency counts. If
ophuf.huf does not exist, the internal byte counts
will be initialized to zeros. The encoding table in the original huf
file will be discarded. The text file foo.txt will
be read and its individual byte frequencies added to the internal byte
count table. Then, ophuf.huf will be written out,
with an encoding scheme based on the current byte counts, and with a
literal character encoding all bytes that have zero frequency. Finally,
if the decode module ophuf.c already exists, a
prompt requesting permission to overwrite it will be output to stdout
and, if an affirmative response is read from stdin, a new version
corresponding to the new ophuf.huf will be written
out.
huffcode ophuf foo.txt
Read myappl.huf and initialize the internal byte
count table with its byte frequency counts. Since no
textfile argument is specified, the only possible
action is to build different coding tables using existing frequency
counts in myappl.huf. The new tables will be based
on a literal character implementation where only bytes that occur more
than 200 times will be given an encoding; all other bytes will be
encoded with the literal character. After new encoding tables are
generated myappl.huf will be written out. The
decode module myappl.c will also be written out
without prompting whether it preexists or not.
huffcode -l200 -o myappl
SEE ALSO
&cdeman.dtsrcreate;,
&cdeman.dtsrdbrec;,
&cdeman.dtsrload;,
&cdeman.DtSrAPI;,
&cdeman.DtSearch;