CLuceneIndexingTool

This is a tool for reading documents, stored using markup language, packed in gzip archive.
For example English GigaWord is stored this way.

This involves three steps, the first one is decompression using zlib. It runs at 200 MB/sec
on an ordinary desktop computer. The folloving stage is the SGML parser which decodes
decompressed data and returns documents and their annotation. Parser is somewhat slower,
arround 20 MB/sec, but that's where multiple threads come in. The program is written so as to
utilize all the logical processors in a system. The last part is storage of the documents,
it may be either a RAW file, or a CLucene index.

The program is written in portable C++.

Because the program uses multiple threads, it is possible to use the Hoard library to optimize
memory allocation, gaining some extra speed. To use Hoard, see here: http://www.hoard.org/

synopsis:

CLuceneIndexingTool --in-path <input file> -o <output-index> [--verbose] [--no-verbose]
