Massive Algorithms: Looking at a Plaintext Lucene Index

Looking at a Plaintext Lucene Index

Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.

To configure the Codec you just set it on the IndexWriterConfig:

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
// recreate the index on each execution
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
config.setCodec(new SimpleTextCodec());

The rest of the indexing process stays exactly the same as it used to be:

Directory luceneDir = FSDirectory.open(plaintextDir);
try (IndexWriter writer = new IndexWriter(luceneDir, config)) {
    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of my first document", Store.YES),
            new TextField("content", "The content of the first document", Store.NO)));

    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of the second document", Store.YES),
            new TextField("content", "And this is the content", Store.NO)));
}

After running this code the index directory contains several files. Those are not the same type of files that are created using the default codec.

ls /tmp/lucene-plaintext/
_1_0.len  _1_1.len  _1.fld  _1.inf  _1.pst  _1.si  segments_2  segments.gen

The segments_x file is the starting point (x depends on the amount of times you have written to the index before and starts with 1). This still is a binary file but contains the information which codec is used to write to the index. It contains the name of each Codec that is used for writing a certain segment.

The rest of the index files are all plaintext. They do not contain the same information as their binary cousins. For example the .pst file represents the complete posting list, the structure you normally mean when talking about an inverted index:

The content that is marked as stored resides in the .fld file:

The SimpleTextCodec only is an interesting byproduct. The Codec API can be used for a lot useful things. For example the feature to read indices of older Lucene versions is implemented using seperate codecs. Also, you can mix several Codecs in an index so reindexing on version updates should not be necessary immediately.
Please read full article from Looking at a Plaintext Lucene Index

Looking at a Plaintext Lucene Index

Labels

Popular Posts