solrdump: A serialization tool for Lucene index

Apache lucene is a quite commonly seen open sourced index engine in the market. Basically every company owning a text search software stack has lucene inside it. No matter it is a direct lucene API call or its wrappers like Elasticsearch and Apache Solr. In a well-designed architecture and common use cases, the search engine shouldn’t be used as source of truth of data. Unfortunately the reality doesn’t come always with a bright situation like this. Sometimes the legacy system have all of the data stored in a search engine, with lucene format hard to do any further process. In particular, in Solr’s terminology they are segment files, they are in binary format where any meaningful access must through Apache lucene API.

That’s the background where this solrdump was created. I would like to export the data stuck in Solr to a format for easier processing or archiving, say JSON. The proper and correct way is to program in java and acess the binary format field with lucene API. However, it is quite boring to write it in java. Cloujre is an interesting alternative and that’s how the decision was made.

The tool at this moment provides two output modes, you oculd either print the JSON format into the console, or insert the record to a JSONB field in the postgres 9.4

print the data to console

lein run test/data/solr-segment

insert the record into postgres

lein run --output=postgres test/data/solr-segment

In order to insert the record into the database, you have to specify the database information in the edn file.

{:db {:protocol "postgres"
      :hostname "localhost"
      :dbname   "test"
      :dbport   5432
      :username "mno2"
      :prepared_sql "insert into units (payload) values (?);"
      :password ""}}

In the future it could support more output mode if the demand is clear.