elasticsearch_dsl: Domain Specific Language in Python for Elasticsearch

2015-05-30 17:30

An emerging trend in recent year for software, including mongodb, elasticsearch and Chef, is to expose an JSON interface to accept complex requests. They give up the traditional SQL query and adopt JSON as the text encoding of abstract syntax tree. Therefore, whenever you are making up a request to these services, you are actually hand coding an abstract syntax tree in JSON. Although it is flexible and easy to extend, it is also error prone and hard to maintain. A common solution for this is to write a Domain Specific Language. And with python’s language design, a naive and natural solution is to use Class to denote AST node and Visitor pattern to code-generate the underlying JSON. And above is the reason I created elasticsearch_dsl

Think that you type the program this way:

nested_agg = ast.NestedAggregation("name", "tags", ast.TermsAggregation("tags.name", size=20, order_type="_count", order="desc", min_doc_count=100))
aggregation = ast.TopLevelAggregation("tag", nested_agg)
ast_root = ast.TopLevelQuery(MatchAllQuery(), aggs=[aggregation])

codegen = CodeGeneratorVisitor()
codegen.visit(ast_root)
query_py_obj = codegen.query
and_clauses = []
and_clauses.append(ast.GeoDistanceFilter("geocode", user.geocode["latitude"], user.geocode["longitude"], 10))

tag_names = []
should_clauses= []
for t, ind in tags:
    should_clauses.append(ast.TermQuery("tags.name", t.name.lower()))

nested_queries= ast.BoolQuery(should=should_clauses)
f = ast.NestedQuery("tags", nested_queries)

query = ast.FilteredQuery(f, ast.AndFilter(and_clauses))
query_size = 20
query_from = 0
ast_root = ast.TopLevelQuery(query, query_size, query_from, sort={"_score": {"order": "desc"}})

codegen = CodeGeneratorVisitor()
codegen.visit(ast_root)
query_py_obj = codegen.query

But not this way

query = {}
query["tags"] = []
for t, ind in tags:
    query["tags].append({"tags.name": t.name.lower()})

query["size"] = query_size
query["from"] = query_from
query["sort"] = {"_score": {"order": "desc"}}

which is error prone, and hard to know the intention of each query. There has been a elasticsearch-dsl package on pypi, but one thing I don’t like is the big runtime that comes with it. If anything gets wrong, you have to read almost all of its implementation to know what is error and how to fix it. And this version of elasticsearch_dsl is just a thin wrapper to compile the class into JSON. You could even do it in the command line.

solrdump: A serialization tool for Lucene index

2015-05-30 17:30

Apache lucene is a quite commonly seen open sourced index engine in the market. Basically every company owning a text search software stack has lucene inside it. No matter it is a direct lucene API call or its wrappers like Elasticsearch and Apache Solr. In a well-designed architecture and common use cases, the search engine shouldn’t be used as source of truth of data. Unfortunately the reality doesn’t come always with a bright situation like this. Sometimes the legacy system have all of the data stored in a search engine, with lucene format hard to do any further process. In particular, in Solr’s terminology they are segment files, they are in binary format where any meaningful access must through Apache lucene API.

That’s the background where this solrdump was created. I would like to export the data stuck in Solr to a format for easier processing or archiving, say JSON. The proper and correct way is to program in java and acess the binary format field with lucene API. However, it is quite boring to write it in java. Cloujre is an interesting alternative and that’s how the decision was made.

The tool at this moment provides two output modes, you oculd either print the JSON format into the console, or insert the record to a JSONB field in the postgres 9.4

print the data to console

lein run test/data/solr-segment

insert the record into postgres

lein run --output=postgres test/data/solr-segment

In order to insert the record into the database, you have to specify the database information in the edn file.

{:db {:protocol "postgres"
      :hostname "localhost"
      :dbname   "test"
      :dbport   5432
      :username "mno2"
      :prepared_sql "insert into units (payload) values (?);"
      :password ""}}

In the future it could support more output mode if the demand is clear.