Writing a search query transpiler

Published on August 23, 2024 under the Coding category.

Google, among other search engines, supports advanced operators for writing refined search queries. For example, you can use the site: search operator to constrain your search to a website, like:

site:reddit.com aeropress coffee recipe

In my How to build a query language in Python blog post, I walked through how to create the foundations of a query language. The blog post explains how to implement AND and OR operators. The engine could then be used to implement query refinements relevant to a particular use case (i.e. site: for site search).

In that blog post, the query language is coupled with the query execution engine. Processing a query follows these steps:

A user provides a query.
The query is parsed into an Abstract Syntax Tree (AST) with a grammar.
The AST is evaluated from the bottom up, querying the search index to find terms that match each word, then applying any operators as the parser encounters them in the AST.

With that said, the query language can be separate from the query execution engine. One use for this is to implement a search query transpiler. Transpilers covert text from one language into another.

There are three high-level steps to build a transpiler:

Decide on the what query types should be supported (i.e. site: or title: or AND)
Turn the queries you want to implement into a grammar that can be parsed.
Implement an Abstract Syntax Tree evaluator that reads a tree made from the grammar and assembles a valid query in the new grammar.

In this post, I am going to explain how I made an advanced string search feature that transpiles into my query language.

Building a transpiler for JameSQL

I have been working on implementing a NoSQL database in Python. The database accepts queries in a JSON structure, a common input format for NoSQL databases. Here is an example of a query:

{
  "query": {
      "and": [
          {"title": {"contains": "aeropress"}},
          {"title": {"contains": "coffee"}},
      ],
    	"not": [
        {"title": {"contains": "recipe"}}
      ]
  },
  "limit": 2,
  "sort_by": "title",
}

This search query looks for documents whose titles contain the words aeropress and coffee, and whose titles do not contain the word recipe.

This internal representation is good for developers, but not useful for an end user. It would be ideal if an end user could type in a query like this:

aeropress coffee -recipe

This query could then be transpiled into a JameSQL query that follows the JSON structure above.

To implement the transpiler, I started by making a list of what operators I wanted to support. Indeed, query language development should begin with the outline of a syntax. I decided that I wanted to support:

coffee recipe: Search for documents that contain “coffee” or “recipe”.
'coffee recipe': Search for documents that contain the exact string “coffee recipe”.
-coffee recipe: Search for documents that contain “recipe” but not “coffee”.
title:'coffee' recipe: Search for documents whose title contains “coffee” and whose title or body text contains “recipe”.

With this blueprint in mind, I implemented the following grammar:

start: query

query: (negate_query | strict_search_query | word_query | field_query)+

strict_search_query: "'" MULTI_WORD "'"
word_query: WORD
field_query: TERM ":" "'" MULTI_WORD "'" | TERM ":" WORD
negate_query: "-" "'" MULTI_WORD "'" | "-" WORD
WORD: /[a-zA-Z0-9_*]+/
MULTI_WORD: /[a-zA-Z0-9 ]+/
TERM: /[a-zA-Z0-9_]+/

%import common.WS
%ignore WS

This grammar encodes the list of operators mentioned above.

With this grammar in place, the next step was to write a tree parser that understands this grammar. This tree parser needs to evaluate each term and turn it into a final, JameSQL-compatible query. To do this, I need to encode exactly what every part of a query looks like in JameSQL so that the tree parser can return a final, assembled query.

Consider the query:

-recipe aeropress coffee

The transpiler works by assembling every piece of the query into its own components.

Here is the result of the transpiler when run on the query above:

{
	'and': [
		{'not': {'or': [{'title': {'contains': 'recipe'}}]}},
		{'or': [{'title': {'contains': 'aeropress'}}]},
		{'or': [{'title': {'contains': 'coffee'}}]}
 	]
 }

The base unit of any query is:

{'title': {'contains': 'WORD'}}

In the above query, three base statements are written:

Find documents whose title does not contain recipe.
Find documents whose title contains aeropress, and;
Find statements whose title contains coffee.

This says “find all documents whose title contains WORD”. The transpiler then encloses these statements in the operators required to make a valid query according to what the user has typed. For example, new recipes is turned into an and statement that looks for documents that contain both the word coffee and recipe (although the words to not have to appear next to each other).

The - flag adds a {"not": } statement around a query to ensure that any result from the enclosed query is excluded from the final search results.

All statements are then wrapped in an and statement, which means that all conditions stated must be met for a result to be returned.

The transpiler is doing all of the heavy lifting to assemble a query.

Of note, some of the JSON in the above query is technically redundant. For example, new and recipes could be in one or block instead of two. This is a limitation with my current implementation. It doesn’t look to simplify queries: it only looks to assemble them correctly. Thus, there are improvements I could make to ensure queries were cleaner.

Let’s talk through another example.

Consider the following query, which searches for all documents whose category contains the exact string coffee recipe:

category:'coffee recipes'

This query is transpiled as:

{'and': [{'category': {'contains': 'coffee recipes', strict: True}}]

“coffee recipes” is turned into a contains query that searches the category field. Unlike in the last example, “coffee recipes” is a single word because it is enclosed in quotation marks.

An “and” statement is wrapped around the contents because the JameSQL execution engine expects one for a query that may have more than one condition. If we added another condition to our search engine, it would be added to the list.

The strict key is set to True, which means that both words must appear next to each other in the document.

Conclusion

With the transpiler above, users can write queries in natural language and add syntax that has special meaning. This syntax can then be evaluated and turned into a complete query that is then sent to the JameSQL execution engine. This is a transpiler because we are turning one form of text (the plain text user query) into another representation (a JameSQL query) for use in running a search.

As part of this project, I wrote several tests. The tests check that:

A plain text query (i.e. -coffee recipes) is turned into its proper corresponding JameSQL representation.
The query executes without an exception.
The query returns the number of results expected.
The top result is equal to what is expected, according to the provided document scoring mechanisms.

From a UX perspective, a valid query could be assembled by adding custom buttons, range filters, and select dropdowns. This is common in search systems. As a user, such visual features can be helpful. Custom buttons, range filters, and select dropdown menus allow a user to perform a search without having to know custom text. These features can also offer a preview of valid options, allowing a user to refine their search with less knowledge of a system.

With that said, the ability to run advanced queries as part of text is useful for power users who have knowledge of the structure of data in a search system.

I will likely roll out this feature on my blog search engine to allow for more complex queries in the coming days.