Writing Your Own Filtering DSL in Python

Writing Your Own Filtering DSL in Python

A few months ago, we've seen how to write a filtering syntax tree in Python. The idea behind this was to create a data structure — in the form of a dictionary — that would allow to filter data based on conditions.

Our API looked like this:

>>> f = Filter(
  {"and": [
    {"eq": ("foo", 3)},
    {"gt": ("bar", 4)},
>>> f(foo=3, bar=5)
>>> f(foo=4, bar=5)

While such a mechanism is pretty powerful to use, the input data structure format might not be user friendly. It's great to use, for example, with a JSON based REST API, but it's pretty terrible to use for a command-line interface.

A good solution to that problem is to build our own language. That's called a DSL.

Building a DSL

What's a Domain-Specific Language (DSL)? It's a computer language that is specialized to a certain domain. In our case, our domain is filtering, as we're providing a Filter class that allows to filter a set of value.

How do you build a data structure such as {"and": [{"eq": ("foo", 3)}, {"gt": ("bar", 4)}]} from a string? Well, you define a language, parse it, and then convert it to the right format.

In order to parse a language, there are a lot of different solutions, from implementing manual parsers to using regular expression. In this case, we'll use lexical analsysis.

First Iteration

Let's start small and define the base of our grammar. That should be something simple, so we'll go with <identifier><operator><value>. For example "foobar"="baz" is a valid sentence in our grammar and will conver to {"=": ("foobar", "baz")}.

The following code snippet leverages pyparsing for parsing the string and specifying the grammar:

import pyparsing

identifier = pyparsing.QuotedString('"')
operator = (
    pyparsing.Literal("=") |
    pyparsing.Literal("≠") |
    pyparsing.Literal("≥") |
    pyparsing.Literal("≤") |
    pyparsing.Literal("<") |
value = pyparsing.QuotedString('"')

match_format = identifier + operator + value


# Prints:
# ['foobar', '=', '123']

With that simple grammar, we can parse and get a token list composed of our 3 items: the identifier, the operator and the value.

Transforming the Data

The list above in the format [identifier, operator, value] is not really what we need in the end. We need something like {operator: (identifier, value)}. We can leverage pyparsing API to help us with that.

def list_to_dict(pos, tokens):
    return {tokens[1]: (tokens[0], tokens[2])}

match_format = (identifier + operator + value).setParseAction(list_to_dict)


# Prints:
# [{'=': ('foobar', '123')}]

The parseString method allows to modify the returned value of a grammar token. In that case, we transform the list of the dict we need.

Plugging the Parser and the Filter

In the following code, we'll reuse the Filter class we wrote in our previous post. We'll just add the following code to our previous example:

def parse_string(s):
    return match_format.parseString(s, parseAll=True)[0]

f = Filter(parse_string('"foobar"="baz"'))

# Prints:
# True
# False

Now, we have a pretty simple parser and a good way to build a Filter object from a string.

As our Filter object supports complex and nested operations, such as and and or, we could also add it to the grammar — I'll leave that to you reader as an exercise!

Building your own Grammar

pyparsing makes it easy to build one's own grammar. However, it should not be abused: building a DSL means that your users will have to discover and learn it. If it's way different that what they know and already exists, it might be cumbersome for them.

Finally, if you're curious and want to see a real world usage, Mergify condition system leverages pyparsing to implement its parser. Check it out!