A Case for a New Language

How often do we need to craft a new programming language? It is quite a common belief that the answer to that question should be “never” or at least “unless you absolutely have to, and even then think twice and decide against it”. However, if we slightly re-formulate the question to “shall I write a DSL for that particular problem?” the answer would be the roaring “YES”.

The reason for such polar opinions on essentially the same problem, I suppose, is that people often associate domain specific languages with some sort of a sophisticated library which can be used in the realm of an existing language. In other words, one does not write a new programming language, but leverages existing language features.

Why is this better? Well, we can use all the existing tools (i.e. bells and whistles) such as IDEs, have syntax highlighting in GitHub, type checkers for some lucky those who use statically typed languages (or pretend to be), debuggers and what not. I call those languages “embeddable”1 and their host languages are “embedding”.

On the other hand, we have a slew of standalone DSLs we use routinely without treating them as such - regular expressions, JSON, XML2 to name just a few. Among less well-known ones, I would point out mini-languages for command line parameters for find and tcpdump tools. Those are non-trivial languages, needfully complex for expressing all the underlying functionality. And that is reasonable, because the more complex something is, the more complex a configuration language for that thing becomes. A great (albeit somewhat infamous) example of that phenomenon is Sendmail which configuration language is accidentally Turing-complete.

My point is that a configuration language can be, and sometimes, should be treated as a programming language. Often times an existing language is enough - just slap in a YAML/TOML file (or an XML if you want to sell it as “enterprise-grade”) and move on. But what if it isn’t enough? Then you have no choice but to craft something of your own. And with that comes the whole stack of problems which are typically associated with “big” programming languages - frontends (i.e. lexical analysers and parsers), a typechecker (why not?), and backends - iterpreters, optimisers, compilers… We also should think about the language semantic, its coherency and consistency. But above all else, we want to make those languages laser-focused on the single problem they are born to solve. And make them perform well there, in fact, better than anything else on Earth.

It just so happened that I came across such a case.

The Problem

The project I am currently working on involves receiving and processing gRPC messages from some A-brand network devices. The messages contain ordered sets of key-value pairs. The meaning and naming conventions for the keys and values varies greatly and is inconsistent across different message types. The problem was how to describe various processing rules for the messages. As a starting point I decided to use a simple JSON file with some relevant information and hard-coded processing rules. And it was OK for a few ad-hoc rules, but later on our customers have been asking for more and more complex processing and the triggering point was the requirement of remote configurability. In other words, customers themselves wanted to supply those configurations.

That was when I first thought about a programming language. We can treat the ordered set of key-value pairs as a list of tuples. And the very first thing which comes to mind when we are talking about list processing is Lisp :)

When I was about to start sketching a Lisp parser/interpreter a colleague of mine made an interesting observation. Apparently, the key-value pairs were not as chaotic as I thought, but occurred as result of flattening of hierarchical data. Trees.

What languages do we have to work with trees? I don’t think I know many of those. Thus I looked at XPath and XSLT and also at GraphQL but neither was a quite the right fit. We needed something like SQL with CTE. And if fact the two could have worked, but expressions quickly became unwieldy even for relatively simple rules.

A Solution

I knew that SQL had the right semantic. We wanted to select a subset of key-value pairs and aggregate/processes the values in some way. And the natural way to select elements is by using paths. So the language should have had a way to succinctly express paths. Well, I conjectured that file system-like paths expressions were concise enough

/foo/goo/blah/x/y

So I made those into first-class syntactic constructs, like numbers or strings. In order to express variability I added “match all” syntax which matched any path segment

/foo/*/blah/*/y

and to store parts of the path for later use I used “var” syntax

/foo/$k/blah/$l/y

Additionally, the paths could have annotated segments

/foo/goo[instance='a2']/blah/x[a='S 512']/y

I added star- and var- syntax there as well.

/foo/goo[instance=*]/blah/x[a=$a_val]/y

Closures

Now that I had a way to express arbitrary paths (with fixed number of segments) I was looking for a syntactic construction to filter the resulting node sets. In SQL we have

SELECT a, b FROM c WHERE <condition>

I got SELECT and FROM with paths but WHERE clause was still missing. Moreover, that <condition> could be arbitrary complex, possibly depending on previous results. The solution was to “capture” environment and delay evaluation of an expression till when it’s needed. In other words, a closure (aka lambda-function):

{ $1 < 5 && $foo == 'abc' }

where, by convention, the caller binds $1, $2, and so on.

Gluing it all together

The combination of function calls and variable bindings was deemed flexible enough for the task.

The main workhorse of the DSL became select function

select(/foo/$x/$goo/z/y/w, { value($1) < 3 })

It accepts a path and a predicate which may use arithmetic, boolean, and regular expressions.

I also added group_by function which takes a collection of elements and splits it into collection of collections, based on key function, i.e a closure that returns certain value

# group elements by their length
group_by($xs, { len($1) })

Intermediate results may be bound to variables

# return a collection of even numbers between 1 and 10
let $even = {$1 % 2 == 0} in filter(range(1, 10), $even)

and as in many declarative and functional languages, expressions may occur almost anywhere.

Side effects

SQL, at least its query part, is pure, declarative, side-effect free language. My initial thought was to follow the suit. However, processing a single message may yield multiple results from application point of view. While it was possible to accumulate them all into one big structure I decided that the escape path that side-effects offer was too convenient to overlook.

let $xs = range(1, 10) in for_each($xs, { yield('value=', $1) })

and I added “expression chaining” to perform multiple side-effects

yield('hello');
yield('world');
let $x = 'hi ' in (yield('greetings: '); yield($x + $name))

Totality

When applied to programming languages, word “total” means that program written in such a language should provably terminate. I am not sure that it is possible to prove that the language is total (at least, I know that I don’t have enough knowledge to attempt it), but it is a crucial property the language should possess. We don’t want a message processing script to stuck in an infinite loop!

One way to achieve it is to prohibit conditional loops, recursion (some forms of it, at least) and infinite sequences. We still have iteration via for_each but it is finite, because ranges and node sets (which result from select function) are also finite. And, to prohibit recursion, I decided against apply operator which could be applied to a closure. Otherwise we could have infinite recursion, for instance

let $f = { apply($1, $1) } in apply($f, $f)

That’s why only built-in functions, that we have full control over, can evaluate closures. Did it make the language always terminating? I daresay “yes”. The script either terminates or crashes because of type mismatch.

Type system (lack of thereof)

As much as I’d love to have a static type system, this was not quite possible here. The reason was that the key-value pairs that come from the messages may have various data types, such as signed/unsigned integers of different length, strings, booleans, and blobs. While I could have provided “typed” versions of “value-extracting” function, e.g. str_val($node) or int_val($node) - they wouldn’t prevent from run-time errors when applied to a node of a different type. And if a type system can not guard against run-time errors, what’s the merit of having one? I still think that some limited form of static analysis is possible, but that’s an exercise for the future. So it ended up being a dynamically, albeit strongly, typed language.

Implementation

On one hand, I tried to keep the language small but flexible, goal-oriented, containing familiar syntactic constructs and just easy to use. On the other hand, we had only two weeks to build a working prototype (yeah, I have a tyrant of a boss one the most awesome bosses around), so I had to trade some of the conveniences for a simpler grammar and easy of implementation.

The implementation language was Elixir because it was the go-to language for anything that runs 247 in our team. While Elixir can leverage Erlang’s parser generator - yecc - which accepts LALR(1) grammars, I had to prototype the grammar in OCaml (Menhir, to be precise) because it provided so much nicer facilities to debug grammars and inspect the automaton. In the end I settled down with just one shift/reduce conflict left which I solved by precedence declarations.

Writing an interpreter was relatively straightforward task (see e.g. this post). It was just walk-the-tree interpreter, and, unless performance is proved to be an issue, I am not going to look at byte-compilation, VMs, JIT, … Yeah, definitely not going to… maybe only if I have free time in one of those long weekends ;)

Conclusion

When I started the project I had a true-to-life business justification. Yet I hesitated a lot and I spent maybe two weeks in total trying to express the rules in JSON/XPath/YAML/Whatelse because… well, because I also share the common opinion. Was it successful? I believe so. Even if we only ever use it internally to pre-configure our software, I think it’s a great time-saver. I just hope that this, somewhat emotional, post would help me or someone else to make the right choice faster :)

Oh, and I have a sure-kill argument for the managers who might oppose your well-deserved fun time spent on designing a programming language every now and then. We all know that microservices are the cutting edge of software architecture design principles. The best practices nowadays are unthinkable without an agile process of mass-producing creating loosely coupled3 services that do one thing and do it well. Then why are we so opposed to exactly the same idea at “compile-time”? Moreover, as using Tomcat and servlets is not as trendy anymore, why should we blindly welcome the idea of embeddable DSLs and frown upon standalone mini-languages? I suggest to adopt the idea of “To the cloud!” slogan some run-time-people use and make it “Let them be Dragon(slayer)s!” or “Finite Automata to the masses!”.


  1. I also heard term “parasite languages” but I think it has an unnecessary negative connotation. [return]
  2. It was quite a big thing, almost a silver bullet-like, in early 2000s. [return]
  3. Unfortunately more often than not they are “loosely coupled” only because they talk over loosely defined JSON messages. [return]