Parsing tasks are extremely common and even though major efforts have been done to standardize our inter-systems communications we still end up analyzing various text files which have structure – but not a structure that we have a library at hand that can parse. As an example, say you have something that outputs this sort of text (I assume this isn’t a real format, it’s not supposed to be).
fruits { banana { [tasty="true"] [colour="yellow"] } orange { } [eatable="if not rotten"] }
How would you go about it, if you want to store the output result in some form of object? I often see splitting on spaces, and other characters, making convoluted paths and abusing regular expressions to no end to wrangle these sorts of things into whatever data structure you have on your client. What if I said there is a better way. It’s not new by a long short, but it rarely spreads outside of advanced comp-sci classes or compiler designer classes to the more pragmatic programmers working in the real world.
Grammars
There’s generic tools that can parse these sorts of texts. They’re incredibly efficient at doing this as well. They simply require the right input. That input is a grammar. A grammar is really the set of rules that a text must follow in order to be parseable. For instance, in the example above we could assume that it would be illegal to not match the brackets properly, or to have an unnamed node. The way of expressing grammars is known as the Backus-Naur form, or BNF for short. It’s an easy syntax.
Resulting symbol := List of input symbols | Another list of symbols | [maybe an empty section as well];
A symbol is one of two things. It’s either a terminal symbol or a nonterminal symbol. Terminal symbols are those that are directly translatable from the input, nonterminals are built up from lists of other symbols – both terminal and nonterminal ones. Lets start to make a grammar for our input.
To make it easy on ourselves, we’ll simplify the input text and add features in as we go. Let’s start with identifying only this subset
basenode { }
There are three distinct terminal tokens here: name
, {
and }
. Note that the terminal name
will represent for us all the legal nodenames. It’s typical that a terminal token is defined by a regular expression. Let’s assume that name can be only lowercase alphabetic characters. The only legal combination of these terminals are is name followed by { followed by }. This must yield a non terminal token. Let’s call it object. The grammar will look like this:
terminal name [a-z]+ object := name '{' '}' ;
A few things about this grammar, for shortness it allows single character terminal to be used as is. Standard regex notation defines the terminal to be lowercase a-z, repeated at least once.
Lets add the [attribute=”value”] notation next. A sample of legal input is now:
rickastley { [nevergonnaone="give you up"] [nevergonnatwo="let you down"] [nevergonnathree="run around"] [nevergonnafour="and desert you"] }
Now it gets more interesting, since there can be more than one attribute, and maybe zero. This is where the cool recursive nature of grammars come into play. Looking at the notation of attributes, we can see that we will require four more terminals: [
, ]
, =
and a quoted-string
. We will also require three more non-terminals: attribute
, attribute-list
and optional-attribute-list
. I’ll give you the grammar, and we can discuss why it looks like it does.
terminal name [a-z]+ terminal quoted-string "[^"]+" object := name '{' optional-attribute-list '}' ; optional-attribute-list := attribute-list | ; attribute-list := attribute-list attribute | attribute; attribute := '[' name '=' quoted-string ']'
I like to pronounce := as is-made-up-of. Using that, lets analyze the rules:
- An object is made up by name followed by an
optional-attribute-list
enclosed in bracers - An
optional-attribute-list
is made up of either anattribute-list
or nothing. The nothing part is important, since this is what makes the rule optional. - An
attribute-list
is anattribute-list
followed by anattribute
, or just anattribute
. This sort of self-recursion is key to understanding formal grammars. A list is either a single item, or another list with another item tacked on to the end. This is probably all very familiar if you know functional languages like lisp (cons) - An
attribute
is simply some terminals tacked together to form a nonterminal called attribute.
The grammar above will recognize an object with any number of attributes, including none, using recursion.
Adding the child objects is similar. This time we will not require any new terminals. We will require two new non-terminals: optional-object-list
and object-list
.
terminal name [a-z]+ terminal quoted-string "[^"]+" object := name '{' optional-object-list optional-attribute-list '}' ; optional-object-lit := object-list | ; object-list := object-list object | object ; optional-attribute-list := attribute-list | ; attribute-list := attribute-list attribute | attribute ; attribute := '[' name '=' quoted-string ']';
As you can see, the object list refers to the main rule object, which causes all the same rules applied to the main object to be applied to the child objects as well. This recursive feature of grammars is extremely powerful. So, we have reduced this problem down to 8 lines of grammar, which save for a few changes in syntax is entirely compatible for a whole range of parser generators.
In fact, more often than not when you are reading text you are in fact parsing it. Only when reading text that has no real context, like counting words without caring if they are nouns or verbs are you not parsing. So why do we keep writing text parsing code by hand when we have all this power right at our fingertips?
This question has been the main motivator for me behind writing Piglet, my parser generator. Not to replace any of the large full-featured generators out there, but to provide a small and nimble alternative that can be easily embedded into your application. Democratizing the formal grammars if you will. It’s currently near an alpha state, with runnable code and a reasonably efficient generation. Feel free to contribute, it’s on github. I’ve also updated the Demo project to include an implementation of this particular grammar. The curious reader will perhaps wonder why there currently are two implementations of the same stuff. That’s because Piglet tries to be educational and chatty and supports a fluent interface for configuration which has a few convenience methods for making lists and optional things. The implementation that exactly matches the non-extended grammar shown in this post is the second method.
If you study the grammar above there are a few things that could be improved. For instance, you must specify child objects first, before specifying attributes. If you’re interested enough, try modifying the grammar so that this becomes possible. It’s good exercise!
I hope I’ve gotten you at least a bit interested in how parsing using tools really works. Hopefully an introductory post for my own library isn’t too far away so it can be put to the test. This little series on parsing will continue with more technical details on how these parsing tools actually work.