Tag Archives: piglet

Building a DSL with Piglet – Extending our DSL

In the last part we created a very basic DSL for use with turtle graphics. Today we’re going to extend the language with more capabilites and learn more about how to use Piglet and also incidentally a bit of compiler construction (nothing scary, I promise).

This is the DSL we ended up with last time:

pendown
move 50
rotate 90
move 50
rotate 90
move 50
rotate 90
move 50
penup
move -10
rotate 90
move 10
pendown
move 30
rotate 90
move 30
rotate 90
move 30
rotate 90
move 30

One obvious thing here is that it would be really nice if we could have variables. We could represent the constants with something else and perform calculations on them. An important thing is to consider the syntax, for a simple DSL we want something that is easy to parse. For this reason we’ll use a var keyword to make variable declarations and require all variables to start with a $. Let’s modify the existing program to include our new language features:

var $size = 50
var $innersize = 30
pendown
move $size
rotate 90
move $size
rotate 90
move $size
rotate 90
move $size
penup
move -10
rotate 90
move 10
pendown
move $innersize
rotate 90
move $innersize
rotate 90
move $innersize
rotate 90
move $innersize

Looking at the new DSL, we note that there is now a new legal statement to make – the var declaration. So we need to extend statement with another rule. The statement list is already pretty long, so we separate it out into another rule:

// Runtime stuff
var variables = new Dictionary<string, int>();

var variableIdentifier = configurator.Expression();
variableIdentifier.ThatMatches("$[A-Za-z0-9]+").AndReturns(f => f.Substring(1));

variableDeclaration.IsMadeUp.By("var")
    .Followed.By(variableIdentifier).As("Name")
    .Followed.By("=")
    .Followed.By<int>().As("InitialValue").WhenFound(f =>
    {
        variables.Add(f.Name, f.InitialValue);
        return null;
    });

Since we want something that matches not a single constant string, we need to make an expression. Expressions are made using the configuration.Expression() method. You give it a regex and you get back an object. You also need to make a function to return a value for the expression. What we want is the variable name without the $ at the start.

To keep track of what variables we have and what values they have, we add them to a dictionary of strings and integers which we then can reference.

We add this new rule to our choices for statementList by adding this to the end:

.Or.By(variableDeclaration)

Note that our rule has just a single component we do not need to add a WhenFound method. It will automatically return the value of the function returned by the single value. Not that we’re using the values returned yet, but we will.

Continuing on, we want to use the variables for the rest of the commands. In order to do that we need to make another rule. Let’s call it expression. For now, a valid expression is either a constant int value, or a variable name. We then use the new rule instead of plain ints for the move and rotate commands. Here’s the full listing that supports the language that we set out to support so far:

// Runtime stuff
var turtle = new Turtle(canvas1);
var variables = new Dictionary<string, int>();

// Parser configurator
var configurator = ParserFactory.Fluent();
var statementList = configurator.Rule();
var statement = configurator.Rule();
var variableDeclaration = configurator.Rule();
var expression = configurator.Rule();

var variableIdentifier = configurator.Expression();
variableIdentifier.ThatMatches("$[A-Za-z0-9]+").AndReturns(f => f.Substring(1));

variableDeclaration.IsMadeUp.By("var")
    .Followed.By(variableIdentifier).As("Name")
    .Followed.By("=")
    .Followed.By<int>().As("InitialValue").WhenFound(f =>
    {
        variables.Add(f.Name, f.InitialValue);
        return null;
    });

expression.IsMadeUp.By<int>()
    .Or.By(variableIdentifier).As("Variable").WhenFound(f => variables[f.Variable]);

statementList.IsMadeUp.ByListOf(statement);
statement.IsMadeUp.By("pendown").WhenFound(f =>
{
    turtle.PenDown = true;
    return null;
})
    .Or.By("penup").WhenFound(f =>
    {
        turtle.PenDown = false;
        return null;
    })
    .Or.By("move").Followed.By(expression).As("Distance").WhenFound(f =>
    {
        turtle.Move(f.Distance);
        return null;
    })
    .Or.By("rotate").Followed.By(expression).As("Angle").WhenFound(f =>
    {
        turtle.Rotate(f.Angle);
        return null;
    })
    .Or.By(variableDeclaration);

A few interesting possibilities here. It’s perfectly acceptable to replace the int for the variable declaration. In fact, it’s a pretty good idea. This enables us to write things like this:

var $foo = 42
var $bar = $foo

What we really want however is to be able to write this:

var $size = 50
var $spacing = 10
var $innersize = $size - $spacing * 2

We’re halfway there already. What is needed is to create some more rules for how an expression can be made. We start simple, we allow only a single addition. Separate what is today an expression into a new rule called term, and make a new rule for additionExpressions.

expression.IsMadeUp.By(additionExpression);

additionExpression.IsMadeUp.By(term).As("First").Followed.By("+").Followed.By(term).As("Second").WhenFound(f => f.First + f.Second)
    .Or.By(term);
term.IsMadeUp.By<int>()
    .Or.By(variableIdentifier).As("Variable").WhenFound(f => variables[f.Variable]);

So, an addition expression is either an add, or a single lone value. This works fine as long as you dont try something like var $a = $b + $c + 100. This is easily fixable by changing the first term into an additionExpression. We are wiring the rule to itself! And like magic we can have as many additions in a row as we’d like. It’s trivial to add a subtraction expression as well, duplicate the rule, and change the operator and the WhenFound function to subtract instead of add (renaming to addSub instead to reflect the new functionality).

addSub.IsMadeUp.By(additionExpression).As("First").Followed.By("+").Followed.By(term).As("Second").WhenFound(f => f.First + f.Second)
                    .Or.By(additionExpression).As("First").Followed.By("-").Followed.By(term).As("Second").WhenFound(f => f.First - f.Second)
                    .Or.By(term);

However, if we add a multiplication and division here things will start to get strange. Everything will be evaluated left to right, so when writing var $a = 2 + 3 * 10 $a will have the value 60 instead of the expected 32. In order to solve this we need to redefine term. A term will now be a multiplication or division expression or a single factor that is a constant. We also wire this rule to itself like the addition rule so we can do many multiplications in a row. The expression grammar now looks like this:

expression.IsMadeUp.By(addSub);

addSub.IsMadeUp.By(addSub).As("First").Followed.By("+").Followed.By(mulDiv).As("Second").WhenFound(f => f.First + f.Second)
    .Or.By(addSub).As("First").Followed.By("-").Followed.By(mulDiv).As("Second").WhenFound(f => f.First - f.Second)
    .Or.By(mulDiv);

mulDiv.IsMadeUp.By(mulDiv).As("First").Followed.By("*").Followed.By(factor).As("Second").WhenFound(f => f.First * f.Second)
    .Or.By(mulDiv).As("First").Followed.By("/").Followed.By(factor).As("Second").WhenFound(f => f.First / f.Second)
    .Or.By(factor);

factor.IsMadeUp.By<int>()
    .Or.By(variableIdentifier).As("Variable").WhenFound(f => variables[f.Variable]);

As the final thing for today, let’s add support for parenthesis. They go in the factor rule. So, a factor can be an entire expression wrapped in parenthesis. It’s almost a thing of magic. We wire the entire last rule back up to the start.

factor.IsMadeUp.By<int>()
    .Or.By(variableIdentifier).As("Variable").WhenFound(f => variables[f.Variable])
    .Or.By("(").Followed.By(expression).As("Expression").Followed.By(")").WhenFound(f => f.Expression);

This is where we’ll end for this part. We have a fully functional expression parser. You can use expressions for all the commands that previously only took integer values, and you can assign values to variables. A few things for the adventurous to try:

  • Add variable assignment. Make set $foo = 1+3*$bar a legal, working piece of code
  • Add support for the modulus operator
  • Add support for unary minus. var $foo = -$barM work. This is a bit tricky, and don’t be afraid if you get some scary exceptions from Piglet

Next time, we’ll make the DSL a bit more fun. We’ll add some basic flow control! Code on github is updated with the latest version

Tagged , , ,

Using Piglet to create a DSL – Setting the stage for turtles

Piglet has quite a few uses, and a case where it shines in particular is in the area of domain specific languages. So, in order to compliment the admittedly scarce tutorials on how to actually use Piglet, I intend to write a short little series on how to make a domain specific language using Piglet. Let’s get straight to it.

Enter the Turtle

I’m going to go for a classic educational DSL here, Turtle graphics. Because everyone like turtles, right? Basically, it’s a way of drawing graphics that could be imagined as if you had a (very obedient) turtle with a paintbrush attached to the back of it’s shell. Masterfully trained, it responds to three commands:

  • Put the brush up or down
  • Move forward or backward a certain number of steps
  • Rotate a certain number of degrees

I have made a little WPF application that implements this behaviour. Eschewing all the WPF junk, here’s the turtle itself:

    public class Turtle
    {
        private readonly Canvas canvas;
        private double angle;
        private double x;
        private double y;

        public Turtle(Canvas canvas)
        {
            this.canvas = canvas;
            x = canvas.ActualWidth/2;
            y = canvas.ActualHeight/2;
        }

        public bool PenDown { get; set; }

        public void Move(double distance)
        {
            var oldX = x;
            var oldY = y;

            double rads = angle/360.0*Math.PI*2.0;
            x += Math.Sin(rads)*distance;
            y += Math.Cos(rads)*distance;

            if (PenDown)
            {
                canvas.Children.Add(new Line { X1 = oldX, Y1 = oldY, X2 = x, Y2 = y, Stroke = new SolidColorBrush(Color.FromRgb(0, 0, 0))});
            }
        }

        public void Rotate(float a)
        {
            angle += a;
        }
    }

Nothing fancy, unless you’ve forgotten all about basic trigonometry. It takes a canvas and when moving it leaves trails if the pen is set to down. Using the turtle is straightforward. This program draws a twenty-sided polygon:

var turtle = new Turtle(canvas);          
turtle.Rotate(90);

turtle.PenDown = true;

for (int i = 0; i < 20; ++i)
{
   turtle.Rotate(360/20);
   turtle.Move(20);
}

It’s fun drawing with the turtle, but cumbersome since you’d have to recompile every time you want to change the shape. What we would really want is to describe it in a structured way using text. Sort of like a mini-programming language. LOGO would be the classic for turtle programming, but I feel it would be more fun to make our own little language for this.

Turtleese

When making a DSL or full blown language, in any tool, it’s usually easiest to start small and work your way up. To get the ball rolling, or turtle walking, let’s make a language that contains only the basic commands understood by the turtle. A sample piece of this rudimentary turtleese looks like this:

pendown
move 50
rotate 90
move 50
rotate 90
move 50
rotate 90
move 50
penup
move -10
rotate 90
move 10
pendown
move 30
rotate 90
move 30
rotate 90
move 30
rotate 90
move 30

This program should make the turtle draw two concentric squares, 10 units apart. It’s really verbose, but it’s a start that we can improve on. If we look at the structure of the program, we can see that this is made up of list of statements. This is the key to formulate a grammar for parsing the language. Piglet gives us two options for configuring a parser, a fluent and a more technical interface. I’ll use the fluent interface for this series, maybe in the end provide an equivalent grammar in the technical interface. In the end, it’ll serve to deepen the knowledge of how grammars actually work, though you should have a pretty good idea when we actually get there.

Here’s the entire thing:

var turtle = new Turtle(canvas1);

var configurator = ParserFactory.Fluent();
var statementList = configurator.Rule();
var statement = configurator.Rule();

statementList.IsMadeUp.ByListOf(statement);
statement.IsMadeUp.By("pendown").WhenFound(f =>
    {
        turtle.PenDown = true;
        return null;
    })
    .Or.By("penup").WhenFound(f =>
    {
        turtle.PenDown = false;
        return null;
    })
    .Or.By("move").Followed.By<int>().As("Distance").WhenFound(f =>
    {
        turtle.Move(f.Distance);
        return null;
    })
    .Or.By("rotate").Followed.By<int>().As("Angle").WhenFound(f =>
    {
        turtle.Rotate(f.Angle);
        return null;
    });
var parser = configurator.CreateParser();
parser.Parse(code.Text);

Going through it line by line. First we make a configurator object. This is the object that we’ll use to make the parser. Then we make two rules, statementList and statement. A thing of huge importance here. The order of declaring these is vitally important. You see, if you’d make this in the wrong order it would just try to find a single statement. A parser constructed by Piglet must be able to parse everything you give it down to a single rule. For now, the single rule is the statement list. Usually it’s something like “program” for a programming language or translation unit or something to that effect.

Moving on, we declare that a statementList unsurprisingly is made up by a list of statement. This is a reference to another rule.

A statement is declared as one of the four commands that the turtle understands. Following each possibility there is a lambda that gets executed when the parser has recognized that a rule is to be applied. For now we make the turtle do the same thing as the command says. There is a need to return something from the parse function, we ignore that for now and return null. Later on we’ll revisit this and find out why you want to return things from rules.

The move and rotate are interesting, since they have a parameter which is an integer value. We’ll need to find out the value of this parameter. In Piglet, this requires you to give the parameter a name. This name then becomes accessible on the dynamic object that is the input to the WhenFound function. Integers are simple types, so Piglet has a built-in function for recognizing them.

Calling parser.Parse with the code causes to the turtle to do what you wanted it to. The program also gives some helpful hints for when you’ve confused our poor reptilian friend, you get this functionality for free when using Piglet.

The full source code for this tutorial is found on GitHub: TurtlePig. Feel free to fork and mess about with it. The same repo will be updated once this tutorial continues, so you might find a slightly different version. I’ll figure a way to keep all versions accessible.

Next time, we’ll take a look at making this language a lot more capable – step by step.

Tagged , , ,

Piglet 1.2.2 released

Well, honestly, 1.2.1 was released too but I forgot to add one of the features that I was going to add to it, so there was two releases today.

Either way, here’s whats new:

Support for escaping inside character classes

You can now use escaped characters inside the character classes. That is, regular expressions like [\d\s] will now result in the correct expected expression. This also enables you to escape the – and ^ sign inside the character class which previously was not possible.

Support for ignoring expressions in character classes

For those purposes where you want the lexer to silently drop things instead of generating tokens. Use configurator.Ignore(expression) to ignore it. For instance, this expression ignores a C# style comment: configurator.Ignore(@”//[^\n]*\n”);

As always, please do report any issues or questions, I’m happy to take a look at it!

Tagged ,

Piglet 1.2.0 released

A new version of Piglet – the fluent parser generator is now out. Here’s a bit of what has been fixed, and how you use those new features in your project.

Added error recovery and reporting to fluent parser configuration

Piglet can now recover from errors and continue parsing in a controlled manner when using the fluent configuration. This is done by the very special Error token. What you do is that when defining a rule, instead of the token you want to use, you use the Error token. In case of an error this token will match instead and all subsequent tokens will be skipped until the next legal token is matched.

When a rule containing the Error token has been found, you will find the Error variable filled in in the corresponding WhenFound clause.

Added token precedence to fluent configuration

You can now specify precedence groups for tokens using the LeftAssociative, RightAssociative and NonAssociative methods found in the fluent configurator object.

Pass one of these methods a group of tokens to declare them to be of equal precedence. Literals or expressions are legal input.

Completed XML documentation to include every method

No more empty intellisense! There should be somewhat helpful text everywhere now.

Fixed bug with escaped ?

\? was not handled correctly when used as an expression. It is now.

Fixed bug with numbered repetition

Numbered repetition in regular expressions now supports the standard {x,y} syntax in addition to the {x:y} syntax that I apparently implemented originally. The old syntax still works.

Also now, the {x,} syntax is supported, which Piglet didn’t support before.

Fixed bug with possible wrong order of defined expressions for fluent configuration

The terminal wasn’t immediately created when a fluent expression was declared. It now becomes created at the moment when the AndReturns function is called. Note that expressions now MUST call the AndReturns function. An incompletely defined expression is no longer legal tender in the piglet world.

Automated the NuGet package management

Thanks to @orjansjoholm the NuGet packages are now built with psake and automated. I could not have done this myself, and it works a treat!

Other news

Check out this thing: A DCPU16 emulator and assembler. Uses Piglet as its parser for the assembly!

Tagged ,

New versions of Piglet and Regexvisualizer are out

It’s been a busy few weeks now, but I finally managed to get some OSS work done again. Here’s an overview of the new stuff

Piglet 1.1.0

Piglet has been updated with a few new features.

  • DFAs generated by piglet can now be minimized, and are minimized by default. This is in essence an optimization that will reduce Piglets memory footprint and make it run faster. As alwyas, this is a tradeoff and you can if you really want to turn off the minimization. The result will be a larger memory footprint but a faster parser generation time. The correctness of the thing remains the same.
  • Ability to get the graph string for the lexer has been added to the public API. Really a feature coming from the regex visualizer, it has been improved with a better API for access from outside the piglet internals. This has also been enhanced with new features, outlined below in the updates to the regex visualizer.
  • Changed test framework to NUnit. This decision was made since there had been some issues with MsTest, mainly that it screwed up building on a machine which did not have Visual Studio installed. This was not an acceptable situation so we decided to change it.

All in all, this release should be fully backwards compatible with earlier versions of Piglet. If you are using it and finding any problems, please do let me know. Piglet is, as always, available on NuGet and on GitHub

Regex visualizer

The regular expression visualizer has been given a new feature, making it really useful for debugging. It will now highlight given an input string which are the active states in both the NFA and the DFA. And if your expression isn’t matched, it will show you what part of your expression is matched and the last state where the machine was in a legal state. All in all this is turning out to be a pretty useful tool. I hope to do some work on the rendering soon, to avoid the pretty annoying flickering – going to look into canviz to turn the rendering into client-side instead of torturing google with constant requests.

Tagged ,

Introducing Piglet

This is an introductory post for my latest creation, Piglet the friendly parsing and lexing tool. As so much software Piglet is written out of both a sense of satisfaction in solving a difficult problem, and seeing a void where my piece of software might be useful to someone. Piglet is open sourced using the MIT license. This post is intended to compliment the readme on github to form a better picture of why I have written this library and how it can be used.

Why another parser generator?!

Because the other parser generators are too big for easy tasks, and harder to integrate into a project. They are also highly technical and hard to understand for someone who isn’t genuinely interested in old-school comp.sci.

The purpose is for you to have a tool at hand for parsing tasks that are smaller than full blown language construction efforts. Piglet tries to bridge the gap between (ab)using regular expressions and going in guns blazing with a large parsing toolset. It is not going to replace ANTLR or any of those tools, though you certainly can parse larger grammars with Piglet. This is the tool to use if you want a low-treshold, easy to use tool for any context free data.

Code based configuration

Piglet is, in sharp contrast to most other generators configurable in code. You create your parser and use it as naturally as you configure any other object. It has about the same functionality as the yacc/bison family of tools and generates similar parsers. This is an important point because if your parser is configured using a separate input file or even a completely separate tool you are always going to have a distance between the running code and the parser. The parser generators also sometimes generate fairly incomprehensible code. If you have a small parsing task, are you really going to use a generator or are you going to roll your own parser?

There are some tradeoffs in this strategy. Obviously the parser generation is going to be done at runtimes. However, the parsers are reentrant to the construction should only need to be done once. It also enables you to be able to use lambdas as actions when rules are followed, which also means that Piglet can construct type-safe parsers!

Ease of use

Unfortunately parsing is a bit complex and it’s usually required to know a bit about context free grammars to be able to construct a parser. Piglet tries to get around this problem by introducing a fluent configuration interface.

It’s easier to show it straight up. This is a full blown JSON parser. The only thing missing here is really only details such as the ability to write 0x notation for hexadecimal and exponent notation for numbers. There rest is all there.

// Create a configuration object. This object generates all the configuration needed for the parser.
var config = ParserFactory.Fluent();

// Create rules. The first rule is the MAIN rule, the rule that everything must be able
// to condense down to. For JSON this is a single object, since a JSON string is always
// just one object.
var jsonObject = config.Rule();

// Used to represent an element of an object
// which is something like "elementname" : some_value
var jsonElement = config.Rule();

// This represents a value of a json element
var jsonValue = config.Rule();

// This represents an array of values
var jsonArray = config.Rule();

// Now we declare what a jsonObject is made up of
// Literals are found in quotes. Interesting parts that we are interested 
// in are named using the As clause, which makes them accessible in the 
// .WhenFound. The result of WhenFound is returned to the caller.
jsonObject.IsMadeUp.By("{")
		  .Followed.ByListOf<JsonElement>(jsonElement).As("ElementList").ThatIs.SeparatedBy(",").Optional
		  .Followed.By("}")
	.WhenFound( o => new JsonObject { Elements = o.ElementList } );

// Declares what an element is made up of. Note that the rule above uses the
// jsonElement before what it is made of is declared. This is a crucial part
// of parsing. This bit has two interesting named pieces "Name" and "Value".
// Since each bit has a value, this gets assigned to the o parameter to the WhenFound lambda.
jsonElement.IsMadeUp.By(config.QuotedString).As("Name")
		   .Followed.By(":")
		   .Followed.By(jsonValue).As("Value")
	.WhenFound( o => new JsonElement { Name = o.Name, Value = o.Value } );

// A jsonValue has several interpretations, separated by Or clauses.
// Predefined parsing is found for simple types. Note the recursive use
// of jsonObject, values can be full objects themselves! There is no need
// for .WhenFound clauses for single part rules, they
// will always return the value of the single part.
jsonValue.IsMadeUp.By(config.QuotedString)
	.Or.By<int>()
	.Or.By<double>()
	.Or.By(jsonObject)
	.Or.By(jsonArray)
	.Or.By<bool>()
	.Or.By("null").WhenFound(o => null); // Need to specify, since the default will return the string "null"

// This rule could have been merged into the jsonValue rule, as its 
// own .Or.By clause. It's separated for readability only.
jsonArray.IsMadeUp.By("[")
		 .Followed.ByListOf(jsonValue).As("Values").ThatIs.SeparatedBy(",").Optional
		 .Followed.By("]")
	   .WhenFound(o => o.Values);

// When the configuratin is completed - create the parser!
// If the parser creation is unsuccessful, this throws the
// friendliest exception possible to help you find the issue.
var parser = config.CreateParser();

// Here is how you use it!
var jObject = (JsonObject)parser.Parse(
	@"{ 
		 ""Property1"":""va\""lue"", 
		 ""IntegerProperty"" : 1234, 
		 ""array"":[1,2,3,4,5],
		 ""another_object"" : {
			""another_property"":13.37
		 },
		 ""empty_object"" : {
			
		 }
	}");

There is also a less verbose technical interface, which right now includes more functions than the fluent one does. In the interest of brevity in this post I’m not going to describe it in detail. The features that currently only exists in the tech interface are context dependent precedence, token precedence, panic mode error recovery and type specific parser generation (the fluent parsers will all generate an object as an end result). These should all be upcoming features in further versions for the fluent interface.

The fluent configuration interface is right now in a working state, but can certainly use more functions! Common tasks should be easy to accomplish while harder tasks may require you to fall back on the technical configuration. As an exciting preview of things to come, I’m currently working on an embedded domain specific language generator (made using Piglet) called Snout, which when finished should help the maintainability of the fluent interface.

Piglet is totally dependency free, and written in C# 4. The code is well commented and understandable. If you are looking for code that reasonably explains the parser generation algorithm, Piglet is a sensible choice.

In technical terms, Piglet generates LALR(1) parsers – mainly because it’s an easier way of expressing your grammar since you do not require left factoring which is a pretty difficult concept to explain.

More examples

I’ve written a Demo project that is available in the main github repository which contains a few demos. The unit tests for Piglet are also quite extensive and are suitable for learning. I’m hoping to find more time to write a detailed tutorial on how to go from a given format into a complete parser. Suggestions for tutorials and articles are most welcome.

Getting your hands on it

Piglet lives on GitHub if you want the source code and demos. It also lives on NuGet if you just want the binaries.

Contributing

You are most welcome, regardless of skill level. There’s lots of stuff to do! Or just fork it and go to town on your own. I would be flattered if you did.

And if you can’t program, Piglet could use a logo. I’m thinking a piglet would make an appropriate one 🙂

Tagged , , ,

An introduction to parsing and BNF, and why you should care even if you’re not writing a compiler

Parsing tasks are extremely common and even though major efforts have been done to standardize our inter-systems communications we still end up analyzing various text files which have structure – but not a structure that we have a library at hand that can parse. As an example, say you have something that outputs this sort of text (I assume this isn’t a real format, it’s not supposed to be).

fruits {
    banana {
        [tasty="true"]
        [colour="yellow"]
    }

    orange {
    }
    
    [eatable="if not rotten"]
}

How would you go about it, if you want to store the output result in some form of object? I often see splitting on spaces, and other characters, making convoluted paths and abusing regular expressions to no end to wrangle these sorts of things into whatever data structure you have on your client. What if I said there is a better way. It’s not new by a long short, but it rarely spreads outside of advanced comp-sci classes or compiler designer classes to the more pragmatic programmers working in the real world.

Grammars

There’s generic tools that can parse these sorts of texts. They’re incredibly efficient at doing this as well. They simply require the right input. That input is a grammar. A grammar is really the set of rules that a text must follow in order to be parseable. For instance, in the example above we could assume that it would be illegal to not match the brackets properly, or to have an unnamed node. The way of expressing grammars is known as the Backus-Naur form, or BNF for short. It’s an easy syntax.

Resulting symbol := List of input symbols | Another list of symbols | [maybe an empty section as well];

A symbol is one of two things. It’s either a terminal symbol or a nonterminal symbol. Terminal symbols are those that are directly translatable from the input, nonterminals are built up from lists of other symbols – both terminal and nonterminal ones. Lets start to make a grammar for our input.

To make it easy on ourselves, we’ll simplify the input text and add features in as we go. Let’s start with identifying only this subset

basenode {
}

There are three distinct terminal tokens here: name, { and }. Note that the terminal name will represent for us all the legal nodenames. It’s typical that a terminal token is defined by a regular expression. Let’s assume that name can be only lowercase alphabetic characters. The only legal combination of these terminals are is name followed by { followed by }. This must yield a non terminal token. Let’s call it object. The grammar will look like this:

terminal name [a-z]+

object := name '{' '}' ;

A few things about this grammar, for shortness it allows single character terminal to be used as is. Standard regex notation defines the terminal to be lowercase a-z, repeated at least once.

Lets add the [attribute=”value”] notation next. A sample of legal input is now:

rickastley {
    [nevergonnaone="give you up"]
    [nevergonnatwo="let you down"]
    [nevergonnathree="run around"]
    [nevergonnafour="and desert you"]
}

Now it gets more interesting, since there can be more than one attribute, and maybe zero. This is where the cool recursive nature of grammars come into play. Looking at the notation of attributes, we can see that we will require four more terminals: [, ], = and a quoted-string. We will also require three more non-terminals: attribute, attribute-list and optional-attribute-list. I’ll give you the grammar, and we can discuss why it looks like it does.

terminal name [a-z]+
terminal quoted-string "[^"]+"

object := name '{' optional-attribute-list '}' ;

optional-attribute-list := attribute-list 
                         | ;

attribute-list := attribute-list attribute 
                | attribute;

attribute := '[' name '=' quoted-string ']'

I like to pronounce := as is-made-up-of. Using that, lets analyze the rules:

  1. An object is made up by name followed by an optional-attribute-list enclosed in bracers
  2. An optional-attribute-list is made up of either an attribute-list or nothing. The nothing part is important, since this is what makes the rule optional.
  3. An attribute-list is an attribute-list followed by an attribute, or just an attribute. This sort of self-recursion is key to understanding formal grammars. A list is either a single item, or another list with another item tacked on to the end. This is probably all very familiar if you know functional languages like lisp (cons)
  4. An attribute is simply some terminals tacked together to form a nonterminal called attribute.

The grammar above will recognize an object with any number of attributes, including none, using recursion.

Adding the child objects is similar. This time we will not require any new terminals. We will require two new non-terminals: optional-object-list and object-list.

terminal name [a-z]+
terminal quoted-string "[^"]+"

object := name '{' optional-object-list optional-attribute-list '}' ;

optional-object-lit := object-list 
                     | ;

object-list := object-list object 
             | object ;

optional-attribute-list := attribute-list 
                        | ;

attribute-list := attribute-list attribute 
                | attribute ;

attribute := '[' name '=' quoted-string ']';

As you can see, the object list refers to the main rule object, which causes all the same rules applied to the main object to be applied to the child objects as well. This recursive feature of grammars is extremely powerful. So, we have reduced this problem down to 8 lines of grammar, which save for a few changes in syntax is entirely compatible for a whole range of parser generators.

In fact, more often than not when you are reading text you are in fact parsing it. Only when reading text that has no real context, like counting words without caring if they are nouns or verbs are you not parsing. So why do we keep writing text parsing code by hand when we have all this power right at our fingertips?

This question has been the main motivator for me behind writing Piglet, my parser generator. Not to replace any of the large full-featured generators out there, but to provide a small and nimble alternative that can be easily embedded into your application. Democratizing the formal grammars if you will. It’s currently near an alpha state, with runnable code and a reasonably efficient generation. Feel free to contribute, it’s on github. I’ve also updated the Demo project to include an implementation of this particular grammar. The curious reader will perhaps wonder why there currently are two implementations of the same stuff. That’s because Piglet tries to be educational and chatty and supports a fluent interface for configuration which has a few convenience methods for making lists and optional things. The implementation that exactly matches the non-extended grammar shown in this post is the second method.

If you study the grammar above there are a few things that could be improved. For instance, you must specify child objects first, before specifying attributes. If you’re interested enough, try modifying the grammar so that this becomes possible. It’s good exercise!

I hope I’ve gotten you at least a bit interested in how parsing using tools really works. Hopefully an introductory post for my own library isn’t too far away so it can be put to the test. This little series on parsing will continue with more technical details on how these parsing tools actually work.

Tagged , , ,