Tag Archives: parsing

Building a DSL with Piglet – Extending our DSL

In the last part we created a very basic DSL for use with turtle graphics. Today we’re going to extend the language with more capabilites and learn more about how to use Piglet and also incidentally a bit of compiler construction (nothing scary, I promise).

This is the DSL we ended up with last time:

pendown
move 50
rotate 90
move 50
rotate 90
move 50
rotate 90
move 50
penup
move -10
rotate 90
move 10
pendown
move 30
rotate 90
move 30
rotate 90
move 30
rotate 90
move 30

One obvious thing here is that it would be really nice if we could have variables. We could represent the constants with something else and perform calculations on them. An important thing is to consider the syntax, for a simple DSL we want something that is easy to parse. For this reason we’ll use a var keyword to make variable declarations and require all variables to start with a $. Let’s modify the existing program to include our new language features:

var $size = 50
var $innersize = 30
pendown
move $size
rotate 90
move $size
rotate 90
move $size
rotate 90
move $size
penup
move -10
rotate 90
move 10
pendown
move $innersize
rotate 90
move $innersize
rotate 90
move $innersize
rotate 90
move $innersize

Looking at the new DSL, we note that there is now a new legal statement to make – the var declaration. So we need to extend statement with another rule. The statement list is already pretty long, so we separate it out into another rule:

// Runtime stuff
var variables = new Dictionary<string, int>();

var variableIdentifier = configurator.Expression();
variableIdentifier.ThatMatches("$[A-Za-z0-9]+").AndReturns(f => f.Substring(1));

variableDeclaration.IsMadeUp.By("var")
    .Followed.By(variableIdentifier).As("Name")
    .Followed.By("=")
    .Followed.By<int>().As("InitialValue").WhenFound(f =>
    {
        variables.Add(f.Name, f.InitialValue);
        return null;
    });

Since we want something that matches not a single constant string, we need to make an expression. Expressions are made using the configuration.Expression() method. You give it a regex and you get back an object. You also need to make a function to return a value for the expression. What we want is the variable name without the $ at the start.

To keep track of what variables we have and what values they have, we add them to a dictionary of strings and integers which we then can reference.

We add this new rule to our choices for statementList by adding this to the end:

.Or.By(variableDeclaration)

Note that our rule has just a single component we do not need to add a WhenFound method. It will automatically return the value of the function returned by the single value. Not that we’re using the values returned yet, but we will.

Continuing on, we want to use the variables for the rest of the commands. In order to do that we need to make another rule. Let’s call it expression. For now, a valid expression is either a constant int value, or a variable name. We then use the new rule instead of plain ints for the move and rotate commands. Here’s the full listing that supports the language that we set out to support so far:

// Runtime stuff
var turtle = new Turtle(canvas1);
var variables = new Dictionary<string, int>();

// Parser configurator
var configurator = ParserFactory.Fluent();
var statementList = configurator.Rule();
var statement = configurator.Rule();
var variableDeclaration = configurator.Rule();
var expression = configurator.Rule();

var variableIdentifier = configurator.Expression();
variableIdentifier.ThatMatches("$[A-Za-z0-9]+").AndReturns(f => f.Substring(1));

variableDeclaration.IsMadeUp.By("var")
    .Followed.By(variableIdentifier).As("Name")
    .Followed.By("=")
    .Followed.By<int>().As("InitialValue").WhenFound(f =>
    {
        variables.Add(f.Name, f.InitialValue);
        return null;
    });

expression.IsMadeUp.By<int>()
    .Or.By(variableIdentifier).As("Variable").WhenFound(f => variables[f.Variable]);

statementList.IsMadeUp.ByListOf(statement);
statement.IsMadeUp.By("pendown").WhenFound(f =>
{
    turtle.PenDown = true;
    return null;
})
    .Or.By("penup").WhenFound(f =>
    {
        turtle.PenDown = false;
        return null;
    })
    .Or.By("move").Followed.By(expression).As("Distance").WhenFound(f =>
    {
        turtle.Move(f.Distance);
        return null;
    })
    .Or.By("rotate").Followed.By(expression).As("Angle").WhenFound(f =>
    {
        turtle.Rotate(f.Angle);
        return null;
    })
    .Or.By(variableDeclaration);

A few interesting possibilities here. It’s perfectly acceptable to replace the int for the variable declaration. In fact, it’s a pretty good idea. This enables us to write things like this:

var $foo = 42
var $bar = $foo

What we really want however is to be able to write this:

var $size = 50
var $spacing = 10
var $innersize = $size - $spacing * 2

We’re halfway there already. What is needed is to create some more rules for how an expression can be made. We start simple, we allow only a single addition. Separate what is today an expression into a new rule called term, and make a new rule for additionExpressions.

expression.IsMadeUp.By(additionExpression);

additionExpression.IsMadeUp.By(term).As("First").Followed.By("+").Followed.By(term).As("Second").WhenFound(f => f.First + f.Second)
    .Or.By(term);
term.IsMadeUp.By<int>()
    .Or.By(variableIdentifier).As("Variable").WhenFound(f => variables[f.Variable]);

So, an addition expression is either an add, or a single lone value. This works fine as long as you dont try something like var $a = $b + $c + 100. This is easily fixable by changing the first term into an additionExpression. We are wiring the rule to itself! And like magic we can have as many additions in a row as we’d like. It’s trivial to add a subtraction expression as well, duplicate the rule, and change the operator and the WhenFound function to subtract instead of add (renaming to addSub instead to reflect the new functionality).

addSub.IsMadeUp.By(additionExpression).As("First").Followed.By("+").Followed.By(term).As("Second").WhenFound(f => f.First + f.Second)
                    .Or.By(additionExpression).As("First").Followed.By("-").Followed.By(term).As("Second").WhenFound(f => f.First - f.Second)
                    .Or.By(term);

However, if we add a multiplication and division here things will start to get strange. Everything will be evaluated left to right, so when writing var $a = 2 + 3 * 10 $a will have the value 60 instead of the expected 32. In order to solve this we need to redefine term. A term will now be a multiplication or division expression or a single factor that is a constant. We also wire this rule to itself like the addition rule so we can do many multiplications in a row. The expression grammar now looks like this:

expression.IsMadeUp.By(addSub);

addSub.IsMadeUp.By(addSub).As("First").Followed.By("+").Followed.By(mulDiv).As("Second").WhenFound(f => f.First + f.Second)
    .Or.By(addSub).As("First").Followed.By("-").Followed.By(mulDiv).As("Second").WhenFound(f => f.First - f.Second)
    .Or.By(mulDiv);

mulDiv.IsMadeUp.By(mulDiv).As("First").Followed.By("*").Followed.By(factor).As("Second").WhenFound(f => f.First * f.Second)
    .Or.By(mulDiv).As("First").Followed.By("/").Followed.By(factor).As("Second").WhenFound(f => f.First / f.Second)
    .Or.By(factor);

factor.IsMadeUp.By<int>()
    .Or.By(variableIdentifier).As("Variable").WhenFound(f => variables[f.Variable]);

As the final thing for today, let’s add support for parenthesis. They go in the factor rule. So, a factor can be an entire expression wrapped in parenthesis. It’s almost a thing of magic. We wire the entire last rule back up to the start.

factor.IsMadeUp.By<int>()
    .Or.By(variableIdentifier).As("Variable").WhenFound(f => variables[f.Variable])
    .Or.By("(").Followed.By(expression).As("Expression").Followed.By(")").WhenFound(f => f.Expression);

This is where we’ll end for this part. We have a fully functional expression parser. You can use expressions for all the commands that previously only took integer values, and you can assign values to variables. A few things for the adventurous to try:

  • Add variable assignment. Make set $foo = 1+3*$bar a legal, working piece of code
  • Add support for the modulus operator
  • Add support for unary minus. var $foo = -$barM work. This is a bit tricky, and don’t be afraid if you get some scary exceptions from Piglet

Next time, we’ll make the DSL a bit more fun. We’ll add some basic flow control! Code on github is updated with the latest version

Advertisements
Tagged , , ,

Using Piglet to create a DSL – Setting the stage for turtles

Piglet has quite a few uses, and a case where it shines in particular is in the area of domain specific languages. So, in order to compliment the admittedly scarce tutorials on how to actually use Piglet, I intend to write a short little series on how to make a domain specific language using Piglet. Let’s get straight to it.

Enter the Turtle

I’m going to go for a classic educational DSL here, Turtle graphics. Because everyone like turtles, right? Basically, it’s a way of drawing graphics that could be imagined as if you had a (very obedient) turtle with a paintbrush attached to the back of it’s shell. Masterfully trained, it responds to three commands:

  • Put the brush up or down
  • Move forward or backward a certain number of steps
  • Rotate a certain number of degrees

I have made a little WPF application that implements this behaviour. Eschewing all the WPF junk, here’s the turtle itself:

    public class Turtle
    {
        private readonly Canvas canvas;
        private double angle;
        private double x;
        private double y;

        public Turtle(Canvas canvas)
        {
            this.canvas = canvas;
            x = canvas.ActualWidth/2;
            y = canvas.ActualHeight/2;
        }

        public bool PenDown { get; set; }

        public void Move(double distance)
        {
            var oldX = x;
            var oldY = y;

            double rads = angle/360.0*Math.PI*2.0;
            x += Math.Sin(rads)*distance;
            y += Math.Cos(rads)*distance;

            if (PenDown)
            {
                canvas.Children.Add(new Line { X1 = oldX, Y1 = oldY, X2 = x, Y2 = y, Stroke = new SolidColorBrush(Color.FromRgb(0, 0, 0))});
            }
        }

        public void Rotate(float a)
        {
            angle += a;
        }
    }

Nothing fancy, unless you’ve forgotten all about basic trigonometry. It takes a canvas and when moving it leaves trails if the pen is set to down. Using the turtle is straightforward. This program draws a twenty-sided polygon:

var turtle = new Turtle(canvas);          
turtle.Rotate(90);

turtle.PenDown = true;

for (int i = 0; i < 20; ++i)
{
   turtle.Rotate(360/20);
   turtle.Move(20);
}

It’s fun drawing with the turtle, but cumbersome since you’d have to recompile every time you want to change the shape. What we would really want is to describe it in a structured way using text. Sort of like a mini-programming language. LOGO would be the classic for turtle programming, but I feel it would be more fun to make our own little language for this.

Turtleese

When making a DSL or full blown language, in any tool, it’s usually easiest to start small and work your way up. To get the ball rolling, or turtle walking, let’s make a language that contains only the basic commands understood by the turtle. A sample piece of this rudimentary turtleese looks like this:

pendown
move 50
rotate 90
move 50
rotate 90
move 50
rotate 90
move 50
penup
move -10
rotate 90
move 10
pendown
move 30
rotate 90
move 30
rotate 90
move 30
rotate 90
move 30

This program should make the turtle draw two concentric squares, 10 units apart. It’s really verbose, but it’s a start that we can improve on. If we look at the structure of the program, we can see that this is made up of list of statements. This is the key to formulate a grammar for parsing the language. Piglet gives us two options for configuring a parser, a fluent and a more technical interface. I’ll use the fluent interface for this series, maybe in the end provide an equivalent grammar in the technical interface. In the end, it’ll serve to deepen the knowledge of how grammars actually work, though you should have a pretty good idea when we actually get there.

Here’s the entire thing:

var turtle = new Turtle(canvas1);

var configurator = ParserFactory.Fluent();
var statementList = configurator.Rule();
var statement = configurator.Rule();

statementList.IsMadeUp.ByListOf(statement);
statement.IsMadeUp.By("pendown").WhenFound(f =>
    {
        turtle.PenDown = true;
        return null;
    })
    .Or.By("penup").WhenFound(f =>
    {
        turtle.PenDown = false;
        return null;
    })
    .Or.By("move").Followed.By<int>().As("Distance").WhenFound(f =>
    {
        turtle.Move(f.Distance);
        return null;
    })
    .Or.By("rotate").Followed.By<int>().As("Angle").WhenFound(f =>
    {
        turtle.Rotate(f.Angle);
        return null;
    });
var parser = configurator.CreateParser();
parser.Parse(code.Text);

Going through it line by line. First we make a configurator object. This is the object that we’ll use to make the parser. Then we make two rules, statementList and statement. A thing of huge importance here. The order of declaring these is vitally important. You see, if you’d make this in the wrong order it would just try to find a single statement. A parser constructed by Piglet must be able to parse everything you give it down to a single rule. For now, the single rule is the statement list. Usually it’s something like “program” for a programming language or translation unit or something to that effect.

Moving on, we declare that a statementList unsurprisingly is made up by a list of statement. This is a reference to another rule.

A statement is declared as one of the four commands that the turtle understands. Following each possibility there is a lambda that gets executed when the parser has recognized that a rule is to be applied. For now we make the turtle do the same thing as the command says. There is a need to return something from the parse function, we ignore that for now and return null. Later on we’ll revisit this and find out why you want to return things from rules.

The move and rotate are interesting, since they have a parameter which is an integer value. We’ll need to find out the value of this parameter. In Piglet, this requires you to give the parameter a name. This name then becomes accessible on the dynamic object that is the input to the WhenFound function. Integers are simple types, so Piglet has a built-in function for recognizing them.

Calling parser.Parse with the code causes to the turtle to do what you wanted it to. The program also gives some helpful hints for when you’ve confused our poor reptilian friend, you get this functionality for free when using Piglet.

The full source code for this tutorial is found on GitHub: TurtlePig. Feel free to fork and mess about with it. The same repo will be updated once this tutorial continues, so you might find a slightly different version. I’ll figure a way to keep all versions accessible.

Next time, we’ll take a look at making this language a lot more capable – step by step.

Tagged , , ,

An introduction to parsing and BNF, and why you should care even if you’re not writing a compiler

Parsing tasks are extremely common and even though major efforts have been done to standardize our inter-systems communications we still end up analyzing various text files which have structure – but not a structure that we have a library at hand that can parse. As an example, say you have something that outputs this sort of text (I assume this isn’t a real format, it’s not supposed to be).

fruits {
    banana {
        [tasty="true"]
        [colour="yellow"]
    }

    orange {
    }
    
    [eatable="if not rotten"]
}

How would you go about it, if you want to store the output result in some form of object? I often see splitting on spaces, and other characters, making convoluted paths and abusing regular expressions to no end to wrangle these sorts of things into whatever data structure you have on your client. What if I said there is a better way. It’s not new by a long short, but it rarely spreads outside of advanced comp-sci classes or compiler designer classes to the more pragmatic programmers working in the real world.

Grammars

There’s generic tools that can parse these sorts of texts. They’re incredibly efficient at doing this as well. They simply require the right input. That input is a grammar. A grammar is really the set of rules that a text must follow in order to be parseable. For instance, in the example above we could assume that it would be illegal to not match the brackets properly, or to have an unnamed node. The way of expressing grammars is known as the Backus-Naur form, or BNF for short. It’s an easy syntax.

Resulting symbol := List of input symbols | Another list of symbols | [maybe an empty section as well];

A symbol is one of two things. It’s either a terminal symbol or a nonterminal symbol. Terminal symbols are those that are directly translatable from the input, nonterminals are built up from lists of other symbols – both terminal and nonterminal ones. Lets start to make a grammar for our input.

To make it easy on ourselves, we’ll simplify the input text and add features in as we go. Let’s start with identifying only this subset

basenode {
}

There are three distinct terminal tokens here: name, { and }. Note that the terminal name will represent for us all the legal nodenames. It’s typical that a terminal token is defined by a regular expression. Let’s assume that name can be only lowercase alphabetic characters. The only legal combination of these terminals are is name followed by { followed by }. This must yield a non terminal token. Let’s call it object. The grammar will look like this:

terminal name [a-z]+

object := name '{' '}' ;

A few things about this grammar, for shortness it allows single character terminal to be used as is. Standard regex notation defines the terminal to be lowercase a-z, repeated at least once.

Lets add the [attribute=”value”] notation next. A sample of legal input is now:

rickastley {
    [nevergonnaone="give you up"]
    [nevergonnatwo="let you down"]
    [nevergonnathree="run around"]
    [nevergonnafour="and desert you"]
}

Now it gets more interesting, since there can be more than one attribute, and maybe zero. This is where the cool recursive nature of grammars come into play. Looking at the notation of attributes, we can see that we will require four more terminals: [, ], = and a quoted-string. We will also require three more non-terminals: attribute, attribute-list and optional-attribute-list. I’ll give you the grammar, and we can discuss why it looks like it does.

terminal name [a-z]+
terminal quoted-string "[^"]+"

object := name '{' optional-attribute-list '}' ;

optional-attribute-list := attribute-list 
                         | ;

attribute-list := attribute-list attribute 
                | attribute;

attribute := '[' name '=' quoted-string ']'

I like to pronounce := as is-made-up-of. Using that, lets analyze the rules:

  1. An object is made up by name followed by an optional-attribute-list enclosed in bracers
  2. An optional-attribute-list is made up of either an attribute-list or nothing. The nothing part is important, since this is what makes the rule optional.
  3. An attribute-list is an attribute-list followed by an attribute, or just an attribute. This sort of self-recursion is key to understanding formal grammars. A list is either a single item, or another list with another item tacked on to the end. This is probably all very familiar if you know functional languages like lisp (cons)
  4. An attribute is simply some terminals tacked together to form a nonterminal called attribute.

The grammar above will recognize an object with any number of attributes, including none, using recursion.

Adding the child objects is similar. This time we will not require any new terminals. We will require two new non-terminals: optional-object-list and object-list.

terminal name [a-z]+
terminal quoted-string "[^"]+"

object := name '{' optional-object-list optional-attribute-list '}' ;

optional-object-lit := object-list 
                     | ;

object-list := object-list object 
             | object ;

optional-attribute-list := attribute-list 
                        | ;

attribute-list := attribute-list attribute 
                | attribute ;

attribute := '[' name '=' quoted-string ']';

As you can see, the object list refers to the main rule object, which causes all the same rules applied to the main object to be applied to the child objects as well. This recursive feature of grammars is extremely powerful. So, we have reduced this problem down to 8 lines of grammar, which save for a few changes in syntax is entirely compatible for a whole range of parser generators.

In fact, more often than not when you are reading text you are in fact parsing it. Only when reading text that has no real context, like counting words without caring if they are nouns or verbs are you not parsing. So why do we keep writing text parsing code by hand when we have all this power right at our fingertips?

This question has been the main motivator for me behind writing Piglet, my parser generator. Not to replace any of the large full-featured generators out there, but to provide a small and nimble alternative that can be easily embedded into your application. Democratizing the formal grammars if you will. It’s currently near an alpha state, with runnable code and a reasonably efficient generation. Feel free to contribute, it’s on github. I’ve also updated the Demo project to include an implementation of this particular grammar. The curious reader will perhaps wonder why there currently are two implementations of the same stuff. That’s because Piglet tries to be educational and chatty and supports a fluent interface for configuration which has a few convenience methods for making lists and optional things. The implementation that exactly matches the non-extended grammar shown in this post is the second method.

If you study the grammar above there are a few things that could be improved. For instance, you must specify child objects first, before specifying attributes. If you’re interested enough, try modifying the grammar so that this becomes possible. It’s good exercise!

I hope I’ve gotten you at least a bit interested in how parsing using tools really works. Hopefully an introductory post for my own library isn’t too far away so it can be put to the test. This little series on parsing will continue with more technical details on how these parsing tools actually work.

Tagged , , ,