If you follow the generic instruction for converting a NFA to a DFA presented in my earlier post, the DFA you generate will work, but it will not be optimal. It will have too many states opposed to the bare minimum needed to do it’s job properly. This tends to happen more frequently if you give it an input of an expression that in some respect have redundancy. A prime example of this is the expression `a+b+|ab`

. The `|ab`

part is of course, totally useless. But the subset construction algorithm doesn’t know that.

To compare, here is the NFA generated by thompsons construction:

Here is the non-minimized DFA that the subset construction algorithm will generate:

And here is the minimum DFA required to do the job properly:

Quite a difference! So, how do we do this? There are several ways of accomplishing this, some more complex than others. I’m going to give you a working algorithm which admittedly isn’t the fastest. The faster versions are more complex, and would ruin the idea of explaining the algorithm.

## For programmers

I’m willing to bet you’re like me, a programmer and not a mathematician. You can describe this algorithm in math terms to no end, and I’m willing to bet that most people are going to miss out on what is really not the most complex algorithm in the world. If you google this you’re going to get the algorithm described as some function which partitions states so that this and that is true. Perhaps this is jealousy on my part since I feel left out of the cool crowd with the greek symbols and all, but I think there are better ways of explaining this stuff.

## The algorithm

It’s a bit backwards, but what we’re going to do is to assume that all states can be merged together and then rule out those that can’t. This is going to be represented by a table, where empty squares are those that we cannot tell apart. So, for now, this table looks like this:

+-----+ |s1| | +--+--+--+ |s2| | | +--+--+--+--+ |s3| | | | +--+--+--+--+--+ |s4| | | | | +--+--+--+--+--+ |s0|s1|s2|s3| +--+--+--+--+

This table is going to be filled later on. There’s no reason for a full table, we only want the distinct pairs of states.

What we are interested in is to sort the states into groups that may be merged into a single state without altering the functionality of the DFA. So, what we really need to do is to find out how to find these state groups normally referred to as nondistinguishable states. The first rule of this is:

A state that is an accepting state may never be in the same group as a state which is not an accepting state.

So, in our graphs we may never merge a double-circle state with a single circle state. Let’s apply this to our table. We end up with this:

+-----+ |s1| | +--+--+--+ |s2|X |X | +--+--+--+--+ |s3| | |X | +--+--+--+--+--+ |s4|X |X | |X | +--+--+--+--+--+ |s0|s1|s2|s3| +--+--+--+--+

We’re not out of the woods yet, still more work to do. There is another rule at work here. It’s easiest to express this in pseudocode. You will apply the algorithm * repeatedly *until the table stops changing. If you complete a full iteration without any entry getting filled in, you are done.

do { for (int i = 0; i < numStates; ++i) { for (int j = i + 1; j < numStates; ++j) { if (table[i,j].IsEmpty) { foreach (var inputsymbol in allValidInputs) { if (one of the states has a transition but the other one doesn't) { table[i,j] = X; } else if (both states have a transition) { int iTarget = state at other end of transition from state i; int jTarget = state at other end of transition from state j; if (!table[iTarget, jTarget].IsEmpty) { table[i,j] = X; } } else { // For clarity, if none of the states have transitions // on this symbol, don't change the table } } } } } } while (table.WasChanged)

What this does is to check if there is any significant difference by checking where the transitions go, and filling in the table as it goes along. Applying this changes the table to look like this:

+-----+ |s1|X | +--+--+--+ |s2|X |X | +--+--+--+--+ |s3|X | |X | +--+--+--+--+--+ |s4|X |X | |X | +--+--+--+--+--+ |s0|s1|s2|s3| +--+--+--+--+

In this example, iterating the algorithm won’t change the table which means we are done. The states s1 and s3 can be merged into one state and s2 and s4 can be merged into another. To illustrate, here is the two nondistinct state groups marked out:

A final caveat is that you might get long chains of states. Say that s0 and s1 will be merged along with s1 and s2, which really means you need to merge s0, s1 and s2 into another state. This is pretty trivial, and I will not elaborate further on this, an example of this can be found in the class DFA of my parser library Piglet, which is available on GitHub.

As a final note, it can be proven that this algorithm generates the smallest possible DFA that accepts the given grammar. All in a few lines of code. Try the algorithm out yourself at the regex visualizer – make a redundant expression and uncheck the Minimize checkbox.