domingo, 30 de outubro de 2016

Monadic Parser Combinators in C#

I have only recently come across the now classic paper by Graham Hutton and Erik Meijer on Monadic Parser Combinators. This beautifully written paper walks the reader, in surprisingly accessible terms, through a technique for developing recursive descent parsers using functional programming.

In the paper, Hutton and Meijer use the functional programming language Gofer for all the examples, but in fact C# is also a perfectly valid language in which to develop these techniques. Doing so offered me a number of insights about C# type inference and generics, LINQ, and laziness, so in this series of blog posts I would like to offer my own translation of the examples in the paper into C#, section by section.

The type of parsers

In the first section of the paper, Hutton and Meijer introduce the parser monad as a function that takes a string as input and returns a collection of results, where each result can be thought of as containing both the abstract syntax tree (AST) corresponding to the parsed input (i.e. the Value), as well as any unconsumed text (i.e. the Tail) that remains to be parsed:

type Parser a = String -> [(a,String)]

Their definition can be translated directly into the following C# delegate type:

public delegate IEnumerable<IResult<TValue>> Parser<out TValue>(string input);

Each individual result from the collection of results can be defined using the following types:

public interface IResult<out TValue>
{
    TValue Value { get; }

    string Tail { get; }
}

public class Result<TValue> : IResult<TValue>
{
    public Result(TValue value, string tail)
    {
        Value = value;
        Tail = tail;
    }

    public TValue Value { get; private set; }

    public string Tail { get; private set; }
}
Notice how we have made the type of TValue covariant through the use of the out keyword in the interface and delegate definition. In this way, we get for free the ability to build parsers for more abstract syntax trees out of parsers for more concrete types.

Primitive parsers

Monadic parsers exploit function composition to build more complex parsers out of combinations of more primitive parsers. In the paper, three primitive parsers are presented as building blocks out of which all other parsers are built: result, zero and item.

In order to keep our C# parser monad more idiomatic relative to other existing monads such as IEnumerable<T> or IObservable<T>, I decided to rename these primitives as Return, Empty and Char, respectively. They are defined below as functions of a static class Parser, much like the enumerable extension methods are defined as part of static class Enumerable:


public static partial class Parser
{
    public static Parser<TValue> Return<TValue>(TValue value)
    {
        return input => EnumerableEx.Return(new Result<TValue>(value, input));
    }

    public static Parser<TValue> Empty<TValue>()
    {
        return input => Enumerable.Empty<IResult<TValue>>();
    }

    public static Parser<char> Char()
    {
        return input => input.Take(1)
                             .Select(x => new Result<char>(x, input.Substring(1)));
    }
}
The Return method creates a parser that succeeds by returning the single result value without consuming any of the input string. In order to create a collection with a single result, I have used the Return method over IEnumerable<T> defined in the Interactive Extensions for the .NET framework (Ix). In case you don't want to include such a large library just for this method definition, its implementation is quite trivial:

static class EnumerableEx
{
    internal static IEnumerable<TValue> Return<TValue>(TValue value)
    {
        yield return value;
    }
}

The Empty method creates a parser that always fails (i.e. produces an empty list of results), in a very close parallel to Enumerable.Empty.

Finally, the Char method returns a parser that consumes a single character from the input string, and returns that character as the result value. Note that this implementation is potentially a very inefficient way to consume single characters from large input strings, but we are focusing for now on clarity rather than performance. Indeed, one advantage of monadic parsers is that if we later come back and reimplement Char using a more efficient technique, the entire library of parsers will automatically benefit from the added performance.

Parser combinators

The parsers defined above will not be very useful until we have provided a solid foundation for combining them into more complex parsers. In monadic parsers, Hutton and Meijer introduce the bind function as a way to integrate sequencing of parsers with processing of their result values:

bind :: Parser a -> (a -> Parser b) -> Parser b
p `bind` f = \inp -> concat [f v inp' | (v,inp') <- p inp]


In the C# world, bind is more commonly known as SelectMany, which is also one of the cornerstones of the IEnumerable<T> monad. The above signature (and implementation) can be translated into idiomatic C# by the following:

public static Parser<TResult> SelectMany<TValue, TResult>(
    this Parser<TValue> parser,
    Func<TValue, Parser<TResult>> parserSelector)
{
    return input => parser(input).SelectMany(
           result => parserSelector(result.Value)(result.Tail));
}
Following the monadic parsers paper, the above definition can be interpreted as follows. First, we apply parser to the input and obtain a collection of results (TValue, string). Notice that we also have a function parserSelector that takes a value and returns a parser. This suggests that what we want to do next is simply to apply that function to each result value (and unconsumed input string) in turn. In this way, for each result we get a collection of new results, so we end up with a collection of collections.

In order to keep the operator inside the monad (i.e. without creating any additional nested structures) we need to flatten this collection of collections. That way, we will have as a result a Parser<T> instead of a Parser<IEnumerable<IResult<T>>>. This is important because we can treat Parser<T> just as any other parser, without thinking of how to deal with the details of some complicated nested structure (i.e. Parser<T> is our monad). Another way to think about this monadic operator is that it allows each of the results of the first parser to be made directly available for the second parser to process.

The easiest way to achieve all of this is simply to apply the existing Enumerable.SelectMany to the collection of results from the parser. This method already expects us to define a function that maps every element in the collection to a new collection (i.e. from one element, select many elements) and it also takes care of concatenating all of the output collections into one final grand collection.

Query comprehension syntax

Following the implementation of the monadic bind (or SelectMany), Hutton and Meijer proceed to define the sat function in terms of bind, which allows parsing characters that satisfy a given predicate. At this point, however, it seemed to me more idiomatic to implement the generic Where combinator, which tests values from any parser against a defined predicate function:

public static Parser<TValue> Where<TValue>(
    this Parser<TValue> parser,
    Func<TValue, bool> predicate)
{
    return input => parser(input).Where(result => predicate(result.Value));
}

Again, we can simply make use of the matching Enumerable combinator applied to the collection of results from the parser in order to make the implementation trivial.

At this point, we might as well also implement the Select method, which will allow us to project parsing results directly into other types:

public static Parser<TResult> Select<TValue, TResult>(
    this Parser<TValue> parser,
    Func<TValue, TResult> selector)
{
    return input => parser(input).Select(
           result => new Result<TResult>(selector(result.Value), result.Tail));
}

Indeed, we can extend the previous implementation of SelectMany to also support projection of the results from the monadic combinator by quite simply applying Select to the intermediate list of results:

public static Parser<TResult> SelectMany<TValue, TIntermediate, TResult>(
    this Parser<TValue> parser,
    Func<TValue, Parser<TIntermediate>> parserSelector,
    Func<TValue, TIntermediate, TResult> resultSelector)
{
    return input => parser(input).SelectMany(
           result => parserSelector(result.Value)(result.Tail).Select(
           iresult => new Result<TResult>(
               resultSelector(result.Value, iresult.Value),
               iresult.Tail)));
}
Now, the amazing thing to realize is that by implementing this set of monadic parser combinators, we have precisely implemented all of the methods that are required to enable LINQ query comprehension syntax over our Parser<T> monad.

Basically, this means we can now write more complex parsers by writing expressions like so:

var digit = from c in Parser.Char()
            where '0' <= c && c <= '9'
            select c;

While digit is not exactly the most complicated of parser combinators, in the next post we will make use of query comprehension syntax freely to enable chaining and sequencing of multiple parser results in order to allow for readable implementations of more advanced (and useful) parsers.