Internal Optimizations

In this section we'll look at internal optimizations that are invisible to the end user.


Our current implementation of delay re-evaluates the delayed Parser every time it is encountered in the interpreter. This is wasteful as the Parser will not change between evaluations. We can store the value after we first evaluate it, a technique known as memoization.

Scala's lazy val makes memoization straightforward. A lazy val is like a val except the right-hand side expression, which produces the value, is not evaluated until the lazy val is first used. The value is then saved and used for subsequent uses. The following example shows these semantics. Note the println is not evaluated until the first use, but future uses do not re-evaluate the println.

lazy val a = { 
// Evaluating
// res0: Int = 1
// res1: Int = 1

(Make sure you understand the difference between the behaviour of val, lazy val, and def. If you're unsure on the difference, replace lazy val with val and with def and see how the output changes.)

With a lazy val we can change the reification of delay, which in my implmentation is a case class called ParserDelay, to memoize the delayed parser. I've shown how to do this below, but try to do it yourself before reading my solution.

final case class ParserDelay[A](parser: () => Parser[A]) extends Parser[A] {
  lazy val force: Parser[A] = parser()

Now we should verify this change actually improves performance. Parsing expressions, which we looked at in the previous section, depends on delayed parsers and therefore makes a good benchmark. Be careful when benchmarking this change. You may need to change how orElse is implemented in addition to Parser.delay.

Below are the results from my experiments. I don't see any consistent improvement from memoization; the memoized and non-memoized parsers are within the error bounds of each other. The results are surprising to me; I was expecting to see more effect from memoization. I tried a lot of different implementations to try to find a consistent performance difference, but was unable to do so.

[info] Benchmark                                               Mode  Cnt       Score      Error  Units
[info] ExpressionBenchmark.exprParser                         thrpt   25   30904.951 ±  612.385  ops/s
[info] ExpressionBenchmark.exprMemoizedParser                 thrpt   25   31923.289 ±  520.764  ops/s
[info] ExpressionBenchmark.exprWhereMemoizedParser            thrpt   25  123323.837 ± 3973.923  ops/s
[info] ExpressionBenchmark.exprWhereParser                    thrpt   25  128436.999 ± 3552.719  ops/s
[info] ExpressionBenchmark.exprWhereAccumulateParser          thrpt   25  182851.316 ± 3105.287  ops/s
[info] ExpressionBenchmark.exprWhereAccumulateMemoizedParser  thrpt   25  183370.477 ± 8889.627  ops/s

A failing experiment is still a useful experiment. It shows that my model of program performance doesn't match reality, and I need to revise my model. In this case the issue is something relatively mundane. We're still avoiding some work with memoization, but the work is so trivial that it doesn't meaningfully effect performance. When we create a recursive parser, like the one below, we're creating a circular data structure. The when we evaluate the delay we're following a reference back into the data structure. The performance difference between evaluating a function to produce this reference, or just directly getting the reference, is very small.

val term: Parser[Expr] =
  (number, mul, Parser.delay(term))
    .mapN((f, _, t) => Expr.literal(f) * t)
    .orElse( _))

In more complicated situtation we might want more detail about how the code runs inside the JVM. A good place to start is with a profiler such as Visual VM. This may not give enough detail, in which case we can look at the generated output code. This blog post goes into more detail on how this can be done.


In many cases we can express the same functionality using charIn or charWhere. This is a problem when it comes to reasoning about performance. We've seen that charWhere can be much more efficient than charIn, but its not obvious to the user that this is the case. We'd ideally like equivalent code to have roughly equivalent performance no matter how it's written, rather than requiring the developer to know obscure implementation specific rules.

We can achieve this by rewriting charIn to charWhere. In fact we can do better, by rewriting any chain of orElse and Parser.char into charWhere. For example, if we see


we can rewrite it to

Parser.charWhere(x => x == 'a' || x == 'b')

This type of optimization, where we rewrite code into a functionally equivalent but more efficient alternative, is the core of what optimizing compilers do.

I want to be able to benchmark this optimization against the current implementation, so I'm going to implement it in a separate object in a separate file. Because we've reified Parser, and hence it's represented as an algebraic data type, we can easily do this.

My implementation of the optimization has three parts:

Have a go at implementing it yourself. It should use techniques you already know, except you're likely to run into one issue: the type system. Attempting to convince the type system that the rewrite is valid cannot be done with just what we have available. The issue is we are attempting to replace, at run-time, an OrElse[A,B] with a CharWhere, and the compiler doesn't know that A and B will both be Char and hence this is ok. We know because we check this is the case, but we cannot represent this check in a way the compiler understands. Remember types exist at compile-time and this is a run-time operation. We could transfer information from compile-time to run-time using so-called "implicit evidence" but doing so requires changing the interface which goes against the idea that this is a purely internal change. The solution is to throw in an asInstanceOf call, to tell the compiler "I know you think this Parser[Char] is not the same as the Parser[A] you're looking for, but trust me". In general, you should never use this method. In almost all the cases I've seen it's a sign of a bad design. It's justified here because we're switching levels. The type system ensures correct construction of the parsers at compile-time. We're rewriting them at run-time, and have done our own checks to ensure the rewrite is correct.

If you struggle, my implementation is below. Once you have an implementation run some benchmarks. In my benchmarks, results of which are also below, this transformation results in code that is about 25 times faster!

Here's the implementation:

import cats.implicits._
import parser.Parser._

object Optimize {
  def orElseCharToCharWhere[A](parser: Parser[A]): Parser[A] = {
    def collectCharIn[A](parser: Parser[A]): Option[Set[Char]] =
      parser match {
        case ParserChar(value) => Set(value).some
        case ParserOrElse(left, right) =>
          (collectCharIn(left), collectCharIn(right)).mapN((l, r) => l ++ r)
        case _ => none[Set[Char]]

    def toCharWhere(chars: Set[Char]): Parser[Char] =
      Parser.charWhere(char => chars.contains(char))

    parser match {
      case p: ParserOrElse[a] =>
        val chars = collectCharIn(p)
          .map(chars => toCharWhere(chars).asInstanceOf[Parser[a]])
      case ParserMap(source, f) => ParserMap(orElseCharToCharWhere(source), f)
      case ParserRepeatBetween(source, min, max, monoid) =>
        ParserRepeatBetween(orElseCharToCharWhere(source), min, max, monoid)
      case ParserRepeat(source, monoid) =>
        ParserRepeat(orElseCharToCharWhere(source), monoid)
      case ParserRepeatAccumulator(source, min, accumulator) =>
        ParserRepeatAccumulator(orElseCharToCharWhere(source), min, accumulator)
      case ParserProduct(left, right) =>
        ParserProduct(orElseCharToCharWhere(left), orElseCharToCharWhere(right))
      case ParserAnd(left, right, semigroup) =>
      case ParserFlatMap(source, f) =>
        ParserFlatMap(orElseCharToCharWhere(source), f)
      case other => other

Here are the benchmark results:

[info] Benchmark                                          Mode  Cnt        Score       Error  Units
[info] ToCharWhereBenchmark.parser                       thrpt   25   104026.788 ±   565.152  ops/s
[info] ToCharWhereBenchmark.parserOrElseCharToCharWhere  thrpt   25  2565950.528 ± 54878.768  ops/s

There are many more rewrites you could attempt to improve performance. For example, you could implement map-fusion, transforming into