JVM Compiler Construction with Scala and BCEL, Part 1.5

July 9th, 2008

The second part of my Scala compiler construction tutorial has been a long time coming. This post is, unfortunately, not the second part of the article — although that is coming soon. Honest.

Since part 1 was published, Scala 2.7 has been released which — among other things — introduced changes to the parser combinator library. Changes that meant the source code from part 1 will not compile in Scala 2.7. Sorry about that. Updated, working code can be found at the end of this post.

So what exactly has changed? Well, let’s see …

keyword() no longer discards its result

A bit of a refresher first:

In both Scala 2.6 and 2.7, the keyword implicit is called whenever you use a string in your “grammar rules”. For example:

def sum = expr ~ "+" ~ expr

Is the equivalent of:

def sum = expr.~(keyword("+").expr)

In Scala 2.6’s parser combinator library, the result of the keyword() call was actually the UnitParser — that is, a parser that would discard its result. At the time, that meant we could use the tilde operator (”~”) to create a sequenced parser and everything was fine:

def sum = expr ~ "+" ~ expr ^^
    ((left : Expr, right : Expr) => Sum(left, right))

In 2.7, the keyword() implicit returns a Parser[String] rather than a UnitParser. This means we have to either deal with the newly introduced tokens as follows:

def sum = expr ~ "+" ~ expr ^^
    ((left : Expr, op : String, right : Expr) => Sum(left, right))

Or alternatively …

Use <~ and ~> to indicate the important parts of a parse rule

Let’s start with something simple based on the 2.6 combinators:

def bracketExpr = "(" ~ expr ~ ")" ^^ ((e : Expr) => e)

Again, in 2.7 we know that keyword() is no longer a UnitParser, so we have to deal with it like so:

def bracketExpr = "(" ~ expr ~ ")" ^^
    ((l : String, e : Expr, r : String) => e)

Alright, this compiles and does what we expect. But why should we keep those strings around if we don’t need them? They sure do clutter up the code a whole bunch.

<~ and ~> can be used to include or discard the result of a given parser. <~ builds a parser that takes the result of two parsers (the parsers to its left and right) and builds a parser that takes the result of both and discards the result of the one on the right. Conversely, ~> yields a parser that takes the result of both parsers and discards the result on the left.

So, using these two operators we can rewrite bracketExpr as follows:

def bracketExpr = "(" ~> expr <~ ")" ^^
    ((e : Expr) => e)

Ah. Much better. :)

The ^^^ operator

I had to go digging in the Scala source code to work out what exactly this one does.

First, let’s take a look at some 2.6 code:

def simpleExpr = term * (
    "+" ^^ ((x : Expr, y : Expr) => Add(x, y)) |
    "-" ^^ ((x : Expr, y : Expr) => Sub(x, y))
)

In 2.6, this can parse zero or more repetitions of “term” interleaved by “+” and “-” (check out part 1 if you need a refresher on how the * combinator works). In 2.7 it’s a compile-time error because of the fact keyword() is no longer a UnitParser (are you seeing a pattern here? :) ), and ^^ is trying to pass the resulting String on to the anonymous method in each case.

If we use the ^^^ operator here, we can effectively discard the result of the keyword parse, and build a parser that uses a simple anonymous method to parse the current pair of terms:

def simpleExpr = term * (
    "+" ^^^ ((x : Expr, y : Expr) => Add(x, y)) |
    "-" ^^^ ((x : Expr, y : Expr) => Sub(x, y))
)

Exactly what we’re after. This compiles and behaves as expected.

What else?

There may be more changes to the parser combinator library which I haven’t covered here, but I’m not going to go looking for any more changes since the updated code seems to work just fine. This should be enough to at least understand the updated code for the compiler described in part 1 without needing to deal with any cryptic compiler errors.

Finally, the new code!

Special thanks to Harshad for sending through working code for 2.7 ages ago which I never got around to posting here. This code, along with the Scala API docs, was used to figure out just what had changed since 2.6. The code below is derived from some code he sent to me a few months back.

I’m really sorry this has been so long in the making. I’ll try to get around to writing the “real” part two of this article. In the meantime, here’s the updated code. Thanks! Please post or email any comments or questions.

Categories: Compiler Construction with Scala and BCEL, Compilers, Functional Programming, Scala, Software Development | 3 Comments

Implicits for the Masses

July 5th, 2008

I just finished reading Tony Morris’s blog post on Scala implicits and saw the following comment:

To me, implicits look a lot like global variables, which is why I don’t like them.

Or maybe I missed something?

Uh, yep — you missed something :) I was going to reply in the comments to Tony’s post but as per usual it became far too long, so I’m posting it here instead.

Let’s try something a little less abstract than what was covered in Tony’s article. I haven’t actually tested any of this code to check that it compiles (lazy, lazy, lazy!), but it should be pretty straightforward & clarify what implicits can be used for:

class Person(name : String) {}
class Conversation(person1 : Person, person2 : Person) {
    def greet() = {
       println(person1.name + " says: Hello, " + person2.name)
       println(person2.name + " says: Oh, hello " + person1.name)
    }
}
implicit def stringToPerson(name : String) : Person = new Person(name)

With this implicit in effect, we can create a Conversation without even mentioning the Person class:

val conv = new Conversation("John", "Mary")
conv.greet()

See how we passed two strings to the Conversation constructor, even though it’s supposed to take two Persons?

Global variables don’t really come into it, it’s more about implicit type conversion.

Why would you do this? Well for one, it lets you create faux multimethods:

class FrameWrapper(frame : JFrame) {
    def title_=(s : String) = frame.setTitle(s)
    def visible_=(b : Boolean) = frame.setVisible(b)
}
implicit def wrapFrame(frame : JFrame) : FrameWrapper =
    new FrameWrapper(frame)

This means we can call FrameWrapper methods on a JFrame and the Scala compiler will figure out what we mean:

val frame = new JFrame
frame.title = "My Application"
frame.visible = true

Normally you’d have to call setTitle/setVisible at the second and third lines of the code above because JFrame doesn’t have any public attributes called “title” and “visible”. However, with the implicits in effect the Scala compiler will wrap your JFrame in a FrameWrapper instance to call that method. A weak example to be sure, but it really is a nice way to “scala-ize” any existing Java code. All this is statically checked too, so it won’t try to set your children on fire when you’re not looking.

Nifty eh?

Categories: Scala, Software Development | 15 Comments

Anonymous Type Acrobatics With Scala

November 12th, 2007

As mentioned in passing at the beginning of a recent article, one of the coolest things made possible by Scala’s expressive type system are anonymous type declarations. The idea was brought to my attention by Neil Bartlett in this article before I even started looking at Scala in any depth. Neil’s short-but-sweet post is most definitely worth reading, but there’s a few more neat tricks to be gleaned from his little example. Let me paraphrase his original code:

def printName(o: { def getName: String }) = Console.println(o.getName)

Here, the printName method can be passed an instance of any type, so long as it implements the anonymous interface we provide here (i.e. “any object with a getName method that returns a String”). Neat.

The most obvious question that occured to me is this: does the statically-checked duck-typing rule hold true for data structures as well as functions? For example, we have a list of objects of any old type, so long as they satisfy the interface constraints of our anonymous type? You bet.

class Foo {
  def getName: String = "Foo"
}

class Bar {
  def getName: String = "Bar"
}

val items = List[{ def getName: String }](new Foo, new Bar)

// despite Foo and Bar being completely types, the compiler "knows"
// that they will both implement our anonymous type declaration
items.map(_.getName).foreach(Console.println)

Very cool stuff. No longer do we need ugly wrapper classes/proxies to work around the shortcomings of more primitive type systems.

Oh and if the anonymous type syntax irks you, fear not: types can be aliased:

// Foo and Bar declared elsewhere ...

type HasName = { def getName: String }

def printName(o: HasName) = Console.println(o.getName)

val items = List[HasName](new Foo, new Bar)
items.foreach(printName)

Irrespective of which side of the dynamic/static fence you sit, you have to admit that Scala is an interesting compromise …

Categories: Scala, Software Development | 1 Comment

JVM Compiler Construction with Scala and BCEL, Part 1

November 3rd, 2007

Writing a compiler is no simple undertaking. I’ve spent a few years on and off (mostly with the Python source code) trying to understand compiler internals on a practical level, and I’m only really only just beginning to learn enough to be able to manage some of the basics. So, in this article I present a detailed discussion of a simple compiler implementation that targets the Java Virtual Machine. I’ve been wanting to write a tutorial like this for a few months, because it’s exactly the sort of thing I wish I could have read a two or three years ago. So this is my little gift to all the other aspiring compiler nerds out there struggling (like me!) with the learning curve of compiler construction :)

Prerequisites

You’ll probably need at least a fundamental understanding of Scala and/or Java to be able to follow the code samples in this article, but I promise to try and keep the code as simple as possible. :) I’d also recommend that you have at least a high-level understanding of how compilers tend to be structured internally i.e. if you don’t know what a “scanner” or a “parser” is, you’re going to get lost very quickly.

Obviously, you’ll want Scala installed if you plan to run the code. You’ll also want some sort of *nix toolchain available if you plan on using the simple build scripts I’ve provided. It’s not that hard to compile it by hand if you’re stuck on Windows or something like that.

Scala, the language

I haven’t covered Scala a whole lot on this blog. To be honest, I only started using Scala about two weeks ago. I’m already falling in love with it – you can think of it as a more expressive Java that straddles the fence the computing industry seems to have erected between functional and object-oriented languages. Yes, yes, there’s classes in Ocaml, the other functional/OO language you’ll hear me ranting about on this blog – but even in Ocaml, OOP is generally shunned in favour of a functional approach. Scala, on the other hand, is a beautiful blend of the two. I won’t go into the details of some of the more elegant aspects of Scala, but here’s just a few little things about Scala that I find especially nice:

  • Scala is statically typed, but the excessive verbosity of Java is a thing of the past thanks to Scala’s type inferencing
  • Scala has closures and anonymous methods
  • Scala has what I can only describe as anonymous types: Neil Bartlett called this statically-checked duck typing
  • Scala types can be aliased
  • Scala code can use any Java class
  • If some care is taken, Java code can easily use Scala classes.

Scala, the standard library

Scala comes with its own standard library (on top of the standard Java library) containing a slew of useful components. The Scala standard library comes with its own unit testing toolkit called SUnit, a database abstraction layer (think Scalafied JDBC), syntactic sugar wrappers around well-known Java collection classes, concurrent programming utilities and – the bit we’re interested in – a bunch of stuff useful to compiler construction.

For more information about the Scala standard library, please check out the Scala API documentation.

Our programming language

The source language will effectively be a glorified calculator with variables. The result of the final expression evaluated is to be printed to the screen. For example, something like this:

v = 5;
t = 10;
v + 6 * t;

Would output “65″.

High-level overview of a compiler

Irrespective of what language you write it in, the structure of a compiler tends to look something like this:

  1. Lexical analysis or Scanning - scan the source document to produce a stream of interesting tokens
  2. Syntax analysis or Parsing - transform the token stream produced by the previous phase into an abstract representation of the program structure
  3. Semantic analysis or Type Checking - analyse the in-memory representation of the compiled program to ensure it “makes sense”
  4. Code generation – generate the executable code from the type-checked representation of the program

The compiler presented by this tutorial won’t have an explicit “step 3″ – the language is simple enough that it should be impossible to write a syntactically correct, semantically invalid program.

Our abstract syntax tree

Our parser will be transforming the token sequence generated by the scanner in phase 1 to produce what’s called an Abstract Syntax Tree – a representation of the program as a structured tree. There’s plenty of literature available on ASTs out in the wild, so I’ll gloss over the details and provide you with the relevant Scala code for declaring the nodes in our tree:

abstract class Expr
case class Add(l: Expr, r: Expr) extends Expr
case class Sub(l: Expr, r: Expr) extends Expr
case class Mul(l: Expr, r: Expr) extends Expr
case class Div(l: Expr, r: Expr) extends Expr
case class Num(value: int) extends Expr
case class Store(id: String, r: Expr) extends Expr
case class Load(id: String) extends Expr
case class Program(body: List[Expr]) extends Expr

So for a simple example, imagine we have a simple program in our language that looks something like this:

3 + 5; 4 * 6 - 2;

We could manually express this as an AST using the above classes as follows:

Program(
  List(
    Add(Num(3), Num(5)),
    Sub(
      Mul(Num(4), Num(6)),
      Num(2)
    )
  )
)

Of course, building ASTs like this isn’t quite what we’re after. In practice, we break AST construction down into a set of simple rules using Scala’s parser combinators. But more on that later.

Finally, you will want to note that the concrete elements of the AST are implemented as Scala case classes. Case classes will allow us to use Scala’s pattern matching features to inspect the AST in the code generation phase. This makes it really, really easy conceptually to see how the code should be generated.

The standard lexer

Scala comes with a standard lexical analysis class (StdLexical) which is capable of producing tokens for a simple Scala-like language. More than enough for our particular needs here. StdLexical knows how to parse numeric values, so all we need to do is tell the scanner which delimiters are important (i.e. a few simple arithmetic operators):

type Tokens = StdLexical
val lexical = new StdLexical
lexical.delimiters ++= Set("(", ")", "+", "-", "*", "/", ";")

That’s all the Scala code we need to write to configure our lexical analyser! Assuming your language doesn’t do anything too weird, StdLexical will likely be able to take care of the rest of the scanning process for you.

Pseudo-EBNF using parser combinators

EBNF is the notation de rigeur for describing programming language syntax in the compiler construction world. In theory, once you have written an EBNF description of your programming language syntax, writing a parser for it should be reasonably simple: you simply transcribe the EBNF into your language of choice and away you go. Unfortunately it’s not always that simple in practice, so people resort to using tools like yacc or bison to generate their parsers using EBNF-esque pattern-matching. The downside to these tools is that most have a fairly big learning curve.

Along comes parser combinators. Using combinators, it suddenly becomes possible to use EBNF-like rules in the compiler implementation language – a sort of DSL for parsers, if you like. Combinators appear to be especially popular in functional languages like Haskell (via Parsec) – although I’m still not sure what’s intrinsically special about functional languages that would make them more appropriate for combinators. If anybody could clear that up for more it would be much appreciated. In any case, Scala provides a parser combinator library as part of its standard API. Here’s the Scala code describing our simple little language, which also acts as the parser for our language:

def program = ((expr ~ ";")+) ^^ ((e: List[Expr])  => Program(e))def expr = assign | simpleExpr

def assign = ident ~ "=" ~ expr ^^ flatten2(Store)

def simpleExpr = term * (
    "+" ^^ ((x: Expr, y: Expr) => Add(x, y))
  | "-" ^^ ((x: Expr, y: Expr) => Sub(x, y))
)

def term = factor * (
    "*" ^^ ((x: Expr, y: Expr) => Mul(x, y))
  | "/" ^^ ((x: Expr, y: Expr) => Div(x, y))
)

def factor: Parser[Expr] = ("(" ~ expr ~ ")"
  | ident ^^ ((s: String) => Load(s))
  | numericLit ^^ ((s: String) => Num(s.toInt)))

Please note that, although I haven’t been explicit about return types for all of these methods they all return a Parser[Expr] – that is, a parser for an element of our AST. In fact, pretty much every method call you see here (including the operators!) is returning a Parser of some sort. If it’s not obvious, parser combinators – despite the scary name – have a very simple task: that of chaining simple parsers together to form a slightly more complex parser. These slightly more complex parsers can then be combined together as well, to form a slightly more complex parser again. Eventually you wind up with a tree of parsers that match one-to-one with the rules you provide in your combinator “grammar”.

It’s also interesting to note here that precedence is actually being enforced using the nested parsing rules: a factor will be parsed before any other rules because the factor rule is nested deeper in the parsing hierarchy than any of the others.

Now it probably goes without saying, but there’s a lot going on under the hood here. I’ll spend the next few sections trying to explain just exactly what all this actually means, but before we start it’s important that you understand …

Scala operators are just method calls

Many of the operators you see in the above source code listing are actually method calls on the objects preceding them. For example:

def assign = ident ~ "=" ~ expr ^^ flatten2(Store)

Could actually be rewritten thus:

def assign = ident.~(keyword("=").~(expr.^^(flatten2(Store))))

Obviously the first example is much more readable, but knowing this will make understanding what’s going on in the above code examples just that little bit easier.

Another bit of Scala magic: the parser combinators know to implicitly convert String arguments into a UnitParser using the keyword method. keyword accepts a string argument – the token it “expects” – and then discards the result.

As a side note, methods with more traditional names can be used as operators too:

import java.util.ArrayList
val list = new ArrayList
list add "Test" // the same as list.add("Test")

Finally, Scala operators – being methods – can be trivially overridden:

class Monkey {
  var numBananas = 0
  def +=(numBananas: int): unit = this.numBananas += numBananas
}
val m = new Monkey
Console.println(m.numBananas)
m += 5
Console.println(m.numBananas)

The “sequence”, “one-or-more” and “apply” combinators

Let’s take a look at the first rule in our program.

def program = ((expr ~ ";")+) ^^ ((e: List[Expr])  => Program(e))

The parser for this rule is expressed by this fragment:

def program = ((expr ~ ";")+)

Which, in plain old English, means: “a program consists of one or more sequences of expr and a semi-colon.”. Those of you familiar with regular expressions will probably have already guessed that the + combinator means “one or more”. The ~ combinator has a slightly more subtle meaning: it’s used to combine parsers together in a sequence. That is, within the context of this rule, once we have finished parsing an expr we would expect to find a literal semi-colon. If we do not, it’s a syntax error.

The final operator we’re covering here is the ^^ operator, which takes the result of the previous parser, applies a function to it, and returns the result (which should be an element of our AST) . Let’s take a look at the whole rule again:

def program = ((expr ~ ";")+) ^^ ((e: List[Expr])  => Program(e))

So the result of the one-or-more combinator is a list of expressions (List[Expr]). If parsing of the one-or-many rule succeeds, the ^^ takes the result (the said list of expressions) and passes it to the function passed as the operand on the right. In this case, we’re wrapping the list of expressions in a Program AST element.

So that should really be all you need to know about the ~, + and ^^ combinators. Please note that the annoying excess of brackets I use here is more for clarity than anything else – you could rewrite this as:

def program = (expr ~ ";" +) ^^ ((e: List[Expr]) => Program(e))

The “alternative” combinator

Take the next rule in our parser:

def expr = assign | simpleExpr

A relatively simple rule compared to the previous one. The only thing of note here, really, is the | operator, which creates a parser that can parse either of its operands: in this case, an assign or a simpleExpr. You can specify as many alternatives as you want. e.g.

def expr = assign | simpleExpr | ifExpr | functionDef | classDef | lambda

The way it works is by simply attempting to parse the leftmost rule. If that fails, it tries the next rule along. If that fails, it tries the next one along again and so on and so forth, until there are no more alternatives (in which case we have a syntax error) or one of the rules are successfully parsed. The net effect is similar to a logical “or”.

Something else of note is that we’re not using the apply combinator here. We simply don’t need to in this case, since both assign and simpleExpr both return a Parser[Expr], so there’s no need to convert the result of the parse into an AST element.

Simplifying your code with flatten[2-5]

The next rule – the “variable assignment” rule – appears to be pretty straightforward:

def assign = ident ~ "=" ~ expr ^^ flatten2(Store)

You’ve seen ~ and ^^ before – we’re just parsing a sequence and applying a function. But what’s flatten2?

The Scala parser combinator library includes a utility class called ImplicitConversions. This includes a couple of functions useful for writing cleaner combinators, including flatten2. Calling flatten2 here returns a function that takes the two arguments resulting from the sequence (the ident String and the expr Expr) and passes them to our AST Store constructor. Rewriting this using typical anonymous methods isn’t quite so clean:

def assign = ident ~ "=" ~ expr ^^ ((id: String, e: Expr) => Store(id, e))

It’s still readable, but still obviously much cleaner in the first example. Feel free to use either approach.

If you need to use flatten with more than two arguments, the library also provides flatten3, flatten4 and flatten5.

(Note: It should also be possible to have the flatten* calls happening implicitly, but I haven’t quite worked out how yet)

The “zero-or-more with separator” combinator

The next combinator we’re going to cover is similar to the one-or-more combinator, with a slight twist. First, let’s take a look at the next parsing rule:

def simpleExpr = term * (
"+" ^^ ((x: Expr, y: Expr) => Add(x, y))
| "-" ^^ ((x: Expr, y: Expr) => Sub(x, y))
)

Let’s get rid of the application combinators and look at the core rule to simplify things:

term * ("+" | "-")

That should be much easier to read. Now, you might initially think that this rule can be described as “a simpleExpr is zero or more terms followed by a plus or minus”. However, that’s not the case. In actual fact, it means: “a simpleExpr is zero or more terms interleaved by a plus or minus”. In order to parse the former, it would need to be expressed as something like:

(term*) ~ ("+" | "-")

But you probably don’t want that, and I digress

Let’s take another look at the original rule:

def simpleExpr = term * (
"+" ^^ ((x: Expr, y: Expr) => Add(x, y))
| "-" ^^ ((x: Expr, y: Expr) => Sub(x, y))
)

If there is a single term, it will be returned as the result for simpleExpr. If there are multiple terms, separated by “+” or “-” literals, they will be folded into a single Expr by recursively parsing the terms into Add or Sub AST nodes depending on which separator is encountered at each step. For example, given this source code:

5 + 3 - 2;

First, the parser will see a term, which evaluates a Num:

Num(5)

Next, the parser will see the literal “+”, and wraps the Num we just parsed in an Add with the next evaluated term (another Num):

Add(Num(5), Num(3))

Finally, the parser will see the literal “-” before wrapping the previous Add in a Sub with the next evaluated term (again, another Num):

Sub(Add(Num(5), Num(3)), Num(2))

I hope this clears up a potentially hairy construct. It’s really pretty simple once you get used to it, just keep at it. :)

More of the same

The next rule is exactly the same as the previous rule, except it expects factors (instead of terms) interleaved by either “*” or “/” (instead of “+” and “-”), and returns Mul and Div AST nodes (instead of Add and Sub nodes).

def term = factor * (
    "*" ^^ ((x: Expr, y: Expr) => Mul(x, y))
  | "/" ^^ ((x: Expr, y: Expr) => Div(x, y))
)

No further explanation should be necessary, but feel free to query it in the comments if you have any problems.

Parsing numbers and variable identifiers

Earlier I discussed the fact that parser combinators (and, indeed, parsers in general) simply take rules for simple parsers and combine them together to make more complex parsers. The factor rule is the simplest of all the parsers in terms of complexity, although the syntax expressing it is a little daunting:

def factor: Parser[Expr] = ("(" ~ expr ~ ")"
  | ident ^^ ((s: String) => Load(s))
  | numericLit ^^ ((s: String) => Num(s.toInt)))

The first thing to note is that I’ve explicitly stated the return type: because our parser is recursive, we must explicitly tell the compiler what our return type is going to be in the most fundamental element of our compiler. Now let’s break that rule down into more manageable chunks for discussion:

"(" ~ expr ~ ")"

This parser forces expressions in brackets to be parsed before anything else. As soon as a factor starting with a bracket is found, parsing starts near the top level again at the expr rule. Once the inner expr has been parsed, we expect a closing bracket. Easy.

ident ^^ ((s: String) => Load(s))

The ident method is provided by the Scala parser combinator library, which can handle typical variable names. If you need to handle any weird and wonderful characters, you may need to write your own.

Anyway, this simple rule parses a variable identifier (expressed as a String) and wraps it in an AST Load node. The only time that identifiers are not wrapped in an AST Load is when we’re doing assignment (see assign), where we use a Store node instead. The difference between Load and Store is semantically important to the code generator. I’ll discuss that more in the next part of this tutorial.

numericLit ^^ ((s: String) => Num(s.toInt))

As above, numericLit is provided by the Scala parser combinator library. It can parse simple numeric values (including floating point numbers, I believe). It returns a string expression of the number, which we then convert to an int and pass to the Num constructor. All numeric literals in our program will be wrapped in Num AST nodes.

Looking at the original factor rule again:

def factor: Parser[Expr] = ("(" ~ expr ~ ")"
  | ident ^^ ((s: String) => Load(s))
  | numericLit ^^ ((s: String) => Num(s.toInt)))

It doesn’t look so daunting any more. In simple terms, a factor is any expression wrapped in brackets, a variable name, or a number.

Code generation

Just to get something on-screen, let’s generate some simple bytecode:

def generateCode(e: Expr): String = e match {
case Num(v) => "push " + v + "n"
case Ident(id) => pushString(id) + "derefn"
case Mul(l, r) => generateCode(r) + generateCode(l) + "muln"
case Div(l, r) => generateCode(r) + generateCode(l) + "divn"
case Add(l, r) => generateCode(r) + generateCode(l) + "addn"
case Sub(l, r) => generateCode(r) + generateCode(l) + "subn"
case Assign(id, e) => generateCode(e) + pushString(id) + "bindn"
case Program(body) => body map(generateCode(_)) mkString ""
}

def pushString(s: String): String = "push '" + s + "'n"

The code generation method accepts an AST node (an Expr) and uses pattern matching to determine what code should be generated. This pseudo-bytecode is for a stack-based virtual machine, similar to the JVM. I’m not going to go into the details of how it works, because it’s quite self explanatory: effectively we walk the AST generated in the parse phase and generate simple pseudo-bytecode instructions depending on the type of the AST node. Arguments to operations like add, sub, mul & div are passed in reverse order. When the node we’re processing contains child nodes, we generate code for them through simple recursion.

Hooking the lexer up to the parser and starting compilation

Finally, to kick of the compilation process we use the following code:

program(new lexical.Scanner(Console.readLine())) match {
    case Success(e, _) => Console.println(generateCode(e))
    case Failure(msg, _) => Console.println("[FAILURE] " + msg)
    case Error(msg, _) => Console.println("[ERROR] " + msg)
}

Here we simply instantiate a new lexical.Scanner instance (recall that lexical is an instance of StdLexical from the Scala standard library). Our call to program here actually returns the Parser[Expr] combinator constructed from our set of parsing rules, which then has its apply method called with the scanner as its one and only argument. It is the apply() call which starts the actual parsing of the input. This is possible thanks to functor-like syntatic sugar made possible by Scala conventions.

Once the Parser object has finished the parsing process, a ParseResult case class is returned. We use pattern matching to determine the outcome of the parsing process: ParseResult can be an instance of any one of Success, Failure or Error. Success obviously means that parsing completed successfully. The other two obviously indicate parsing failures and other problems in the combinators. Assuming parsing completes successfully, the root node of the AST (a Program instance in our parser) is provided as part of the Success case class. We then simply generate code for the AST and print it to the screen.

Click here to download the source code for part 1. Just for kicks, I’ve included a simple Python “virtual machine” in the download below which takes the pseudo-bytecode generated by this compiler and “executes” it in case you want to test it out. :) You can run the compile/execute toolchain using the following command:

script/build; script/run "v = 1; n = 2; v + n * 5;"

Thanks

This massive post was in no small part inspired by this document, written by Adriaan Moors. I strongly recommend you give it a read for a more formal introduction to Scala’s parser combinators by somebody who’s probably been doing it a lot longer than I have. :)

Part 2

Thanks for taking the time to read this tutorial. If you enjoyed it, drop me a line with some suggestions and/or improvements. I’m always open to input. In part 2 I’ll be discussing code generation with BCEL and providing the source code to the final, JVM-targetted compiler discussed in both tutorials. See you next time!

UPDATE: No part 2 yet, but part 1.5 is here!

Categories: Compiler Construction with Scala and BCEL, Compilers, Scala, Software Development | 6 Comments