March 15th, 2008
I’ve got to admit that I don’t come from a hard core CS background (and I’m sure it shows in some of my articles on more complicated topics
), but I know a little bit about data structures and algorithms. While I’m no expert on the topic of data structures, I was surprised that – up until now – I had never even heard of Bloom filters.
What are Bloom filters?
Let’s say you have a set of millions of records in a database. Running a query over such a database can be quite time consuming. It would be great if we a cheap way to determine if a given record is likely to be within the database without necessarily performing a query. Bloom filters to the rescue!
Rather than executing a complex database query, we can ask our Bloom filter whether or not a certain record exists. It will then respond with either a definitive “no” (i.e. the record is definitely not in the database) or a “maybe”. In the event our Bloom filter says “maybe”, we must (unfortunately) query the database to determine whether or not the record really exists.
While this may sound like double the workload for our poor application, querying the Bloom filter is likely to be very fast. The goal of the Bloom filter here is to reduce the number of times we actually hit the database, a way to determine the probability a given input exists in our data set. Bloom filters are also small in terms of memory usage, requiring nothing more than a simple bitfield.
A quick overview
Bloom filters primarily consist of a bitfield of size m and one or more (call it k) hash functions. The classic Bloom filter provides two main operations:
- add, which is used to add new records to the Bloom filter; and
- query, which is used to determine whether a record is likely to be found in the data set.
Starting out with a zero-value bitfield, each time we call add our hash functions are each called on the input record in turn to produce a set of k indices (one for each hash function). For each one of these indices, we set the corresponding bit in the bitfield.
To query this filter, we again generate k indices using our hash functions on the input record. Rather than setting the bits in the bitfield, we instead check each bit. If any one of the bits is not set, then there is zero chance our input record exists in the data set (a definitive “no”). Otherwise, if all k bits are set, we have a “maybe”.
Problems with the classic implementation
There’s one glaring problem with the classic Bloom filter: there are no deletions! If records in your source data set can be deleted, your Bloom filter may need to be regenerated (depending on your accuracy needs). There exists a variation on the classic filter which uses an array of counters instead of a bitfield. The add operation becomes a counter increment. This allows for a delete operation which, predictably, becomes a decrement of the counter. This variant trades memory for flexibility.
A reference implementation
I’m uploading a C reference implementation of a Bloom filter with 1 (super naive) hash function and a 32-bit-long bitfield. Hopefully there is somebody out there who finds it informative.
Download my crummy (but hopefully instructive!) reference implementation here.
WARNING I wrote this implementation from a quick read of the Wikipedia page, so if there are any CS geeks out there who can see I’m clearly doing something wrong, please tell me how wrong I am
Categories: C, Software Development |
3 Comments
January 23rd, 2008
I’m still somewhat in holiday mode, so this entry is probably going to feel a little cheap for those of you following my more technical posts. I’m a big fan of GTK+ for user interfaces. If you don’t have the option or desire to use the Java platform and Swing, GTK+ is one of the better cross-platform user interface toolkits out there.
It’s high-level enough that it is easy to build quick, effective GUIs but low-level enough not to get in your way when you need to start messing around at the pixel level. GUIs can be built by hand using code, or designed using Glade and exported to an XML document to be loaded at runtime by any GTK+ binding. Currently it runs on Windows and Linux (the toolkit actually has its roots in GNOME) and has bindings for most popular programming languages. The only major downer is that Mac users are left out in the cold unless they go to the (herculean) effort of getting X11 up and running.
All the little differences between the various GTK+ bindings out there tend to get my goat when I move from one language to another. You would think that a GTK+ example in C could easily be translated to other languages without referencing documentation right? For one reason or another, the GTK+ bindings for other languages tend to diverge from the pleasant consistency of the C API to varying degrees. This post is all about those little differences that crop up even in the simplest applications: GTK’s take on “Hello World” in a few different languages.
C
C, being the language GTK+ is actually written in, is probably the most consistent with function naming across the different objects … the GTK_* and G_* macros are quite ugly though.
#include <gtk/gtk.h>
int main (int argc, char **argv) {
GtkWidget *window;
gtk_init(&argc, &argv);
window = gtk_window_new (GTK_WINDOW_TOPLEVEL);
gtk_window_set_title (GTK_WINDOW (window), "Hello, World");
g_signal_connect (G_OBJECT (window), "delete-event", gtk_main_quit, NULL);
gtk_widget_show_all (window);
gtk_main();
return 0;
}
C#/Mono
GTK# adds stuff like the Application object and delegates to produce a “Hello World” example I had to go digging around in documentation for. Not too bad on the whole, though.
using Gtk;
using GtkSharp;
public class HelloWorld {
public static void Main(string[] args) {
Gtk.Window window = new Gtk.Window();
window.Title = "Hello, World";
window.DeleteEvent += delegate { Application.Quit(); };
window.ShowAll();
Application.Run();
}
}
Ocaml
The Ocaml bindings diverge from the original C API much more so than the other languages listed here, most likely due to design decisions that had to be made to provide a C binding to a functional language. My only real grumble is the inconsistency with window#event#connect vs window#connect. I’m guessing there was a technical reason for that, but it still irks me every time I see it.
let delete_event evt = false
let destroy () = GMain.Main.quit ()
let main () =
let window = GWindow.window in
let _ = window#set_title "Hello, World" in
let _ = window#event#connect#delete ~callback:delete_event in
let _ = window#connect#destroy ~callback:destroy in
let _ = window#show () in
GMain.Main.main ()
;;
let _ = main () ;;
Perl
Although I find writing Perl to be painful for everything but processing text files in a terminal, I found the Perl GTK bindings to be relatively straightforward.
use strict;
use Gtk2 '-init';
my $window = Gtk2::Window->new;
$window->set_title("Hello, World");
$window->signal_connect('delete-event', sub { Gtk2->main_quit; });
$window->show_all;
Gtk2->main;
Python
I’m more familiar with PyGTK than with other bindings, so this was a snap. Why they chose GtkObject.connect over GtkObject.signal_connect is a mystery and the pygtk.require(’…’) crap is a little weird, but aside from that there should be nothing surprising here (this is a good thing!).
import pygtk
pygtk.require('2.0')
import gtk
window = gtk.Window()
window.set_title('Hello, World')
window.connect('delete-event', gtk.main_quit)
window.show_all()
gtk.main()
Ruby
RubyGNOME provides a GTK+ binding for Ruby. Take an almost 1:1 port of the C API, take away the ugly casting macros, mix in closures for handling signals and Ruby really is one of the nicest ways to get intimate with GTK+.
require 'gtk2'
window = Gtk::Window.new
window.title = 'Hello, World'
window.signal_connect(:delete-event) { Gtk.main_quit }
window.show_all
Gtk.main
That’s all for now. There are many more language bindings for GTK+ out in the wild for languages like Lisp/Scheme, C++, Haskell and Erlang. If you’re looking around for a GUI toolkit, be sure to give GTK a go. There’s plenty of documentation available for all the bindings listed here, often with some very detailed and easy to follow tutorials.
UPDATE: Miguel and she suggested some changes to the C#/Mono and RubyGNOME examples.
UPDATE 2: Aristotle suggested an easier way to initialize the Perl GTK bindings.
Categories: C, GTK, Ocaml, Python, Ruby, Software Development |
11 Comments
November 11th, 2007
I just read Hans-Eric Gronlund’s post on the pending extinction of programmers. The sky is falling in Software Development Land, it would seem: we’re all to be replaced by the very programs we’re writing. My natural reaction to his post is outright rejection. To my mind, there are too many fringe cases to account for, too many semantic nuances in spoken and written language to possibly pretend – even for a moment – that we’re just an academic paper or two away from being able to produce truly useful software from a high level specification. I wager that we’ll never see such a silver bullet come to fruition any time in the foreseeable future.
The Devil in the Details
Software development is not easy. You can trivialize it away all you want, and pretend that somehow the computer is going to know what you mean when you say “generate a report based on the data in the products table”. The devil is, and always will be, in the details. Simply put, you cannot abstract away details. Combine this with the fact that programming languages are constantly being born, evolved and killed. Ideas in languages decades old are being reimplemented in modern languages with a twist. We’re in an endless spiral of self-destruction: the software we write to day will be out of date – technologically speaking – by tomorrow because language X is no longer popular, or framework Y is now unsupported, or platform Z has crap support for multiple cores and so our software simply cannot scale. Our tools are constantly improving, but as developers we’re forever reinventing the software wheel – and that’s a problem caused by the overzealous nature of the software market.
The VM Wars
For example, the arrival of feasible byte-code-driven virtual machines is a wonderful thing: so what if language X is out of date? We can start using language W on Monday because it compiles down to the same bytecode. Platform Z has crap support for multiple cores? Hey, that’s cool: the VM we target also runs on another platform with better support for multi-core hardware. Sure, we’re still stuck with a shit framework, but hey: gift, horse, mouth. Virtual machines, while not fixing The Semantic Problem discussed by Gronlund, are a big step forward on other fronts.
How could the industry possibly screw such a theoretically wonderful idea up? Enter Sun and Microsoft with the JVM and .NET, respectively. Two entirely incompatible virtual machines with separate bytecode instructions and standard libraries. Brilliant. So now we all have to make a choice and hope that it’s the right one or we’re back to inventing wheels when the JVM is no longer supported by Sun, or .NET is superseded by an instruction set only feasibly implemented on specific hardware configurations. Even if both continue to exist forever and a day, the choice is still largely mutually exclusive: if you pick Java, your options for reuse are limited to Java solutions. If you pick .NET, you’re in the same boat in a different creek with the familiar uneasiness one feels in the absence of paddles in such a situation.
I wouldn’t for a second advocate blind homogeneity, but the two virtual machines are so god damn similar at a semantic level, they might as well be the same platform. It’s for entirely avoidable reasons like this VM divide that we’re doomed to the software development dark ages – and all that without even mentioning the words “software patent”. My prediction is that there will never be a magical breakthrough while we’re throwing away everything we produce every few years. We’ll be too busy fixing yesterday’s problem with today’s technology to ever think about a better way to approach the process as a whole. Even if we improve the process, we’re still doomed to the details, because – again – you can’t abstract away details.
No “Super Compiler”
Clearly, I’m a sceptic of the super compiler that Gronlund predicts (even when looking far, far into the future as he does), but I do see highly configurable software systems filling specific niches: systems like Drupal and Joomla are already filling this niche on the web development front. For bog-standard content-managed web sites, you can hand Drupal off to a half-trained monkey, slap a pretty theme on it, and you really can’t go wrong. However, even with such systems you are limited to the constraints of the systems themselves: the moment you want your site to do something unsupported out-of-the-box, the moment you want a certain feature to behave slightly differently, you’re back in the land of programmers. Further, what happens should PHP (the technology on which Drupal is built) disappear? Is Drupal then to be simply rebuilt in another language on another platform? Even if this is unlikely to happen in the real world: on a hypothetical level, why the hell should we need to do this?
Commercialization, Legislation and a Dull Sense of Deja VuÂ
To make matters worse, software development is outrageously commercialized. Even if we assume that a massive breakthrough in the software development tool set will eventually surface, you can bet it will be so riddled with patents that nobody will be able to legally use it anyway. It’s sad, but it’s the brutal truth. I don’t mean to be overly gloomy with this post: I love software development, and the possibilities for our profession are simply endless. It’s because I love my job that I look to the future with such pessimism: it’s unrealistic to think that – with the industry’s current approach to software development – we are going to see anything more than incremental improvement.
The more you try something new to the software development scene, the more it feels like we’re all doing a slightly different dance to the same bad song. I guess all we can do for now is keep on trying new things and hope somebody, somewhere stumbles upon Brooks’ silver bullet in the process …
Categories: Compilers, Languages, Software Development |
4 Comments
November 3rd, 2007
Writing a compiler is no simple undertaking. I’ve spent a few years on and off (mostly with the Python source code) trying to understand compiler internals on a practical level, and I’m only really only just beginning to learn enough to be able to manage some of the basics. So, in this article I present a detailed discussion of a simple compiler implementation that targets the Java Virtual Machine. I’ve been wanting to write a tutorial like this for a few months, because it’s exactly the sort of thing I wish I could have read a two or three years ago. So this is my little gift to all the other aspiring compiler nerds out there struggling (like me!) with the learning curve of compiler construction
Prerequisites
You’ll probably need at least a fundamental understanding of Scala and/or Java to be able to follow the code samples in this article, but I promise to try and keep the code as simple as possible.
I’d also recommend that you have at least a high-level understanding of how compilers tend to be structured internally i.e. if you don’t know what a “scanner” or a “parser” is, you’re going to get lost very quickly.
Obviously, you’ll want Scala installed if you plan to run the code. You’ll also want some sort of *nix toolchain available if you plan on using the simple build scripts I’ve provided. It’s not that hard to compile it by hand if you’re stuck on Windows or something like that.
Scala, the language
I haven’t covered Scala a whole lot on this blog. To be honest, I only started using Scala about two weeks ago. I’m already falling in love with it – you can think of it as a more expressive Java that straddles the fence the computing industry seems to have erected between functional and object-oriented languages. Yes, yes, there’s classes in Ocaml, the other functional/OO language you’ll hear me ranting about on this blog – but even in Ocaml, OOP is generally shunned in favour of a functional approach. Scala, on the other hand, is a beautiful blend of the two. I won’t go into the details of some of the more elegant aspects of Scala, but here’s just a few little things about Scala that I find especially nice:
- Scala is statically typed, but the excessive verbosity of Java is a thing of the past thanks to Scala’s type inferencing
- Scala has closures and anonymous methods
- Scala has what I can only describe as anonymous types: Neil Bartlett called this statically-checked duck typing
- Scala types can be aliased
- Scala code can use any Java class
- If some care is taken, Java code can easily use Scala classes.
Scala, the standard library
Scala comes with its own standard library (on top of the standard Java library) containing a slew of useful components. The Scala standard library comes with its own unit testing toolkit called SUnit, a database abstraction layer (think Scalafied JDBC), syntactic sugar wrappers around well-known Java collection classes, concurrent programming utilities and – the bit we’re interested in – a bunch of stuff useful to compiler construction.
For more information about the Scala standard library, please check out the Scala API documentation.
Our programming language
The source language will effectively be a glorified calculator with variables. The result of the final expression evaluated is to be printed to the screen. For example, something like this:
v = 5;
t = 10;
v + 6 * t;
Would output “65″.
High-level overview of a compiler
Irrespective of what language you write it in, the structure of a compiler tends to look something like this:
- Lexical analysis or Scanning - scan the source document to produce a stream of interesting tokens
- Syntax analysis or Parsing - transform the token stream produced by the previous phase into an abstract representation of the program structure
- Semantic analysis or Type Checking - analyse the in-memory representation of the compiled program to ensure it “makes sense”
- Code generation – generate the executable code from the type-checked representation of the program
The compiler presented by this tutorial won’t have an explicit “step 3″ – the language is simple enough that it should be impossible to write a syntactically correct, semantically invalid program.
Our abstract syntax tree
Our parser will be transforming the token sequence generated by the scanner in phase 1 to produce what’s called an Abstract Syntax Tree – a representation of the program as a structured tree. There’s plenty of literature available on ASTs out in the wild, so I’ll gloss over the details and provide you with the relevant Scala code for declaring the nodes in our tree:
abstract class Expr
case class Add(l: Expr, r: Expr) extends Expr
case class Sub(l: Expr, r: Expr) extends Expr
case class Mul(l: Expr, r: Expr) extends Expr
case class Div(l: Expr, r: Expr) extends Expr
case class Num(value: int) extends Expr
case class Store(id: String, r: Expr) extends Expr
case class Load(id: String) extends Expr
case class Program(body: List[Expr]) extends Expr
So for a simple example, imagine we have a simple program in our language that looks something like this:
3 + 5; 4 * 6 - 2;
We could manually express this as an AST using the above classes as follows:
Program(
List(
Add(Num(3), Num(5)),
Sub(
Mul(Num(4), Num(6)),
Num(2)
)
)
)
Of course, building ASTs like this isn’t quite what we’re after. In practice, we break AST construction down into a set of simple rules using Scala’s parser combinators. But more on that later.
Finally, you will want to note that the concrete elements of the AST are implemented as Scala case classes. Case classes will allow us to use Scala’s pattern matching features to inspect the AST in the code generation phase. This makes it really, really easy conceptually to see how the code should be generated.
The standard lexer
Scala comes with a standard lexical analysis class (StdLexical) which is capable of producing tokens for a simple Scala-like language. More than enough for our particular needs here. StdLexical knows how to parse numeric values, so all we need to do is tell the scanner which delimiters are important (i.e. a few simple arithmetic operators):
type Tokens = StdLexical
val lexical = new StdLexical
lexical.delimiters ++= Set("(", ")", "+", "-", "*", "/", ";")
That’s all the Scala code we need to write to configure our lexical analyser! Assuming your language doesn’t do anything too weird, StdLexical will likely be able to take care of the rest of the scanning process for you.
Pseudo-EBNF using parser combinators
EBNF is the notation de rigeur for describing programming language syntax in the compiler construction world. In theory, once you have written an EBNF description of your programming language syntax, writing a parser for it should be reasonably simple: you simply transcribe the EBNF into your language of choice and away you go. Unfortunately it’s not always that simple in practice, so people resort to using tools like yacc or bison to generate their parsers using EBNF-esque pattern-matching. The downside to these tools is that most have a fairly big learning curve.
Along comes parser combinators. Using combinators, it suddenly becomes possible to use EBNF-like rules in the compiler implementation language – a sort of DSL for parsers, if you like. Combinators appear to be especially popular in functional languages like Haskell (via Parsec) – although I’m still not sure what’s intrinsically special about functional languages that would make them more appropriate for combinators. If anybody could clear that up for more it would be much appreciated. In any case, Scala provides a parser combinator library as part of its standard API. Here’s the Scala code describing our simple little language, which also acts as the parser for our language:
def program = ((expr ~ ";")+) ^^ ((e: List[Expr]) => Program(e))def expr = assign | simpleExpr
def assign = ident ~ "=" ~ expr ^^ flatten2(Store)
def simpleExpr = term * (
"+" ^^ ((x: Expr, y: Expr) => Add(x, y))
| "-" ^^ ((x: Expr, y: Expr) => Sub(x, y))
)
def term = factor * (
"*" ^^ ((x: Expr, y: Expr) => Mul(x, y))
| "/" ^^ ((x: Expr, y: Expr) => Div(x, y))
)
def factor: Parser[Expr] = ("(" ~ expr ~ ")"
| ident ^^ ((s: String) => Load(s))
| numericLit ^^ ((s: String) => Num(s.toInt)))
Please note that, although I haven’t been explicit about return types for all of these methods they all return a Parser[Expr] – that is, a parser for an element of our AST. In fact, pretty much every method call you see here (including the operators!) is returning a Parser of some sort. If it’s not obvious, parser combinators – despite the scary name – have a very simple task: that of chaining simple parsers together to form a slightly more complex parser. These slightly more complex parsers can then be combined together as well, to form a slightly more complex parser again. Eventually you wind up with a tree of parsers that match one-to-one with the rules you provide in your combinator “grammar”.
It’s also interesting to note here that precedence is actually being enforced using the nested parsing rules: a factor will be parsed before any other rules because the factor rule is nested deeper in the parsing hierarchy than any of the others.
Now it probably goes without saying, but there’s a lot going on under the hood here. I’ll spend the next few sections trying to explain just exactly what all this actually means, but before we start it’s important that you understand …
Scala operators are just method calls
Many of the operators you see in the above source code listing are actually method calls on the objects preceding them. For example:
def assign = ident ~ "=" ~ expr ^^ flatten2(Store)
Could actually be rewritten thus:
def assign = ident.~(keyword("=").~(expr.^^(flatten2(Store))))
Obviously the first example is much more readable, but knowing this will make understanding what’s going on in the above code examples just that little bit easier.
Another bit of Scala magic: the parser combinators know to implicitly convert String arguments into a UnitParser using the keyword method. keyword accepts a string argument – the token it “expects” – and then discards the result.
As a side note, methods with more traditional names can be used as operators too:
import java.util.ArrayList
val list = new ArrayList
list add "Test" // the same as list.add("Test")
Finally, Scala operators – being methods – can be trivially overridden:
class Monkey {
var numBananas = 0
def +=(numBananas: int): unit = this.numBananas += numBananas
}
val m = new Monkey
Console.println(m.numBananas)
m += 5
Console.println(m.numBananas)
The “sequence”, “one-or-more” and “apply” combinators
Let’s take a look at the first rule in our program.
def program = ((expr ~ ";")+) ^^ ((e: List[Expr]) => Program(e))
The parser for this rule is expressed by this fragment:
def program = ((expr ~ ";")+)
Which, in plain old English, means: “a program consists of one or more sequences of expr and a semi-colon.”. Those of you familiar with regular expressions will probably have already guessed that the + combinator means “one or more”. The ~ combinator has a slightly more subtle meaning: it’s used to combine parsers together in a sequence. That is, within the context of this rule, once we have finished parsing an expr we would expect to find a literal semi-colon. If we do not, it’s a syntax error.
The final operator we’re covering here is the ^^ operator, which takes the result of the previous parser, applies a function to it, and returns the result (which should be an element of our AST) . Let’s take a look at the whole rule again:
def program = ((expr ~ ";")+) ^^ ((e: List[Expr]) => Program(e))
So the result of the one-or-more combinator is a list of expressions (List[Expr]). If parsing of the one-or-many rule succeeds, the ^^ takes the result (the said list of expressions) and passes it to the function passed as the operand on the right. In this case, we’re wrapping the list of expressions in a Program AST element.
So that should really be all you need to know about the ~, + and ^^ combinators. Please note that the annoying excess of brackets I use here is more for clarity than anything else – you could rewrite this as:
def program = (expr ~ ";" +) ^^ ((e: List[Expr]) => Program(e))
The “alternative” combinator
Take the next rule in our parser:
def expr = assign | simpleExpr
A relatively simple rule compared to the previous one. The only thing of note here, really, is the | operator, which creates a parser that can parse either of its operands: in this case, an assign or a simpleExpr. You can specify as many alternatives as you want. e.g.
def expr = assign | simpleExpr | ifExpr | functionDef | classDef | lambda
The way it works is by simply attempting to parse the leftmost rule. If that fails, it tries the next rule along. If that fails, it tries the next one along again and so on and so forth, until there are no more alternatives (in which case we have a syntax error) or one of the rules are successfully parsed. The net effect is similar to a logical “or”.
Something else of note is that we’re not using the apply combinator here. We simply don’t need to in this case, since both assign and simpleExpr both return a Parser[Expr], so there’s no need to convert the result of the parse into an AST element.
Simplifying your code with flatten[2-5]
The next rule – the “variable assignment” rule – appears to be pretty straightforward:
def assign = ident ~ "=" ~ expr ^^ flatten2(Store)
You’ve seen ~ and ^^ before – we’re just parsing a sequence and applying a function. But what’s flatten2?
The Scala parser combinator library includes a utility class called ImplicitConversions. This includes a couple of functions useful for writing cleaner combinators, including flatten2. Calling flatten2 here returns a function that takes the two arguments resulting from the sequence (the ident String and the expr Expr) and passes them to our AST Store constructor. Rewriting this using typical anonymous methods isn’t quite so clean:
def assign = ident ~ "=" ~ expr ^^ ((id: String, e: Expr) => Store(id, e))
It’s still readable, but still obviously much cleaner in the first example. Feel free to use either approach.
If you need to use flatten with more than two arguments, the library also provides flatten3, flatten4 and flatten5.
(Note: It should also be possible to have the flatten* calls happening implicitly, but I haven’t quite worked out how yet)
The “zero-or-more with separator” combinator
The next combinator we’re going to cover is similar to the one-or-more combinator, with a slight twist. First, let’s take a look at the next parsing rule:
def simpleExpr = term * (
"+" ^^ ((x: Expr, y: Expr) => Add(x, y))
| "-" ^^ ((x: Expr, y: Expr) => Sub(x, y))
)
Let’s get rid of the application combinators and look at the core rule to simplify things:
term * ("+" | "-")
That should be much easier to read. Now, you might initially think that this rule can be described as “a simpleExpr is zero or more terms followed by a plus or minus”. However, that’s not the case. In actual fact, it means: “a simpleExpr is zero or more terms interleaved by a plus or minus”. In order to parse the former, it would need to be expressed as something like:
(term*) ~ ("+" | "-")
But you probably don’t want that, and I digress
Let’s take another look at the original rule:
def simpleExpr = term * (
"+" ^^ ((x: Expr, y: Expr) => Add(x, y))
| "-" ^^ ((x: Expr, y: Expr) => Sub(x, y))
)
If there is a single term, it will be returned as the result for simpleExpr. If there are multiple terms, separated by “+” or “-” literals, they will be folded into a single Expr by recursively parsing the terms into Add or Sub AST nodes depending on which separator is encountered at each step. For example, given this source code:
5 + 3 - 2;
First, the parser will see a term, which evaluates a Num:
Num(5)
Next, the parser will see the literal “+”, and wraps the Num we just parsed in an Add with the next evaluated term (another Num):
Add(Num(5), Num(3))
Finally, the parser will see the literal “-” before wrapping the previous Add in a Sub with the next evaluated term (again, another Num):
Sub(Add(Num(5), Num(3)), Num(2))
I hope this clears up a potentially hairy construct. It’s really pretty simple once you get used to it, just keep at it.
More of the same
The next rule is exactly the same as the previous rule, except it expects factors (instead of terms) interleaved by either “*” or “/” (instead of “+” and “-”), and returns Mul and Div AST nodes (instead of Add and Sub nodes).
def term = factor * (
"*" ^^ ((x: Expr, y: Expr) => Mul(x, y))
| "/" ^^ ((x: Expr, y: Expr) => Div(x, y))
)
No further explanation should be necessary, but feel free to query it in the comments if you have any problems.
Parsing numbers and variable identifiers
Earlier I discussed the fact that parser combinators (and, indeed, parsers in general) simply take rules for simple parsers and combine them together to make more complex parsers. The factor rule is the simplest of all the parsers in terms of complexity, although the syntax expressing it is a little daunting:
def factor: Parser[Expr] = ("(" ~ expr ~ ")"
| ident ^^ ((s: String) => Load(s))
| numericLit ^^ ((s: String) => Num(s.toInt)))
The first thing to note is that I’ve explicitly stated the return type: because our parser is recursive, we must explicitly tell the compiler what our return type is going to be in the most fundamental element of our compiler. Now let’s break that rule down into more manageable chunks for discussion:
"(" ~ expr ~ ")"
This parser forces expressions in brackets to be parsed before anything else. As soon as a factor starting with a bracket is found, parsing starts near the top level again at the expr rule. Once the inner expr has been parsed, we expect a closing bracket. Easy.
ident ^^ ((s: String) => Load(s))
The ident method is provided by the Scala parser combinator library, which can handle typical variable names. If you need to handle any weird and wonderful characters, you may need to write your own.
Anyway, this simple rule parses a variable identifier (expressed as a String) and wraps it in an AST Load node. The only time that identifiers are not wrapped in an AST Load is when we’re doing assignment (see assign), where we use a Store node instead. The difference between Load and Store is semantically important to the code generator. I’ll discuss that more in the next part of this tutorial.
numericLit ^^ ((s: String) => Num(s.toInt))
As above, numericLit is provided by the Scala parser combinator library. It can parse simple numeric values (including floating point numbers, I believe). It returns a string expression of the number, which we then convert to an int and pass to the Num constructor. All numeric literals in our program will be wrapped in Num AST nodes.
Looking at the original factor rule again:
def factor: Parser[Expr] = ("(" ~ expr ~ ")"
| ident ^^ ((s: String) => Load(s))
| numericLit ^^ ((s: String) => Num(s.toInt)))
It doesn’t look so daunting any more. In simple terms, a factor is any expression wrapped in brackets, a variable name, or a number.
Code generation
Just to get something on-screen, let’s generate some simple bytecode:
def generateCode(e: Expr): String = e match {
case Num(v) => "push " + v + "n"
case Ident(id) => pushString(id) + "derefn"
case Mul(l, r) => generateCode(r) + generateCode(l) + "muln"
case Div(l, r) => generateCode(r) + generateCode(l) + "divn"
case Add(l, r) => generateCode(r) + generateCode(l) + "addn"
case Sub(l, r) => generateCode(r) + generateCode(l) + "subn"
case Assign(id, e) => generateCode(e) + pushString(id) + "bindn"
case Program(body) => body map(generateCode(_)) mkString ""
}
def pushString(s: String): String = "push '" + s + "'n"
The code generation method accepts an AST node (an Expr) and uses pattern matching to determine what code should be generated. This pseudo-bytecode is for a stack-based virtual machine, similar to the JVM. I’m not going to go into the details of how it works, because it’s quite self explanatory: effectively we walk the AST generated in the parse phase and generate simple pseudo-bytecode instructions depending on the type of the AST node. Arguments to operations like add, sub, mul & div are passed in reverse order. When the node we’re processing contains child nodes, we generate code for them through simple recursion.
Hooking the lexer up to the parser and starting compilation
Finally, to kick of the compilation process we use the following code:
program(new lexical.Scanner(Console.readLine())) match {
case Success(e, _) => Console.println(generateCode(e))
case Failure(msg, _) => Console.println("[FAILURE] " + msg)
case Error(msg, _) => Console.println("[ERROR] " + msg)
}
Here we simply instantiate a new lexical.Scanner instance (recall that lexical is an instance of StdLexical from the Scala standard library). Our call to program here actually returns the Parser[Expr] combinator constructed from our set of parsing rules, which then has its apply method called with the scanner as its one and only argument. It is the apply() call which starts the actual parsing of the input. This is possible thanks to functor-like syntatic sugar made possible by Scala conventions.
Once the Parser object has finished the parsing process, a ParseResult case class is returned. We use pattern matching to determine the outcome of the parsing process: ParseResult can be an instance of any one of Success, Failure or Error. Success obviously means that parsing completed successfully. The other two obviously indicate parsing failures and other problems in the combinators. Assuming parsing completes successfully, the root node of the AST (a Program instance in our parser) is provided as part of the Success case class. We then simply generate code for the AST and print it to the screen.
Click here to download the source code for part 1. Just for kicks, I’ve included a simple Python “virtual machine” in the download below which takes the pseudo-bytecode generated by this compiler and “executes” it in case you want to test it out.
You can run the compile/execute toolchain using the following command:
script/build; script/run "v = 1; n = 2; v + n * 5;"
Thanks
This massive post was in no small part inspired by this document, written by Adriaan Moors. I strongly recommend you give it a read for a more formal introduction to Scala’s parser combinators by somebody who’s probably been doing it a lot longer than I have.
Part 2
Thanks for taking the time to read this tutorial. If you enjoyed it, drop me a line with some suggestions and/or improvements. I’m always open to input. In part 2 I’ll be discussing code generation with BCEL and providing the source code to the final, JVM-targetted compiler discussed in both tutorials. See you next time!
UPDATE: No part 2 yet, but part 1.5 is here!
Categories: Compiler Construction with Scala and BCEL, Compilers, Scala, Software Development |
6 Comments
October 3rd, 2007
Welcome to part two in my series of Ocaml tutorials! There was an overwhelming response to the first tutorial in the series and I have taken plenty of suggestions on board for this and future tutorials. In this instalment I’ll be introducing the list data type and the very useful List.iter function from the Ocaml standard library.
An Introduction to List Initialization and List.iter
Here’s the source code we’re going to be breaking down:
let main () =
let seq = ["a"; "b"; "c"] in
List.iter print_endline seq ;;
let _ = main ()
Looks scary, doesn’t it? It’s actually quite simple, but we’ll get to the source code in a moment. First, let’s see the program’s output. Save this file as tmhoo02.ml and run it like so:
$ ocaml tmhoo02.ml
You should see the following:
a
b
c
Now, let’s take a look at the interesting bits of the source code:
let seq = ["a"; "b"; "c"] in
As you (hopefully!) learned in part 1, the let syntax binds the variable seq to some expression. The in keyword indicates that this binding is a named local expression whose value to be discarded when the function exits (think “local variable”). But what exactly does that expression mean?
["a"; "b"; "c"]
This is the most direct way to initialize a list in Ocaml. Each element in a list is separated by a semi-colon. This particular list has three elements: the strings “a”, “b” and “c”. As our list comprises of string elements, we can say that this is a string list. Note that unlike some dynamically typed languages (e.g. Python and Ruby), every Ocaml list element must be of the same type: you have a list of string elements or a list of integer elements, but you simply can’t mix the two. Honestly, it won’t compile.
List.iter print_endline seq ;;
Here we’re calling the iter function of the List module – a module which is bundled as part of the Ocaml standard library. This function takes two parameters: the first is another function that accepts a single parameter, the second is a list.
As you might suspect, List.iter iterates over the list, passing each element in the list to the function one by one. To word it another way, inside List.iter we’re calling print_endline for each element in our list to the equivalent of: print_endline “a”, print_endline “b”, print_endline “c”. This explains the output of our example program.
Finally:
let _ = main ()
main is evaluated here and its return value (the <em>unit</em> value) is ignored by the underscore wildcard. You’ll notice in part 1 I omitted everything up to the equals operator but, at the behest of others who know better, I’ve included it here in part two. I’m not going to say anything more about it, other than the fact that the underscore wildcard is a simple but powerful construct you’ll likely become very familiar with at a later date. If you’d prefer to use the syntax from part 1, that should work too (but don’t come crawling to me when other Ocaml programmers come down on you like a ton of bricks
).
Advanced Discussion of the Type System
As a sneak preview into Ocaml’s type system, I can tell you that List.iter isn’t restricted to string lists. In fact, it can work with lists containing elements of any type. However, because each element of the list is passed to the function given to List.iter as the first parameter, that function must accept a parameter the same type as a single element in the list. Since print_endline accepts a string, we must use a string list. Or: since we’re passing in a string list as the second parameter to List.iter, the function passed as the first parameter must itself accept a string as its sole parameter.
Confused? Read that last paragraph again carefully. If you still don’t understand it, don’t worry too much: we’ll cover Ocaml’s type system more in the next few tutorials.
Until Next Time …
So that wraps up tutorial number two. I’ve already got parts three and four in the pipeline and I can tell you they introduce a fair bit of new material. Take your time trying to understand what was covered in this tutorial and post a comment or two if you have any suggestions, questions or improvements.
Categories: Functional Programming, Ocaml, Software Development, TMHOO |
No Comments
July 6th, 2007
DRAFT: I’m still learning Ocaml myself, so some of what I say here may be plain wrong or misguided. I’m relying on those more experienced with Ocaml to correct my mistakes! If you have plenty of Ocaml experience, please correct my mistakes. By all means, be brutal!
Welcome to the very first part of my (hopefully enlightening) series of tutorials on the Ocaml programming language! In this tutorial we’ll be learning the Ocaml flavour of everybody’s favourite program: Hello World.
Since those interested in Ocaml will likely be coming from such a background, please note that this tutorial is written under the assumption that readers will have experience with other, imperative languages such as Java or C++. Now, on with the code:
let main () =
print_endline "Hello, World!" ;;
main () ;;
Save this file as tmhoo01.ml. You can run this program from the command-line using the following:
$ ocaml tmhoo01.ml
Or compile it to native code:
$ ocamlopt -o tmhoo01 tmhoo01.ml
Predictably, running this program displays “Hello, World!” in the console window. It’s not the output we’re interested in, however: it’s Ocaml’s weird ass syntax! Let’s take this line by line:
let main () =
This declares a function called “main”. let is an Ocaml keyword which defines named expressions (sometimes called let-bindings). In this case, the name of our expression/let-binding is “main”.
main takes a single parameter: the parentheses () represent what is known as the “unit value”. It is the only possible value for the unit type, and has a similar use and meaning to what void has in C/C++. In this case, it means the function main accepts no other parameters. This is necessary because Ocaml functions must always be applied to one or more parameters: thus, when we have no parameters to pass we must resort to using (). If we do not accept this unit parameter at a minimum, the expression in main is evaluated at the next double semi-colon ;;. This is not our intention.
print_endline "Hello, World!" ;;
The print_endline call does as you would expect. Similar to System.out.println or printf. print_endline returns the unit value. Implicitly, our main function has a unit type. This distinction may not completely make sense just yet, but the importance of this will make sense later once we start doing more work with different types.
The ;; keyword is used to separate multiple top-level constructs (e.g. let-bindings and class definitions). Later, you will use ; to separate expressions within other constructs.
main () ;;
As you would expect, this calls our main function.
So that’s our first Ocaml program dissected. Thanks for reading! If you have any questions or comments, please post them here: I’ll surely be revising this article based on your recommendations.
UPDATE 1: Correction to the descriptions of ; and ;;. Thanks Paul!
UPDATE 2: Removed my haughty claim that Ocaml references are more like C++ pointers than Java references. Cheers Chris!
UPDATE 3: Tried to otherwise simplify the tutorial. Less “blah blah blah”!
UPDATE 4: Remove introduction to references all together. Several people felt it was out of place to introduce a concept like references in the first tutorial. I happen to agree.
Categories: Ocaml, Software Development, TMHOO |
16 Comments