Python 2.6a2: Compile ASTs from within Python code

April 4th, 2008

I’m not going to go into too much depth because Georg Brandl has already covered it, but it’s an interesting topic. I couldn’t help but write a little entry about it. :)

A new alpha of Python has just been released, including a patch I wrote for compiling Python Abstract Syntax Trees down to bytecode. This means it’s now possible to manipulate ASTs from within your Python program, which lets you do all sorts of crazy things – like this, for example.

Piping this little program into itself yields the following:

$ ./python wacky.py <wacky.py
Bwahaha! I was once an Assign node!
Bwahaha! I was once an Assign node!

Neat huh?

Anyway, since then – on Neal Norwitz’s advice – I’ve started on an experimental patch for what I hope will one day be an optimizer for Python ASTs. Even though it’s early days, the possibilities offered by optimizing at the AST level are very interesting. For example, the (dirty, filthy, ugly, hack of a) patch I’m working on at the moment has support for optimizing this code:

if 1:
    'true'
else:
    'false'

Down to this (remember, no bytecode has been generated yet):

'true'

Very, very exciting stuff.

Anyway, I have a train to catch. More on this when I have more to show!

Categories: Compilers, Python, Software Development | No Comments

When Speed Matters

April 3rd, 2008

I’d be the first to tell you if, when confronted with a task that requires a degree of automation, you use a scripting language like Python, Ruby, Perl, Bash or some mix of the four. However, I recently had a problem that involved changing the permissions on a large number of files.

My initial approach used a small (10 lines or so) Bash script to traverse the filesystem hierarchy and change the permissions based on whether we were processing a file or a directory. The resulting script ran over more than 3000 files and directories in about a minute and a half. Which wasn’t exactly slow, but I had to run this script about twenty to thirty times throughout the day and it felt very unproductive to have to wait for a whole minute and a half before I could continue my work. So I did something crazy: I rewrote that little Bash program in C.

The resulting C program – of maybe 75 lines of code – finished in 1.3 seconds.

Initially I thought I had made a mistake, so I checked the processed files. They were all exactly as I expected them to be. I was astounded: I could execute this program almost 150 times before my old Bash solution finished even once! This decision made my day – the rest of my afternoon was much more productive. I even felt happier to know I wasn’t wasting so much time on something trivial and secondary to the actual task at hand.

Again, I’m not one to go preaching about how important performance is with respect to any given language – often it’s much, much easier to write a few Python/Ruby/Perl scripts and push data between them with some Bash glue. In this particular case, however, the choice of a lower-level language was clearly a massive win.

Even in retaining the pragmatic perspective, it really makes you wonder just how much time is lost to “inefficient” software stacks…

Categories: C, Linux, Software Development | 4 Comments

On Bloom Filters

March 15th, 2008

I’ve got to admit that I don’t come from a hard core CS background (and I’m sure it shows in some of my articles on more complicated topics :P ), but I know a little bit about data structures and algorithms. While I’m no expert on the topic of data structures, I was surprised that – up until now – I had never even heard of Bloom filters.

What are Bloom filters?
Let’s say you have a set of millions of records in a database. Running a query over such a database can be quite time consuming. It would be great if we a cheap way to determine if a given record is likely to be within the database without necessarily performing a query. Bloom filters to the rescue!

Rather than executing a complex database query, we can ask our Bloom filter whether or not a certain record exists. It will then respond with either a definitive “no” (i.e. the record is definitely not in the database) or a “maybe”. In the event our Bloom filter says “maybe”, we must (unfortunately) query the database to determine whether or not the record really exists.

While this may sound like double the workload for our poor application, querying the Bloom filter is likely to be very fast. The goal of the Bloom filter here is to reduce the number of times we actually hit the database, a way to determine the probability a given input exists in our data set. Bloom filters are also small in terms of memory usage, requiring nothing more than a simple bitfield.

A quick overview

Bloom filters primarily consist of a bitfield of size m and one or more (call it k) hash functions. The classic Bloom filter provides two main operations:

  1. add, which is used to add new records to the Bloom filter; and
  2. query, which is used to determine whether a record is likely to be found in the data set.

Starting out with a zero-value bitfield, each time we call add our hash functions are each called on the input record in turn to produce a set of k indices (one for each hash function). For each one of these indices, we set the corresponding bit in the bitfield.

To query this filter, we again generate k indices using our hash functions on the input record. Rather than setting the bits in the bitfield, we instead check each bit. If any one of the bits is not set, then there is zero chance our input record exists in the data set (a definitive “no”). Otherwise, if all k bits are set, we have a “maybe”.

Problems with the classic implementation

There’s one glaring problem with the classic Bloom filter: there are no deletions! If records in your source data set can be deleted, your Bloom filter may need to be regenerated (depending on your accuracy needs). There exists a variation on the classic filter which uses an array of counters instead of a bitfield. The add operation becomes a counter increment. This allows for a delete operation which, predictably, becomes a decrement of the counter. This variant trades memory for flexibility.

A reference implementation

I’m uploading a C reference implementation of a Bloom filter with 1 (super naive) hash function and a 32-bit-long bitfield. Hopefully there is somebody out there who finds it informative.

Download my crummy (but hopefully instructive!) reference implementation here.

WARNING I wrote this implementation from a quick read of the Wikipedia page, so if there are any CS geeks out there who can see I’m clearly doing something wrong, please tell me how wrong I am :)

Categories: C, Software Development | 3 Comments

Fixing Rails Nested Forms (or: HashWithIndifferentAccess is evil)

March 12th, 2008

I’ve moved over to Shine’s Ruby on Rails team and, as such, have been exposed to a whole lot of Rails code over the past few weeks. Something I ran into a few weeks back was this bug, relating to a parsing bug in nested forms. I’ve submitted a series of patches, the last of which I hope to see in HEAD sometime over the next few days.

Anyway, the actual bug was caused by weirdo (and non-Hash-like) semantics in HashWithIndifferentAccess that causes certain types (Hash and Array) to be stored as a copy of the value passed in. The following code will give you an idea of the crazy logic that might result from such behavior:

def mktom(); {:name => 'Tom', :age => 23}; end

tom = mktom
people = {:tom => tom}
tom[:name] = 'Thomas'
puts people[:tom][:name] # displays 'Thomas' - cool!
puts "The same object" if tom.object_id == people[:tom].object_id # displays 'The same object'

tom = mktom
people = {:tom => tom}.with_indifferent_access
tom[:name] = 'Thomas'
puts people[:tom][:name] # displays 'Tom' - wtf!
puts "Not the same object!" if tom.object_id != people[:tom].object_id # displays 'Not the same object!'

The lesson learned from tracking down and fixing this bug?

  • Do not treat HashWithIndifferentAccess as though it were any old Hash. It has different semantics to Hash and may store copies of the keys and/or values instead of the original objects.
  • Do not nest HashWithIndifferentAccess instances in Hash instances unless you enjoy headaches. It’s so easy to think that HashWithIndifferentAccess is semantically identical to Hash (especially thanks to Hash#with_indifferent_access), but this assumption was the cause of two nasty, hard-to-find bugs in this instance.
  • When providing an API virtually identical to (and easily mistakable for) core Ruby classes, ensure that semantics are consistent with convention.

I’m not sure why they didn’t just override the reader for #[]. Maybe there’s more to that than I realize. Anyway, I’ve spent enough time thinking about this for tonight, time for bed. Sorry for the blog drought, I’m hoping to finalize a few more installments of the Ocaml series in along with the final part of my Scala/BCEL parser tutorial over the next few months. Stay tuned. :)

UPDATE Since I wrote this post, I’ve written a plugin that replaces Rails’ parameter parsing implementation to make dealing with complex nested forms easier still. Refer to Taking the Pain Out of Complex Forms in Rails.

Categories: Ruby, Software Development | No Comments

Don’t Just Read Other People’s Code: Understand It!

January 30th, 2008

Becoming a Better Programmer
The Internet’s software development stratosphere is forever spewing forth lists of N steps to becoming a better programmer. Among those touting their personal take on the road to code nirvana, I’ve noticed a single step that seems to be almost ubiquitous: read other people’s code. Since reading the perfectly distilled common sense of Steve McConnell’s Code Complete a few years back, I’ve always considered this to be a stylistic thing: when you read other people’s code, you do so to improve your own aesthetic.

McConnell’s book encouraged readers to (and I’m paraphrasing here) take what they found in the code of others and evaluate it. I took that at face value at the time (and I have ever since), and used others’ code as a judge of good style. If somebody’s code was easy to read, I’d integrate elements of their style into my own. If the code was … uh … labyrinthine … well, I’d tend to lean away from the styles in play. Easy.

The Goal is to Grok
Some time ago it struck me that maybe, after all this time, I’ve been quite wrong about what I should be focusing on when looking at others’ code. I think the one thing I’ve dearly undervalued in all my years of programming is this: reading code is just a step on the long path to understanding it. That is, don’t strive to merely “read” the individual statements in a source file. Develop a general understanding of how the system fits together as a whole and at multiple levels of granularity. My narrow-minded approach was never going to be nearly as effective as truly understanding more complex code bases.

Style is doubtlessly important, but there’s more to be gained from reading code than merely what works stylistically. There is really so, so much more to take from the nitty gritty of the implementation details of even the most poorly constructed software systems. By all means, discard the stylistic abortions of programmers who don’t care for aesthetics or code of a language which doesn’t necessarily lend itself to the problem at hand. But damn it, play with it. Become familiar with how all the little ugly parts of a system fit together. Step through the interesting bits in a debugger, watch the data move around, watch the state changes. Read the documentation back to front. Know the libraries, know the subsystems. Come to know the code, the software and the domain at a level you never thought possible when you first started playing with it. Understand it! Again, I guarantee you’ll learn as much from all the bad code you encounter as you do from the good – but only if you take the time to fit the pieces together.

Finding direction
Having said that, slogging through code for software you find mundane or boring is going to both test your patience and limit your potential for learning something new. Pick an open-source project (or two) that interests you and spend a few weeks or months tinkering with it. For example, you might be interested to learn how Pidgin/Gaim protocols and plugins are handled internally, or how JEdit’s editor component is rendered. How does Glade work? Delve into the source code behind your favourite libraries like wxWidgets, GTK+ and Qt/KDE and see how user interface libraries are built. How do Mono’s C# compiler and Python’s internals tick? How do you implement a robust HTTP server like Apache?

There’s a lot of code out there, so much to learn and a lot of smart people. Don’t be afraid to get your hands dirty with non-trivial projects, even when it’s initially frustrating to understand the basics of an unfamiliar domain. Understanding complex software systems will make you a better programmer and help you grow as a professional.

Categories: Software Development | 1 Comment

GTK Hello World in Six Different Languages

January 23rd, 2008

I’m still somewhat in holiday mode, so this entry is probably going to feel a little cheap for those of you following my more technical posts. I’m a big fan of GTK+ for user interfaces. If you don’t have the option or desire to use the Java platform and Swing, GTK+ is one of the better cross-platform user interface toolkits out there.

It’s high-level enough that it is easy to build quick, effective GUIs but low-level enough not to get in your way when you need to start messing around at the pixel level. GUIs can be built by hand using code, or designed using Glade and exported to an XML document to be loaded at runtime by any GTK+ binding. Currently it runs on Windows and Linux (the toolkit actually has its roots in GNOME) and has bindings for most popular programming languages. The only major downer is that Mac users are left out in the cold unless they go to the (herculean) effort of getting X11 up and running.

All the little differences between the various GTK+ bindings out there tend to get my goat when I move from one language to another. You would think that a GTK+ example in C could easily be translated to other languages without referencing documentation right? For one reason or another, the GTK+ bindings for other languages tend to diverge from the pleasant consistency of the C API to varying degrees. This post is all about those little differences that crop up even in the simplest applications: GTK’s take on “Hello World” in a few different languages.

C

C, being the language GTK+ is actually written in, is probably the most consistent with function naming across the different objects … the GTK_* and G_* macros are quite ugly though.

#include <gtk/gtk.h>

int main (int argc, char **argv) {
  GtkWidget *window;
  gtk_init(&argc, &argv);

  window = gtk_window_new (GTK_WINDOW_TOPLEVEL);
  gtk_window_set_title (GTK_WINDOW (window), "Hello, World");
  g_signal_connect (G_OBJECT (window), "delete-event", gtk_main_quit, NULL);
  gtk_widget_show_all (window);

  gtk_main();
  return 0;
}

C#/Mono

GTK# adds stuff like the Application object and delegates to produce a “Hello World” example I had to go digging around in documentation for. Not too bad on the whole, though.

using Gtk;
using GtkSharp;

public class HelloWorld {
  public static void Main(string[] args) {
    Gtk.Window window = new Gtk.Window();
    window.Title = "Hello, World";
    window.DeleteEvent += delegate { Application.Quit(); };
    window.ShowAll();
    Application.Run();
  }
}

Ocaml

The Ocaml bindings diverge from the original C API much more so than the other languages listed here, most likely due to design decisions that had to be made to provide a C binding to a functional language. My only real grumble is the inconsistency with window#event#connect vs window#connect. I’m guessing there was a technical reason for that, but it still irks me every time I see it.

let delete_event evt = false

let destroy () = GMain.Main.quit ()

let main () =
  let window = GWindow.window in
  let _ = window#set_title "Hello, World" in
  let _ = window#event#connect#delete ~callback:delete_event in
  let _ = window#connect#destroy ~callback:destroy in
  let _ = window#show () in
  GMain.Main.main ()
;;

let _ = main () ;;

Perl

Although I find writing Perl to be painful for everything but processing text files in a terminal, I found the Perl GTK bindings to be relatively straightforward.

use strict;
use Gtk2 '-init';

my $window = Gtk2::Window->new;
$window->set_title("Hello, World");
$window->signal_connect('delete-event', sub { Gtk2->main_quit; });
$window->show_all;
Gtk2->main;

Python

I’m more familiar with PyGTK than with other bindings, so this was a snap. Why they chose GtkObject.connect over GtkObject.signal_connect is a mystery and the pygtk.require(’…’) crap is a little weird, but aside from that there should be nothing surprising here (this is a good thing!).

import pygtk
pygtk.require('2.0')
import gtk

window = gtk.Window()
window.set_title('Hello, World')
window.connect('delete-event', gtk.main_quit)
window.show_all()
gtk.main()

Ruby

RubyGNOME provides a GTK+ binding for Ruby. Take an almost 1:1 port of the C API, take away the ugly casting macros, mix in closures for handling signals and Ruby really is one of the nicest ways to get intimate with GTK+.

require 'gtk2'

window = Gtk::Window.new
window.title = 'Hello, World'
window.signal_connect(:delete-event) { Gtk.main_quit }
window.show_all

Gtk.main

That’s all for now. There are many more language bindings for GTK+ out in the wild for languages like Lisp/Scheme, C++, Haskell and Erlang. If you’re looking around for a GUI toolkit, be sure to give GTK a go. There’s plenty of documentation available for all the bindings listed here, often with some very detailed and easy to follow tutorials.

UPDATE: Miguel and she suggested some changes to the C#/Mono and RubyGNOME examples.
UPDATE 2: Aristotle suggested an easier way to initialize the Perl GTK bindings.

Categories: C, GTK, Ocaml, Python, Ruby, Software Development | 11 Comments

HTML Sucks for Rich Web Applications

December 2nd, 2007

This post started out as a reply to a work colleague’s article – Ben Teese’s CSS Layout Sucks for Panel-Based Web Apps- but it eventually expanded out into such a large little rant that I thought it would be better suited to my blog. So here it is.

Trying to Make CSS Suck Less

Ben, since I come from a web development background … I hear ya. At my last job, XHTML and CSS were our primary building blocks for web sites and applications. By the end of my time there, we got pretty good at building CSS layouts – but there was always some little exception, some little element that didn’t *quite* look right due to some weird CSS nuance. Worse, client change requests could often be downright painful for exactly the reasons you mention: recalculating measurements, layout “knowledge” scattered throughout the stylesheet(s), etc. Worse, trying to make it work in every version of Firefox, IE, Safari and Opera? No deal. Not without many, many painful hacks. Despite being a standard for *years*, most vendors’ implementations of CSS are half-assed at best and will remain so for the foreseeable future.

Regarding the DRY problems mentioned with CSS: a half-baked idea of mine was to use $TEMPLATE_LANGUAGE to generate CSS files as part of the build process to alleviate some of the pain. I never actually got around to this myself, but others seem to be scared of this approach. Why? If your template engine provides you with variables, conditionals, loops and a means for performing calculations in your CSS templates why aren’t you generating your CSS using Velocity, Erb or even PHP!?

HTML and Rich UI Don’t Mix

All gripes aside, when it comes to the web I remain a somewhat stalwart supporter of CSS and XHTML – if only because tables clutter up HTML like a dog. However, I do think that the HTML hemisphere of the web world is starved for choice: both tables and CSS suck in their own hairy little warty ways. Tables suck because they’re awful to maintain and effectively render screen-readers useless. CSS sucks because it’s error-prone and void of expressive syntax & semantics.

The web world should be paying a lot of attention to rich client solutions such as Flex and Silverlight. Flex would be my first choice for a client platform if I had to start building a web application tomorrow. In fact, I’d go so far as to argue that the decision between CSS/tables for web applications is moot – the real problem is the current trend of using HTML as a medium for rich web applications. GWT is a perfect example of the ridiculous lengths we’re going to in order to just make things work. It’s a godawful hack around a limited technology.

If your requirements permit it, don’t try to make your HTML jump through hoops – opt for Flex or Silverlight instead if you plan on having a truly interactive UI. If you’re stuck with HTML, inexpressive CSS stylesheets can be augmented with a template language to reduce the amount of knowledge being spread throughout your code.

Categories: Software Development | 7 Comments

New Server

November 30th, 2007

I’ve just moved this blog to a new hosting provider. ‘Tis a spiffy new VPS. Please drop me a line if there are any issues (performance or otherwise).

Categories: Software Development | No Comments

Anonymous Type Acrobatics With Scala

November 12th, 2007

As mentioned in passing at the beginning of a recent article, one of the coolest things made possible by Scala’s expressive type system are anonymous type declarations. The idea was brought to my attention by Neil Bartlett in this article before I even started looking at Scala in any depth. Neil’s short-but-sweet post is most definitely worth reading, but there’s a few more neat tricks to be gleaned from his little example. Let me paraphrase his original code:

def printName(o: { def getName: String }) = Console.println(o.getName)

Here, the printName method can be passed an instance of any type, so long as it implements the anonymous interface we provide here (i.e. “any object with a getName method that returns a String”). Neat.

The most obvious question that occured to me is this: does the statically-checked duck-typing rule hold true for data structures as well as functions? For example, we have a list of objects of any old type, so long as they satisfy the interface constraints of our anonymous type? You bet.

class Foo {
  def getName: String = "Foo"
}

class Bar {
  def getName: String = "Bar"
}

val items = List[{ def getName: String }](new Foo, new Bar)

// despite Foo and Bar being completely types, the compiler "knows"
// that they will both implement our anonymous type declaration
items.map(_.getName).foreach(Console.println)

Very cool stuff. No longer do we need ugly wrapper classes/proxies to work around the shortcomings of more primitive type systems.

Oh and if the anonymous type syntax irks you, fear not: types can be aliased:

// Foo and Bar declared elsewhere ...

type HasName = { def getName: String }

def printName(o: HasName) = Console.println(o.getName)

val items = List[HasName](new Foo, new Bar)
items.foreach(printName)

Irrespective of which side of the dynamic/static fence you sit, you have to admit that Scala is an interesting compromise …

Categories: Scala, Software Development | 1 Comment

The Software Development Dark Ages

November 11th, 2007

I just read Hans-Eric Gronlund’s post on the pending extinction of programmers. The sky is falling in Software Development Land, it would seem: we’re all to be replaced by the very programs we’re writing. My natural reaction to his post is outright rejection. To my mind, there are too many fringe cases to account for, too many semantic nuances in spoken and written language to possibly pretend – even for a moment – that we’re just an academic paper or two away from being able to produce truly useful software from a high level specification. I wager that we’ll never see such a silver bullet come to fruition any time in the foreseeable future.

The Devil in the Details

Software development is not easy. You can trivialize it away all you want, and pretend that somehow the computer is going to know what you mean when you say “generate a report based on the data in the products table”. The devil is, and always will be, in the details. Simply put, you cannot abstract away details. Combine this with the fact that programming languages are constantly being born, evolved and killed. Ideas in languages decades old are being reimplemented in modern languages with a twist. We’re in an endless spiral of self-destruction: the software we write to day will be out of date – technologically speaking – by tomorrow because language X is no longer popular, or framework Y is now unsupported, or platform Z has crap support for multiple cores and so our software simply cannot scale. Our tools are constantly improving, but as developers we’re forever reinventing the software wheel – and that’s a problem caused by the overzealous nature of the software market.

The VM Wars

For example, the arrival of feasible byte-code-driven virtual machines is a wonderful thing: so what if language X is out of date? We can start using language W on Monday because it compiles down to the same bytecode. Platform Z has crap support for multiple cores? Hey, that’s cool: the VM we target also runs on another platform with better support for multi-core hardware. Sure, we’re still stuck with a shit framework, but hey: gift, horse, mouth. Virtual machines, while not fixing The Semantic Problem discussed by Gronlund, are a big step forward on other fronts.

How could the industry possibly screw such a theoretically wonderful idea up? Enter Sun and Microsoft with the JVM and .NET, respectively. Two entirely incompatible virtual machines with separate bytecode instructions and standard libraries. Brilliant. So now we all have to make a choice and hope that it’s the right one or we’re back to inventing wheels when the JVM is no longer supported by Sun, or .NET is superseded by an instruction set only feasibly implemented on specific hardware configurations. Even if both continue to exist forever and a day, the choice is still largely mutually exclusive: if you pick Java, your options for reuse are limited to Java solutions. If you pick .NET, you’re in the same boat in a different creek with the familiar uneasiness one feels in the absence of paddles in such a situation.

I wouldn’t for a second advocate blind homogeneity, but the two virtual machines are so god damn similar at a semantic level, they might as well be the same platform. It’s for entirely avoidable reasons like this VM divide that we’re doomed to the software development dark ages – and all that without even mentioning the words “software patent”. My prediction is that there will never be a magical breakthrough while we’re throwing away everything we produce every few years. We’ll be too busy fixing yesterday’s problem with today’s technology to ever think about a better way to approach the process as a whole. Even if we improve the process, we’re still doomed to the details, because – again – you can’t abstract away details.

No “Super Compiler”

Clearly, I’m a sceptic of the super compiler that Gronlund predicts (even when looking far, far into the future as he does), but I do see highly configurable software systems filling specific niches: systems like Drupal and Joomla are already filling this niche on the web development front. For bog-standard content-managed web sites, you can hand Drupal off to a half-trained monkey, slap a pretty theme on it, and you really can’t go wrong. However, even with such systems you are limited to the constraints of the systems themselves: the moment you want your site to do something unsupported out-of-the-box, the moment you want a certain feature to behave slightly differently, you’re back in the land of programmers. Further, what happens should PHP (the technology on which Drupal is built) disappear? Is Drupal then to be simply rebuilt in another language on another platform? Even if this is unlikely to happen in the real world: on a hypothetical level, why the hell should we need to do this?

Commercialization, Legislation and a Dull Sense of Deja Vu 

To make matters worse, software development is outrageously commercialized. Even if we assume that a massive breakthrough in the software development tool set will eventually surface, you can bet it will be so riddled with patents that nobody will be able to legally use it anyway. It’s sad, but it’s the brutal truth. I don’t mean to be overly gloomy with this post: I love software development, and the possibilities for our profession are simply endless. It’s because I love my job that I look to the future with such pessimism: it’s unrealistic to think that – with the industry’s current approach to software development – we are going to see anything more than incremental improvement.

The more you try something new to the software development scene, the more it feels like we’re all doing a slightly different dance to the same bad song. I guess all we can do for now is keep on trying new things and hope somebody, somewhere stumbles upon Brooks’ silver bullet in the process …

Categories: Compilers, Languages, Software Development | 4 Comments