When Speed Matters

April 3rd, 2008

I’d be the first to tell you if, when confronted with a task that requires a degree of automation, you use a scripting language like Python, Ruby, Perl, Bash or some mix of the four. However, I recently had a problem that involved changing the permissions on a large number of files.

My initial approach used a small (10 lines or so) Bash script to traverse the filesystem hierarchy and change the permissions based on whether we were processing a file or a directory. The resulting script ran over more than 3000 files and directories in about a minute and a half. Which wasn’t exactly slow, but I had to run this script about twenty to thirty times throughout the day and it felt very unproductive to have to wait for a whole minute and a half before I could continue my work. So I did something crazy: I rewrote that little Bash program in C.

The resulting C program – of maybe 75 lines of code – finished in 1.3 seconds.

Initially I thought I had made a mistake, so I checked the processed files. They were all exactly as I expected them to be. I was astounded: I could execute this program almost 150 times before my old Bash solution finished even once! This decision made my day – the rest of my afternoon was much more productive. I even felt happier to know I wasn’t wasting so much time on something trivial and secondary to the actual task at hand.

Again, I’m not one to go preaching about how important performance is with respect to any given language – often it’s much, much easier to write a few Python/Ruby/Perl scripts and push data between them with some Bash glue. In this particular case, however, the choice of a lower-level language was clearly a massive win.

Even in retaining the pragmatic perspective, it really makes you wonder just how much time is lost to “inefficient” software stacks…

Categories: C, Linux, Software Development | 4 Comments

On Bloom Filters

March 15th, 2008

I’ve got to admit that I don’t come from a hard core CS background (and I’m sure it shows in some of my articles on more complicated topics :P ), but I know a little bit about data structures and algorithms. While I’m no expert on the topic of data structures, I was surprised that – up until now – I had never even heard of Bloom filters.

What are Bloom filters?
Let’s say you have a set of millions of records in a database. Running a query over such a database can be quite time consuming. It would be great if we a cheap way to determine if a given record is likely to be within the database without necessarily performing a query. Bloom filters to the rescue!

Rather than executing a complex database query, we can ask our Bloom filter whether or not a certain record exists. It will then respond with either a definitive “no” (i.e. the record is definitely not in the database) or a “maybe”. In the event our Bloom filter says “maybe”, we must (unfortunately) query the database to determine whether or not the record really exists.

While this may sound like double the workload for our poor application, querying the Bloom filter is likely to be very fast. The goal of the Bloom filter here is to reduce the number of times we actually hit the database, a way to determine the probability a given input exists in our data set. Bloom filters are also small in terms of memory usage, requiring nothing more than a simple bitfield.

A quick overview

Bloom filters primarily consist of a bitfield of size m and one or more (call it k) hash functions. The classic Bloom filter provides two main operations:

  1. add, which is used to add new records to the Bloom filter; and
  2. query, which is used to determine whether a record is likely to be found in the data set.

Starting out with a zero-value bitfield, each time we call add our hash functions are each called on the input record in turn to produce a set of k indices (one for each hash function). For each one of these indices, we set the corresponding bit in the bitfield.

To query this filter, we again generate k indices using our hash functions on the input record. Rather than setting the bits in the bitfield, we instead check each bit. If any one of the bits is not set, then there is zero chance our input record exists in the data set (a definitive “no”). Otherwise, if all k bits are set, we have a “maybe”.

Problems with the classic implementation

There’s one glaring problem with the classic Bloom filter: there are no deletions! If records in your source data set can be deleted, your Bloom filter may need to be regenerated (depending on your accuracy needs). There exists a variation on the classic filter which uses an array of counters instead of a bitfield. The add operation becomes a counter increment. This allows for a delete operation which, predictably, becomes a decrement of the counter. This variant trades memory for flexibility.

A reference implementation

I’m uploading a C reference implementation of a Bloom filter with 1 (super naive) hash function and a 32-bit-long bitfield. Hopefully there is somebody out there who finds it informative.

Download my crummy (but hopefully instructive!) reference implementation here.

WARNING I wrote this implementation from a quick read of the Wikipedia page, so if there are any CS geeks out there who can see I’m clearly doing something wrong, please tell me how wrong I am :)

Categories: C, Software Development | 3 Comments

GTK Hello World in Six Different Languages

January 23rd, 2008

I’m still somewhat in holiday mode, so this entry is probably going to feel a little cheap for those of you following my more technical posts. I’m a big fan of GTK+ for user interfaces. If you don’t have the option or desire to use the Java platform and Swing, GTK+ is one of the better cross-platform user interface toolkits out there.

It’s high-level enough that it is easy to build quick, effective GUIs but low-level enough not to get in your way when you need to start messing around at the pixel level. GUIs can be built by hand using code, or designed using Glade and exported to an XML document to be loaded at runtime by any GTK+ binding. Currently it runs on Windows and Linux (the toolkit actually has its roots in GNOME) and has bindings for most popular programming languages. The only major downer is that Mac users are left out in the cold unless they go to the (herculean) effort of getting X11 up and running.

All the little differences between the various GTK+ bindings out there tend to get my goat when I move from one language to another. You would think that a GTK+ example in C could easily be translated to other languages without referencing documentation right? For one reason or another, the GTK+ bindings for other languages tend to diverge from the pleasant consistency of the C API to varying degrees. This post is all about those little differences that crop up even in the simplest applications: GTK’s take on “Hello World” in a few different languages.

C

C, being the language GTK+ is actually written in, is probably the most consistent with function naming across the different objects … the GTK_* and G_* macros are quite ugly though.

#include <gtk/gtk.h>

int main (int argc, char **argv) {
  GtkWidget *window;
  gtk_init(&argc, &argv);

  window = gtk_window_new (GTK_WINDOW_TOPLEVEL);
  gtk_window_set_title (GTK_WINDOW (window), "Hello, World");
  g_signal_connect (G_OBJECT (window), "delete-event", gtk_main_quit, NULL);
  gtk_widget_show_all (window);

  gtk_main();
  return 0;
}

C#/Mono

GTK# adds stuff like the Application object and delegates to produce a “Hello World” example I had to go digging around in documentation for. Not too bad on the whole, though.

using Gtk;
using GtkSharp;

public class HelloWorld {
  public static void Main(string[] args) {
    Gtk.Window window = new Gtk.Window();
    window.Title = "Hello, World";
    window.DeleteEvent += delegate { Application.Quit(); };
    window.ShowAll();
    Application.Run();
  }
}

Ocaml

The Ocaml bindings diverge from the original C API much more so than the other languages listed here, most likely due to design decisions that had to be made to provide a C binding to a functional language. My only real grumble is the inconsistency with window#event#connect vs window#connect. I’m guessing there was a technical reason for that, but it still irks me every time I see it.

let delete_event evt = false

let destroy () = GMain.Main.quit ()

let main () =
  let window = GWindow.window in
  let _ = window#set_title "Hello, World" in
  let _ = window#event#connect#delete ~callback:delete_event in
  let _ = window#connect#destroy ~callback:destroy in
  let _ = window#show () in
  GMain.Main.main ()
;;

let _ = main () ;;

Perl

Although I find writing Perl to be painful for everything but processing text files in a terminal, I found the Perl GTK bindings to be relatively straightforward.

use strict;
use Gtk2 '-init';

my $window = Gtk2::Window->new;
$window->set_title("Hello, World");
$window->signal_connect('delete-event', sub { Gtk2->main_quit; });
$window->show_all;
Gtk2->main;

Python

I’m more familiar with PyGTK than with other bindings, so this was a snap. Why they chose GtkObject.connect over GtkObject.signal_connect is a mystery and the pygtk.require(’…’) crap is a little weird, but aside from that there should be nothing surprising here (this is a good thing!).

import pygtk
pygtk.require('2.0')
import gtk

window = gtk.Window()
window.set_title('Hello, World')
window.connect('delete-event', gtk.main_quit)
window.show_all()
gtk.main()

Ruby

RubyGNOME provides a GTK+ binding for Ruby. Take an almost 1:1 port of the C API, take away the ugly casting macros, mix in closures for handling signals and Ruby really is one of the nicest ways to get intimate with GTK+.

require 'gtk2'

window = Gtk::Window.new
window.title = 'Hello, World'
window.signal_connect(:delete-event) { Gtk.main_quit }
window.show_all

Gtk.main

That’s all for now. There are many more language bindings for GTK+ out in the wild for languages like Lisp/Scheme, C++, Haskell and Erlang. If you’re looking around for a GUI toolkit, be sure to give GTK a go. There’s plenty of documentation available for all the bindings listed here, often with some very detailed and easy to follow tutorials.

UPDATE: Miguel and she suggested some changes to the C#/Mono and RubyGNOME examples.
UPDATE 2: Aristotle suggested an easier way to initialize the Perl GTK bindings.

Categories: C, GTK, Ocaml, Python, Ruby, Software Development | 11 Comments

Passing Data Between GTK Applications With GtkClipboard

June 27th, 2007

GtkClipboard is a wonderful (if somewhat recent) addition to GTK. As you would expect, it allows users to pass data between your application and other GTK applications via the X11 clipboard (and vice versa). A common past annoyance with the clipboard under linux is that data stored disappeared from the clipboard when the application which set the data terminated. No longer so, thanks to gtk_clipboard_store and the GNOME Clipboard Daemon.

Now, tutorials on GtkClipboard are a little sparse and the documentation seems to leave a few questions unanswered, so here’s a quick and dirty primer on passing text data between your GTK applications using GtkClipboard.

Python

import pygtk
pygtk.require('2.0')
import gtk

# get the clipboard
clipboard = gtk.clipboard_get()

# set the clipboard text data
clipboard.set_text('Hello!')

# make our data available to other applications
clipboard.store()

Ruby

require 'gtk2'

# initialize Ruby's GTK bindings
Gtk.init

# get the clipboard
clipboard = Gtk::Clipboard.get(Gdk::Selection::CLIPBOARD)

# set the text. Ruby-Gnome2 also provides a text= setter
clipboard.set_text('Hello, World')

# make the clipboard data available to external applications
clipboard.store

C

#include <gtk/gtk.h>
#include <gdk/gdk.h>
#include <string.h>

int main (int argc, char **argv) {
    const char *message = "Hello, World";

    /* initialize GTK */
    gtk_init (&argc, &argv);

    /* set the clipboard text */
    gtk_clipboard_set_text(gtk_clipboard_get(GDK_SELECTION_CLIPBOARD), message, strlen(message));

    /* store the clipboard text */
    gtk_clipboard_store(gtk_clipboard_get(GDK_SELECTION_CLIPBOARD));

    return 0;
}

You can also use gtk_clipboard_set_image (for Ruby and Python, there are equivalents) to pass GdkPixbuf data to the clipboard. Check the GTK/PyGTK/Ruby-Gnome2 documentation for more details.

It’s actually all quite easy, but I thought it might be nice to see the code in practice.

Categories: C, GTK, Linux, Python, Ruby, Software Development | 3 Comments