Thursday, June 18, 2009

Hamster time tracker on opensuse

I've got interested in trying to identify what I spend my time in front of my computer doing. I knocked up a python script to watch 'wnck' and 'dbus' for desktop events (dbus for screensaver) which is available on my website.

While I was thinking of alerting the gnome community to the embryonic work, I discovered the 'Hamster' applet, which apparently is aimed at fulfiling the same need as my scratchy program.

So, I visited the hamster site and cloned the git repository onto my machine. Then I went about configuring it.

For OpenSuse, this turned out to require a bunch of dependencies to be fulfilled, so I thought I'd record them here so others may benifit.

zypper install autoconf automake libtool glib2-devel libgconf2-dev intltool gconf2-devel python-devel
zypper install gtk2-devel python-gtk-devel python-gobject2-devel python-gnome-devel

I might have missed some out while going through my .history, but that should get you close.

I'm not sure how much this is an additional OpenSuse specific feature, but I needed to set $PKG_CONFIG_PATH which I set to: /usr/lib64/pkgconfig/ (i'm on 64bit).

Friday, June 5, 2009

PostgreSQL row overhead

As mentioned previously, I'm in currently planning a +1 billion row ingestion. I've begun to examine the on-disk storage requirements. Memory will be treated in a future post.

One of the properties of the dataset which I'm ingesting is that it is time stamped. Infact, there is not too much else other than timestamps. With one obvious representation, total cost of the data in a row is ~ 8+4+4+4+4 = 22 bytes.

However, what I didn't appreciate was that the overhead which PostgreSQL imposes on a row is ~34bytes! This figure has improved slightly in 8.3, but not significantly enough.

So a couple of options occur to me: use a data base with a smaller row overhead (MySQL?) or redesign the tables.

I've plumped for the latter. Basically I'm working the tables into a form where they are much squarer. Reducing the row count and increasing the amount of data per row. I'm using arrays to increase the amount of data.

This redesign has a cost, and without exact requirements from our users it's too early to tell if this decision will return to haunt me...

Tuesday, June 2, 2009

perl DBI vs c libpq PostgreSQL ingestion

I'm planning to ingest ~1B rows into a PostgreSQL data base.

There are some interesting storage requirements here, but in the first instance I've been working on getting ingestion throughput up.

For my trial ingestion of 5M rows, i've found libPQ to be roughly twice as fast as Perl DBI.

Some notes:
  • indexing is omitted during ingestion.

  • libpq included data type hints, my Perl code, using DBI, didn't

  • each row ingested included a timestamp, and a few (<5) numeric values

  • running multiple ingestion process improved throughput, although less than linearly (for <4 cpus). I.e. 2 CPUs ~ doubled throughput, but it tailed off after that.