Friday, June 5, 2009

PostgreSQL row overhead

As mentioned previously, I'm in currently planning a +1 billion row ingestion. I've begun to examine the on-disk storage requirements. Memory will be treated in a future post.

One of the properties of the dataset which I'm ingesting is that it is time stamped. Infact, there is not too much else other than timestamps. With one obvious representation, total cost of the data in a row is ~ 8+4+4+4+4 = 22 bytes.

However, what I didn't appreciate was that the overhead which PostgreSQL imposes on a row is ~34bytes! This figure has improved slightly in 8.3, but not significantly enough.

So a couple of options occur to me: use a data base with a smaller row overhead (MySQL?) or redesign the tables.

I've plumped for the latter. Basically I'm working the tables into a form where they are much squarer. Reducing the row count and increasing the amount of data per row. I'm using arrays to increase the amount of data.

This redesign has a cost, and without exact requirements from our users it's too early to tell if this decision will return to haunt me...

No comments: