SQLite: Negative Integer Primary Keys


Posted:   |  More posts about SQLite analysis

I was working on the design of a PlayBook app which would perform some data logging, and considered using SQLite as the primary mechanism. Alternatives included a simple text file (CSV format), XML, or SharedObject (not really).

Since the user may want to acquire data for a very long time, possibly on an ongoing basis for years, it seemed important to spend at least a bit of time considering the storage space that would be used.

With my prototype, which simply appends rows to a CSV file each time a reading occurs, the file would grow to approximately 50MB in the course of a year if the app were always running. Each record is roughly 204 bytes including newline.

The nature of the data is largely integers, but I realized that many of them would be negative, and wondered about how this would affect the file size of a SQLite database.

SQLite is interesting in many ways, one of which is that it uses a variable-length format for integers, using anywhere from 1 to 8 bytes depending on the size. That page may be inaccurate, because in a more detailed description it says it could be 1 to 9 bytes.

I built a small test database with an INTEGER PRIMARY KEY column, some text, and another integer column, and created rows with a variety of data. I then analyzed a hex dump of the records in the data area.

I immediately noted that my rows that had negative primary key values were taking up space for lots of FF bytes for the primary key. Yikes! I thought, is it really this bad for storing negative integers? Here's a couple of samples (note, this output doesn't actually match the schema I mentioned above, but it doesn't matter for this analysis):

0A FF FF FF FF FF FF FF FF 81 05 00 17 00 00 66 2D 31 32 37 "f-127"
08 FF FF FF FF FF FF FF FF FD 05 00 13 00 00 66 2D 33       "f-3"
08 FF FF FF FF FF FF FF FF FE 05 00 13 00 00 66 2D 32       "f-2"
08 03 05 00 13 00 00 66 2B 33                               "f+3"
09 40 05 00 15 00 00 66 2B 36 34                            "f+64"
0A 81 00 05 00 17 00 00 66 2B 31 32 38                      "f+128"

Note all those FF bytes. What a waste of space. The positive integers are stored compactly, with 3 and 64 requiring only a single byte, and 128 taking up two bytes. (Note that it would have made comparison easier had I used the same length of text in each row.)

Upon further inspection and reading, I sorted it out. Every row, even if you don't specify an INTEGER PRIMARY KEY, has a ROWID assigned by the database to track it. If you specify any other column as the INTEGER type, even negative values will be stored compactly. A -3 is stored as just FD, rather than FF FF FF FF FF FF FF FD as shown in the ROWID sample earlier.

Here's the key point: if you have a column which is defined as the INTEGER PRIMARY KEY, then SQLite doesn't bother creating a real column just for that data. It will reuse the ROWID instead. But if you plan to store lots of small negative values in that field, you'll potentially be using up lots of extra bytes for all those FFs! The "varint" format for ROWID is compact only for positive values.

It would take a very particular type of data for this even possibly to be a problem. You'd have to have lots of negative values between -1 and maybe -16 million or so.

If this were an issue, the "fix" would be to define a separate column for your data, and a INTEGER PRIMARY KEY which you mostly ignore, allowing SQLite to auto-assign and increment as positive values.

As a crude estimate of the difference in space used, here are file sizes for the two versions, one where we store our negative values in the INTEGER PRIMARY KEY column where they end up becoming the ROWID, and the other where we assign a second column just for that data. Each example assumes a million rows, with data going from -1 to -999,999, and I'm ignoring any overhead for tracking b-tree pages, the database header, etc.

Mode File Size
reuse ROWID 10.5MB
separate field 6.6MB

Okay, well, that about ties things up. Quite clearly, even on the PlayBook, it's unlikely a difference of 4MB is going to matter to anyone. And none of this would have affected me anyway, as the values I need for my primary key fields turn out not to be negative!

Comments powered by Disqus