This blog post is part of a series of blog posts about the new OSM file format “OMA”. This is the fourth post. At the end of the article you’ll find links to the other blog entries.
The real subject of this blog series is the newly developed file format. So far I have mainly talked about the tools for creating and using the format, because the format itself is a dry subject. But now it’s time to dive more deeply into the format itself. I will not go into all the details, because I think that apart from some freaks like me, people are not interested in all the details. If you are, take a look at the specs.
Fast Access
OpenStreetMap data consists of a set of elements. Some of these elements are nodes, some are ways, and some are relations. You can think of OSM as a big storehouse where all the elements are scattered around:

Obviously it’s hard to find what you’re looking for, if there is no order.
The Order of OSM Files
Fortunately, it’s not that bad. In traditional OSM files there is some order: Elements are sorted by type and, as a second criterion, by ID.

This order doesn’t help much, when looking for certain elements. Especially: You can’t take an element out of the middle and look at it, you have to go through the elements from top to bottom.1
The Order of Databases
Because it’s hard to work this way, many ways of sorting OSM elements have been developed over time. For example, in a database they can be stored as tables with one element per row and one table for each type. You can think of this as storing nodes in one room, ways in another, and relations in a third room in our storehouse.
Typically, databases also contain some indexes. In our analogy, they are like posters at the entrance to each room, telling you, which shelf to look on for a particular element. Indexes are very important for quick searching.

The Order of OMA Files
Oma files sort elements on three levels: First by type and geographic location, second by key, and third by value. They contain something like a database index at each level, so typical searches can be handled very quickly.
In our analogy, you might think of the rooms in the building as the first level, the shelf units in each room as the second level and the shelves in each shelf unit as the third level; again, with posters at each level.

Small Filesize
The second design goal was to keep the file size small. Two mechanisms are used for this: Efficient storage of the elementary datatypes and compression. There are basically three elementary datatypes to handle: geographic coordinates, integers and strings.
Geographical coordinates
Coordinates in OSM data are given as a pair of numbers expressing longitude and latitude in the WGS84 coordinate system. A naive approche of storing them would be as a pair of floating point numbers. For full accuracy you would need to use double precision numbers, resulting in 16 bytes per coordinate.
That’s a lot. For example, germany.oma
contains 576,208,086 pairs of coordinates. At 16 bytes per pair, this alone would result in a file size of over 8 GB.
It would be possible to use single precision numbers, accepting some small rounding errors. This would halve the number of bytes used. But there are better ways: If you multiply each number involved by 10,000,000, the two coordinates become integers that can be stored in 8 bytes without loss of precision.
This can still be improved using a technique called delta encoding: Two consecutive nodes in the file are often close to each other. So the difference between the coordinates of these two nodes is a much smaller number that can be stored in fewer bytes.
In Oma files, if the difference between two coordinates is between -32767 and 32767, this difference is stored, using only 2 bytes. This case is used in 94% of all coordinates. If it’s larger, 6 bytes must be used. This results in a total of 4,4 bytes for a pair of coordinates on average, almost halving the amount needed again.2
For comparison, the pbf and o5m formats use a similar, but slightly more complicated delta encoding. The result is the same: an average of 4,4 bytes for a pair of coordinates.
Integers
If you look carefully at OSM data, there are many places where there are small non-negative integers (the number of tags of an element, the number of nodes of a way, the version number, and so on). A normal integer takes up 4 bytes of space, and that’s a waste when the numbers are so small, because almost always 3 of those 4 bytes are just zeros. That’s like writing 0008 instead of 8.
So, Oma files store integers smaller than 255 in just one byte. Integers between 255 and 65534 use 3 bytes, and all larger integers use 7 bytes. On average, this results in 1,006 bytes per integer, which is almost perfect (using germany.oma
as a reference).
Strings
There are a lot of repetitive strings in OSM data. Imagine how many times the string highway
alone occurs. It would be nice to store these strings more efficiently. Other formats use lookup tables to do so. For example, the lookup table might contain highway
at position 3, so instead of storing the string highway
, the number 3 could be used.
I have experimented with such lookup tables too, but they turned out to be a bad idea. Why is that? Well, the whole thing is piped through a compression algorithm called deflate
. This algorithm is very good at recognising repetitions, and so the repeated highway
strings can be compressed better than repeated occurrences of 3.
It turned out, that table lookup mechanisms together with deflate increased the file size instead of reducing it.
So I didn’t do much with strings. There is only one small difference from the standard way of storing strings (in Java): I do not use a two-byte value to store the length of the string, but an integer, as explained above, which most of the time uses only one byte.
Compression
All slices can be compressed using an algorithm called deflate
. This is a very old and well-established algorithm. It was developed back in the eighties and is used in zip files, pbf files and many other places.
Well, I didn’t think any more about compression – it worked, so what?
In a comment to my introductory post, user cello pointed out, that there are better algorithms available today. They are about 5 to 10 times faster, often with similar or even better compression.
I don’t know how well these algorithms will work together with the element storage decisions I wrote about above. They probably need to be added to the tools to find out. I haven’t found the time to do that and I probably won’t find the time to do so in the near future.
But I thought I could calculate rough estimates of the speed gains of these algorithms. All I had to do was measure the time taken by the deflate
algorithm and divide it by 5 or 10. Unfortunately, the measurement produced some very strange results, leaving me with very little information to estimate.3
It could be, that the speed gain is almost nothing, or even a loss. It could also be that access times are halved. I can’t predict.
I thought for a long time about what to do with these new compression algorithms. Yesterday I finally forced myself into a decision: I will not include them in version 1 of the file format. It would take too many resources on my part and I can’t tell yet if anyone (besides me) will use OMA files anyway. But I definitely plan to include these algorithms in version 2 of the file format, when the time is ripe.
See also
-
The analogy of a stack is somewhat missleading here. Actually you can look at every place in the pile, but you do not know, where one element ends and where the next starts. I described this in more detail in my introductory post. ↩
-
Oma files do not restrict the order of elements in a slice. If they where sorted in an approbriate manner, the average amount of bytes needed could probably be reduced to almost 4,0. I never tried this though. ↩
-
I used two versions of germany.oma
: One compressed, the other not. The second one is about three times larger as the first one. Reading the whole file byte by byte takes 1 minute for the first one and 3 minutes for the second one; that’s what I expected. Next, I used these two files to read them with an OmaReader
. I used an empty filter which means, that all elements are read. I expected this to be slower in both cases. While this was true for the compressed version (took now about 2 minutes), it was not true for the uncompressed version (took only 1 1/2 minutes). I have no clue why reading the whole file while jumping around in the file and doing some additional compution with the bytes is much faster than just reading the file sequentially… ↩