Reference data as dependency

TL;DR Let’s suppose your application needs some kind of reference data, for example, names of all countries or currencies or time zone, you have two options:

  • copy data from open source to your application and work with it directly
  • use third-party dependency (library, npm package, ruby gem etc) which will provide this data

The second approach is a good choice: you get versioning, community support, bugfixes.

Case studies

ruby-mime-types

Ruby MIME type registry and library. It is used by mail gem which used by action mailer which used by rails- which means it is very popular.

Problem: memory hog. Before it was fixed it took about ~30mb of the memory of ruby. After it was fixed it takes ~7mb.

mini_mime to the rescue. It uses the cache-proxy-like thing which bounds how many simultaneous objects to store in the memory. This PR describes in details what are the problems and what was fixed.

There is, even more, optimization on its way: replace lookup with native extension and use capnproto for data storage.

all-the-cities

all-the-cities is npm package which contains all the 138,398 cities of the world with a population of at least 1000 inhabitants, in a big JSON array.

Problem: slow startup.

Problem was solved by replacing a text-based file with binary file, Protocol Buffers to be specific.

See: Comparison of data serialization formats. Pay special atteintion to capnproto.

Countries

Countries is ruby gem.

Countries is a collection of all sorts of useful information for every country in the ISO 3166 standard.

An interesting feature is that it looks like ActiveRecord interface, which is familiar to any Rails developer.

ActiveHash

ActiveHash is a simple base class that allows you to use a ruby hash as a read-only data source for an ActiveRecord-like model.

ActiveHash also ships with:

  • ActiveFile: a base class that you can use to create file data sources
  • ActiveYaml: a base class that will turn YAML into a hash and load the data into an ActiveHash object

Idea: use binary format for data and C or Rust extension to traverse it.

Predefined schema

world.db takes a bit other approaches. Instead of just bundling reference data it provides predefined data models.

Embedded database

While you can start with simple Hash or Array to bundle data. You also can go with full featured embedded database. Most famous would be SQLite, for example, MapBox suggest distributing map tiles as SQLite files.

But there are much more interesting options:

  • rocksdb
  • LMDB. Memory-mapped, allowing for zero-copy lookup and iteration
  • MDBM which uses “Memory Mapping” and “Zero-Copy”.
  • Mnesia is the somewhat odd name for the real-time, distributed database which comes with Erlang.

See also:

Data strcutures

You do not need a full database when good data structure will do. You just need to implement it in something performant like C or Rust.

Examples:

Open data

Last, but not least question: is where to get the data. Open data is the nice initiative which addresses this problem. See also: Awesome Public Datasets

See also:

  • http://www.datasciencetoolkit.org/