Thursday, November 11, 2010

The impact of document IDs on performance of CouchDB

One major part of our product is a data acquisition framework. The framework is responsible of gathering data on devices and sending it to the server. The server then processes the data and eventually stores it in CouchDB.

CouchDB storage is pretty simple. It's just a single database where the data gets inserted with a random ID. Removals or modifications are hardly done to the data after it has been inserted there, its only function is just to be read later and to be displayed as graphs and such

Things started to get weird

However, the data server started to behave strangely. First symptoms were slow inserts and huge disk usage of the database. For the slow inserts we first suspected the reason was that we were not using CouchDB's bulk document API for inserting the data but after fixing that we only got a minor speed increase. The compaction of the database didn't really help, as it compacted the database too slow, peak rate being 300 documents per second and the average about 100 doc/s. With 8 million docs that takes a while. Worst thing was that the compaction didn't even reduce the disk usage. The database ate 26 gigabytes of disk containing only 8 million documents. That's a whopping 3 kilobytes per document, and the documents were really about 100 bytes in size.

What was even weirder we didn't really see anything strange in the way the server performed. IO ops were low, CPU usage was low and there was a plenty of free memory. Heck, we even could write zeros (dd if=/dev/zero of=/var/somefile.img) on the disk at the rate of 50Mb/s. Even killing most of the other services didn't help: CouchDB just kept being slow. We even upgraded the CouchDB in our test environment to try if it helped, but we only gained small performance gain from that operation.

Let's not be that random

As you might have heard, the problem with the random is that you never can be sure. So after a week or so pondering the issue we stumbled upon on a chapter about bulk inserts and monotonic DocIDs in the excellent CouchDB guide. With a tip from the awesome folks at #couchdb we rebuilt our system using sequential IDs instead of purely random IDs.

We copied the idea of sequential IDs straight from CouchDB which uses them for its auto-generated DocIDs. We just needed to implement it in Python and ended up with the following:

The

SequentialID is a random 32 character hexadecimal string. The first 26 characters is a prefix that stays the same for approx. 8000 subsequent calls. The suffix is increased monotonically. After around 8000 calls the the prefix is regenerated and the suffix is reseted to a small positive value. We also built another one using ISO-formatted UTC-timestamp with few random digits suffixed.

Now you might think that we would have collisions with the

SequentialID, especially if multiple processes are writing to the same CouchDB database. However, we're not since the prefix is a random generated string and the entropy (i.e. string length) is big enough to make the collision highly unlikely. Plus the SequentialID is never shared between processes it is regenerated for every single one instead.

Performance rocketed through the roof

No fix would be good without performance metrics so Petri wrote a small benchmarking script.

This showed us the huge performance increase we got. The write speed is almost four times as fast with sequential IDs than with random IDs. Not to mention that the database takes a one seventh of the space on the disk! We didn't try the compaction speed, but everything indicates that should be a lot faster too.

How does it work?

The reason why this worked is due to the design how CouchDB stores it data on the disk. Internally CouchDB uses B+-tree (from now on referred as B-tree) to store the data of a single database. Now if the data that is inserted in the B-tree (in this case, the DocIDs) is random, it causes the tree to fan out quickly. As the minimum fill rate is 1/2 for every internal node, the nodes are mostly filled up to the 1/2 (as the data spreads evenly due to its randomness) generating more internal nodes than before.

More the internal nodes the more disk seeks the CouchDB has to perform to write the correct leaf to write the data in to. This is why we didn't see any IO ops stacking up: the disk seeks do not show up in the iostat! And this was the single biggest cause to slow down both reads and especially writes of our CouchDB database.

The problem aren't even limited to the disk seeking. The randomness of the DocIDs causes the tree structure to be changed often (due to the nature of B-trees). When using append only log the database has no other way to do than rewrite the whole new structure of the tree to the end of the append only log. As this gets more common and common as more random data gets poured in, the disk usage grows rapidly, eventually hogging a lot more disk than the data in the database is actually requiring. Not only this, this slow down compaction too, as the compaction needs to constantly rebuild the new database.

The B-tree is why using the (semi-)sequential IDs is a real life saver. The new model causes the database to be filled in orderly fashion and the buckets (i.e. leafs) are filled in instead of leaving them half full. Best part here is that the auto generated IDs by CouchDB (which were not an option for us) already use the sequential ID scheme, so using those IDs you don't really need to worry a thing.

So remember kids: if you cram loads of data in your CouchDB, remember to select your document ID scheme carefully!

Friday, October 29, 2010

Migrating from Xapian to ElasticSearch

We need full text search capabilities in our frontend WWW interface to allow users to search through logs sent by embedded devices on the field. Quite recently, we changed the search backend from Xapian to ElasticSearch.

Background

Xapian is an open source, GPL licensed C++ library that implements a rich set of features for indexing any type of documents, searching and ranking them. An application that uses Xapian embeds it by linking to the C++ library. There's no server involved whatsoever, unless your application itself is a server.

ElasticSearch is an open source, Apache licensed Java application that implements a server that performs indexing and searching of JSON documents. It's built on top of Lucene, a popular Java library used by many higher-level search engines. ElasticSearch has a HTTP REST API as well as higher performance Thrift API, and it's query DSL provides rich searching capabilities.

From day one, our log indexing service has indexed JSON documents. Log parsers output JSON documents and mappings can be used to convert specific fields to forms understood by Xapian. I think the system (not invented by me!) is quite clever on how Xapian is used to allow indexing JSON documents.

Problems with Xapian

As more and more logs started to flow, we started facing problems with Xapian.

First, we had problems on how to scale indexing. Xapian's database is a bunch of files, and only one process is allowed to write to the database at a time. We wanted good durability, so the database was flushed often to not lose any data. Due to how the communication between the client (that sends logs) and the server (that submits them to indexing) works, we couldn't index a large batch of documents and then flush the database. So as the amount of incoming logs started to grow, the indexing was left behind at times.

The next problem was search performance. As our log database hit about 10 million entries, searches on a single device's logs were taking many seconds to complete. Searching through the logs of all devices took minutes, even if limiting to a few lines of results. The situation was worsened by the fact that flushing a Xapian database invalidates all ongoing search operations and they have to be restarted. And our indexer flushed often.

Our setup had a single node and a single database. I believe that with some refactorizations, splitting databases, adding nodes, etc. we could have made better with Xapian. It would just have been too much trouble, as we would have to build clustering and scaling all by ourselves. At about 19 million log entries we decided to do something about it.

Meet ElasticSearch

We started to look for alternatives and found ElasticSearch. It was amazing how it seemed to fit to our needs perfectly. It uses JSON as the native document format, its mapping capabilities and JSON-based search language were built in the same spirit as in our Xapian-based system.

So I started playing with it.

ElasticSearch was ridiculously simple to get running. Just download the binaries and start one shell script. It was up an running in 5 minutes, with zero configuration. Once I got grip of the mapping system, it was easy to make the same fields searchable in the same way as we had done with Xapian. What needed most work was to change from building Xapian-type queries to ElasticSearch ones. But after all, this wasn't so big deal either, as our own query language was also based on JSON.

After these issues were solved became the fun part: Moving our logs from Xapian to ElasticSearch. I wrote a small Python script that iterated through all the documents in our 16GB Xapian database, made minor modifications to them and used the ElasticSearch bulk API to index a few thousand in each request. The process took a few hours to complete, and after it was done, it was time to see what had happened to the performance.

ElasticSearch is fast

Our first ElasticSearch node had one CPU and 2GB of memory, of which 1GB was dedicated for ElasticSearch. And searching was blazingly fast. After getting used to waiting 10-15 seconds for the 1000 most recent log entries of a single client with Xapian, ElasticSearch returned the results in 5 seconds. When I pressed the search button again, the I got the results (with a few new lines) in less than a second. This was amazing.

The log indexer perfomed a lot better too. Before, cathing up on 2000 pending indexing jobs took an hour to complete. Now it was 2 minutes.

ElasticSearch is bonsai cool

All the worries about scaling are gone. If speed becomes an issue, we can start a new node or three, and let ElasticSearch work out load balancing behind the scenes.

But we're nowhere near requiring more performance. Currently, in our testing environment we still have a single ElasticSearch node, but I reduced the memory limit of ElasticSearch to 512 MB. Nothing changed in terms of speed even though the available memory was cut to half.

We're really happy about ElasticSearch and would never change back to our old system.Because of the speed, we're now able to enhance the log searching user experience. We have plans on implementing polling for new entries from log browser, fetching more lines dynamically when the user scrolls the window, and more.

As its website states, ElasticSearch really is bonsai cool.