Turning Bitcask into a Caching Solution Contd.

Before I start listing the various amazing speed boosts tricks, I’d better give some numbers:

  • Windows 7, 16 cores, 8g RAM, HDD
  • key size: 16 bytes (“0000000000000001” etc.)
  • value size: 100 bytes
  • entries: 10M

We used LevelDB’s benchmark testing (right at my hand), and tested major scenarios like “fillSeq”, “fillRandom”, “overwrite”, “randSeq”, “readRandom” etc. And here’s what we got:

  • Avg write seq: 10+ micros/write
  • Avg write random: 20+ micros/write
  • Avg overwrite: 30+ micros/write
  • Avg read seq: 3+ micros/read
  • Avg read random: 5+ micros/read

Well, both write and read are pretty fast, in the range of 10 micros, besides, a few more interesting things:

  • IBM’s JDK 6 is slower than Oracle’s JDK 6, 150% to 200% time, mainly in the area of IO operations, still looking at this big gap.
  • SSD drive could boost the performance noticeably, 200% faster or more.
  • Heap usage is sort of inevitable, much as we designed to shift the storage to RAM & Disk, as frequent write/read requires serialization & deserialization, the intermediary exists on heap couldn’t get collected as fast as we anticipate, even if we use -Xgcpolicy:gencon to ensure faster young generation objects collections. GC frees up eventually, but OOM still happened couple of times during the stress tests.

Despite the fact that this POC isn’t over, and the performance definitely has room of improvements, I’m still quite excited by proving couple of ideas that really speed things up. And I’m gonna list a few:

  • The fastest IO is no IO
    • Obviously it’s impossible, caching at certain stage would have to start swapping, and IO is inevitable apparently.
    • But it could be reduced by a few means: memory mapped file, sequential writes are some good choices, but here I’m gonna talk about a nio structure: ByteBuffer; LevelDB calls it a Slice, well, it’s essentially a byte array or stream, plus position, limit marks. Simple as it looks, it’s quite powerful.
    • Here’s a real case, as we store “key”, “value” as a document structure, when we need to read it back, we’ll have to deserialize them respectively. We tried using byte[] as an obvious choice, but it requires us to copy the byte array from either MappedByteBuffer or FileChannel. I hated this copy, well System.arraycopy is exetremely fast, but when dealing with KB or MB data, we’re talking about O(n) operation to do the copy. ByteBuffer gives me a choice, as I fetched the whole document ByteBuffer (or mapped), I could do sth neat: {code}ByteBuffer getKey(final ByteBuffer doc, final int offset, final int length){final ByteBuffer key = doc.duplicate();key.position(offset).limit(offset + length);return key;}{code}
    • The power as simple as it looks like is to reduce the redundant bytes copies, deserializer should know how to handle ByteBuffer indifferent from byte[], and it’s reading from the MappedByteBuffer in the best scenario;
    • Using compact structure to reduce IO is also widely used in the caching solution: BloomFilter, BitSet such structures could help us to prevent repeated IO operations as long as we fully leverage them, and most of the time, they’re very small, and deserve to live in heap.
    • When no IO isn’t an option, then smaller IO is better too, and we used Snappy compression which LevelDB does in the background too.
  • Memory mapped IO is faster than RandomAccessFile, besides, easier to code sometimes.
    • It’s 3 times faster in our benchmark test, but it could vary due to the actual RAM available, or the SSD drive.
    • It’s not more difficult but in fact easier to code against as I found, you could literally map the whole content of file into a MappedByteBuffer, a large chunk of bytes, and you could write, read anywhere you like using offsets, not like file channels, you read sth at a time, more to read later, you don’t know if you’ve read enough sometimes, which means more IO operations, and less readable code.
    • Though memory mapped files are faster, we couldn’t go beyond the RAM size limit, when exceeding RAM availability, it slides down pretty fast, such problem has been identified by MongoDB users because of the extensive use of memory mapped files in MongoDB. So it practically forces us to “level” our journal files into 2 levels “memory mapped” vs. “file channel”; It’s no doubt that file channel would be slower. Therefore, we do compaction + compressions, by examine the usage of each journal, we identify low usage journals, and move the active documents to latest one, and dump the whole thing. Call it survival or anything alike, it’s helping us reduce file channel reads.
  • Concurrency could be messy, and must be dealt with case by case.
    • Immutable structure is the simplest in a concurrent world. That’s why we chose to do journals, all writes are APPENDs, and turns into journaled documents. We have a limited number of journal files, and only the latest accepts APPEND, after it’s filled up, it’s transformed into a read only journal and pushed down. The read only journal is such an immutable structure, it’s 100% thread safe, and it allows all sorts of dangerous concurrent operations, including the complex optimizations tasks we issued. Because such immutable structures are so good to deal with background optimizations: compactions, compressions, evictions etc. We ensured that all optimizations against documents are done only in the read only journals. The writable journal doesn’t accept any concurrent optimizations.
    • Thread Safety has a cost, Atomic structures cost less than Locks, Semaphores, etc. When we do APPENDs, we map the writing document into the writable journal, which is memory mapped as MappedByteBuffer. Clearly we need to handle concurrent writes, using ReentrantLock was a good option, but we found that an AtomicInteger could fulfill the task nicely and faster!
    • How’s non-blocking possible? And yes, we do support non-blocking readings, not writings, even if it’s just for readings, it was a big deal. I’ll explain in future Post, how we could have done non-blocking readings.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s