Skip to content

RocksDB Basics

Common KV Stores

Common KV stores include:

  • Memory-based Redis
  • Single-machine disk-based LevelDB, RocksDB
  • Distributed (over TB level) Cassandra, Hbase

RocksDB

Installation

RocksDB is written in C++, and because the official binary library for the corresponding platform is not provided, different platform-compiled libraries are not interchangeable, so you need to compile and use them on the corresponding platform. If you use RocksDBJava, this problem is much better, but the Java API is always behind the C++ version.

Before installing RocksDB, you need to understand what an embedded kv-store is. For non-embedded kv-stores, such as using Redis, you need to install Redis-server first, and then read and write data through Redis-client. For embedded kv-stores, there is no specific server, and data storage is directly operated through code, that is, you only need to introduce the relevant dependencies of RocksDB into your project (add dependencies in Java to maven) to start developing.

C++

RocksDB Install: https://github.com/facebook/rocksdb/blob/master/INSTALL.md

# If ./configure reports an error like configure: error: C++ compiler cannot create executables, this means that the C++ component is missing
$ yum install -y gcc gcc-c++ gcc-g77

# Install gflags
$ git clone <https://github.com/gflags/gflags.git>
$ cd gflags
$ git checkout v2.0
$ ./configure && make && sudo make install

# Install snappy:
$ sudo yum install -y snappy snappy-devel

# Install zlib:
$ sudo yum install -y zlib zlib-devel

# Install bzip2:
$ sudo yum install -y bzip2 bzip2-devel

# Install lz4:
$ sudo yum install -y lz4-devel

# Install ASAN (optional for debugging):
$ sudo yum install -y libasan

# Install zstandard:
$ wget <https://github.com/facebook/zstd/archive/v1.1.3.tar.gz>
$ mv v1.1.3.tar.gz zstd-1.1.3.tar.gz
$ tar zxvf zstd-1.1.3.tar.gz
$ cd zstd-1.1.3
$ make && sudo make install

# Install rocksdb
$ git clone <https://github.com/facebook/rocksdb.git>

$ cd rocksdb

# Compile static library, release mode, get librocksdb.a
$ make static_lib

# Compile dynamic library, release mode, get librocksdb.so
$ make shared_lib

$ cp librocksdb.a /usr/lib
$ cp -r include/ /usr/include
$ cp librocksdb.so /usr/lib
$ sudo ldconfig

Java

RocksJava Basics: https://github.com/facebook/rocksdb/wiki/RocksJava-Basics

<dependency>
    <groupId>org.rocksdb</groupId>
    <artifactId>rocksdbjni</artifactId>
    <version>5.17.2</version>
</dependency>

Management Tool

Use ldb and sst_dump to manage RocksDB, after installing the above C++ dependencies

# Linux
$ cd rocksdb
$ make ldb sst_dump
$ cp ldb /usr/local/bin/
$ cp sst_dump /usr/local/bin/

# OSX
$ brew install rocksdb
# Corresponding commands are
$ rocksdb_ldb
$ rocksdb_sst_dump

For the usage of ldb and sst_dump, please refer to the official document https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool

examples

View all keys of all databases

#!/bin/bash

export DATA_PATH=/hadoop/rockfs/data
export DB_INSTANCES=$(ls /hadoop/rockfs/data)

for i in $DB_INSTANCES; do
  ldb --db=$DATA_PATH/$i/ scan 2>/dev/null;
done

View SST file properties

# Note that --file needs to specify the absolute path
$ sst_dump --file=$DATA_PATH/test --show_properties
from [] to []
Process /hadoop/rockfs/data/0006.sst
Sst file format: block-based
Table Properties:
------------------------------
  # data blocks: 26541
  # entries: 2283572
  raw key size: 264639191
  raw average key size: 115.888262
  raw value size: 26378342
  raw average value size: 11.551351
  data block size: 67110160
  index block size: 3620969
  filter block size: 0
  (estimated) table size: 70731129
  filter policy name: N/A
  # deleted keys: 571272

# If sst_dump: error while loading shared libraries: libgflags.so.2: cannot open shared object file: No such file or directory
$ find / -name libgflags.so.2
/root/gflags/.libs/libgflags.so.2
/usr/local/lib/libgflags.so.2
# It can be seen that the dependency has been downloaded, but it has not been found. Specify the dependency library manually
$ vim /etc/ld.so.conf.d/usr-libs.conf
/usr/local/lib
$ sudo ldconfig

Upgrade gcc to 8.1

Download gmp-5.1.1.tar.bz2, mpfr-3.1.5.tar.bz2, mpc-1.1.0.tar.gz, gcc-8.0.1.tar.gz from http://mirror.hust.edu.cn/gnu

yum install -y m4

tar -xvf gmp-5.1.1.tar.bz2
cd gmp-5.1.1
./configure --prefix=/usr/local/gmp-5.1.1 && make
make install

tar -xvf mpfr-3.1.5.tar.bz2
cd mpfr-3.1.5
./configure --prefix=/usr/local/mpfr-3.1.5 --with-gmp=/usr/local/gmp-5.1.1 && make
make install

tar -zxvf mpc-1.1.0.tar.gz
cd mpc-1.1.0
./configure --prefix=/usr/local/mpc-1.1.0 --with-gmp=/usr/local/gmp-5.1.1 --with-mpfr=/usr/local/mpfr-3.1.5 && make
make install

tar -zxvf gcc-8.0.1.tar.gz
export LD_LIBRARY_PATH=/usr/local/mpc-1.1.0/lib:/usr/local/gmp-5.1.1/lib:/usr/local/mpfr-3.1.5/lib:$LD_LIBRARY_PATH
cd gcc-8.0.1
./configure --prefix=/usr/local/gcc-8.0.1 --enable-threads=posix --disable-checking --disable-multilib --enable-languages=c,c++ --with-gmp=/usr/local/gmp-5.1.1 --with-mpfr=/usr/local/mpfr-3.1.5 --with-mpc=/usr/local/mpc-1.1.0 && make
make install

LSM-Tree(Log-Structured-Merge Tree)

Background

LSM provides better write throughput than traditional B+ trees or ISAM by eliminating random disk I/O. The fundamental problem is that whether it is HDD or SSD, random read and write are slower than sequential read and write, especially for mechanical hard disks. However, if we write to the disk sequentially, the speed is comparable to that of writing to memory: because it reduces the seek time and rotation time. Moreover, in the case of sequential writing, the data can be put into the buffer first, waiting for the number to reach one page of the disk before being written to further reduce the number of disk IOs.

For example, for log data, they have a strict time order, so they can be written sequentially. When you want to search for data, you need to scan all the data, similar to grep, so the search speed is relatively slow O(n).

For cases where there is a lot of data, the general way to improve reading performance is

  • Binary search: Sort file data by key and use binary search to complete the search for a specific key, O(logN).
  • Hash: Divide the data into different buckets using a hash, and when querying some data, get the bucket where the data is located through the hash function, and then traverse a small amount of data in the bucket.
  • B+ tree: Using B+ trees or ISAM methods can reduce the read of external files
  • External files: Save the data as logs and create a hash or search tree to map the corresponding file.

The above methods effectively improve the reading performance by arranging and storing data according to fixed data structures, but also increase a lot of random IO, losing good writing performance, especially when the amount of data is large. For example, when updating the structure of Hash and B+ Tree, part of the data needs to be moved, resulting in random IO.

So how to ensure that the data is sorted in a certain order and then written to the disk in sequence, and support fast data retrieval? It sounds contradictory, but LSM does this.

LSM principle

LSM uses a method different from the above four methods, which maintains the log file write performance and the small read operation performance loss. Essentially, it sequences all operations to avoid random disk reads and writes. How to achieve it?

The basic principle of LSM is to save the addition and modification of data to a number of similar ordered files sstable. Each file contains a certain number of changes in a short period of time. Because the files are ordered, subsequent queries will also be very fast. The file is immutable, and it will never be updated. New updates will only be written to new files. The read operation checks very few files. The background merges these files periodically to reduce the number of files.

Write principle

When there are new updates, they are first written to the memory cache, that is, memtable. The memtable uses tree structure to keep the key ordered. To ensure that the data is not lost, most implementations write WAL to the disk before the operation is written to memory. When the memtable data reaches a certain size, it will be flushed to the disk to generate an sstfile. In the flush process, the system only performs sequential disk reads and writes. The file is immutable, and the modification of the data is done by writing new files to cover the data in the old files.

Reading Principles

When a read operation comes, the system first checks memtables. If the key is not found, it will sequentially check the sst files in reverse order (newer sst data is prioritized over older sst data) until the key is found or all sst files have been traversed. The search for a single sst file is O(logN), but if there are many sst files, in the worst case, each sst file needs to be checked, so the overall time complexity is O(k*logN), where k is the number of sst files.

Therefore, the read operation of LSM is slower than other locally updated structures, and there are some tricks to improve performance.

  • The most basic method is page caching, such as the TableCache in LevelDB, which caches sstables in memory according to LRU to reduce the cost of binary search.
  • Create an index for each file using Block Index.
  • Even if each file has an index, as the number of files increases, read operations will still be slow. Moreover, each read operation still needs to access a large number of files. Is there a way to quickly tell us whether the data we want is in the current sstfile without accessing a large number of index files? The answer is to use Bloom Filter. If the Bloom filter tells you that the key is not in the current set, it will definitely not exist. But if it tells you that it is, it may not exist in reality, and there is a certain probability that the key you want to find is not in the current set. However, this does not affect the efficiency of Bloom Filter, which avoids unnecessary file reads.

All write operations are processed in batches and written only to the sequential block. In addition, the periodic operation of the merge operation will have an impact on IO, and read operations may access a large number of files (scattered reads). This simplifies the algorithm work method, and we exchange the random IO of reading and writing. This compromise is meaningful, and we can optimize read performance through software techniques such as Bloom filters or hardware (large file cache).

Basic Compaction

Summary

LSM is neutral between log and traditional single-file indexes (B+ tree, Hash Index). It provides a mechanism for managing smaller independent index files (sstable). By managing a group of index files instead of a single index file, LSM makes expensive random IO of structures such as B+ trees faster, at the cost of read operations that need to process many index files (sstable) rather than one. In addition, some IOs are consumed by merge operations.

Reference

Terminology

Terminology: https://github.com/facebook/rocksdb/wiki/Terminology

Storage Types

RocksDB has three very important structures: memtable, sstfile, and logfile.

  • memtable: As the name suggests, it is a data structure in memory. memtable is divided into two types: active memtable and immutable memtable. When the active memtable is full, it becomes immutable memtable and is written to sstfile and successfully saved to disk. After that, the data in memory can be cleared. The format of a default sstfile (https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format).
  • sstfile: sorted string table, which is an ordered string of data key. sst is layered, and each layer is larger than the previous one. The data in the upper layer and the lower layer are newer than the data in the lower layer. When the key in the upper layer is hit, the key in the lower layer is not queried, and only the old version of the key is deleted during compaction.
  • logfile: When new data is written, it is inserted into memtable. By choosing to enable WAL, the data operation is written to logfile before it is written to protect against the loss of data written to memory. logfile is a sequential log file written to disk.

Compaction

Compaction mainly includes two types: the process of dumping imutable from memory to disk sst is called flush or minor compaction. The process of dumping sst files on disk from lower levels to higher levels is called compaction or major compaction. There are separate threads for minor compaction and major compaction, which can be controlled by parameters rocksdb_max_background_flushes and rocksdb_max_background_compactions. Through minor compaction, data in memory is continuously written to disk to ensure that there is enough memory to cope with new writes. Through major compaction, the duplicate and useless data between multiple layers of SST files can be quickly reduced, thereby reducing the disk space occupied by SST files. For reading, since the number of sst files that need to be accessed is reduced, performance is also improved. Since the compaction process is continuously operating in the background, the amount of content compaction in a unit of time is not large, which will not affect the overall performance. Of course, this can be adjusted according to the actual situation by adjusting the parameters. The overall architecture of compaction can be seen in Figure 1. Understanding the basic concepts of compaction, the following will introduce the process of compaction in detail, mainly including two parts: flush (minor compaction) and compaction (major compaction), and the corresponding entry functions are BackgroundFlush and BackgroundCompaction.

RocksDB Tuning

Amplification Factors

In fact, a large part of the work of tuning a database system is to balance between the three amplification factors: write amplification, read amplification, and space amplification.

  • Write Amplification: On average, one byte is written, and n bytes are actually written during the data’s life cycle in the engine (for example, the compaction operation will write the data again after merging and compressing), and its write amplification rate is n. That is, if the write speed is 10mb/s in the business side, and the data write speed observed in the engine or operating system level is 30mb/s, then the write amplification rate of the system is 3. A high write amplification rate will limit the actual throughput of the system. Moreover, for SSDs, the larger the write amplification, the shorter the life of the SSD.
  • Read Amplification: For a small read request, the system needs to read n pages to complete the query. Its read amplification rate is n. The logical read operation may hit the internal cache of the engine or the file system cache, and if it does not hit the cache, it will carry out actual disk IO. The cost of reading operations that hit the cache is very low, but it also consumes CPU.
  • Space Amplification: It means that on average, storing one byte of data occupies n bytes of disk space in the storage engine. Its space amplification is n. For example, if you write 10mb of data and it actually occupies 100mb on the disk, the space amplification rate is 10. Space amplification and write amplification are often mutually exclusive in tuning. The larger the space amplification, the less frequent data needs to be compaction, and the lower the write amplification will be; if the space amplification rate is set small, the data will need to be compaction frequently to release storage space, which will increase the write amplification.

Statistics

Statistics is a function that RocksDB uses to statistically analyze system performance and throughput information. Enabling it can provide performance observation data more directly, quickly detect system bottlenecks or system operating status, and increase 5%~10% additional overhead due to the large number of buried points updated by various operations in the engine for statistics, but it is worth it.

** Compaction Stats **
Level Files  Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms)     KeyIn   KeyDrop
------------------------------------------------------------
L0      2/0        15   0.5     0.0     0.0      0.0      32.8     32.8       0.0   0.0      0.0     23.0    1457      4346    0.335       0.00          0    0.00             0        0
L1     22/0       125   1.0   163.7    32.8    130.9     165.5     34.6       0.0   5.1     25.6     25.9    6549      1086    6.031       0.00          0    0.00    1287667342        0
L2    227/0      1276   1.0   262.7    34.4    228.4     262.7     34.3       0.1   7.6     26.0     26.0   10344      4137    2.500       0.00          0    0.00    1023585700        0
L3   1634/0     12794   1.0   259.7    31.7    228.1     254.1     26.1       1.5   8.0     20.8     20.4   12787      3758    3.403       0.00          0    0.00    1128138363        0
L4   1819/0     15132   0.1     3.9     2.0      2.0       3.6      1.6      13.1   1.8     20.1     18.4     201

Source

  • https://arch-long.cn/articles/kv%20store/RocksDB.html
Feedback