Head First RocksDB
Common KV Stores
Common KV stores include:
- Memory-based Redis
- Single-machine disk-based LevelDB, RocksDB
- Distributed (over TB level) Cassandra, Hbase
RocksDB
Installation
RocksDB is written in C++, and because the official binary library for the corresponding platform is not provided, different platform-compiled libraries are not interchangeable, so you need to compile and use them on the corresponding platform. If you use RocksDBJava, this problem is much better, but the Java API is always behind the C++ version.
Before installing RocksDB, you need to understand what an embedded kv-store is. For non-embedded kv-stores, such as using Redis, you need to install Redis-server first, and then read and write data through Redis-client. For embedded kv-stores, there is no specific server, and data storage is directly operated through code, that is, you only need to introduce the relevant dependencies of RocksDB into your project (add dependencies in Java to maven) to start developing.
C++
# If ./configure reports an error like configure: error: C++ compiler cannot create executables, this means that the C++ component is missing
$ yum install -y gcc gcc-c++ gcc-g77
# Install gflags
$ git clone <https://github.com/gflags/gflags.git>
$ cd gflags
$ git checkout v2.0
$ ./configure && make && sudo make install
# Install snappy:
$ sudo yum install -y snappy snappy-devel
# Install zlib:
$ sudo yum install -y zlib zlib-devel
# Install bzip2:
$ sudo yum install -y bzip2 bzip2-devel
# Install lz4:
$ sudo yum install -y lz4-devel
# Install ASAN (optional for debugging):
$ sudo yum install -y libasan
# Install zstandard:
$ wget <https://github.com/facebook/zstd/archive/v1.1.3.tar.gz>
$ mv v1.1.3.tar.gz zstd-1.1.3.tar.gz
$ tar zxvf zstd-1.1.3.tar.gz
$ cd zstd-1.1.3
$ make && sudo make install
# Install rocksdb
$ git clone <https://github.com/facebook/rocksdb.git>
$ cd rocksdb
# Compile static library, release mode, get librocksdb.a
$ make static_lib
# Compile dynamic library, release mode, get librocksdb.so
$ make shared_lib
$ cp librocksdb.a /usr/lib
$ cp -r include/ /usr/include
$ cp librocksdb.so /usr/lib
$ sudo ldconfig
Java
<dependency>
<groupId>org.rocksdb</groupId>
<artifactId>rocksdbjni</artifactId>
<version>5.17.2</version>
</dependency>
Management Tool
Use ldb
and sst_dump
to manage RocksDB, after installing the above C++ dependencies
# Linux
$ cd rocksdb
$ make ldb sst_dump
$ cp ldb /usr/local/bin/
$ cp sst_dump /usr/local/bin/
# OSX
$ brew install rocksdb
# Corresponding commands are
$ rocksdb_ldb
$ rocksdb_sst_dump
For the usage of ldb
and sst_dump
, please refer to the official document https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool
examples
View all keys of all databases
#!/bin/bash
export DATA_PATH=/hadoop/rockfs/data
export DB_INSTANCES=$(ls /hadoop/rockfs/data)
for i in $DB_INSTANCES; do
ldb --db=$DATA_PATH/$i/ scan 2>/dev/null;
done
View SST file properties
# Note that --file needs to specify the absolute path
$ sst_dump --file=$DATA_PATH/test --show_properties
from [] to []
Process /hadoop/rockfs/data/0006.sst
Sst file format: block-based
Table Properties:
------------------------------
# data blocks: 26541
# entries: 2283572
raw key size: 264639191
raw average key size: 115.888262
raw value size: 26378342
raw average value size: 11.551351
data block size: 67110160
index block size: 3620969
filter block size: 0
(estimated) table size: 70731129
filter policy name: N/A
# deleted keys: 571272
# If sst_dump: error while loading shared libraries: libgflags.so.2: cannot open shared object file: No such file or directory
$ find / -name libgflags.so.2
/root/gflags/.libs/libgflags.so.2
/usr/local/lib/libgflags.so.2
# It can be seen that the dependency has been downloaded, but it has not been found. Specify the dependency library manually
$ vim /etc/ld.so.conf.d/usr-libs.conf
/usr/local/lib
$ sudo ldconfig
Upgrade gcc to 8.1
Download gmp-5.1.1.tar.bz2
, mpfr-3.1.5.tar.bz2
, mpc-1.1.0.tar.gz
, gcc-8.0.1.tar.gz
from http://mirror.hust.edu.cn/gnu
yum install -y m4
tar -xvf gmp-5.1.1.tar.bz2
cd gmp-5.1.1
./configure --prefix=/usr/local/gmp-5.1.1 && make
make install
tar -xvf mpfr-3.1.5.tar.bz2
cd mpfr-3.1.5
./configure --prefix=/usr/local/mpfr-3.1.5 --with-gmp=/usr/local/gmp-5.1.1 && make
make install
tar -zxvf mpc-1.1.0.tar.gz
cd mpc-1.1.0
./configure --prefix=/usr/local/mpc-1.1.0 --with-gmp=/usr/local/gmp-5.1.1 --with-mpfr=/usr/local/mpfr-3.1.5 && make
make install
tar -zxvf gcc-8.0.1.tar.gz
export LD_LIBRARY_PATH=/usr/local/mpc-1.1.0/lib:/usr/local/gmp-5.1.1/lib:/usr/local/mpfr-3.1.5/lib:$LD_LIBRARY_PATH
cd gcc-8.0.1
./configure --prefix=/usr/local/gcc-8.0.1 --enable-threads=posix --disable-checking --disable-multilib --enable-languages=c,c++ --with-gmp=/usr/local/gmp-5.1.1 --with-mpfr=/usr/local/mpfr-3.1.5 --with-mpc=/usr/local/mpc-1.1.0 && make
make install
LSM-Tree(Log-Structured-Merge Tree)
Background
LSM provides better write throughput than traditional B+ trees or ISAM by eliminating random disk I/O. The fundamental problem is that whether it is HDD or SSD, random read and write are slower than sequential read and write, especially for mechanical hard disks. However, if we write to the disk sequentially, the speed is comparable to that of writing to memory: because it reduces the seek time and rotation time. Moreover, in the case of sequential writing, the data can be put into the buffer first, waiting for the number to reach one page of the disk before being written to further reduce the number of disk IOs.
For example, for log data, they have a strict time order, so they can be written sequentially. When you want to search for data, you need to scan all the data, similar to grep
, so the search speed is relatively slow O(n)
.
For cases where there is a lot of data, the general way to improve reading performance is
- Binary search: Sort file data by
key
and use binary search to complete the search for a specifickey
,O(logN)
. - Hash: Divide the data into different
buckets
using a hash, and when querying some data, get thebucket
where the data is located through the hash function, and then traverse a small amount of data in thebucket
. - B+ tree: Using
B+
trees orISAM
methods can reduce the read of external files - External files: Save the data as logs and create a
hash
or search tree to map the corresponding file.
The above methods effectively improve the reading performance by arranging and storing data according to fixed data structures, but also increase a lot of random IO, losing good writing performance, especially when the amount of data is large. For example, when updating the structure of Hash
and B+ Tree
, part of the data needs to be moved, resulting in random IO.
So how to ensure that the data is sorted in a certain order and then written to the disk in sequence, and support fast data retrieval? It sounds contradictory, but LSM does this.
LSM principle
LSM uses a method different from the above four methods, which maintains the log file write performance and the small read operation performance loss. Essentially, it sequences all operations to avoid random disk reads and writes. How to achieve it?
The basic principle of LSM is to save the addition and modification of data to a number of similar ordered files sstable
. Each file contains a certain number of changes in a short period of time. Because the files are ordered, subsequent queries will also be very fast. The file is immutable, and it will never be updated. New updates will only be written to new files. The read operation checks very few files. The background merges these files periodically to reduce the number of files.
Write principle
When there are new updates, they are first written to the memory cache, that is, memtable
. The memtable
uses tree structure to keep the key
ordered. To ensure that the data is not lost, most implementations write WAL
to the disk before the operation is written to memory. When the memtable
data reaches a certain size, it will be flushed
to the disk to generate an sstfile
. In the flush
process, the system only performs sequential disk reads and writes. The file is immutable, and the modification of the data is done by writing new files to cover the data in the old files.
Reading Principles
When a read operation comes, the system first checks memtables
. If the key
is not found, it will sequentially check the sst
files in reverse order (newer sst
data is prioritized over older sst
data) until the key
is found or all sst
files have been traversed. The search for a single sst
file is O(logN)
, but if there are many sst
files, in the worst case, each sst
file needs to be checked, so the overall time complexity is O(k*logN)
, where k
is the number of sst
files.
Therefore, the read operation of LSM
is slower than other locally updated structures, and there are some tricks to improve performance.
- The most basic method is page caching, such as the
TableCache
inLevelDB
, which cachessstables
in memory according toLRU
to reduce the cost of binary search. - Create an index for each file using
Block Index
. - Even if each file has an index, as the number of files increases, read operations will still be slow. Moreover, each read operation still needs to access a large number of files. Is there a way to quickly tell us whether the data we want is in the current
sstfile
without accessing a large number of index files? The answer is to useBloom Filter
. If the Bloom filter tells you that thekey
is not in the current set, it will definitely not exist. But if it tells you that it is, it may not exist in reality, and there is a certain probability that thekey
you want to find is not in the current set. However, this does not affect the efficiency ofBloom Filter
, which avoids unnecessary file reads.
All write operations are processed in batches and written only to the sequential block. In addition, the periodic operation of the merge operation will have an impact on IO, and read operations may access a large number of files (scattered reads). This simplifies the algorithm work method, and we exchange the random IO of reading and writing. This compromise is meaningful, and we can optimize read performance through software techniques such as Bloom filters or hardware (large file cache).
Basic Compaction
Summary
LSM
is neutral between log and traditional single-file indexes (B+ tree
, Hash Index
). It provides a mechanism for managing smaller independent index files (sstable
). By managing a group of index files instead of a single index file, LSM
makes expensive random IO of structures such as B+
trees faster, at the cost of read operations that need to process many index files (sstable
) rather than one. In addition, some IOs are consumed by merge operations.
Reference
Terminology
Storage Types
RocksDB has three very important structures: memtable
, sstfile
, and logfile
.
memtable
: As the name suggests, it is a data structure in memory.memtable
is divided into two types:active memtable
andimmutable memtable
. When theactive memtable
is full, it becomesimmutable memtable
and is written tosstfile
and successfully saved to disk. After that, the data in memory can be cleared. The format of a default sstfile.sstfile
:sorted string table
, which is an ordered string of datakey
.sst
is layered, and each layer is larger than the previous one. The data in the upper layer and the lower layer are newer than the data in the lower layer. When thekey
in the upper layer is hit, thekey
in the lower layer is not queried, and only the old version of thekey
is deleted duringcompaction
.logfile
: When new data is written, it is inserted intomemtable
. By choosing to enableWAL
, the data operation is written tologfile
before it is written to protect against the loss of data written to memory.logfile
is a sequential log file written to disk.
Compaction
Compaction
mainly includes two types: the process of dumping imutable
from memory to disk sst is called flush
or minor compaction
. The process of dumping sst files on disk from lower levels to higher levels is called compaction
or major compaction
. There are separate threads for minor compaction
and major compaction
, which can be controlled by parameters rocksdb_max_background_flushes
and rocksdb_max_background_compactions
. Through minor compaction, data in memory is continuously written to disk to ensure that there is enough memory to cope with new writes. Through major compaction, the duplicate and useless data between multiple layers of SST files can be quickly reduced, thereby reducing the disk space occupied by SST files. For reading, since the number of sst files that need to be accessed is reduced, performance is also improved. Since the compaction process is continuously operating in the background, the amount of content compaction
in a unit of time is not large, which will not affect the overall performance. Of course, this can be adjusted according to the actual situation by adjusting the parameters. The overall architecture of compaction
can be seen in Figure 1. Understanding the basic concepts of compaction
, the following will introduce the process of compaction
in detail, mainly including two parts: flush (minor compaction) and compaction (major compaction), and the corresponding entry functions are BackgroundFlush
and BackgroundCompaction
.
RocksDB Tuning
Amplification Factors
In fact, a large part of the work of tuning a database system is to balance between the three amplification factors: write amplification, read amplification, and space amplification.
- Write Amplification: On average, one byte is written, and
n
bytes are actually written during the data’s life cycle in the engine (for example, thecompaction
operation will write the data again after merging and compressing), and its write amplification rate isn
. That is, if the write speed is10mb/s
in the business side, and the data write speed observed in the engine or operating system level is30mb/s
, then the write amplification rate of the system is3
. A high write amplification rate will limit the actual throughput of the system. Moreover, for SSDs, the larger the write amplification, the shorter the life of the SSD. - Read Amplification: For a small read request, the system needs to read
n
pages to complete the query. Its read amplification rate isn
. The logical read operation may hit the internalcache
of the engine or the file systemcache
, and if it does not hit thecache
, it will carry out actual disk IO. The cost of reading operations that hit thecache
is very low, but it also consumes CPU. - Space Amplification: It means that on average, storing one byte of data occupies
n
bytes of disk space in the storage engine. Its space amplification isn
. For example, if you write10mb
of data and it actually occupies100mb
on the disk, the space amplification rate is10
. Space amplification and write amplification are often mutually exclusive in tuning. The larger the space amplification, the less frequent data needs to becompaction
, and the lower the write amplification will be; if the space amplification rate is set small, the data will need to becompaction
frequently to release storage space, which will increase the write amplification.
Statistics
Statistics
is a function that RocksDB uses to statistically analyze system performance and throughput information. Enabling it can provide performance observation data more directly, quickly detect system bottlenecks or system operating status, and increase 5%~10%
additional overhead due to the large number of buried points updated by various operations in the engine for statistics, but it is worth it.
** Compaction Stats **
Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) KeyIn KeyDrop
------------------------------------------------------------
L0 2/0 15 0.5 0.0 0.0 0.0 32.8 32.8 0.0 0.0 0.0 23.0 1457 4346 0.335 0.00 0 0.00 0 0
L1 22/0 125 1.0 163.7 32.8 130.9 165.5 34.6 0.0 5.1 25.6 25.9 6549 1086 6.031 0.00 0 0.00 1287667342 0
L2 227/0 1276 1.0 262.7 34.4 228.4 262.7 34.3 0.1 7.6 26.0 26.0 10344 4137 2.500 0.00 0 0.00 1023585700 0
L3 1634/0 12794 1.0 259.7 31.7 228.1 254.1 26.1 1.5 8.0 20.8 20.4 12787 3758 3.403 0.00 0 0.00 1128138363 0
L4 1819/0 15132 0.1 3.9 2.0 2.0 3.6 1.6 13.1 1.8 20.1 18.4 201
Source
https://arch-long.cn/articles/kv%20store/RocksDB.html
Disclaimer
- License under
CC BY-NC 4.0
- Copyright issue feedback
me#imzye.me
, replace # with @ - Not all the commands and scripts are tested in production environment, use at your own risk
- No personal information is collected.
Feedback