Skip to content

Redis source code - how to get started

homepage-banner

From: http://zhangtielei.com/posts/blog-redis-how-to-start.html

Redis is implemented in C language. First of all, you should start from the main function. But when we read the code, we should grasp a main thread, which is how the code executes step by step when we input a command into Redis. This way, we can observe from the outside first, try to execute some commands, and after understanding the external performance of these command executions, we can dive into the corresponding source code. To understand these codes, we first need to understand Redis’ event mechanism. Once we understand Redis’ event loop mechanism, we will also figure out an interesting question: why can Redis handle multiple requests simultaneously even though it runs on a single thread? (Of course, strictly speaking, Redis does not run only on a single thread, but except for the main thread, the other threads of Redis only play auxiliary roles and are asynchronous threads running in the background.)

Starting from the main function and following the code execution path, we can actually trace it all the way down. But to avoid making this article too long, we will limit the scope. The goal of this article is to lead readers from the main function and trace it step by step to reach the execution entry point of any Redis command. This way, we can connect with the series of articles on Redis internal data structure. Or, you can explore the remaining parts by yourself.

To express clearly, this article is divided into several parts:

  1. Introduce the initialization process of the entire code (starting from the main function) and the structure of the event loop in a summary way;
  2. Introduce the processing flow of Redis command requests in a summary way;
  3. Introduce the event mechanism in detail;
  4. Provide detailed code calling relationships for the various code processing flows introduced earlier, making it easy to refer to anytime.

According to these divisions, if you only want to read the rough processing flow, then you only need to read the first two parts. The latter two parts will delve into some details worth noting.

Note: This article analyzes Redis source code on the 5.0 branch.

Overview of Initialization Flow and Event Loop

The main function of Redis source code is in the server.c file. After the main function starts executing, the logic can be divided into two stages:

  • Various initializations (including event loop initialization);
  • Execution of the event loop.

These two execution stages can be expressed using the following flowchart (click to view a larger image):

First, let’s take a look at each step in the initialization stage:

  • Configuration loading and initialization. This step represents the initialization of basic data structures and various parameters of the Redis server. In Redis source code, the Redis server is expressed using a struct called redisServer, which defines various parameters that the Redis server relies on to run, such as the port number and file descriptors listened to, various connected client ends, Redis command table configuration, various parameters related to persistence, etc., as well as the event loop structure that we will discuss shortly. The Redis server is represented by a global variable of type redisServer (named server), and the main purpose of this initialization step is to initialize this global variable. In the entire initialization process, there is a function that needs special attention: populateCommandTable. It initializes the Redis command table, and any Redis command configuration information (such as the number of command parameters received and the entry point of the execution function) can be obtained by searching for the name of any Redis command. In the second part of this article, we will start from receiving a Redis command request and step by step execute to look up this command table, thereby finding the execution entry point of this command. In addition, there is one thing worth noting in this step: after initializing the global redisServer structure, the configuration needs to be loaded from the configuration file (redis.conf). This process may overwrite some of the parameters previously initialized in the redisServer structure. In other words, it goes through an initialization round first to ensure that Redis’ various internal data structures and parameters have default values, and then loads custom configurations from the configuration file.
  • Create the event loop. In Redis, the event loop is represented by a struct called aeEventLoop. “Create the event loop” is mainly to create an aeEventLoop structure and store it in the server global variable (the aforementioned redisServer type structure). In addition, the execution of the event loop depends on the underlying I/O multiplexing mechanism (such as the epoll mechanism on the Linux system). Therefore, this step also includes the initialization of the underlying I/O multiplexing mechanism (by calling system APIs).
  • Start socket listening. The server program needs to listen in order to receive requests. Depending on the configuration, this step may open two types of listeners: listening to TCP connections and listening to Unix domain sockets. “Unix domain socket” is an efficient inter-process communication (IPC) mechanism, which is also explicitly defined in the POSIX specification and is used for communication between two different processes on the same host. It has better performance than using the TCP protocol (because it eliminates the overhead of the protocol stack). When connecting to the Redis server on the same machine using the Redis client, you can choose to use “Unix domain sockets” for connection. But no matter which type of listener is used, the program will get a file descriptor and store it in the server global variable. For TCP listeners, since the IP address and port number listened to can be bound to multiple ones, the file descriptor obtained for listening to TCP connections can also include multiple ones. Later, the program can use the file descriptor obtained in this step to register I/O event callbacks.
  • Register timer event callbacks. As a single-threaded program, if Redis wants to schedule some asynchronously executed tasks, such as periodically executing the expired key recycling action, there is no other way except relying on the event loop mechanism. This step is to register a timer event in the previously created event loop and configure it to periodically execute a callback function: serverCron. Since Redis only has one main thread, the periodic execution of this function is also in this thread, driven by the event loop (that is, called at the appropriate time), but it does not affect the execution of other logic on the same thread (equivalent to time slicing). What does the serverCron function do? In fact, in addition to periodically executing the expired key recycling action, it also performs many other tasks, such as master-slave reconnection, reconnection between Cluster nodes, triggering the execution of BGSAVE and AOF rewrite, etc. This is not the focus of this article, so it will not be described in detail here.
  • Register I/O event callbacks. The main job of the Redis server is to listen to I/O events, analyze command requests from clients, execute commands, and then return response results. Naturally, listening to I/O events also relies on the event loop. As mentioned earlier, Redis can open two types of listeners: listening to TCP connections and listening to Unix domain sockets. Therefore, this step includes registering callbacks for these two I/O events, with two callback functions: acceptTcpHandler and acceptUnixHandler. For processing requests from Redis clients, it will go into these two functions. We will discuss this processing process in the next section. In addition, Redis also registers an I/O event here.
  • Initialize background threads. Redis creates some additional threads that run in the background, specifically for handling time-consuming tasks that can be delayed (usually some cleanup work). In Redis, these background threads are called bio (Background I/O service). They are responsible for tasks such as delayed file closing operations (such as the execution of the unlink command), AOF persistent write operations (i.e., fsync calls, but note that only delayed fsync operations are executed in the background thread), and clearing operations for some large keys (such as the execution of the flushdb async command). It can be seen that the name bio is somewhat misleading, as the tasks it performs are not necessarily related to I/O. We may still have a question about these background threads: in the initialization process earlier, a timer event callback was already registered, namely the serverCron function, and it seems that these background tasks that the background threads execute could also be done in serverCron. Because serverCron can also be used to execute background tasks. In fact, this is not possible. As mentioned earlier, serverCron is driven by the event loop, and the execution is still on the Redis main thread, which is equivalent to time slicing other operations (mainly command request execution) executed on the main thread by time. In this case, serverCron cannot perform operations that are too time-consuming, otherwise it will affect the response time of Redis executing commands. Therefore, for time-consuming and delayed tasks, they can only be executed in a separate thread.
  • Start event loop. The event loop structure has been created earlier, but the logic of actually entering the loop has not yet been implemented. After this step, the event loop runs and drives the timer event callback and I/O event callback registered earlier to execute continuously.

Note: The initialization of the Redis server actually involves many, many things, such as loading data into memory, initializing the Cluster cluster, initializing modules, etc. However, to simplify, the initialization process discussed above only lists the steps we are currently focusing on. This article focuses on the entire operation mechanism driven by events and directly related to command execution. Therefore, we temporarily ignore other irrelevant steps.

Now, let us continue to discuss the second stage in the flowchart above: the event loop.

First, let us think about why we need a loop here.

After a program is started, if there is no loop, it will execute from the first instruction to the last instruction and then can only exit. As a server-side program, Redis needs to wait for clients to send requests and then process them continuously, and cannot exit after finishing its own execution. Therefore, Redis must enter an infinite loop after starting. Obviously, in each cycle of loop execution, if an event (including I/O events requested by clients) occurs, the program will go to process these events. But what if no events occur? Obviously, the program should not idle, but wait, blocking the entire loop. What kind of wait and wakeup operations are needed here? They both rely on the system’s capabilities (we will explain in detail in the third part of the article).

In fact, this event loop mechanism is very common and basic for colleagues who have developed mobile client applications. For example, Apps running on iOS/Android have a message loop, which is responsible for waiting for various UI events (clicking, sliding, etc.) to occur and then processing them. Similarly, for the server, the principle of this loop can be considered similar, only the events waited for and processed become I/O events. In addition, in addition to I/O events, the entire system definitely needs to schedule and execute some tasks based on time during the running process, such as delaying an operation by 100 milliseconds or executing a task every 1 second. This requires waiting for and processing another type of event-timer event.

Timer events and I/O events are two completely different events. How can the event loop schedule them uniformly? Assuming that the event loop waits for I/O events to occur when it is idle, it is possible that a timer event will occur first, and the event loop will not be awakened in time (still waiting for I/O events); conversely, if the event loop is waiting for timer events, and an I/O event occurs first, it will not be awakened in time either. Therefore, we must have a mechanism that can wait for both types of events to occur simultaneously. Fortunately, some system APIs can do this (such as the epoll mechanism we mentioned earlier (https://man.cx/epoll)).

The second stage in the flowchart above has expressed the execution process of the event loop quite clearly. Here we will provide some supplementary explanations for some of the steps that need attention:

  • Find the nearest timer event. As mentioned earlier, the event loop needs to wait for both timer and I/O events. For I/O events, it is sufficient to specify which file descriptors to wait for; for timer events, it is necessary to compare and determine how long to wait in the current round of the loop. Since multiple timer event callbacks may be registered during system operation, for example, a callback is required to be executed after 100 milliseconds, and another callback is required to be executed after 200 milliseconds, the event loop needs to find the nearest timer event that needs to be executed before each round of execution. In this way, the event loop knows how long to wait in the next waiting period (in this example, we need to wait for 100 milliseconds).
  • Wait for events to occur. In this step, we need to wait for both timer and I/O events to occur simultaneously. To achieve this, we rely on the underlying I/O multiplexing mechanism of the system. This mechanism is generally designed as follows: it allows us to wait for corresponding I/O events for multiple file descriptors and specify a maximum blocking timeout period at the same time. If an I/O event occurs during this blocking time, the program will be awakened and continue to execute; if no I/O events occur and the specified time timeout first, the program will also be awakened. Waiting for timer events relies on the timeout mechanism here. Of course, the timeout time here can also be specified as infinite, which is equivalent to only waiting for I/O events. Let us take a look at the previous step Find the nearest timer event. After finding it, there may be three corresponding situations in this waiting step:
  • In the first case, a nearest timer event is found, which is required to be triggered at a future moment. In this case, this step only needs to convert the future moment into a blocking timeout time.
  • In the second case, a nearest timer event is found, but the required moment has passed. At this time, it should be triggered immediately, without any waiting. Of course, in the implementation, the API for event waiting is still called, but the timeout event is set to 0 to achieve this effect.
  • In the third case, no registered timer events are found. At this time, the timeout time should be set to infinite. Only I/O events can awaken the program.
  • Determine whether I/O events occur or time out. This is the judgment logic executed by the program after it resumes from the (possible) blocking state in the previous step.
  • Execute I/O event callbacks. The callback functions acceptTcpHandler and acceptUnixHandler for TCP connection listening and Unix domain socket listening respectively, which we mentioned earlier, are called at this step.
  • Execute timer event callbacks. The periodic callback function serverCron, which we mentioned earlier, is called at this step. Generally, after a timer event is processed, it will be deleted from the event loop queue and will not be executed again. However, serverCron is cyclically called because Redis has designed a mechanism for timer event callbacks: the callback function can return a millisecond value to be executed next time. If the return value is a normal positive value, Redis will not delete this timer event from the event loop queue, so it has a chance to be executed again later. For example, according to the default setting, serverCron returns a value of 100, so it will be executed every 100 milliseconds (of course, this execution frequency can be adjusted by the hz variable in redis.conf).

So far, we have a clear outline of the Redis event loop. The main processing flow of Redis, including receiving requests, executing commands, and periodically executing background tasks (serverCron), is driven by this event loop. When a request arrives, an I/O event is triggered and the event loop is awakened, executing the command according to the request and returning the response result. At the same time, background asynchronous tasks (such as reclaiming expired keys) are split into several small segments, triggered by timer events, and run periodically in the gaps of I/O event processing. This execution method allows the use of a single thread to handle a large number of requests and provide fast response times. Of course, in addition to the structure of the event loop, this implementation can operate efficiently thanks to the asynchronous I/O multiplexing mechanism provided by the system. The event loop allows CPU resources to be multiplexed in time-sharing, and there is no “real” concurrent execution between different code blocks, but the execution of CPU and I/O is truly concurrent through the I/O multiplexing mechanism. Moreover, using a single thread has an additional advantage: it avoids concurrent execution of code, and there is no need to consider thread-safety issues when accessing various data structures, greatly reducing the complexity of implementation.

Overview of Redis command request processing flow

When we discussed “registering I/O event callbacks” earlier, we mentioned that Redis processes requests from clients by calling the acceptTcpHandler or acceptUnixHandler callback functions. In fact, this description is too rough.

The process of Redis clients sending commands to the server can be divided into two steps:

  1. Connection establishment. The client initiates a connection request (via TCP or Unix domain socket), and the server accepts the connection.
  2. Command sending, execution, and response. Once the connection is established, the client can send command data over this newly established connection, and the server executes the command and returns the execution result to the client. Moreover, on the newly established connection, the entire “command sending, execution, and response” process can be repeated.

The first step, “connection establishment”, corresponds to the callbacks acceptTcpHandler or acceptUnixHandler in the server-side code. In other words, every time Redis server receives a new connection request, an I/O event is triggered by the event loop, and the code executes the acceptTcpHandler or acceptUnixHandler callback function.

Next, from the perspective of socket programming, the server should call the system API [accept](<https://man.cx/accept(2)>) to accept the connection request and create a socket for the new connection. This new socket corresponds to a new file descriptor. In order to receive command data from the new connection, the event loop must register an I/O event callback for this new file descriptor. The flowchart of this process is as follows:

From the above flowchart, we can see that a new connection registers an I/O event callback, namely readQueryFromClient. In other words, for the second step mentioned earlier, “command sending, execution, and response”, when the server receives command data, an I/O event is triggered by the event loop, and the code executes the readQueryFromClient callback. The implementation of this function is in charge of “execution and response” of the command. Therefore, let’s take a look at the flowchart of this function:

There are several points to note in the above flowchart:

  • Reading data from the socket is done in a streaming manner. In other words, from the perspective of the application layer, the data read from the lower-level network layer is a stream of bytes. We need to parse complete Redis commands from these byte streams to know how to handle them. However, due to the characteristics of network transmission, we cannot control how many bytes are read at a time. In fact, even if the server only receives part of the data of a Redis command (even if it’s just one byte), it may trigger an I/O event callback. We read data by calling the system API [read](<https://man.cx/read(2)>)[8]. Although we can specify the expected number of bytes to be read when calling read, it does not guarantee that it will return the expected length of data. For example, if we want to read 100 bytes, we may only be able to read 80 bytes, and the remaining 20 bytes may still be in transit. This situation causes great trouble for receiving Redis commands: first, we may not have enough data to parse a complete command, in which case we should continue to wait for more data to arrive. Secondly, we may receive a large amount of data at once, including more than one command. In this case, we must parse all the commands contained in it, and correctly parse the boundary of the last complete command. If there is excess data after the last complete command, these data should be left to be processed when more data arrives. This complex process is generally referred to as “packet sticking”.
  • The first manifestation of “packet sticking” processing is that if parsing a complete command fails when attempting to parse a complete command, the above process will exit directly. Next, if more data arrives, the event loop will trigger an I/O event callback again, and re-enter the above process to continue processing.
  • The second manifestation of “packet sticking” processing is the large loop in the above flowchart. As long as there is data that can be processed in the query buffer that temporarily stores the input data, we will keep trying to parse complete commands until all the complete commands inside are processed before exiting the loop.
  • In the step of checking the command table, we are looking up the command table initialized by populateCommandTable, which is stored in the global variable redisCommandTable in server.c. The command table contains the execution entry of each Redis command.
  • For the execution result of the command, it is only stored in an output buffer in the above flowchart, and is not output to the client yet. The process of outputting to the client is not included in this process, but is completed by another process driven by the event loop. This process involves many details, which we will skip for now and discuss in the fourth part later.

Introduction to event mechanism

In the first part of this article, we mentioned that we must have a mechanism to wait for the occurrence of both I/O and timer events at the same time. This mechanism is the system’s underlying I/O multiplexing mechanism. However, there are multiple different I/O multiplexing mechanisms on different systems. Therefore, for the convenience of upper-layer program implementation, Redis has implemented a simple event-driven program library, namely the code of ae.c, which shields the differences in the underlying system’s event processing.

Redis currently supports four I/O multiplexing mechanisms in its event library:

  • The select (https://man.cx/select(2))[9] system call. This was one of the earliest I/O multiplexing mechanisms and was first used in 4.2BSD Unix in 1983[10]. It is part of the POSIX specification. Another system call similar to select is [poll](<https://man.cx/poll(2)>)[11], which was first used in the SVR3 Unix system in 1986[10] and also follows the POSIX specification (http://pubs.opengroup.org/onlinepubs/9699919799/nframe.html). As long as the operating system follows the POSIX specification, it can support both select and poll mechanisms. Therefore, these two I/O event mechanisms are generally supported in the systems we commonly use nowadays.
  • The epoll mechanism (https://man.cx/epoll)[1]. epoll is a more modern I/O multiplexing mechanism than select, which was first introduced in version 2.5.44 of the Linux kernel[12]. It was designed to replace the old select and poll and provides a more efficient I/O mechanism. Note that epoll is unique to the Linux system and is not part of the POSIX specification.
  • The kqueue mechanism (https://man.cx/kqueue)[13]. kqueue was first designed in FreeBSD 4.1 in 2000, and later supported by NetBSD, OpenBSD, DragonflyBSD, and macOS systems[14]. It is similar to epoll in Linux systems.
  • event ports. This is an I/O event mechanism unique to the illumos (https://en.wikipedia.org/wiki/Illumos) system[15].

Since different systems have different event mechanisms, which mechanism does Redis use when compiling on different systems? Since the latter three mechanisms are more modern and more efficient than select and poll, Redis preferentially chooses to use them.

From the summary of the operating systems applicable to each I/O mechanism, it is easy to see that if you compile Redis on macOS, it will use kqueue at the bottom. If you compile it in Linux, it will choose the epoll mechanism, which is also a common situation in the actual operation of Redis.

It should be noted that the I/O event mechanism on which it depends is closely related to how to implement high-concurrency network services. Many technical students should have heard of the C10K problem. With the development of hardware and networks, supporting 10,000 connections on a single machine, or even millions of connections on a single machine, has become possible[17]. High-performance network programming is closely related to these underlying mechanisms. Here are some recommended blogs for those who are interested (visit links, please see the references at the end of the article):

  • [The C10K problem] (http://www.kegel.com/c10k.html)
  • [Epoll is fundamentally broken] (https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22)
  • [The Implementation of epoll] (https://idndx.com/2014/09/01/the-implementation-of-epoll-1)

Now let’s take a look back at how these underlying I/O event mechanisms support Redis’ event loop (the following description is a refinement of the event loop process in the first part of this article):

  • First, when registering I/O event callbacks in the event loop, you need to specify which callback function is registered to which event (the event is represented by a file descriptor). The mapping between events and callback functions is maintained by the event-driven program library encapsulated by Redis. Please refer to the code of the aeCreateFileEvent function for details.
  • Similarly, when registering the timer event callback in the event loop, you need to specify which callback function will be executed after how long. You need to record which callback function is expected to be called at which time, which is also maintained by the event-driven program library encapsulated by Redis. Please refer to the code of the aeCreateTimeEvent function for details.
  • Various underlying event mechanisms will provide a waiting event operation, such as the epoll_wait API provided by epoll. This waiting operation can generally specify the expected event list (the event is represented by a file descriptor), and also specify a timeout (i.e. how long to wait at most). When waiting for events in the event loop, call this waiting operation, pass in all previously registered I/O events, and convert the nearest timer event’s corresponding time into the timeout required here. Please refer to the code of the aeProcessEvents function for details.
  • There are two situations in which to wake up from the previous waiting operation: if an I/O event occurs, look up the I/O callback function according to the triggered event and call it. If the timeout expires, check all registered timer events and call the callback functions whose expected call time exceeds the current time.

Finally, about event mechanisms, there is some information worth paying attention to: there are already some mature open-source event libraries in the industry, such as libevent and libev. Generally speaking, these open-source libraries shield very complex underlying system details, and implement compatibility with different system versions, which is very valuable. Why did Redis’ author still implement a set of its own? In a post on the Google Group, Redis’s author gave some reasons. The post is located at:

  • https://groups.google.com/group/redis-db/browse_thread/thread/b52814e9ef15b8d0

The reasons can be summarized as follows:

  • Do not want to introduce too much external dependence. For example, libevent is too large and larger than Redis’s codebase.
  • Facilitate customized development.
  • Third-party libraries sometimes have unexpected bugs.

Code call relationship

For the various code processing flows analyzed earlier in this article, including initialization, event loops, receiving command requests, executing commands, returning response results, etc., for easy reference, the following tree diagram shows the call relationship of some key functions (the picture is quite large, you can click to view the larger image). Once again, this call relationship diagram is based on the Redis source code 5.0 branch and may change as the Redis codebase iterates in the future.

The meaning of this tree structure is first introduced as follows:

  • Each branch to the right of the tree represents a deeper level of function call (call stack push).
  • Going to the end branch to the right indicates that there is no more function call (the call stack begins to pop, and control is returned to the event loop).
  • There are six independent trees in total, except for the main function entry point at the beginning, the other five are new call flows triggered by the event loop. The root of the left tree is the entry point of the flow.
  • This tree diagram does not express all the function call relationships, only lists the call flows related to this article.

Some annotations have been added to the above diagram, which should be clear and correspond to some of the previous flows introduced in this article. In addition, some details in the figure that may need attention are listed below:

  • The initialization process adds aeSetBeforeSleepProc and aeSetAfterSleepProc, registering two callback functions, which were not mentioned earlier in this article. One is used to be called at the beginning of each round of the event loop, and the other is called after each round of event loop blocking waiting (i.e. after aeApiPoll returns). The entry point of the fifth call flow in the figure below is beforeSleep, which is registered by aeSetBeforeSleepProc to the event loop.
  • The serverCron executes periodically, which means calling the timeProc function called in the processTimeEvents call flow.
  • In the data reception processing flow readQueryFromClient, use lookupCommand to query the Redis command table, which is the redisCommandTable global structure initialized by populateCommandTable during initialization. After finding the command entry, call the call function in server.c to execute the command. The next level of the call function in the figure is to call the entry function of each command (only a few examples are listed in the figure). Taking the entry function getCommand of the get command as an example, the execution result will eventually be stored in the output buffer, that is, the buf or reply field of the client structure (depending on the size of the execution result). It should be noted that, just like the last discussion of “Redis command request processing flow” in this article, only the execution result is stored in an output buffer here.

Reference

  • [1] epoll − I/O event notification facility, https://man.cx/epoll
  • [2] Unix domain socket, https://en.wikipedia.org/wiki/Unix_domain_socket
  • [3] Inter-process communication, https://en.wikipedia.org/wiki/Inter-process_communication
  • [4] POSIX.1-2017, http://pubs.opengroup.org/onlinepubs/9699919799/nframe.html
  • [5] Definitions for UNIX domain sockets, http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_un.h.html
  • [6] Create descriptor pair for interprocess communication, https://man.cx/pipe
  • [7] BSD System Calls Manual ACCEPT(2), https://man.cx/accept(2)
  • [8] BSD System Calls Manual READ(2), https://man.cx/read(2)
  • [9] BSD System Calls Manual SELECT(2), https://man.cx/select(2)
  • [10] poll vs select vs event-based, https://daniel.haxx.se/docs/poll-vs-select.html
  • [11] BSD System Calls Manual POLL(2), https://man.cx/poll(2)
  • [12] Epoll from Wikipedia, https://en.wikipedia.org/wiki/Epoll
  • [13] BSD System Calls Manual KQUEUE(2), https://man.cx/kqueue
  • [14] Kqueue from Wikipedia, https://en.wikipedia.org/wiki/Kqueue
  • [15] illumos from Wikipedia, https://en.wikipedia.org/wiki/Illumos
  • [16] The C10K problem, http://www.kegel.com/c10k.html
  • [17] C10k problem from Wikipedia, https://en.wikipedia.org/wiki/C10k_problem
  • [18] Epoll is fundamentally broken, https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22
  • [19] The Implementation of epoll, https://idndx.com/2014/09/01/the-implementation-of-epoll-1
  • [20] libevent, http://libevent.org
  • [21] libev, https://github.com/enki/libev
Leave a message