MongoDB Monitoring
MongoDB monitoring is critical to ensure the smooth operation of your database. By monitoring different metrics, you can quickly identify performance issues and take corrective actions. In this post, we discussed some of the key metrics that you need to monitor when running a MongoDB database. By monitoring these metrics, you can ensure that your MongoDB database is performing optimally.
Capacity
The db.stats command can be used to obtain storage space information for each database.
Category | Indicator Name | Monitoring Item | Reference Threshold |
---|---|---|---|
Capacity | Index size | dbstats.indexSize | <= cacheSize |
Capacity | Data size | dbstats.dataSize | <= 2T × 80% |
Capacity | Storage size | dbstats.storageSize | <= diskSize × 60% |
- The
cacheSize
value of the database requires enough space to accommodate indexes, otherwise it will affect performance. - The demand for disk space is roughly equal to the sum of
storageSize
(the size of the WiredTiger compressed dataset) andindexSize
, considering the water level set at around80%
.
Resource Usage
Connection Number
The db.serverStatus command can be used to obtain complete database status indicator information.
Category | Indicator Name | Monitoring Item | Reference Threshold |
---|---|---|---|
Connection | Available connections | connections.available | > 0 |
Connection | Current connections | connections.current | <= 8000 |
- The database can limit the number of incoming connections that a single process can accept by setting
maxIncomingConnections
, which defaults to 65536.
Concurrent Queue
Category | Indicator Name | Monitoring Item | Reference Threshold |
---|---|---|---|
Concurrency | Ticket read usage | wiredTiger.concurrentTransactions.read.out | < 128 |
Concurrency | Ticket write usage | wiredTiger.concurrentTransactions.write.out | < 128 |
Concurrency | Ticket read remaining | wiredTiger.concurrentTransactions.read.available | > 0 |
Concurrency | Ticket write remaining | wiredTiger.concurrentTransactions.write.available | > 0 |
- The
WiredTiger
engine uses theticket
voting method to manage concurrent threads. The number oftickets
generally corresponds to the number of read and write operations that are performed simultaneously. When the remaining availabletickets
are 0, new read and write requests will be blocked (entering the blocking queue).
Memory and Cache Usage
Category | Indicator Name | Monitoring Item | Reference Threshold |
---|---|---|---|
Memory | Physical memory | memory.resident | < OS.TotalMemory × 85% |
Memory | Virtual memory | memory.virtual | < OS.TotalMemory |
Cache | Cache usage size | wiredTiger.cache.”bytes currently in the cache” | < maximum × 95% |
Cache | Maximum cache size | wiredTiger.cache.”maximum bytes configured” | None |
Cache | Dirty cache size | wiredTiger.cache.”tracked dirty bytes in the cache” | < maximum ×20% |
Cache | Pages read into cache | wiredTiger.cache.”pages-read-into-cache” | Observe fluctuations |
Cache | Unmodified eviction pages | wiredTiger.cache.”unmodified pages evicted” | Observe fluctuations |
WiredTiger
simultaneously uses the file system cache and the storage engine cache (defaulting to half of the memory).memory.resident
refers to the physical memory occupied by MongoDB, and some schema designs that are unreasonable and redundant indexes may lead to excessive memory usage.- Dirty cache refers to data that has been modified in the cache but has not yet been flushed to disk. As the proportion of dirty data gradually increases, when it exceeds 20%, it means that there is a lot of pressure on cache elimination, and the response time of business requests will increase accordingly. Usually, if the write pressure is too high and the disk write performance is insufficient, the ratio of dirty data may remain high, and optimization can be performed by improving disk performance or horizontal scaling.
- For businesses with more read scenarios, it is best to reserve sufficient cache space.
Checkpoint
andTTL
timers will generate backlog-style writes to some extent. If the disk capacity is poor, there will be spikes in I/O utilization. If there is business latency jitter, consider setting a smaller trigger interval to achieve smooth writing.
Throughput
Access Class Indicators
Category | Indicator Name | Monitoring Item | Reference Threshold |
---|---|---|---|
Access | Insert | opcounters.insert (growth rate) | Calculated by merging write operations |
Access | Query | opcounters.query (growth rate) | Calculated by merging read operations |
Access | Update | opcounters.update (growth rate) | Calculated by merging write operations |
Access | Delete | opcounters.delete (growth rate) | Calculated by merging write operations |
Access | Getmore | opcounters.getmore (growth rate) | Calculated by merging read operations |
Access | Command | opcounters.command (growth rate) | <= 10000 |
Traffic | netIn | network.bytesIn (growth rate) | <= 100MB |
Traffic | netOut | network.bytesOut (growth rate) | <= 100MB |
Queue | Active read clients | globalLock.activeClients.readers | < 128 |
Queue | Active write clients | globalLock.activeClients.writers | < 128 |
Queue | Blocked read clients | globalLock.currentClients.readers | < 32 |
Queue | Blocked write clients | globalLock.currentClients.writers | < 32 |
opcounters
are counters for current request operations, and checking the growth rate of different types of operations can be used to determine the current access throughput.- By monitoring read and write requests reasonably, potential load bottlenecks can be quickly discovered and measures can be taken to expand capacity before problems occur.
Cursor
Category | Indicator Name | Monitoring Item | Reference Threshold |
---|---|---|---|
Cursor | Number of open cursors | metrics.cursor.open.total | None |
Cursor | Number of timed-out cursors | metrics.cursor.open.total | None |
Cursor | Number of never-time-out cursors | metrics.cursor.open.noTimeout | None |
- MongoDB enables a cursor for each query and points it to a query result set.
- When a connection is disconnected abnormally, the cursor may not be closed, and the database will automatically extend its timeout period. If there is no activity in the next 10 minutes (cursor.timeOut), it will be destroyed. If the application does not close the cursor in time, it will cause a large number of cursor backlogs, which will consume a lot of memory.
Replication Set
Category | Indicator Name | Monitoring Item | Reference Threshold |
---|---|---|---|
Replication | Node status | members.state | = PRIMARY/SECONDARY/ARBITER |
Replication | Node replication lag | members.optimeDate[primary] -members.optimeDate[secondary] | < 60s |
Replication | Replication window | getReplicationInfo().timeDiff | > 5h |
Replication | Replication net value | oplog.window - oplog.lag | > 0 |
- The replication lag describes the gap between the backup node and the primary node. The smaller the value, the better the situation.
- The replication window is the time interval between the newest and oldest records in the
oplog
collection. - The replication net value is the difference between the replication window and the replication lag.
Reference
- “MongoDB Advanced and Practical: Microservice Integration, Performance Optimization, and Architecture Management” (Tang Zhuozhang)
Back to Table of Contents
Disclaimer
- License under
CC BY-NC 4.0
- Copyright issue feedback
me#imzye.com
, replace # with @ - Not all the commands and scripts are tested in production environment, use at your own risk
- No personal information is collected.
Feedback