TLDR: It is not profile level 1 that is the problem; it’s a gotcha with the optional ‘slowms’ argument that causes users to accidentally set verbose logging and fill their disk with log files.
In MongoDB, there are two ways to see, with individual detail, which operations were executed and how long they took.
- Profiling. Saves the operation details to a capped collection system.profile. You access this information through a mongodb connection.
- Log lines of “COMMAND” component type in the mongod log files. (Also mongos v4.0+ log files). You have to access the files as Unix or Windows user and work with them as text.
Profiling is low-cost, but it is limited to keeping only a small snapshot. It has levels: 0 (off), 1 (slowOp(s only)), 2 (all).
The (log file) logger also has levels of its own, but there is no ‘off’. Even at level 0, it prints any slow operation in a “COMMAND” log line. ‘Slow’ is defined by the configured slowOpThresholdMs option (originally “slowMS”). That is 100ms by default, which is a good default i.m.o.
Usability problem
The log-or-not code and profile-or-not are unrelated systems to the user, but they share the same global variable for ‘slow’: serverGlobalParams.slowMS
The basic post-mortem will go:
- Someone used db.setProfilingLevel(1/*slowOp*/, 0/*ms*/) to start profiling all operations (DON’T DO THIS – use db.setProfilingLevel(2/*all*/) instead.)
- The logger starts writing every command to the log files.
- They executed db.setProfilingLevel(0/*off*/) to turn the profiler off.
- The logger continues writing every command to the log files because slowMS is still 0ms.
- The DBA gets paged after hours because the disk filled up with log file and thought ‘oh, I should have set up log rotation on that server; I’ll do it in the morning’.
- The DBA gets woken up in the small hours of the morning because the new primary node has also crashed due to a full disk.
So that’s the advice: Until MongoDB is enhanced to have separate slow-op threshold options for the profiler, never use db.setProfilingLevel(1, …). Even if you know the gotcha, someone learning over your shoulder won’t see it.
What to do instead:
- Use only
db.setProfilingLevel(0 /*off*/) <-> db.setProfilingLevel(2 /*all*/)
and don’t touch slowms- There is still a valid use case for using profiler at level 1, but if you are not taking on the responsibility of looking after the log file’s code interdependence on the slowMS value, don’t go there.
- If you want the logger to print “COMMAND” lines for every command, including fast ones, use
db.setLogLevel(1, "command")
. And rundb.setLogLevel(-1, "command")
to stop it again. ‘Use log level 1, not profile level 1’ could almost be the catchphrase of this article.
(db.setLogLevel(1) + db.setLogLevel(0) is an alternative to the above, but is an older, blunter method.) - If you want to set the sampleRate you can do that without changing level (or ‘slowms’) with the following command: db.setProfilingLevel(db.getProfilingStatus().was, {“sampleRate”: <new float value>})
- If you want a different threshold for slowMS permanently, use the slowOpThresholdMs option in the config file, but you can also do it dynamically as for the sampleRate instructions above.
Some other gotchas to do with profiling and logging:
- The logger level is global; the slowOpThresholdMs a.k.a. slowMS value is global, but profiling level is per db namespace.
- All of the above are local only to the mongod (or mongos) the commands are run on / the config file is set for. If you run it on a primary it does not change the secondaries, and if you run it on a mongos it does not change the shard or the configsvr replicaset nodes.
- mongos nodes only provide a subset of these diagnostic features. They have no collections of their own, so for starters, they cannot make a system.profile collection.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | bool shouldDBProfile(bool shouldSample = true) { // Profile level 2 should override any sample rate or slowms settings. if (_dbprofile >= 2) return true; if (!shouldSample || _dbprofile <= 0) return false; /* Blog: and by elimination if _dbprofile == 1: */ return elapsedTimeExcludingPauses() >= Milliseconds{serverGlobalParams.slowMS}; } bool CurOp::completeAndLogOperation(OperationContext* opCtx, logger::LogComponent component, boost::optional<size_t> responseLength, boost::optional<long long> slowMsOverride, bool forceLog) { // Log the operation if it is eligible according to the current slowMS and sampleRate settings. const bool shouldLogOp = (forceLog || shouldLog(component, logger::LogSeverity::Debug(1))); const long long slowMs = slowMsOverride.value_or(serverGlobalParams.slowMS); ... ... const bool shouldSample = client->getPrng().nextCanonicalDouble() < serverGlobalParams.sampleRate; if (shouldLogOp || (shouldSample && _debug.executionTimeMicros > slowMs * 1000LL)) { auto lockerInfo = opCtx->lockState()->getLockerInfo(_lockStatsBase); ... // Return 'true' if this operation should also be added to the profiler. return shouldDBProfile(shouldSample); } |
Please file a feature request and list the URL in this post
https://jira.mongodb.org/browse/SERVER-18946 (from 2015), with related SERVER-4785 (2012).
We can hope that it might be fixed in v4.4 I guess. (I can’t see any patch as being backward compatible due to it probably having small user interface changes).
Thank you. Now I can file the issue on jira
Wait a minute. System.profile can be set as a capped collection, did you try it?
v4.4 update: I thought I saw https://jira.mongodb.org/browse/SERVER-4785 and/or https://jira.mongodb.org/browse/SERVER-5266 as one of the “4.3 Desired” or “4.3 Required” items several months ago, but maybe that was just a daydream. It’s in Backlog now and won’t be 4.4.