New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce fatlog to record big response packet. #336
base: unstable
Are you sure you want to change the base?
Conversation
Signed-off-by: Chen Tianjie <TJ_Chen@outlook.com>
Signed-off-by: Chen Tianjie <TJ_Chen@outlook.com>
4fd206a
to
c3727a7
Compare
I'm not sure about this specific feature. At least from what I've seen, it's less common for a single command to produce an unexpectedly large result without it also showing up in the slowlog. |
This does happen. Some of our users are very sensitive to network flow spikes, and they couldn't find any clue in slowlog, so thay asked us about it. Slowlog only records time of command executions ss. But a hash lookup and a memory copy (simple |
@CharlesChen888 Before I review your PR, could you please provide some background information about why you want to update the slowlogCommand to heavyLoadLogCommand? Because in the top comment, you mention you want to add some new commands for fatlog, but in your pr, you make some changes for slowlog files. I am confused it. |
@hwware This is because fatlog shares a lot code of slowlog. So the function used to be called slowlogCommand now contains code of both slowlog command and fatlog command. I tried to think of a name meaning "both slow and fat", "heavy load" seems to cover both situation. |
Can you be more precise than this? People ask us at AWS too about issues, but we've always been able to debug this type of problem without what you proposed. What do you mean they are sensitive to "network flow spikes". We've seen a lot more issues with a sudden burst of smaller commands than this type of "single large" packet issue. |
Yes, this is a common case and we've also seen a lot. But there is another situation in which network flow suddenly increases without significant higher QPS, the most common reason is that users are accessing big keys, like getting a long string, or using commands like keys, scan, hget, zrange to request too many elements at once. This is common among our users. And fatlog is mainly helpful in this situation, it is a quick way to pinpoint the cause. |
Signed-off-by: chentianjie.ctj <chentianjie.ctj@alibaba-inc.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #336 +/- ##
============================================
+ Coverage 69.79% 69.84% +0.05%
============================================
Files 109 109
Lines 61795 61824 +29
============================================
+ Hits 43127 43184 +57
+ Misses 18668 18640 -28
|
Signed-off-by: chentianjie.ctj <chentianjie.ctj@alibaba-inc.com>
428f06b
to
f894a52
Compare
I think the scenario in the top comment is very clear: "slow" and "fat" are not same. I'd love to explain it further in my understanding. We have different types of resources like CPU, memory, network, etc. The slowlog can only record commands eat lots of CPU during the processing phase (doesn't include read/write network time), but can not record commands eat too many memory and network. For example:
So, the fatlog can help users to locate these special cases. BTW, the PR only record the fat output buffer, we should also record the fat input buffer (by |
+1, although I would imagine unexpected large outputs are a lot more common than unexpected large inputs. There are many surprising ways to see large output, like running I guess I'm convinced about the value of this feature, however I don't believe that this should be solving the case where there is uniform data with large keys, such as only being able to hit 1k tps with 1mb items. That is trivial to debug by just doing a Another thought, should we also accumulate this information on cmd_stats? That would maybe more useful for seeing changes in command output trends compared to a fat log? |
src/Makefile
Outdated
@@ -383,7 +383,7 @@ endif | |||
ENGINE_NAME=valkey | |||
SERVER_NAME=$(ENGINE_NAME)-server$(PROG_SUFFIX) | |||
ENGINE_SENTINEL_NAME=$(ENGINE_NAME)-sentinel$(PROG_SUFFIX) | |||
ENGINE_SERVER_OBJ=threads_mngr.o adlist.o quicklist.o ae.o anet.o dict.o kvstore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o crc16.o endianconv.o slowlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o valkey-check-rdb.o valkey-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script_lua.o script.o functions.o function_lua.o commands.o strl.o connection.o unix.o logreqres.o | |||
ENGINE_SERVER_OBJ=threads_mngr.o adlist.o quicklist.o ae.o anet.o dict.o kvstore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o crc16.o endianconv.o heavyloadlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o valkey-check-rdb.o valkey-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script_lua.o script.o functions.o function_lua.o commands.o strl.o connection.o unix.o logreqres.o |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even assuming we accept this, I don't really like the idea of renaming the slowlog. I think the current naming of "slow" is probably sufficient.
@@ -6,7 +6,7 @@ | |||
"since": "2.2.12", | |||
"arity": -2, | |||
"container": "SLOWLOG", | |||
"function": "slowlogCommand", | |||
"function": "heavyLoadLogCommand", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference was to avoid large container commands. I would rather have slowlogCommand
call some shared function then have this call a container function.
* Similarly, fatlog remembers the latest N queries that has a response | ||
* larger than K bytes. | ||
* | ||
* The size of the response to reach to be logged in the fat lof is set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The size of the response to reach to be logged in the fat lof is set | |
* The size of the response to reach to be logged in the fat log is set |
@@ -1156,6 +1156,7 @@ int64_t commandFlagsFromString(char *s) { | |||
else if (!strcasecmp(t,"allow-stale")) flags |= CMD_STALE; | |||
else if (!strcasecmp(t,"no-monitor")) flags |= CMD_SKIP_MONITOR; | |||
else if (!strcasecmp(t,"no-slowlog")) flags |= CMD_SKIP_SLOWLOG; | |||
else if (!strcasecmp(t,"no-fatlog")) flags |= CMD_SKIP_FATLOG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the use case be for logging to the slowlog but not the fatlog (or vice versa)? Today it's only set in exec, since we want the underlying container command to show up instead of the exec, which I imagine is also true here. I almost think we have one flag, no-slowlog
that implies skipping both.
@@ -3230,6 +3231,7 @@ standardConfig static_configs[] = { | |||
createLongLongConfig("proto-max-bulk-len", NULL, DEBUG_CONFIG | MODIFIABLE_CONFIG, 1024*1024, LONG_MAX, server.proto_max_bulk_len, 512ll*1024*1024, MEMORY_CONFIG, NULL, NULL), /* Bulk request max size */ | |||
createLongLongConfig("stream-node-max-entries", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.stream_node_max_entries, 100, INTEGER_CONFIG, NULL, NULL), | |||
createLongLongConfig("repl-backlog-size", NULL, MODIFIABLE_CONFIG, 1, LLONG_MAX, server.repl_backlog_size, 1024*1024, MEMORY_CONFIG, NULL, updateReplBacklogSize), /* Default: 1mb */ | |||
createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, INTEGER_CONFIG, NULL, NULL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, INTEGER_CONFIG, NULL, NULL), | |
createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, MEMORY_CONFIG, NULL, NULL), |
Any reason we wouldn't let this be a memory config, so you could set it to 1mb or something?
16kb also seems way too small, some historical data was have shows that isn't even in the P99 of data size. I would say at least 1mb.
@@ -0,0 +1,69 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I understand the rational behind a new command, a fat command is logically different from a slow command, I would also like us to more seriously consider just adding one new field to the slowlog, the output/input bytes, and then having both get logged to the same place. It's not a perfect fit, but then users can just look in "one" place for bad behavior. It also allows the two logs to perhaps explain each other, since a large output can cause a slow command. We could extend SLOWLOG GET [count] [TYPE FAT|SLOW]
to only show entries of the specific type.
I'm thinking about it more from a simpler way to explain it to end users.
Why fat log?
We already have slowlog, which keeps a record of recent queries that take relatively long time to execute. However sometimes it is not enough for debugging.
For example, we often receive questions from users like: "Why is my QPS not high, but the network flow is substantial?" or "Why did the network flow suddenly increase?" The situation is network flow suddenly increases without significant higher QPS.
The most common reason is that users are accessing big keys, like getting a long string, or using commands like keys, scan, hget, zrange to request too many elements at once. This is common among our users.
Here a new log is introduced to keep a record when a query triggers a large response (and we may call this log "fatlog"), in which the command, response size, time stamp, client name and client network address are logged. This can help to monitor traffic flow spikes and big keys. And it is a quick way to pinpoint the problem.
Alternative solution?
Detail introduction
Just like slowlog, fatlog is not actually logged in a file, and is accessible through a series of commands:
also a help command is provided:
And just like slowlog, two config items are provided to set how many log items are preserved and the response size threshold of being logged.
Also a new command flag (currently used by
EXEC
).About implementation
You will find that a lot changes have been appllied to slowlog.c. That is because the implementation of fatlog is basically reusing the code of slowlog, since they have similar format and are controled by similar commands. So what used to be slowlog.c now contains code of both slowlog and fatlog. I tried to think of a name meaning "both slow and fat", and "heavy load" seems to cover both situations.