-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Memory Usage/ LRU cache size is not being respected #12579
Comments
@ajkr any idea what could have happened here in both cases? I guess the easiest one to answer is how/why rocksdb went above the allocated LRU cache size? Unfortunately, I don't have any other LOGs to share because of the issues described here: #12584 (nothing showed up in the WARN level logs so I don't know what was happening at the time) |
I was thinking of using strict LRU capacity but it looks like reads (and writes?) will fail if the capacity is hit which is not expected. Why don't we evict from cache instead of failing new reads? |
looks like it happens when we have lots of tombstones. This appears to match what was happening in #2952 although the issue there was due to some compaction bug. I'm wondering if there is another compaction bug at play here. |
What allocator are you using? RocksDB tends to perform poorly with glibc malloc, and better with allocators like jemalloc, which is what we use internally. Reference: https://smalldatum.blogspot.com/2015/10/myrocks-versus-allocators-glibc.html
We evict from cache as long as we can find clean, unpinned entries to evict. Block cache only contains dirty entries when That said, we try to evict from cache even if you don't set strict LRU capacity. That setting is to let you choose the behavior in cases where there is nothing evictable - fail the operation (strict), or allocate more memory (non-strict). |
I'm using jemalloc for the allocator (i've double checked this). In the last instance this happened (screenshot above), block cache was not maxing out beyond what is configured so I don't think that's the issue. I started seeing this issue happen when I enabled the part of the system that does the "is there an index that matches prefix x check" which is prefix seek that only looks at the first kv returned. From the last graph i posted, it also appears to happen when there is a lot of tombstones so the seek + tombstones combination is very odd/suspect to me (similar to the problem reported in the rocksdb ticket i linked to). Right now, i'm doing a load test, so I'm sending 5K requests with unique prefixes and the prefixes are guaranteed to not finding any matching kv |
Thanks for the details. Are the 5K requests in parallel? Does your memory budget allow indexes to be pinned in block cache ( |
Also here is the db options I have configured: |
For more details on this problem, see the stats added in #6681. It looks like you have statistics enabled so you might be able to check those stats to confirm or rule out whether that is the problem. If it is the problem, unfortunately I don't think we have a good solution yet. That is why I was wondering if you have enough memory to pin the indexes so they don't risk thrashing. Changing |
ok this is good to know, i'll definitely investigate this part. I would like to confirm, If we assume that's the problem, then my options are:
I think the main reason I set cache_index_and_filter_blocks to true is to cap/control memory usage (but also that's when i thought i had jemalloc enabled but it wasn't so my issues at the time could be different).
regarding this part, is there a way/formula to know how much memory it will cost to pin the indexes? Or is this a try and find out kind of thing? Is it any different/better to use WriteBufferManager to control memory usage vs cache_index_and_filter_blocks ? |
There is a property: |
cool, i'll check this out and just to double check, is unpartitioned_pinning = PinningTier::kAll more prefered than setting cache_index_and_filter_blocks to false? |
It is preferable if you want to use our block cache capacity setting for limiting RocksDB's total memory usage.
|
great! Thanks for confirming, once the c api changes land I'll experiment with this and report back |
A few other questions that just came to my mind:
|
Yes, prefix filter should prevent thrashing for index block lookups. I didn't notice earlier that it's already enabled. Then, it's surprising that |
I don't, unless that is set by default in rocksdb under the hood? In the rust library, I call Maybe I should start by looking at the ribbon filter metrics? is there a specific metric I should be looking at to see if things are working as they should? |
I found the following:
I couldn't find anything specific to ribbon filter so my guess is "bloom" filter would also be populated for ribbon fiter, if so, which would be the most useful for me to add a metric for to track this issue? or maybe seek stats: rocksdb/include/rocksdb/statistics.h Lines 457 to 481 in 36ab251
|
Looks like the If you want to measure an operation's stats in isolation, we have |
something that I'm wanting to make sure of, is the |
It's as wide a scope as the |
great! thanks for confirming, I'm going to track:
and will report back what I see |
so while the issue didn't happen again, I'm still waiting for it, I think I've narrowed down the part of the system that causes this. So I initially told you that we do a prefix check to see whether there exists a key in rocksdb that starts with some prefix x that gets provided by some external system. Most of the time, there is none, so bloom filter does the job. Now, the part I forgot to mention which is very relevant is that in the event there exists a key, we store this prefix key in the "queue" cf. A background services then iterates over each key in the queue cf, and fetches all matching kvs by prefix. After adding metrics, we see that one prefix can match 3M kvs. Once we find those keys, we delete the file that is referenced by the kvs and eventually delete the kvs. The calls we do is basically:
We get keys in batches of 1K to process them that's why we either start from the first key that matches the prefix or continue from where we last left of. Given this background, I'm thinking this is definitely causing the thrasing issue as we are iterating over millions of keys. Given that the intention is to then delete those keys, maybe I should disable caching while iterating over those keys so that rocksdb doesn't try to cache those lookups as they are useless? to be specific, i'm thinking of setting:
to false before iterating. What do you think @ajkr ? |
we have the same issue: recently we moved from rocksdb 7.x + CLOCK_CACHE to rocksdb 8.x + LRU_CACHE . The limit (3GB) is not respected at all: the process is keeping allocating memory until it receives OOM. Our setup is: rocksdbjni 8.x + range scan workload |
@zaidoon1 FYI: switching to HYPER_CLOCK_CACHE fixed the memory issue in our case. Maybe it is a valid workaround for you too (but we were using CLOCK_CACHE in the rocksdb 7.x) |
Thanks! That's interesting to know, maybe clock cache is more resistant to thrashing? @ajkr any idea? Or it could be your issue is different than mine. In my case, I'm pretty sure it's the iterators that are reading 1M+ kvs and disabling caching for those should help with that. In general, I plan on switching to hyper clock cache once the auto tunning pparamter of hyper clock cache rocksdb/include/rocksdb/cache.h Line 380 in 4eaf628
|
@zaidoon1 you're welcome! :-) Anyway, we are dealing with smaller range scan (up to few thousands) with caching enabled and 0 as estimated_entry_charge is working fine (maybe it could be better in 9.x) |
Sorry I'm not sure. It sounds like a generally good idea to use fill_cache=false for a scan-and-delete workload because the scanned blocks will not be useful for future user reads. But, I am not sure how much it will help with this specific problem. The CPU profile is mostly showing index blocks, which leads me to think there is something special about those. If you are pinning index blocks in cache, the index blocks will already be cached so the logic related to fill_cache is bypassed. |
I don't think i'm pinning them (I have not set unpartitioned_pinning = PinningTier::kAll since the c api change didn't land yet in the rocksdb release, everything else is default which I don't think enables pinning?). I'm setting cache_index_and_filter_blocks to true which should be just caching them but if as I scan and fill the cache with useless data, I expect to be kicking things out including index and filter blocks so this issues would be seen? Or am I misunderstanding how this works? |
Oh ok if you're not pinning them then fill_cache will make a difference on index blocks. Whether it's a positive or negative difference, I don't know. The reason it could be negative is if 5k iterators simultaneously load the same index block without putting it in block cache, the memory usage could be a bit higher than if they had accessed a shared copy of that block in block cache. But let's see. |
Would this increased memory exceed the allocated memory? I assume this is not something that is capped by block cache since the reads are not getting cached? |
Right, it won't be capped by block cache. |
So this happened again which means the fill cache idea I had didn't work. Here is the values of the metrics I added from #12579 (comment) as well as other metrics/state of things Notice that the filtering is doing a great job and pretty much filtering out all seeks. At this point, I don't have any more ideas, is there any other metrics/stats you suggest I add? The next step I think is waiting for the next rocksdb release that will let me set via c api:
or did this solution also become invalid given that we are filtering out most seeks? @ajkr What do you think? Also looking at the graphs above, is it weird that block cache memory usage didn't go up at all even though index blocks end up in block cache as they are read so I would expect block cache to also spike? Or maybe I'm misunderstanding that? For example, my first occurrence of this, block cache matched the total memory usage: #12579 (comment) but then other times/most times, it doesn't and block cache is fine but total memory usage spikes to the max. |
What are the data sources for "Number of SST Files" and "FDs" metric? I previously assumed they would be similar but in the most recent charts I realized there's 15-20 SST files but up to 1K FDs. |
fds metric comes from : |
is it possible that it's a prefetching issue? over 90% of the prefix lookups are for keys that don't exist, but for the ones that do, when I do my prefix check using a prefix iterator, does the iterator try to fetch more data in anticipation that I would read this data and thus I end up pulling more data than I need? If there is prefetching happening, is there a way to limit this since I only care about the first kv? Would lowering the readahead size help here? |
is there another metric I should add that would at least show the problem? For example, without getting a flamegraph, and by just looking at the existing metrics I'm tracking, we wouldn't know that we are doing any lookups as it looks like everything is being filtered out by the filter setup but this is not true. It feels like there is an observability gap? |
@ajkr So I added more metrics and here is what they look like when the issue happens: a few questions:
|
It is low, but that could be due to swapping to make room for index allocations, which I think was what was happening in the original CPU profile. I think block cache does not necessarily need to be maxed out. With
If you have access to any URL column family SST files it would be pretty interesting to know what their index size is. |
so I wanted to update, i've set cache_index_and_filter_blocks to false 4 days ago and I haven't seen the issue since and rocksdb seems to be behaving much better. At this point, i'm going to say the thrashing issue was definitely the cause and it's fixed now. Thank you so much @ajkr !! I wouldn't have connected the dots without you. Also memory wise, I don't see a noticeable increase in memory after removing cache index and filter blocks, in fact on some machines, the memory went down. Follow ups:
|
Glad to hear!
No, the allocation for a block is unaccounted until it is inserted to block cache. The redundant insertion stats we eventually used would be difficult to find without guessing the root cause first. Plus they still didn't give any info on the magnitude of the memory usage for redundantly loading blocks. Maybe we can add a DB property for memory usage of incomplete block loads, and also expose it as part of I also think we should give the option to charge block cache for these allocations immediately, but we probably couldn't make it default at this point as it is bound to cause a regression for some workload somewhere. We could also reduce the memory usage for these redundant block loads with something like @pdillinger's cooperative caching idea (mentioned here: #6699 (comment)). |
SGTM, thanks again for all the help! |
Commenting on this closed ticket as it is also relevant for me as well. We are also experiencing high memory usage disrespecting the memory cap we set with custom rocksdb config setter. We have 6GB allocated for container limit, 5 state stores. We configured 50MB for LRU cache, no of memtables 3, write buffer size 64MB, write buffer manager 3 * 64MB. JVM Heap usage is around 2GB. On top of that we have
Given the setting my understanding is overall DataBlock + IndexBlock should be capped by 50MB and WriteBuffer should be shared across the stores and capped by no of memtables * size which is 3 * 64 = 192MB. In our case we also have punctuators triggering every 5 mins and cleaning up 1000 old records from the state stores avoid letting it grow indefinitely. Here it uses I checked rocksdb interla log. For one of the pods having high memory usage this is what it shows for a state store of size 1.7GB:
And JVM RSS shows this:
I suspect is something to do with the iteration and index block? @zaidoon1 What were your settings for these? It was recommended to use to cap the memory.
If we set these to false will not the index take the memory from the rest of the available memory causing not capping the memory? @ajkr |
And now I have:
|
Is a single Even if there's only one To achieve the objective you described ("DataBlock + IndexBlock should be capped by 50MB"), you could simply not tell the A more ideal setup would be to set the block cache capacity to the amount of memory you want RocksDB to use for all purposes and continue letting it charge write buffers to block cache. But you would need to come up with that number. Sorry if I am missing something from the chart. The inline image in GitHub truncates the key for the line colors, and when I click into it to see the full picture, it returns a page not found error. |
Yes. I just want to understand the impact of these settings a bit more.
/**
* Indicating if we'd put index/filter blocks to the block cache.
* If not specified, each "table reader" object will pre-load index/filter
* block during table initialization.
*
* @param cacheIndexAndFilterBlocks and filter blocks should be put in block cache.
* @return the reference to the current config.
*/ and on this,
Say, we need around 1 GB for caching (block cache + index / filter block) and with 3 memtables and 64MB for each memtable in that case the setting should be something like: cache = LRUCache(1GB -1, strictCapacityLimit)
writeBufferManager = WriteBufferManager(192MB, cache) Please correct my understanding. 🙇 Another question: how does the thrashing lead to high memory consumption? |
Right. The benefit is preventing thrashing. The downside is the index/filter memory is not accounted for. To fix that, one way is charging block cache for table reader memory usage: rocksdb/include/rocksdb/table.h Lines 355 to 368 in 0ee7f8b
Another way is to pin the indexes/filters in block cache as mentioned earlier in this issue: #12579 (comment).
Yes that looks good. 1GB is the combined size for blocks+write buffers. So up to 1GB-192MB=832MB will always be available for blocks. The remaining 192MB will sometimes be available for blocks depending on the memory demand for write buffers.
We don't account for allocations for pending block loads. So N threads simultaneously loading a 10MB index block could use up to N*10MB extra memory. |
6GB RSS is not expected. I am not sure what allocator you are using, but fwiw allocator choice came up earlier in this thread - #12579 (comment) Assuming the RSS is not fragmentation, we have a function,
I don't think so as I didn't see |
We are using jemalloc. And exactly yes we also don't expect 6GB memory usage. The JVM native memory tracker shows it is using around ~2GB. And I suspect it must be off-heap memory used by Rockdb. Native Memory Tracking:
Total: reserved=6375413KB, committed=2142949KB
Ok! I will look into this. |
options file:
OPTIONS.txt
I've set the LRU cache to 1.5gb for the "url" cf. However, all of the sudden, the service that runs rocksdb hit the max memory limit I allocated for the service and I can see that the LRU cache for the "url" cf hit that limit:
This also caused the service to max out the cpu usage (likely because of back pressure).
flamegraph:
The text was updated successfully, but these errors were encountered: