Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Memory Usage/ LRU cache size is not being respected #12579

Closed
zaidoon1 opened this issue Apr 24, 2024 · 52 comments
Closed

High Memory Usage/ LRU cache size is not being respected #12579

zaidoon1 opened this issue Apr 24, 2024 · 52 comments

Comments

@zaidoon1
Copy link
Contributor

options file:
OPTIONS.txt

I've set the LRU cache to 1.5gb for the "url" cf. However, all of the sudden, the service that runs rocksdb hit the max memory limit I allocated for the service and I can see that the LRU cache for the "url" cf hit that limit:

Screenshot 2024-04-24 at 11 13 44 AM

This also caused the service to max out the cpu usage (likely because of back pressure).

flamegraph:

cachedb service-2024-04-23T23_16_26+00_00-127m51

@zaidoon1
Copy link
Contributor Author

at least in the above instance, I have something "obvious" to blame, this happened again on a different machine. Except, this time, it doesn't look like we exceeded the configured LRU cache size by much and yet, rocksdb still used up all memory:

Screenshot 2024-04-25 at 1 21 20 AM

@zaidoon1 zaidoon1 changed the title LRU cache size is not being respected High Memory Usage/ LRU cache size is not being respected Apr 25, 2024
@zaidoon1
Copy link
Contributor Author

@ajkr any idea what could have happened here in both cases? I guess the easiest one to answer is how/why rocksdb went above the allocated LRU cache size? Unfortunately, I don't have any other LOGs to share because of the issues described here: #12584 (nothing showed up in the WARN level logs so I don't know what was happening at the time)

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 1, 2024

I was thinking of using strict LRU capacity but it looks like reads (and writes?) will fail if the capacity is hit which is not expected. Why don't we evict from cache instead of failing new reads?

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 2, 2024

Here is more data:
Screenshot 2024-05-02 at 10 29 12 AM

looks like it happens when we have lots of tombstones. This appears to match what was happening in #2952 although the issue there was due to some compaction bug. I'm wondering if there is another compaction bug at play here.

@ajkr
Copy link
Contributor

ajkr commented May 2, 2024

What allocator are you using? RocksDB tends to perform poorly with glibc malloc, and better with allocators like jemalloc, which is what we use internally. Reference: https://smalldatum.blogspot.com/2015/10/myrocks-versus-allocators-glibc.html

Why don't we evict from cache instead of failing new reads?

We evict from cache as long as we can find clean, unpinned entries to evict. Block cache only contains dirty entries when WriteBufferManager is configured to charge its memory to block cache, and only contains pinned entries according to the BlockBasedTableOptions.

That said, we try to evict from cache even if you don't set strict LRU capacity. That setting is to let you choose the behavior in cases where there is nothing evictable - fail the operation (strict), or allocate more memory (non-strict).

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 2, 2024

What allocator are you using

I'm using jemalloc for the allocator (i've double checked this).

In the last instance this happened (screenshot above), block cache was not maxing out beyond what is configured so I don't think that's the issue. I started seeing this issue happen when I enabled the part of the system that does the "is there an index that matches prefix x check" which is prefix seek that only looks at the first kv returned. From the last graph i posted, it also appears to happen when there is a lot of tombstones so the seek + tombstones combination is very odd/suspect to me (similar to the problem reported in the rocksdb ticket i linked to). Right now, i'm doing a load test, so I'm sending 5K requests with unique prefixes and the prefixes are guaranteed to not finding any matching kv

@ajkr
Copy link
Contributor

ajkr commented May 3, 2024

Thanks for the details. Are the 5K requests in parallel? Does your memory budget allow indexes to be pinned in block cache (unpartitioned_pinning = PinningTier::kAll, or less preferably but still good, unpartitioned_pinning = PinningTier::kFlushedAndSimilar)? From the CPU profile it shows most of the work is decompressing index blocks, and that work might even be redundant in which case the memory consumed by index data would be amplified.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 3, 2024

  1. Are the 5K requests in parallel?
    yes they are
  2. I've enabled cache_index_and_filter_blocks, everything else is whatever the default is (I'll need to check the default for unpartitioned_pinning)
  3. It just happened again and here is what it looked like
Screenshot 2024-05-02 at 9 07 51 PM

cachedb service-2024-05-03T01_11_22+00_00-378m6

and here is a pprof of what is happening to make it easier to see what rocksdb is doing

cachedb-2024-05-03T01_13_09+00_00-378m6 inuse_space

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 3, 2024

Also here is the db options I have configured:
db_options.txt

@ajkr
Copy link
Contributor

ajkr commented May 3, 2024

and that work might even be redundant in which case the memory consumed by index data would be amplified.

For more details on this problem, see the stats added in #6681. It looks like you have statistics enabled so you might be able to check those stats to confirm or rule out whether that is the problem.

If it is the problem, unfortunately I don't think we have a good solution yet. That is why I was wondering if you have enough memory to pin the indexes so they don't risk thrashing. Changing cache_index_and_filter_blocks to false could have a similar effect.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 3, 2024

ok this is good to know, i'll definitely investigate this part. I would like to confirm, If we assume that's the problem, then my options are:

  1. set cache_index_and_filter_blocks to false
  2. keep cache_inex_and_filter_blocks set to true and also set unpartitioned_pinning = PinningTier::kAll? Or should I set cache_inex_and_filter_blocks to false and set unpartitioned_pinning?

I think the main reason I set cache_index_and_filter_blocks to true is to cap/control memory usage (but also that's when i thought i had jemalloc enabled but it wasn't so my issues at the time could be different).

That is why I was wondering if you have enough memory to pin the indexes so they don't risk thrashing

regarding this part, is there a way/formula to know how much memory it will cost to pin the indexes? Or is this a try and find out kind of thing?

Is it any different/better to use WriteBufferManager to control memory usage vs cache_index_and_filter_blocks ?

@ajkr
Copy link
Contributor

ajkr commented May 4, 2024

regarding this part, is there a way/formula to know how much memory it will cost to pin the indexes? Or is this a try and find out kind of thing?

There is a property: TableProperties::index_size. It's available via DB APIs like GetPropertiesOfAllTables(), or externally to the DB via sst_dump on the DB dir. It isn't exactly the same as the memory cost of holding an index block in memory but I think it should give an upper bound.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 4, 2024

cool, i'll check this out and just to double check, is unpartitioned_pinning = PinningTier::kAll more prefered than setting cache_index_and_filter_blocks to false?

@ajkr
Copy link
Contributor

ajkr commented May 4, 2024

cool, i'll check this out and just to double check, is unpartitioned_pinning = PinningTier::kAll more prefered than setting cache_index_and_filter_blocks to false?

It is preferable if you want to use our block cache capacity setting for limiting RocksDB's total memory usage.

  • cache_index_and_filter_blocks=true with unpartitioned_pinning = PinningTier::kAll: index block memory usage counts towards block cache capacity. Pinning prevents potential thrashing.
  • cache_index_and_filter_blocks=false: Index block memory usage counts toward table reader memory, which is not counted towards block cache capacity by default. Potential thrashing is still prevented because they are preloaded and non-evictable in table reader memory.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 4, 2024

great! Thanks for confirming, once the c api changes land I'll experiment with this and report back

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 5, 2024

A few other questions that just came to my mind:

  1. right now, i'm using a prefix extractor with a prefix bloom (ribbon filter, 10.0 ratio). The 5K prefix lookups per second is for kvs that don't exist, even the prefix wouldn't exist. I expect the ribbon filter to detect this and therefore rocksdb just skips doing any work. Given this, would we still have the thrashing issue?
  2. would https://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html help in my case?

@ajkr
Copy link
Contributor

ajkr commented May 6, 2024

Yes, prefix filter should prevent thrashing for index block lookups. I didn't notice earlier that it's already enabled. Then, it's surprising that BinarySearchIndexReader::NewIterator() is consuming most of the CPU. Do you set ReadOptions::total_order_seek for the iterator? That can defeat the prefix filter optimizations.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 7, 2024

Do you set ReadOptions::total_order_seek for the iterator

I don't, unless that is set by default in rocksdb under the hood? In the rust library, I call prefix_iterator_cf which just sets prefix_same_as_start and then does the lookup: https://github.com/zaidoon1/rust-rocksdb/blob/567825463480b75f733b73f01c1bd05990aea5b9/src/db.rs#L1438-L1446

Maybe I should start by looking at the ribbon filter metrics? is there a specific metric I should be looking at to see if things are working as they should?

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 7, 2024

I found the following:

rocksdb.bloom.filter.useful COUNT
rocksdb.bloom.filter.full.positive COUNT
rocksdb.bloom.filter.full.true.positive COUNT
rocksdb.bloom.filter.prefix.checked COUNT
rocksdb.bloom.filter.prefix.useful COUNT
rocksdb.bloom.filter.prefix.true.positive COUNT

I couldn't find anything specific to ribbon filter so my guess is "bloom" filter would also be populated for ribbon fiter, if so, which would be the most useful for me to add a metric for to track this issue?

or maybe seek stats:

// Statistics on iterator Seek() (and variants) for each sorted run. I.e. a
// single user Seek() can result in many sorted run Seek()s.
// The stats are split between last level and non-last level.
// Filtered: a filter such as prefix Bloom filter indicate the Seek() would
// not find anything relevant, so avoided a likely access to data+index
// blocks.
LAST_LEVEL_SEEK_FILTERED,
// Filter match: a filter such as prefix Bloom filter was queried but did
// not filter out the seek.
LAST_LEVEL_SEEK_FILTER_MATCH,
// At least one data block was accessed for a Seek() (or variant) on a
// sorted run.
LAST_LEVEL_SEEK_DATA,
// At least one value() was accessed for the seek (suggesting it was useful),
// and no filter such as prefix Bloom was queried.
LAST_LEVEL_SEEK_DATA_USEFUL_NO_FILTER,
// At least one value() was accessed for the seek (suggesting it was useful),
// after querying a filter such as prefix Bloom.
LAST_LEVEL_SEEK_DATA_USEFUL_FILTER_MATCH,
// The same set of stats, but for non-last level seeks.
NON_LAST_LEVEL_SEEK_FILTERED,
NON_LAST_LEVEL_SEEK_FILTER_MATCH,
NON_LAST_LEVEL_SEEK_DATA,
NON_LAST_LEVEL_SEEK_DATA_USEFUL_NO_FILTER,
NON_LAST_LEVEL_SEEK_DATA_USEFUL_FILTER_MATCH,
? Not sure which would help me figure out what i need

@ajkr
Copy link
Contributor

ajkr commented May 8, 2024

Looks like the *LEVEL_SEEK* statistics are for iterators. The *FILTERED vs. *DATA can tell you the filtering rate. If you have a lot of other iterations happening, it could be hard to attribute the metric values to prefix checks vs. other things. Though if *LEVEL_SEEK*FILTERED stats are zero, that'd tell us a lot.

If you want to measure an operation's stats in isolation, we have PerfContext instead of Statistics for that: https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 8, 2024

something that I'm wanting to make sure of, is the *LEVEL_SEEK* statistics a db level stats or is it per cf? As far as iterators go, I only use them when doing the prefix check. Other operations use multiget

@ajkr
Copy link
Contributor

ajkr commented May 8, 2024

It's as wide a scope as the Statistics object, which is at minimum one DB since it's configured in DBOptions::statistics. It could be multiple DBs if you configure multiple DBs to use the same object.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 8, 2024

great! thanks for confirming, I'm going to track:

LAST_LEVEL_SEEK_FILTERED,
LAST_LEVEL_SEEK_DATA,
NON_LAST_LEVEL_SEEK_FILTERED,
NON_LAST_LEVEL_SEEK_DATA,

and will report back what I see

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 13, 2024

so while the issue didn't happen again, I'm still waiting for it, I think I've narrowed down the part of the system that causes this.

So I initially told you that we do a prefix check to see whether there exists a key in rocksdb that starts with some prefix x that gets provided by some external system. Most of the time, there is none, so bloom filter does the job. Now, the part I forgot to mention which is very relevant is that in the event there exists a key, we store this prefix key in the "queue" cf. A background services then iterates over each key in the queue cf, and fetches all matching kvs by prefix. After adding metrics, we see that one prefix can match 3M kvs. Once we find those keys, we delete the file that is referenced by the kvs and eventually delete the kvs.

The calls we do is basically:

     let kv_iterator = if starting_index_key.is_empty() {
            self.db.prefix_iterator_cf(&cf_handle, prefix)
        } else {
            self.db.prefix_iterator_cf(&cf_handle, starting_index_key)
        };

We get keys in batches of 1K to process them that's why we either start from the first key that matches the prefix or continue from where we last left of.

Given this background, I'm thinking this is definitely causing the thrasing issue as we are iterating over millions of keys. Given that the intention is to then delete those keys, maybe I should disable caching while iterating over those keys so that rocksdb doesn't try to cache those lookups as they are useless?

to be specific, i'm thinking of setting:

rocksdb_readoptions_set_fill_cache

to false before iterating. What do you think @ajkr ?

@dfa1
Copy link
Contributor

dfa1 commented May 15, 2024

we have the same issue: recently we moved from rocksdb 7.x + CLOCK_CACHE to rocksdb 8.x + LRU_CACHE . The limit (3GB) is not respected at all: the process is keeping allocating memory until it receives OOM.

Our setup is: rocksdbjni 8.x + range scan workload

@dfa1
Copy link
Contributor

dfa1 commented May 15, 2024

@zaidoon1 FYI: switching to HYPER_CLOCK_CACHE fixed the memory issue in our case. Maybe it is a valid workaround for you too (but we were using CLOCK_CACHE in the rocksdb 7.x)

LRU_CACHE
image

HYPER_CLOCK_CACHE
image

@zaidoon1
Copy link
Contributor Author

Thanks! That's interesting to know, maybe clock cache is more resistant to thrashing? @ajkr any idea? Or it could be your issue is different than mine. In my case, I'm pretty sure it's the iterators that are reading 1M+ kvs and disabling caching for those should help with that.

In general, I plan on switching to hyper clock cache once the auto tunning pparamter of hyper clock cache

// * Requires an extra tuning parameter: see estimated_entry_charge below.
is marked production ready. I'd would rather not have to play around with parameters too much and let rocksdb figure it out.

@dfa1
Copy link
Contributor

dfa1 commented May 16, 2024

@zaidoon1 you're welcome! :-) Anyway, we are dealing with smaller range scan (up to few thousands) with caching enabled and 0 as estimated_entry_charge is working fine (maybe it could be better in 9.x)

@ajkr
Copy link
Contributor

ajkr commented May 16, 2024

to be specific, i'm thinking of setting:

rocksdb_readoptions_set_fill_cache

to false before iterating. What do you think @ajkr ?

Sorry I'm not sure. It sounds like a generally good idea to use fill_cache=false for a scan-and-delete workload because the scanned blocks will not be useful for future user reads.

But, I am not sure how much it will help with this specific problem. The CPU profile is mostly showing index blocks, which leads me to think there is something special about those. If you are pinning index blocks in cache, the index blocks will already be cached so the logic related to fill_cache is bypassed.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 16, 2024

. If you are pinning index blocks in cache, the index blocks will already be cached so the logic related to fill_cache is bypassed

I don't think i'm pinning them (I have not set unpartitioned_pinning = PinningTier::kAll since the c api change didn't land yet in the rocksdb release, everything else is default which I don't think enables pinning?). I'm setting cache_index_and_filter_blocks to true which should be just caching them but if as I scan and fill the cache with useless data, I expect to be kicking things out including index and filter blocks so this issues would be seen? Or am I misunderstanding how this works?

@ajkr
Copy link
Contributor

ajkr commented May 16, 2024

Oh ok if you're not pinning them then fill_cache will make a difference on index blocks. Whether it's a positive or negative difference, I don't know. The reason it could be negative is if 5k iterators simultaneously load the same index block without putting it in block cache, the memory usage could be a bit higher than if they had accessed a shared copy of that block in block cache. But let's see.

@zaidoon1
Copy link
Contributor Author

the memory usage could be a bit higher than if they had accessed a shared copy of that block in block cache

Would this increased memory exceed the allocated memory? I assume this is not something that is capped by block cache since the reads are not getting cached?

@ajkr
Copy link
Contributor

ajkr commented May 17, 2024

Right, it won't be capped by block cache.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 21, 2024

So this happened again which means the fill cache idea I had didn't work. Here is the values of the metrics I added from #12579 (comment) as well as other metrics/state of things

Screenshot 2024-05-21 at 7 35 56 PM

Notice that the filtering is doing a great job and pretty much filtering out all seeks.

Screenshot 2024-05-21 at 7 35 47 PM Screenshot 2024-05-21 at 7 35 35 PM Screenshot 2024-05-21 at 7 34 58 PM

At this point, I don't have any more ideas, is there any other metrics/stats you suggest I add? The next step I think is waiting for the next rocksdb release that will let me set via c api:

cache_index_and_filter_blocks=true with unpartitioned_pinning = PinningTier::kAll

or did this solution also become invalid given that we are filtering out most seeks? @ajkr What do you think?

Also looking at the graphs above, is it weird that block cache memory usage didn't go up at all even though index blocks end up in block cache as they are read so I would expect block cache to also spike? Or maybe I'm misunderstanding that? For example, my first occurrence of this, block cache matched the total memory usage: #12579 (comment) but then other times/most times, it doesn't and block cache is fine but total memory usage spikes to the max.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 22, 2024

can point lookups also cause this issue or is this unique to seeks/prefix lookups? During the entire duration where memory was maxed out, the endpoint that performs point lookups was spiking in requests:
Screenshot 2024-05-21 at 9 02 35 PM

This endpoint does two things:

  1. multiget a bunch of keys from the queue cf, however, the queue cf is empty 99% of the time since we process as fast as possible so I don't think that's it (unless something heavy still happens even when the cf is empty?)
  2. get the value of a key from the key cf using RocksDB PinnableSlice. Can this be doing something bad?

@ajkr
Copy link
Contributor

ajkr commented May 22, 2024

What are the data sources for "Number of SST Files" and "FDs" metric? I previously assumed they would be similar but in the most recent charts I realized there's 15-20 SST files but up to 1K FDs.

@zaidoon1
Copy link
Contributor Author

fds metric comes from : container_file_descriptors{app_name="app_running_rocksdb"}
Number of SST Files: count(rocksdb_livefiles)

@zaidoon1
Copy link
Contributor Author

is it possible that it's a prefetching issue? over 90% of the prefix lookups are for keys that don't exist, but for the ones that do, when I do my prefix check using a prefix iterator, does the iterator try to fetch more data in anticipation that I would read this data and thus I end up pulling more data than I need? If there is prefetching happening, is there a way to limit this since I only care about the first kv? Would lowering the readahead size help here?

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 23, 2024

is there another metric I should add that would at least show the problem? For example, without getting a flamegraph, and by just looking at the existing metrics I'm tracking, we wouldn't know that we are doing any lookups as it looks like everything is being filtered out by the filter setup but this is not true. It feels like there is an observability gap?

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 23, 2024

@ajkr So I added more metrics and here is what they look like when the issue happens:

Screenshot 2024-05-23 at 12 35 04 AM Screenshot 2024-05-23 at 12 35 18 AM Screenshot 2024-05-23 at 12 35 33 AM Screenshot 2024-05-23 at 12 35 45 AM Screenshot 2024-05-23 at 12 36 06 AM Screenshot 2024-05-23 at 12 36 29 AM Screenshot 2024-05-23 at 12 36 40 AM

a few questions:

  1. does it make sense that block cache usage is low but memory usage is maxed out?
  2. I added more seek related metrics, none look interesting to me but maybe something sticks out to you?
  3. I added a metric for block cache index redundant add per your suggestion, I can see that it's happening constantly but at a very low rate so I'm not sure if this rate is enough to push the memory usage usage from the typical 2-3gb to the max allocated memory of 10gb and even then, I would expect us to max out the block cache, right?

@ajkr
Copy link
Contributor

ajkr commented May 26, 2024

  1. I added a metric for block cache index redundant add per your suggestion, I can see that it's happening constantly but at a very low rate so I'm not sure if this rate is enough to push the memory usage usage from the typical 2-3gb to the max allocated memory of 10gb and even then, I would expect us to max out the block cache, right?

It is low, but that could be due to swapping to make room for index allocations, which I think was what was happening in the original CPU profile.

I think block cache does not necessarily need to be maxed out. With capacity=1.5GB and num_shard_bits=6, each cache shard is 1.5GB / 2^6 = 24MB. If one or more index blocks are assigned to the same cache shard and their combined size exceeds 24MB, it is already enough to cause thrashing.

There is a property: TableProperties::index_size. It's available via DB APIs like GetPropertiesOfAllTables(), or externally to the DB via sst_dump on the DB dir

If you have access to any URL column family SST files it would be pretty interesting to know what their index size is.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented May 26, 2024

so I wanted to update, i've set cache_index_and_filter_blocks to false 4 days ago and I haven't seen the issue since and rocksdb seems to be behaving much better. At this point, i'm going to say the thrashing issue was definitely the cause and it's fixed now. Thank you so much @ajkr !! I wouldn't have connected the dots without you.

Also memory wise, I don't see a noticeable increase in memory after removing cache index and filter blocks, in fact on some machines, the memory went down.

Screenshot 2024-05-26 at 7 29 14 PM

Follow ups:

  1. do you feel we had enough metrics in rocksdb to identify this? I felt like I was missing something the entire time but maybe I just don't know enough

@ajkr
Copy link
Contributor

ajkr commented May 27, 2024

so I wanted to update, i've set cache_index_and_filter_blocks to false 4 days ago and I haven't seen the issue since and rocksdb seems to be behaving much better. At this point, i'm going to say the thrashing issue was definitely the cause and it's fixed now.

Glad to hear!

  1. do you feel we had enough metrics in rocksdb to identify this?

No, the allocation for a block is unaccounted until it is inserted to block cache. The redundant insertion stats we eventually used would be difficult to find without guessing the root cause first. Plus they still didn't give any info on the magnitude of the memory usage for redundantly loading blocks.

Maybe we can add a DB property for memory usage of incomplete block loads, and also expose it as part of GetApproximateMemoryUsageByType().

I also think we should give the option to charge block cache for these allocations immediately, but we probably couldn't make it default at this point as it is bound to cause a regression for some workload somewhere.

We could also reduce the memory usage for these redundant block loads with something like @pdillinger's cooperative caching idea (mentioned here: #6699 (comment)).

@zaidoon1
Copy link
Contributor Author

SGTM, thanks again for all the help!

@sazalhf
Copy link

sazalhf commented May 27, 2024

Commenting on this closed ticket as it is also relevant for me as well. We are also experiencing high memory usage disrespecting the memory cap we set with custom rocksdb config setter.

We have 6GB allocated for container limit, 5 state stores. We configured 50MB for LRU cache, no of memtables 3, write buffer size 64MB, write buffer manager 3 * 64MB. JVM Heap usage is around 2GB. On top of that we have

tableConfig.setCacheIndexAndFilterBlocks(true)
tableConfig.setCacheIndexAndFilterBlocksWithHighPriority(true)
tableConfig.setPinTopLevelIndexAndFilter(true)

Given the setting my understanding is overall DataBlock + IndexBlock should be capped by 50MB and WriteBuffer should be shared across the stores and capped by no of memtables * size which is 3 * 64 = 192MB.

Screenshot 2024-05-27 at 14 43 58

In our case we also have punctuators triggering every 5 mins and cleaning up 1000 old records from the state stores avoid letting it grow indefinitely. Here it uses all() api to iterate over the stores, find 1000 potential items and remove them. Can this be a potential bottleneck?

I checked rocksdb interla log. For one of the pods having high memory usage this is what it shows for a state store of size 1.7GB:

Cumulative compaction: 139.70 GB write, 9.52 MB/s write, 138.05 GB read, 9.40 MB/s read, 2174.8 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache@0x7fcb27971000#1 capacity: 50.00 MB collections: 52 last_copies: 5 last_secs: 0.000199 secs_since: 12
Block cache entry stats(count,size,portion): DataBlock(32,1.00 MB,2.00488%) IndexBlock(16,6.57 MB,13.1431%) WriteBuffer(449,112.25 MB,224.5%) Misc(1,0.00 KB,0%)

And JVM RSS shows this:

1879184

I suspect is something to do with the iteration and index block?

@zaidoon1 What were your settings for these? It was recommended to use to cap the memory.

tableConfig.setCacheIndexAndFilterBlocks(true)
tableConfig.setCacheIndexAndFilterBlocksWithHighPriority(true)
tableConfig.setPinTopLevelIndexAndFilter(true)

If we set these to false will not the index take the memory from the rest of the available memory causing not capping the memory? @ajkr

@zaidoon1
Copy link
Contributor Author

@sazalhf

setPinTopLevelIndexAndFilter and setCacheIndexAndFilterBlocksWithHighPriority are default true in rocksdb so I didn't change that. Before this issue, I used to have:

  cache_index_and_filter_blocks: 1
  cache_index_and_filter_blocks_with_high_priority: 1
  pin_top_level_index_and_filter: 1

And now I have:

cache_index_and_filter_blocks: 0
cache_index_and_filter_blocks_with_high_priority: 1
pin_top_level_index_and_filter: 1

@ajkr
Copy link
Contributor

ajkr commented May 27, 2024

Is a single WriteBufferManager shared across the five state stores?

Even if there's only one WriteBufferManager, the block cache is still oversubscribed since a 192MB WriteBufferManager is charging its memory usage against a block cache with 50MB capacity.

To achieve the objective you described ("DataBlock + IndexBlock should be capped by 50MB"), you could simply not tell the WriteBufferManager constructor about the block cache and then its allocations would be untracked.

A more ideal setup would be to set the block cache capacity to the amount of memory you want RocksDB to use for all purposes and continue letting it charge write buffers to block cache. But you would need to come up with that number.

Sorry if I am missing something from the chart. The inline image in GitHub truncates the key for the line colors, and when I click into it to see the full picture, it returns a page not found error.

@sazalhf
Copy link

sazalhf commented May 27, 2024

@ajkr

Is a single WriteBufferManager shared across the five state stores?

Yes.

I just want to understand the impact of these settings a bit more.

cache_index_and_filter_blocks: 0
cache_index_and_filter_blocks_with_high_priority: 1
pin_top_level_index_and_filter: 1

cache_index_and_filter_blocks: 0 in this case the index & filter block cache will consume the ram with the table reader object? This is the API doc:

  /**
   * Indicating if we'd put index/filter blocks to the block cache.
   * If not specified, each "table reader" object will pre-load index/filter
   * block during table initialization.
   *
   * @param cacheIndexAndFilterBlocks and filter blocks should be put in block cache.
   * @return the reference to the current config.
   */

and on this,

A more ideal setup would be to set the block cache capacity to the amount of memory you want RocksDB to use for all purposes and continue letting it charge write buffers to block cache. But you would need to come up with that number.

Say, we need around 1 GB for caching (block cache + index / filter block) and with 3 memtables and 64MB for each memtable in that case the setting should be something like:

cache = LRUCache(1GB -1, strictCapacityLimit)
writeBufferManager = WriteBufferManager(192MB, cache)

Please correct my understanding. 🙇

Another question: how does the thrashing lead to high memory consumption?

@ajkr
Copy link
Contributor

ajkr commented May 28, 2024

I just want to understand the impact of these settings a bit more.

cache_index_and_filter_blocks: 0
cache_index_and_filter_blocks_with_high_priority: 1
pin_top_level_index_and_filter: 1

cache_index_and_filter_blocks: 0 in this case the index & filter block cache will consume the ram with the table reader object?

Right. The benefit is preventing thrashing. The downside is the index/filter memory is not accounted for. To fix that, one way is charging block cache for table reader memory usage:

// (c) CacheEntryRole::kBlockBasedTableReader
// (i) If kEnabled:
// Charge memory usage of table properties +
// index block/filter block/uncompression dictionary (when stored in table
// reader i.e, BlockBasedTableOptions::cache_index_and_filter_blocks ==
// false) + some internal data structures during table reader creation.
// If such a table reader exceeds
// the avaible space left in the block cache at some point (i.e, causing
// a cache full under `LRUCacheOptions::strict_capacity_limit` = true),
// creation will fail with Status::MemoryLimit().
// (ii) If kDisabled:
// Does not charge the memory usage mentioned above.
// (iii) Compatible existing behavior:
// Same as kDisabled.

Another way is to pin the indexes/filters in block cache as mentioned earlier in this issue: #12579 (comment).

Say, we need around 1 GB for caching (block cache + index / filter block) and with 3 memtables and 64MB for each memtable in that case the setting should be something like:

cache = LRUCache(1GB -1, strictCapacityLimit)
writeBufferManager = WriteBufferManager(192MB, cache)

Please correct my understanding. 🙇

Yes that looks good. 1GB is the combined size for blocks+write buffers. So up to 1GB-192MB=832MB will always be available for blocks. The remaining 192MB will sometimes be available for blocks depending on the memory demand for write buffers.

Another question: how does the thrashing lead to high memory consumption?

We don't account for allocations for pending block loads. So N threads simultaneously loading a 10MB index block could use up to N*10MB extra memory.

@sazalhf
Copy link

sazalhf commented May 28, 2024

Thanks will look into it.

A more ideal setup would be to set the block cache capacity to the amount of memory you want RocksDB to use for all purposes and continue letting it charge write buffers to block cache. But you would need to come up with that number.

Is there any easy way to calculate that number like deriving it from the SST file size or by checking Rocksdb internal dumping stats from the log?

FYI, I tried with these settings:

cache_index_and_filter_blocks: 0
cache_index_and_filter_blocks_with_high_priority: 1
pin_top_level_index_and_filter: 1

Yet we see high memory usage. This is the disk usage of the stores (sst files + other files) from one of the pods

220M	Table1
1.4G	        Table2
1.5G	        Table3
649M      Table4
703M	Table5
Screenshot 2024-05-28 at 15 53 26

I noticed the spike in the memory aligned with the punctuator iterating over the stores to cleanup old records.

store1.all()

We load it as a sequence lazily and stop iterating as soon as we delete 1000 records. Then next 1000 in another cycle.

Another suggestion you provided I want to try is

Right. The benefit is preventing thrashing. The downside is the index/filter memory is not accounted for. To fix that, one way is charging block cache for table reader memory usage:

 // Charge memory usage of table properties + 
 // index block/filter block/uncompression dictionary (when stored in table 
 // reader i.e, BlockBasedTableOptions::cache_index_and_filter_blocks == 
 // false) + some internal data structures during table reader creation. 
 // If such a table reader exceeds 
 // the avaible space left in the block cache at some point (i.e, causing 
 // a cache full under `LRUCacheOptions::strict_capacity_limit` = true), 
 // creation will fail with Status::MemoryLimit(). 
 // (ii) If kDisabled: 
 // Does not charge the memory usage mentioned above. 
 // (iii) Compatible existing behavior: 
 // Same as kDisabled. 

Is there a corresponding java API to configure this? We are using rocksdb_version = "7.5.3"

@ajkr

@ajkr
Copy link
Contributor

ajkr commented May 28, 2024

6GB RSS is not expected. I am not sure what allocator you are using, but fwiw allocator choice came up earlier in this thread - #12579 (comment)

Assuming the RSS is not fragmentation, we have a function, GetApproximateMemoryUsageByType(), that can report how much memory RocksDB is using for various purposes. I checked and it is available in Java in "MemoryUtil.java".

Is there a corresponding java API to configure this? We are using rocksdb_version = "7.5.3"

I don't think so as I didn't see cache_usage_options set anywhere in Java.

@sazalhf
Copy link

sazalhf commented May 28, 2024

6GB RSS is not expected. I am not sure what allocator you are using,

We are using jemalloc.

And exactly yes we also don't expect 6GB memory usage. The JVM native memory tracker shows it is using around ~2GB. And I suspect it must be off-heap memory used by Rockdb.

Native Memory Tracking:

Total: reserved=6375413KB, committed=2142949KB

PS -aux shows RSS memory around 2.8G

GetApproximateMemoryUsageByType(),

Ok! I will look into this.
@ajkr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants