Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Load Memory Grows Exponentially? #503

Open
bxcodec opened this issue Nov 6, 2023 · 3 comments
Open

Initial Load Memory Grows Exponentially? #503

bxcodec opened this issue Nov 6, 2023 · 3 comments

Comments

@bxcodec
Copy link

bxcodec commented Nov 6, 2023

PGSync version: master branch commit: 9511670
Postgres version: 14.5
Elasticsearch version: Latest version of Opensearch
Redis version: Latest
Python version: 3.8

Problem Description:

Hi,
I tried to run this in my local.

I have this env set

ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=http
ELASTICSEARCH_HOST=opensearch
ELASTICSEARCH=false
OPENSEARCH=true
ELASTICSEARCH_CHUNK_SIZE=1000
QUERY_CHUNK_SIZE=10000

Qns

Why does the memory grow exponentially on my initial load?
Is there any way to make the memory consumption stable?

Screenshot 2023-11-06 at 5 23 54 PM

I have tried to lower the number of the chunk size, e.g., to even 1K and 500, but I am still facing the same issue on the first initial load (?)

Error Message (if any):
Because of OOM, it killed the application

Killed
exited with code 137

@accelq
Copy link

accelq commented Nov 6, 2023

The query_chunk_size is not on the Postgres cursor. So SQL Alchemy keeps on pulling the data until it can.

result = conn.execution_options(

You can add a new env variable to control it
https://github.com/accelq/pgsync/blob/f1d7caa95cf8edb30da03e05172e90bf7775b666/pgsync/base.py#L869

This worked for me though.

@bxcodec
Copy link
Author

bxcodec commented Nov 8, 2023

Any plan to add this functionality as a core feature? cc @toluaina

@sergiojgm
Copy link

sergiojgm commented Nov 29, 2023

I found after debugging with 26MM of records on initial load, If you use es parallel bulk(default)(stream it will exit in case of error) in case of failures/exceptions(ex: different structure potentially on json type between records) with the options ELASTICSEARCH_RAISE_ON_ERROR and ELASTICSEARCH_RAISE_ON_EXCEPTION(both true by default), all errors will make the the failed record resident in memory due unhandled error/exception, continue to process, and pile up until has out of memory. Disable both options and memory will be stable, also review the data structure matches on all records on what is being created on ES. Hope this helps ;) I was able to copy the 26MM records, 250G to es in 3h35m with max usage of 1.6G of ram.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants