Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when using geo_point #531

Open
mbianchihr opened this issue Feb 26, 2024 · 0 comments
Open

Memory leak when using geo_point #531

mbianchihr opened this issue Feb 26, 2024 · 0 comments

Comments

@mbianchihr
Copy link

PGSync version:
3.1.0
Postgres version:
16
Elasticsearch/OpenSearch version:
8.11.4
Redis version:
7.2.4
Python version:
3.11.6
Problem Description:

When marking fields as geo_point memory leak.

[
    {
        "database": "db",
        "index": "delivery",
        "plugins": [
            "Multisource"
        ],
        "nodes": {
            "table": "parcel",
            "schema": "public",
            "columns": [
                "id",
                "externalId",
                "status",
                "weight",
                "statusChangeTime",
                "statusNote",
                "createdAt"
            ],
            "transform": {
                "mapping": {
                    "location": {
                        "type": "geo_point"
                    },
                    "proofLocation": {
                        "type": "geo_point"
                    }
                }
            },
          ...
      }
   }
]

Then in plugin I have following "rewrite" since the location fields are deep down and are remapped to top level.

from pgsync import plugin

class MultisourcePlugin(plugin.Plugin):
    name = 'Multisource'

    def transform(self, doc, **kwargs):

       ...

       doc['location'] = None
       doc['proofLocation'] = None

       ...

       if doc['delivery'].get('location'):
            doc['location'] = doc['delivery']['location']['coordinates']

       if doc['delivery'].get('proofLocation'):
            doc['proofLocation'] = doc['delivery']['proofLocation']['coordinates']

       ...

When not using this there barely any memory used, and redis is in both cases using around 6-10mb of ram. The real issue lies in memory usage that makes the EC2 fail and become unresponsive at around 200 000 rows with usage of 3.7GB of ram, with our current data set of 1.5 milion it would eat up to 32+GB of ram if estimating from current usage.

This is happening on initial sync, what I suppose is happening is data is not being cleared from memory because fields do not exist in the schema/top level.

However I would love to be pointed to correct place, and yes I sadly have to have these fields on top level instead of parcel.delivery.location.

Sync with geo_point - 150k/1.5mil synced
image

Sync without geo_point - 600k/1.5mil synced
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant