[Bug]: Increased number of Read/Write operations per second #17647

EvgeniaPatsoni · 2024-05-13T12:04:58Z

Bug description

We are using the Netdata Cloud Business plan in order to monitor several clusters of ours, including some clusters on AWS.

For one of these AWS clusters, we use an EFS in order to bring up persistent volumes.
After some time we noticed that there is a pretty high number of read/write operations between the EFS and the cluster node that the Netdata parent pod is running.
Specifically using nfsiostat, we noticed that on average there were ~300 Operations/Sec, most of which refer to Read operations and fewer refer to Write operations. This increased number of IO operations caused a large increase of the EFS cost on AWS.
We then disabled the persistence for parent.alarms, parent.database and k8sState and noticed that both the IO operations as well as the EFS price were dropped significantly (almost $1000).

Based on the above behavior, I believe that the increased Disk IO traffic is consistent and does not only increase during the parent's startup process.

Finally, is Machine Learning enabled by default for the parent? We could disable it if you believe that the increased number of Disk IO operations comes from ML.

parent/child: 3,
number of collected metrics per second:

    "tier":0,
    "metrics":21556,
    "samples":474439282,
    "disk_used":256836556,
    "disk_max":268435456,
    "disk_percent":95.679073,
    "from":1715344390,
    "to":1715600972,
    "retention":256582,
    "expected_retention":268169,
    "currently_collected_metrics":17480
},{
    "tier":1,
    "metrics":21556,
    "samples":19316561,
    "disk_used":129440304,
    "disk_max":134217728,
    "disk_percent":96.4405417,
    "from":1715357220,
    "to":1715600972,
    "retention":243752,
    "expected_retention":252748,
    "currently_collected_metrics":17480
},{
    "tier":2,
    "metrics":21556,
    "samples":2500160,
    "disk_used":59486976,
    "disk_max":67108864,
    "disk_percent":88.6425018,
    "from":1714356000,
    "to":1715600972,
    "retention":1244972,
    "expected_retention":1404486,
    "currently_collected_metrics":17480

In case you require any further information please let us know.

Expected behavior

Number of Read/Write operations should decrease if possible.

Steps to reproduce

...

Installation method

helmchart (kubernetes)

System info

/etc/os-release:NAME="Amazon Linux"
/etc/os-release:VERSION="2"
/etc/os-release:ID="amzn"
/etc/os-release:ID_LIKE="centos rhel fedora"
/etc/os-release:VERSION_ID="2"
/etc/os-release:PRETTY_NAME="Amazon Linux 2"
/etc/os-release:ANSI_COLOR="0;33"
/etc/os-release:CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
/etc/os-release:SUPPORT_END="2025-06-30"
/etc/system-release:Amazon Linux release 2 (Karoo)

Netdata build info

Packaging:
    Netdata Version ____________________________________________ : v1.45.3
    Installation Type __________________________________________ : oci
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ : unknown
    Configure Options __________________________________________ : dummy-configure-command
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /usr/share/netdata/web
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 5.10.205-195.807.amzn2.x86_64
    Operating System ___________________________________________ : Amazon Linux
    Operating System ID ________________________________________ : amzn
    Operating System ID Like ___________________________________ : centos rhel fedora
    Operating System Version ___________________________________ : 2
    Operating System Version ID ________________________________ : 12
    Detection __________________________________________________ : /host/etc/os-release
Hardware:
    CPU Cores __________________________________________________ : 4
    CPU Frequency ______________________________________________ : 3584000000
    RAM Bytes __________________________________________________ : 16467312640
    Disk Capacity ______________________________________________ : 107374182400
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : kvm
    Virtualization Detection ___________________________________ : lscpu
Container:
    Container __________________________________________________ : container
    Container Detection ________________________________________ : kubernetes
    Container Orchestrator _____________________________________ : kubernetes
    Container Operating System _________________________________ : Debian GNU/Linux
    Container Operating System ID ______________________________ : debian
    Container Operating System ID Like _________________________ : unknown
    Container Operating System Version _________________________ : 12 (bookworm)
    Container Operating System Version ID ______________________ : 12
    Container Operating System Detection _______________________ : /etc/os-release
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine ___________________________________________________ : YES
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Brotli (generic-purpose lossless compression algorithm) ____ : NO
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : NO
    libcrypto (cryptographic functions) ________________________ : YES
    libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : NO
    ebpf (monitor system calls) ________________________________ : NO
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    nfacct (gather netfilter accounting) _______________________ : NO
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : NO
    Xen VBD Error Tracking _____________________________________ : NO
    Logs Management ____________________________________________ : YES
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : YES
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO

Additional info

No response

The text was updated successfully, but these errors were encountered:

vkalintiris · 2024-05-13T12:15:17Z

Finally, is Machine Learning enabled by default for the parent?

Yes, ML is enabled everywhere by default.

We could disable it if you believe that the increased number of Disk IO operations comes from ML.

ML is probably the culprit here because it needs to read historical data at regular intervals for every dimension that gets trained. You can disable it by updating the [ml] section of netdata.conf like this:

[ml]
    enabled = no

vkalintiris · 2024-05-16T13:30:35Z

@EvgeniaPatsoni curious if disabling ML fixed the issue you were facing.

EvgeniaPatsoni · 2024-05-17T09:35:17Z

Hello @vkalintiris

We disabled machine learning on the parent and upgraded to the latest version (3.7.89). We re-enabled persistence as well, so we'll have to wait for a few days in order to see if there will be an increase in traffic again.

vkalintiris · 2024-06-01T08:57:16Z

@EvgeniaPatsoni gentle ping on this one.

EvgeniaPatsoni added bug needs triage Issues which need to be manually labelled labels May 13, 2024

ilyam8 added need feedback performance Performance issue, optimization needed question and removed needs triage Issues which need to be manually labelled labels May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Increased number of Read/Write operations per second #17647

[Bug]: Increased number of Read/Write operations per second #17647

EvgeniaPatsoni commented May 13, 2024

vkalintiris commented May 13, 2024

vkalintiris commented May 16, 2024

EvgeniaPatsoni commented May 17, 2024

vkalintiris commented Jun 1, 2024

[Bug]: Increased number of Read/Write operations per second #17647

[Bug]: Increased number of Read/Write operations per second #17647

Comments

EvgeniaPatsoni commented May 13, 2024

Bug description

Expected behavior

Steps to reproduce

Installation method

System info

Netdata build info

Additional info

vkalintiris commented May 13, 2024

vkalintiris commented May 16, 2024

EvgeniaPatsoni commented May 17, 2024

vkalintiris commented Jun 1, 2024