Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: It looks like there is something wrong with the L0 compaction task with 248k segments #33131

Closed
1 task done
ThreadDao opened this issue May 17, 2024 · 3 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20240516-875ad88d
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):   pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. deploy milvus with config
spec:
  components:
    dataNode:
      replicas: 1
      resources:
        limits:
          cpu: "4"
          memory: 16Gi
        requests:
          cpu: "2"
          memory: 8Gi
    indexNode:
      replicas: 3
      resources:
        limits:
          cpu: "8"
          memory: 8Gi
        requests:
          cpu: "4"
          memory: 4Gi
    mixCoord:
      replicas: 1
      resources:
        limits:
          cpu: "4"
          memory: 16Gi
        requests:
          cpu: "2"
          memory: 8Gi
    proxy:
      resources:
        limits:
          cpu: "1"
          memory: 8Gi 
    queryNode:
      replicas: 1
      resources:
        limits:
          cpu: "4" 
          memory: 4Gi 
        requests:
          cpu: "2" 
          memory: 2Gi
  config:
    log:
      level: debug
    quotaAndLimits:
      flushRate:
        enabled: true
        max: 0.1
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces
      sampleFraction: 1
  1. create a collection with 1 shard and 4096 num_partitions( partition_key)
  2. create hnsw index -> insert 1m sift-128d data -> flush
  3. concurrent: upsert + flush
    image
  4. Something wrong with compaction
    image
    image
    image

Expected Behavior

  • Although there are few L0 seg, L0 compaction should still be completed, and the number of flushed-L0 should be reduced to 0.
  • Mix compaction should continue and the number of flushed-L1 segmentys gradually decreases

Steps To Reproduce

- argo workflow: https://argo-workflows.zilliz.cc/archived-workflows/qa/73692041-123c-49dc-aa8e-e047b94115c4?nodeId=compact-opt-partitionkey-5
- grafana: [metrics before compact-opt-op-5-3844 restarted](https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&var-datasource=P1809F7CD0C75ACF3&var-cluster=&var-namespace=qa-milvus&var-instance=compact-opt-op-5-3844&var-collection=All&var-app_name=milvus&from=1715861072000&to=1715900414000)
- pods:

compact-opt-op-5-3844-etcd-0                                      1/1     Running       0               14h     10.104.21.70    4am-node24   <none>           <none>
compact-opt-op-5-3844-etcd-1                                      1/1     Running       1 (3h15m ago)   14h     10.104.27.209   4am-node31   <none>           <none>
compact-opt-op-5-3844-etcd-2                                      1/1     Running       0               14h     10.104.20.96    4am-node22   <none>           <none>
compact-opt-op-5-3844-milvus-datanode-6b86644dc5-s522w            1/1     Running       2 (3h12m ago)   14h     10.104.26.100   4am-node32   <none>           <none>
compact-opt-op-5-3844-milvus-indexnode-d6d8d778-clbxq             1/1     Running       1 (3h15m ago)   14h     10.104.30.152   4am-node38   <none>           <none>
compact-opt-op-5-3844-milvus-indexnode-d6d8d778-crt8k             1/1     Running       1 (3h15m ago)   14h     10.104.4.196    4am-node11   <none>           <none>
compact-opt-op-5-3844-milvus-indexnode-d6d8d778-hhfw6             1/1     Running       1 (3h15m ago)   14h     10.104.21.81    4am-node24   <none>           <none>
compact-opt-op-5-3844-milvus-mixcoord-6b7745bddc-5x8fv            1/1     Running       2 (3h12m ago)   14h     10.104.13.133   4am-node16   <none>           <none>
compact-opt-op-5-3844-milvus-proxy-6969654c8b-shjzv               1/1     Running       3 (3h10m ago)   14h     10.104.13.132   4am-node16   <none>           <none>
compact-opt-op-5-3844-milvus-querynode-0-7699bd85bb-9hxmx         1/1     Running       1 (3h15m ago)   14h     10.104.5.64     4am-node12   <none>           <none>
compact-opt-op-5-3844-minio-0                                     1/1     Running       0               14h     10.104.21.71    4am-node24   <none>           <none>
compact-opt-op-5-3844-minio-1                                     1/1     Running       0               14h     10.104.24.92    4am-node29   <none>           <none>
compact-opt-op-5-3844-minio-2                                     1/1     Running       0               14h     10.104.20.99    4am-node22   <none>           <none>
compact-opt-op-5-3844-minio-3                                     1/1     Running       0               14h     10.104.27.210   4am-node31   <none>           <none>
compact-opt-op-5-3844-pulsar-bookie-0                             1/1     Running       0               14h     10.104.21.72    4am-node24   <none>           <none>
compact-opt-op-5-3844-pulsar-bookie-1                             1/1     Running       0               14h     10.104.20.100   4am-node22   <none>           <none>
compact-opt-op-5-3844-pulsar-bookie-2                             1/1     Running       0               14h     10.104.24.95    4am-node29   <none>           <none>
compact-opt-op-5-3844-pulsar-bookie-init-hglzs                    0/1     Completed     0               14h     10.104.14.208   4am-node18   <none>           <none>
compact-opt-op-5-3844-pulsar-broker-0                             1/1     Running       0               14h     10.104.14.205   4am-node18   <none>           <none>
compact-opt-op-5-3844-pulsar-proxy-0                              1/1     Running       0               14h     10.104.14.204   4am-node18   <none>           <none>
compact-opt-op-5-3844-pulsar-pulsar-init-9bdtw                    0/1     Completed     0               14h     10.104.14.207   4am-node18   <none>           <none>
compact-opt-op-5-3844-pulsar-recovery-0                           1/1     Running       0               14h     10.104.5.62     4am-node12   <none>           <none>
compact-opt-op-5-3844-pulsar-zookeeper-0                          1/1     Running       0               14h     10.104.21.73    4am-node24   <none>           <none>
compact-opt-op-5-3844-pulsar-zookeeper-1                          1/1     Running       0               14h     10.104.27.212   4am-node31   <none>           <none>
compact-opt-op-5-3844-pulsar-zookeeper-2                          1/1     Running       0               14h     10.104.20.102   4am-node22   <none>           <none>


### Milvus Log

_No response_

### Anything else?

_No response_
@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 17, 2024
@ThreadDao
Copy link
Contributor Author

Pod restarted due to connection lost from etcd. After restarted, channel watch failed: (reason see the #33125)

Milvus(compact-opt-op-5-3844) > show channel-watch --collection 449802983197901109
=============================
key: compact-opt-op-5-3844/meta/channelwatch/7/compact-opt-op-5-3844-rootcoord-dml_0_449802983197901109v0
Channel Name:compact-opt-op-5-3844-rootcoord-dml_0_449802983197901109v0 	 WatchState: ToWatch
Channel Watch start from: 2024-05-16 20:24:34 +0800, timeout at: 1970-01-01 08:00:00 +0800
Start Position ID: [8 1 16 0 24 0 32 0], time: 2024-05-16 20:24:31.219 +0800
Unflushed segments: []
Flushed segments: []
Dropped segments: []
--- Total Channels: 1

@yanliang567
Copy link
Contributor

seems like L0 compaction is not working?
/assign @XuanYang-cn
/unassign

@yanliang567 yanliang567 added this to the 2.4.2 milestone May 17, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 17, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.2, 2.4.3 May 24, 2024
@ThreadDao
Copy link
Contributor Author

upgrade image 2.4-20240528-ef9c1191-amd64 and start to compaction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants