Optimization task panicked after a collection is recovered, causing search API timeout #4131

no7dw · 2024-04-29T11:52:33Z

Current Behavior

we have a collection report error:
optimizer_status": {
"error": "Service internal error: Optimization task panicked: called Option::unwrap() on a None value"
},

which seems causing slow performance usually timeout on search .

we check the docker logs
show as :

2024-04-29T09:04:57.585769Z  INFO actix_web::middleware::logger: 103.152.220.15 "GET /collections/new_text HTTP/1.1" 200 761 "http://34.142.201.204:6333/dashboard" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36" 0.001342

2024-04-29T09:04:57.632442Z ERROR qdrant::startup: Panic backtrace:
   0: qdrant::startup::setup_panic_hook::{{closure}}
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/alloc/src/boxed.rs:2029:9
   2: std::panicking::rust_panic_with_hook
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:785:13
   3: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:659:13
   4: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys_common/backtrace.rs:171:18
   5: rust_begin_unwind
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:647:5
   6: core::panicking::panic_fmt
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/panicking.rs:72:14
   7: <segment::vector_storage::vector_storage_base::VectorStorageEnum as segment::vector_storage::vector_storage_base::VectorStorage>::is_deleted_vector
   8: segment::segment::Segment::vector_by_offset
   9: <segment::segment::Segment as segment::entry::entry_point::SegmentEntry>::vector
  10: <segment::segment::Segment as segment::entry::entry_point::SegmentEntry>::all_vectors
  11: collection::collection_manager::segments_searcher::SegmentsSearcher::retrieve
  12: collection::shards::local_shard::shard_ops::<impl collection::shards::shard_trait::ShardOperation for collection::shards::local_shard::LocalShard>::scroll_by::{{closure}}
  13: collection::shards::replica_set::read_ops::<impl collection::shards::replica_set::ShardReplicaSet>::scroll_by::{{closure}}::{{closure}}::{{closure}}
  14: collection::shards::replica_set::execute_read_operation::<impl collection::shards::replica_set::ShardReplicaSet>::execute_and_resolve_read_operation::{{closure}}
  15: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
  16: collection::collection::point_ops::<impl collection::collection::Collection>::scroll_by::{{closure}}
  17: storage::content_manager::toc::point_ops::<impl storage::content_manager::toc::TableOfContent>::scroll::{{closure}}
  18: actix_web::handler::handler_service::{{closure}}::{{closure}}
  19: <actix_web::resource::Resource<T> as actix_web::service::HttpServiceFactory>::register::{{closure}}::{{closure}}
  20: <actix_utils::future::either::Either<L,R> as core::future::future::Future>::poll
  21: <actix_web_extras::middleware::condition::ConditionMiddlewareFuture<E,D> as core::future::future::Future>::poll
  22: <actix_cors::middleware::CorsMiddleware<S> as actix_service::Service<actix_web::service::ServiceRequest>>::call::{{closure}}
  23: <qdrant::actix::actix_telemetry::ActixTelemetryService<S> as actix_service::Service<actix_web::service::ServiceRequest>>::call::{{closure}}
  24: actix_http::h1::dispatcher::InnerDispatcher<T,S,B,X,U>::poll_response
  25: <actix_http::h1::dispatcher::Dispatcher<T,S,B,X,U> as core::future::future::Future>::poll
  26: tokio::runtime::task::raw::poll
  27: tokio::task::local::LocalSet::tick
  28: tokio::task::local::LocalSet::run_until::{{closure}}
  29: std::sys_common::backtrace::__rust_begin_short_backtrace
  30: core::ops::function::FnOnce::call_once{{vtable.shim}}
  31: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/alloc/src/boxed.rs:2015:9
  32: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/alloc/src/boxed.rs:2015:9
  33: std::sys::pal::unix::thread::Thread::new::thread_start
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys/pal/unix/thread.rs:108:17
  34: <unknown>
  35: clone

2024-04-29T09:04:57.632456Z ERROR qdrant::startup: Panic occurred in file /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/bitvec-1.0.1/src/slice/api.rs at line 2594: index 178442 out of bounds: 0

2024-04-29T09:04:57.632787Z DEBUG reqwest::connect: starting new connection: https://telemetry.qdrant.io/
2024-04-29T09:04:59.366668Z  INFO actix_web::middleware::logger: 103.152.220.15 "GET /collections/new_text/cluster HTTP/1.1" 200 166 "http://34.142.201.204:6333/dashboard" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36" 0.000207

config for the collection name : new_text

{
  "result": {
    "status": "red",
    "optimizer_status": {
      "error": "Service internal error: Optimization task panicked: called `Option::unwrap()` on a `None` value"
    },
    "indexed_vectors_count": 14246516,
    "points_count": 23415154,
    "segments_count": 12,
    "config": {
      "params": {
        "vectors": {
          "gemini:models/embedding-001": {
            "size": 768,
            "distance": "Cosine"
          },
          "huggingface:jinaai/jina-embeddings-v2-base-en": {
            "size": 768,
            "distance": "Cosine"
          },
          "mistral:mistral-embed": {
            "size": 1024,
            "distance": "Cosine"
          },
          "openai:text-embedding-ada-002": {
            "size": 1536,
            "distance": "Cosine",
            "on_disk": true
          }
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": 120000,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": 16
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": {
        "binary": {
          "always_ram": true
        }
      }
    },
    "payload_schema": {
      "created_at": {
        "data_type": "integer",
        "points": 23323677
      },
      "website_profile": {
        "data_type": "keyword",
        "points": 23323677
      },
      "source": {
        "data_type": "keyword",
        "points": 23323677
      },
      "language": {
        "data_type": "keyword",
        "points": 300382
      },
      "headline": {
        "data_type": "text",
        "points": 23323677
      },
      "website_url": {
        "data_type": "keyword",
        "points": 23323677
      },
      "content": {
        "data_type": "text",
        "points": 23323677
      },
      "published_at": {
        "data_type": "integer",
        "points": 21134611
      }
    }
  },
  "status": "ok",
  "time": 0.000158987
}

Context (Environment)

version: Linux running in Dockerv1.9

this collection(new_text) is backup from a snapshot, seems sth wrong with the snapshot.
when snapshot recovered is completed (took 1-2hours), the collections could query/search with good performance.
when we query

GET collections/new_text

and the error show up ( probably in ~6 hours or so)
search API become extreme slow with usually timeout(>60secs).

once this issue happened, will the status will always turn red.

one weird thing is:
when query has filter ( index hit), the collection is queryable without timeout.

Detailed Description

the snapshot is broken due to some deletion of some abnormal operation (probably by us).
we have other collections in this instance, seems they works fine.

Possible Implementation

Will it be possible to rebuild index to avoid the collection issue.
Thx in advance.

The text was updated successfully, but these errors were encountered:

generall · 2024-04-29T13:14:35Z

Hey @no7dw, thanks for reporting this.

Could you please provide a bit more details on how the snapshot was created (on which version of the engine?)

no7dw · 2024-04-30T04:41:01Z

the current version is v1.9 while the backup snapshot is created by v1.8.3

we used the following command just follow the docs:

backup:
curl -s -X POST "$QDRANT_PROD_URL/collections/$COLLECTION_NAME/snapshots"

restore:
curl -X POST "$QDRANT_DEV_URL/collections/$COLLECTION_NAME/snapshots/upload" -H 'Content-Type:multipart/form-data' -F "snapshot=@/data/qdrant/snapshots/$COLLECTION_NAME/$SNAPSHOTS_NAME"

additional details:
we're investing (not sure about this yet) the operation cause, one of is which is: our team once deleted some of the unused vectors(multiple model but only one is used in prod), detail was disclosed in this issue

timvisee · 2024-04-30T07:20:45Z

Does the very same happen repeatedly? Or was this a one time occurrence? That isn't entirely clear to me from the issue description.

no7dw · 2024-04-30T10:17:29Z

happen repeatedly , once I use a new vm, after restore one of the collection (new_text) will soom become
"error": "Service internal error

timvisee · 2024-04-30T10:34:29Z

I do indeed expect a corrupted snapshot state somehow.

We did just merge #4132 which likely fixes this panic. It'll be part of our next release.

In 1.9 we also improved snapshots, also improving data consistency, which would likely solve this from happening in the future. Here's the related PR if you're interested: #3420

no7dw added the bug Something isn't working label Apr 29, 2024

generall mentioned this issue Apr 29, 2024

Use safe reading of deleted flags #4132

Merged

generall mentioned this issue May 2, 2024

Optimizer panic #4154

Closed

no7dw mentioned this issue May 6, 2024

[question]search API timeout , collection is yellow , not sure when will turn green and wanna know why? #4146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization task panicked after a collection is recovered, causing search API timeout #4131

Optimization task panicked after a collection is recovered, causing search API timeout #4131

no7dw commented Apr 29, 2024

generall commented Apr 29, 2024

no7dw commented Apr 30, 2024 •

edited

timvisee commented Apr 30, 2024

no7dw commented Apr 30, 2024

timvisee commented Apr 30, 2024

Optimization task panicked after a collection is recovered, causing search API timeout #4131

Optimization task panicked after a collection is recovered, causing search API timeout #4131

Comments

no7dw commented Apr 29, 2024

Current Behavior

Context (Environment)

Detailed Description

Possible Implementation

generall commented Apr 29, 2024

no7dw commented Apr 30, 2024 • edited

timvisee commented Apr 30, 2024

no7dw commented Apr 30, 2024

timvisee commented Apr 30, 2024

no7dw commented Apr 30, 2024 •

edited