Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization task panicked after a collection is recovered, causing search API timeout #4131

Open
no7dw opened this issue Apr 29, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@no7dw
Copy link

no7dw commented Apr 29, 2024

Current Behavior

we have a collection report error:
optimizer_status": {
"error": "Service internal error: Optimization task panicked: called Option::unwrap() on a None value"
},

which seems causing slow performance usually timeout on search .

we check the docker logs
show as :

2024-04-29T09:04:57.585769Z  INFO actix_web::middleware::logger: 103.152.220.15 "GET /collections/new_text HTTP/1.1" 200 761 "http://34.142.201.204:6333/dashboard" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36" 0.001342

2024-04-29T09:04:57.632442Z ERROR qdrant::startup: Panic backtrace:
   0: qdrant::startup::setup_panic_hook::{{closure}}
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/alloc/src/boxed.rs:2029:9
   2: std::panicking::rust_panic_with_hook
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:785:13
   3: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:659:13
   4: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys_common/backtrace.rs:171:18
   5: rust_begin_unwind
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:647:5
   6: core::panicking::panic_fmt
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/panicking.rs:72:14
   7: <segment::vector_storage::vector_storage_base::VectorStorageEnum as segment::vector_storage::vector_storage_base::VectorStorage>::is_deleted_vector
   8: segment::segment::Segment::vector_by_offset
   9: <segment::segment::Segment as segment::entry::entry_point::SegmentEntry>::vector
  10: <segment::segment::Segment as segment::entry::entry_point::SegmentEntry>::all_vectors
  11: collection::collection_manager::segments_searcher::SegmentsSearcher::retrieve
  12: collection::shards::local_shard::shard_ops::<impl collection::shards::shard_trait::ShardOperation for collection::shards::local_shard::LocalShard>::scroll_by::{{closure}}
  13: collection::shards::replica_set::read_ops::<impl collection::shards::replica_set::ShardReplicaSet>::scroll_by::{{closure}}::{{closure}}::{{closure}}
  14: collection::shards::replica_set::execute_read_operation::<impl collection::shards::replica_set::ShardReplicaSet>::execute_and_resolve_read_operation::{{closure}}
  15: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
  16: collection::collection::point_ops::<impl collection::collection::Collection>::scroll_by::{{closure}}
  17: storage::content_manager::toc::point_ops::<impl storage::content_manager::toc::TableOfContent>::scroll::{{closure}}
  18: actix_web::handler::handler_service::{{closure}}::{{closure}}
  19: <actix_web::resource::Resource<T> as actix_web::service::HttpServiceFactory>::register::{{closure}}::{{closure}}
  20: <actix_utils::future::either::Either<L,R> as core::future::future::Future>::poll
  21: <actix_web_extras::middleware::condition::ConditionMiddlewareFuture<E,D> as core::future::future::Future>::poll
  22: <actix_cors::middleware::CorsMiddleware<S> as actix_service::Service<actix_web::service::ServiceRequest>>::call::{{closure}}
  23: <qdrant::actix::actix_telemetry::ActixTelemetryService<S> as actix_service::Service<actix_web::service::ServiceRequest>>::call::{{closure}}
  24: actix_http::h1::dispatcher::InnerDispatcher<T,S,B,X,U>::poll_response
  25: <actix_http::h1::dispatcher::Dispatcher<T,S,B,X,U> as core::future::future::Future>::poll
  26: tokio::runtime::task::raw::poll
  27: tokio::task::local::LocalSet::tick
  28: tokio::task::local::LocalSet::run_until::{{closure}}
  29: std::sys_common::backtrace::__rust_begin_short_backtrace
  30: core::ops::function::FnOnce::call_once{{vtable.shim}}
  31: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/alloc/src/boxed.rs:2015:9
  32: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/alloc/src/boxed.rs:2015:9
  33: std::sys::pal::unix::thread::Thread::new::thread_start
             at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys/pal/unix/thread.rs:108:17
  34: <unknown>
  35: clone

2024-04-29T09:04:57.632456Z ERROR qdrant::startup: Panic occurred in file /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/bitvec-1.0.1/src/slice/api.rs at line 2594: index 178442 out of bounds: 0

2024-04-29T09:04:57.632787Z DEBUG reqwest::connect: starting new connection: https://telemetry.qdrant.io/
2024-04-29T09:04:59.366668Z  INFO actix_web::middleware::logger: 103.152.220.15 "GET /collections/new_text/cluster HTTP/1.1" 200 166 "http://34.142.201.204:6333/dashboard" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36" 0.000207

config for the collection name : new_text

{
  "result": {
    "status": "red",
    "optimizer_status": {
      "error": "Service internal error: Optimization task panicked: called `Option::unwrap()` on a `None` value"
    },
    "indexed_vectors_count": 14246516,
    "points_count": 23415154,
    "segments_count": 12,
    "config": {
      "params": {
        "vectors": {
          "gemini:models/embedding-001": {
            "size": 768,
            "distance": "Cosine"
          },
          "huggingface:jinaai/jina-embeddings-v2-base-en": {
            "size": 768,
            "distance": "Cosine"
          },
          "mistral:mistral-embed": {
            "size": 1024,
            "distance": "Cosine"
          },
          "openai:text-embedding-ada-002": {
            "size": 1536,
            "distance": "Cosine",
            "on_disk": true
          }
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": 120000,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": 16
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": {
        "binary": {
          "always_ram": true
        }
      }
    },
    "payload_schema": {
      "created_at": {
        "data_type": "integer",
        "points": 23323677
      },
      "website_profile": {
        "data_type": "keyword",
        "points": 23323677
      },
      "source": {
        "data_type": "keyword",
        "points": 23323677
      },
      "language": {
        "data_type": "keyword",
        "points": 300382
      },
      "headline": {
        "data_type": "text",
        "points": 23323677
      },
      "website_url": {
        "data_type": "keyword",
        "points": 23323677
      },
      "content": {
        "data_type": "text",
        "points": 23323677
      },
      "published_at": {
        "data_type": "integer",
        "points": 21134611
      }
    }
  },
  "status": "ok",
  "time": 0.000158987
}

Context (Environment)

version: Linux running in Dockerv1.9

this collection(new_text) is backup from a snapshot, seems sth wrong with the snapshot.
when snapshot recovered is completed (took 1-2hours), the collections could query/search with good performance.
when we query

GET collections/new_text

and the error show up ( probably in ~6 hours or so)
search API become extreme slow with usually timeout(>60secs).

once this issue happened, will the status will always turn red.

one weird thing is:
when query has filter ( index hit), the collection is queryable without timeout.

Detailed Description

the snapshot is broken due to some deletion of some abnormal operation (probably by us).
we have other collections in this instance, seems they works fine.

Possible Implementation

Will it be possible to rebuild index to avoid the collection issue.
Thx in advance.

@no7dw no7dw added the bug Something isn't working label Apr 29, 2024
@generall
Copy link
Member

Hey @no7dw, thanks for reporting this.

Could you please provide a bit more details on how the snapshot was created (on which version of the engine?)

@no7dw
Copy link
Author

no7dw commented Apr 30, 2024

the current version is v1.9 while the backup snapshot is created by v1.8.3

we used the following command just follow the docs:

backup:
curl -s -X POST "$QDRANT_PROD_URL/collections/$COLLECTION_NAME/snapshots"

restore:
curl -X POST "$QDRANT_DEV_URL/collections/$COLLECTION_NAME/snapshots/upload" -H 'Content-Type:multipart/form-data' -F "snapshot=@/data/qdrant/snapshots/$COLLECTION_NAME/$SNAPSHOTS_NAME"

additional details:
we're investing (not sure about this yet) the operation cause, one of is which is: our team once deleted some of the unused vectors(multiple model but only one is used in prod), detail was disclosed in this issue

@timvisee
Copy link
Member

Does the very same happen repeatedly? Or was this a one time occurrence? That isn't entirely clear to me from the issue description.

@no7dw
Copy link
Author

no7dw commented Apr 30, 2024

happen repeatedly , once I use a new vm, after restore one of the collection (new_text) will soom become
"error": "Service internal error

@timvisee
Copy link
Member

I do indeed expect a corrupted snapshot state somehow.

We did just merge #4132 which likely fixes this panic. It'll be part of our next release.

In 1.9 we also improved snapshots, also improving data consistency, which would likely solve this from happening in the future. Here's the related PR if you're interested: #3420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants