Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: resource estimate improvement #32964

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

chyezh
Copy link
Contributor

@chyezh chyezh commented May 10, 2024

issue: #32963

  • Add Exception to file class to avoid exception ignore.

  • Add GetResourceUsage method for index, column and segment.

  • Use bin log size but not mem size in CU of search and query.

  • Use in-use disk size in disk cache weight if segment is loaded.

  • Enable cache metric itself, remove redundant metric at search/query path.

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chyezh
To complete the pull request process, please assign wxyucs after the PR has been reviewed.
You can assign the PR to them by writing /assign @wxyucs in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot added the size/XL Denotes a PR that changes 500-999 lines. label May 10, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels May 10, 2024
Copy link
Contributor

mergify bot commented May 10, 2024

@chyezh ut workflow job failed, comment rerun ut can trigger the job again.

Copy link
Contributor

mergify bot commented May 10, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@sre-ci-robot sre-ci-robot added size/XXL Denotes a PR that changes 1000+ lines. and removed size/XL Denotes a PR that changes 500-999 lines. labels May 11, 2024
@chyezh chyezh force-pushed the fix_lru_master branch 2 times, most recently from d774a06 to 5c8d83c Compare May 11, 2024 03:19
Copy link
Contributor

mergify bot commented May 11, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link
Contributor

mergify bot commented May 11, 2024

@chyezh ut workflow job failed, comment rerun ut can trigger the job again.

Copy link
Contributor

mergify bot commented May 11, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh chyezh force-pushed the fix_lru_master branch 2 times, most recently from 46989a0 to 595b74c Compare May 11, 2024 12:45
Copy link
Contributor

mergify bot commented May 11, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented May 11, 2024

/run-cpu-e2e

@chyezh
Copy link
Contributor Author

chyezh commented May 11, 2024

rerun ut

Copy link
Contributor

mergify bot commented May 11, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented May 11, 2024

/run-cpu-e2e

Copy link

codecov bot commented May 11, 2024

Codecov Report

Attention: Patch coverage is 70.79646% with 99 lines in your changes are missing coverage. Please review.

Project coverage is 82.08%. Comparing base (8a9a421) to head (7935ad6).
Report is 28 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #32964    +/-   ##
========================================
  Coverage   82.07%   82.08%            
========================================
  Files        1007      999     -8     
  Lines      127647   128008   +361     
========================================
+ Hits       104767   105074   +307     
- Misses      18903    18960    +57     
+ Partials     3977     3974     -3     
Files Coverage Δ
internal/core/src/common/Types.h 31.45% <ø> (+3.43%) ⬆️
internal/core/src/index/Index.h 100.00% <ø> (ø)
internal/core/src/index/InvertedIndexTantivy.cpp 89.47% <100.00%> (ø)
internal/core/src/index/InvertedIndexTantivy.h 47.36% <ø> (ø)
internal/core/src/index/ScalarIndexSort.cpp 74.63% <100.00%> (+0.21%) ⬆️
internal/core/src/index/ScalarIndexSort.h 57.14% <ø> (ø)
internal/core/src/index/StringIndexMarisa.h 100.00% <ø> (ø)
internal/core/src/index/VectorDiskIndex.h 18.18% <ø> (ø)
internal/core/src/index/VectorMemIndex.h 100.00% <ø> (ø)
internal/core/src/segcore/SegmentGrowingImpl.h 81.35% <100.00%> (+0.99%) ⬆️
... and 31 more

... and 197 files with indirect coverage changes

@chyezh chyezh force-pushed the fix_lru_master branch 2 times, most recently from f37ae6f to dd77db4 Compare May 12, 2024 05:40
Copy link
Contributor

mergify bot commented May 12, 2024

@chyezh ut workflow job failed, comment rerun ut can trigger the job again.

Copy link
Contributor

mergify bot commented May 12, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Signed-off-by: chyezh <chyezh@outlook.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Signed-off-by: chyezh <chyezh@outlook.com>
Signed-off-by: chyezh <chyezh@outlook.com>
@chyezh
Copy link
Contributor Author

chyezh commented May 12, 2024

rerun ut

1 similar comment
@chyezh
Copy link
Contributor Author

chyezh commented May 12, 2024

rerun ut

Copy link
Contributor

mergify bot commented May 12, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented May 12, 2024

/run-cpu-e2e

@chyezh
Copy link
Contributor Author

chyezh commented May 12, 2024

rerun ut

1 similar comment
@chyezh
Copy link
Contributor Author

chyezh commented May 12, 2024

rerun ut

Signed-off-by: chyezh <chyezh@outlook.com>
Copy link
Contributor

mergify bot commented May 12, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented May 13, 2024

/run-cpu-e2e

@sunby
Copy link
Contributor

sunby commented May 13, 2024

/lgtm

Copy link
Contributor

mergify bot commented May 13, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented May 13, 2024

/run-cpu-e2e

@chyezh chyezh added this to the 2.4.2 milestone May 13, 2024
Copy link
Contributor

@longjiquan longjiquan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm generally for the cpp part.

throw SegcoreError(
ErrorCode::UnistdError,
fmt::format("write index to fd error: {}", strerror(errno)));
throw;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw e.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw; will rethrow the current exception.

write_disk_duration_sum +=
(std::chrono::system_clock::now() - start_write_file);
AssertInfo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore this exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File instance in File.h is already handing the exception.
So exception handling can be ignored here.

milvus::ResourceUsage
GetResourceUsage() const override {
auto mem_size = stats_.mem_size.load() + deleted_record_.mem_size();
return milvus::ResourceUsage{mem_size, 0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also consider the intermediate index for growing index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it will be a TODO after incoming growing mmap index.

ResourceUsage
StringIndexMarisa::GetResourceUsage() const {
// TODO: we cannot estimate the memory usage of marisa trie.
return resource_usage_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

die we enable mmap for marisa trie?
if so, the memory usage can be set to 0

VectorDiskAnnIndex<T>::GetResourceUsage() const {
// TODO: we can't get the memory usage of vector index now.
auto disk_size = file_manager_->GetLocalFileSize();
return ResourceUsage{0, disk_size};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this need to double check with @liliu-z
i believe there is a way to estimate memory utilizations

ResourceUsage
VectorMemIndex<T>::GetResourceUsage() const {
// TODO: we can't get the memory usage of vector index now.
return resource_usage_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if mmap is not enbaled, resource usage should be all memory usage. but I dind't see any calculation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will use file size as mem size if disable mmap

@@ -179,12 +167,6 @@ type promMetricsObserver struct {
nodeID string
label SegmentLabel // never updates

DiskCacheLoadTotal prometheus.Counter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we remove these metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already exported in cache monitor.go
repeated metrics can be removed.

// ResourceUsageEstimate returns the estimated resource usage of the segment
ResourceUsageEstimate() ResourceUsage

// ResourceUsageEstimate returns the estimated resource usage of the segment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the variable name,
ResourceUsageEstimateofLoad

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

return pk.Size() + 8
})
return SegmentResourceUsage{
Predict: ResourceUsage{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you get predict value before you load it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original segment loader logic is used in prediction.

@@ -180,7 +180,7 @@ func NewManager() *Manager {
if segment == nil {
return 0
}
return int64(segment.ResourceUsageEstimate().DiskSize)
return int64(segment.ResourceUsageEstimate().GetInuseOrPredictDiskUsage())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for cache, if weight is set.
once we use index as alternative, or add new delete, the weight will not be correct.

@@ -1540,7 +1540,7 @@ func (loader *segmentLoader) checkSegmentSize(ctx context.Context, segmentLoadIn
zap.Float64("diskUsage(MB)", toMB(usage.DiskSize)),
zap.Float64("memoryLoadFactor", factor.memoryUsageFactor),
)
mmapFieldCount += usage.MmapFieldCount
mmapFieldCount += mmapFieldCount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does we use mmapFieldCount?
and mmapFieldCount += mmapFieldCount seems to be a bug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only used in logging, record how many fields to be mmap.

}

type LazyScavenger[K comparable] struct {
capacity int64
size int64
size *atomic.Int64
weight func(K) int64
weights map[K]int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not maintain weights here.
we should cache sizable object

@czs007 czs007 removed this from the 2.4.2 milestone May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-passed DCO check passed. do-not-merge/hold kind/enhancement Issues or changes related to enhancement lgtm size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants