upstream: implementation of outlier detection extensions #34154

cpakulski · 2024-05-14T20:00:14Z

Commit Message:
upstream: implementation of outlier detection extensions

Additional Description:
The idea and need for the extensions are described in RFC document: https://docs.google.com/document/d/1ZCZSoirVB39eOLdD0VPlsEUING8c23Sq5bzozrv6f4k/edit?usp=drive_link

In a nutshell, the design decouples types of result reported to outlier detector from an algorithm which marks a host as unhealthy. For example, an outlier detector may be configured to count 3 consecutive errors and type of those errors can be defined by a user. For example:

count http codes in range 500-503
count http codes in range 401-403
count locally originated errors (resets, timeouts)

So, the algorithm does not really care about the exact type of reported result. It is only interested whether reported result should be considered an error or not.
The idea of using user-defined errors can be expanded to database errors. See issue #24215 (I have working prototype for errors reported by Redis).

This design puts extensions on top of already existing outlier detector. It means that the solution is 100% backwards compatible. Previous configs are accepted. But a user may configure outlier detection extension to

use "old" outlier detection to react to 5xx errors and add another range of HTTP errors (say 4xx)
disable "old" outlier detection and configure everything using extensions

The implementation of extensions is built on top of already existing outlier detection structures. This minimizes code changes and re-uses event logger, timers, etc.

The implementation is built on top of already approved API for extensions: #31205

Risk Level: Low (previous configuration still works. Extensions do not have to be configured)
Testing: Added unit tests for new code and tests checking co-existence of "old" outlier and "new" extensions
Docs Changes: Yes. Added.
Release Notes: Yes.
Platform Specific Features: No
Fixes #18789

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

… monitor. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only · 2024-05-14T20:00:18Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #34154 was opened by cpakulski.

see: more, trace.

repokitteh-read-only · 2024-05-14T20:00:24Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #34154 was opened by cpakulski.

see: more, trace.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-05-17T18:08:46Z

/retest

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-05-17T21:36:40Z

@wbpcode I see that you have been assigned for API approval, but changes in proto files are not really API changes, but rather changes to event log which is defined in terms of protobufs. Hope it helps!

nezdolik · 2024-05-20T21:10:56Z

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-05-21T23:37:06Z

/retest

api/envoy/data/cluster/v3/outlier_detection_event.proto

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-05-22T15:22:03Z

/retest

envoy/upstream/outlier_detection.h

nezdolik · 2024-05-23T09:54:12Z

source/common/upstream/outlier_detection_impl.cc

@@ -67,12 +68,12 @@ void DetectorHostMonitorImpl::updateCurrentSuccessRateBucket() {

 void DetectorHostMonitorImpl::putHttpResponseCode(uint64_t response_code) {
  external_origin_sr_monitor_.incTotalReqCounter();
+  std::shared_ptr<DetectorImpl> detector = detector_.lock();


alternative approach we could take here is to disallow users to configure same outlier detection algorithm via traditional (old) way and via extension, or to fallback to new mechanism if both are configured. That would simplify the flow and make it easier to reason about ejection events. Does it even make sense to configure 2 same algorithms with diff params for same cluster, wonder what could be the use case?

I think that use case could be to use "old" method for 5xx errors and "new" method for 4xx range. "old" has been battle tested and users trust it, but it is not expandable for ranges outside of 5xx. Once there is enough usage of extensions, we can start limiting config options. WDYT?

nezdolik · 2024-05-23T10:13:09Z

source/extensions/outlier_detection_monitors/consecutive_errors/consecutive_errors_monitor.cc

+
+void ConsecutiveErrorsMonitor::onReset() { counter_ = 0; }
+
+class ConsecutiveErrorsMonitorFactory


Most of Envoy extensions follow a specific code layout, this is not enforced but helps to easier navigate the code, group tests etc. The factories that parse the extension proto config and create extension instances are usually placed into config.h+cc files. Extension related code goes into its dedicated header+impl files. For example:
extension config
extension

I dont have string opinion here since some extensions deviate from this layout and this file does not have that much code.

OK. Thanks. I will leave it for now as it is.

nezdolik · 2024-05-23T10:23:00Z

source/extensions/outlier_detection_monitors/consecutive_errors/consecutive_errors_monitor.cc

+namespace Outlier {
+
+bool ConsecutiveErrorsMonitor::onError() {
+  if (counter_ < max_) {


This if block does not avoid race conditions, e.g. you have counter_ == 29 and max_ == 30:

T1 reads the counter_ value in line 12

T2 increments counter_ value to 30

T1 executes line 13 and increases counter_ to 31

You could utilize atomic CAS primitive: https://en.cppreference.com/w/cpp/atomic/atomic_compare_exchange

Yeah! That is good point. Will change the code.

Done. Converted to using CAS primitives. Thanks for catching it!

nezdolik · 2024-05-23T10:31:58Z

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like frequency, success rate always run regardless whether users use them to eject nodes or not.

Deprecating old mechanism would as well reduce maintenance load on cognitive load for outlier detection api (where same thing can be achieved via different config mechanisms). Think we should do a proper deprecation cycle going forward.

cpakulski · 2024-05-23T19:16:37Z

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

It depends on the community and adoption of extensions. In general, everything what the current outlier offers will be implemented in form of extensions, so deprecation should be easy. I believe that removing "old" outlier should improve performance a bit, because in the current implementation all monitors like frequency, success rate always run regardless whether users use them to eject nodes or not.

Deprecating old mechanism would as well reduce maintenance load on cognitive load for outlier detection api (where same thing can be achieved via different config mechanisms). Think we should do a proper deprecation cycle going forward.

OK. Let me try to add warnings that "old" consecutive errors will be deprecated. Thanks!

====================================================================================
Update:
Actually after some thinking I believe that it would be best to start deprecating and displaying warnings when extensions's functionality is equivalent to the "old" outlier. I mean after success rate and failure frequency monitors are implemented. Once we have this PR merged, the framework is ready and implementing those missing monitors should not take too much time. The reason is that "old" outlier is "flat". Once one option is enabled, it actually enables all methods and we would like to display warning only when a user uses "old" consecutive_5xx or consecutive_gateway or consecutive_local_origin.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-05-24T18:40:01Z

/retest

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-05-24T23:26:52Z

/retest

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-05-27T01:03:32Z

@nezdolik @wbpcode Thanks for reviewing this PR. I think I addressed all the comments. In some cases, I provided explanation why I am not planning to change anything and leave it as is. I would appreciate if you could do another pass. Thanks!

nezdolik · 2024-05-27T21:13:04Z

source/extensions/outlier_detection_monitors/consecutive_errors/consecutive_errors_monitor.cc

+  // no-op. Just keep executing compare_exchange_strong until threads synchronize.
+  do {
+    ;
+  } while (!counter_.compare_exchange_strong(expected_count, expected_count + 1));


i think the default memory order memory_order_seq_cst here requires too much synchronization across worker threads, you could use relaxed order for cas failure and make writes to atomic visible to other threads on success (e.g. memory order release)

How about just counter_++?

The reason is that while counter_ can be "atomically" increased, what we need is entire logic of incrementing and checking a value to be deterministic. This approach seems to be sort of pattern for this lock-less approach. See #34154 (comment)

nezdolik · 2024-05-27T21:23:11Z

@cpakulski thank you for all the work done! i added one more comment and believe this is now ready for further review. @wbpcode do you have capacity to review this one? (otherwise will tag senior maintainers)

nezdolik · 2024-05-29T10:58:16Z

/assign-from @envoyproxy/senior-maintainers

repokitteh-read-only · 2024-05-29T10:58:21Z

@envoyproxy/senior-maintainers assignee is @wbpcode

🐱

Caused by: a #34154 (comment) was created by @nezdolik.

see: more, trace.

wbpcode

Thanks for the great contribution. But it's so huge, so, only some initial comments to added.

wbpcode · 2024-06-03T06:46:20Z

CODEOWNERS

+# Outlier detection monitors
+/*/extensions/extensions/outlier_detection_monitors @cpakulski @nezdolik


It would be better to list the specific extension here. Like:

/*/extensions/extensions/outlier_detection_monitors/consecutive_errors

OK. I can do that.

wbpcode · 2024-06-03T07:11:38Z

source/extensions/extensions_build_config.bzl

+    # Outlier Detection Monitors
+    #
+    #
+    "envoy.outlier_detection_monitors.common":             "//source/extensions/outlier_detection_monitors/common:outlier_detection_monitors_lib",


envoy.outlier_detection_monitors.common is not an extension and needn't to be listed here.

Agree. And I originally did not list it here, but unfortunately things are so interconnected in bazel/docs that without that line build fails. I do not remember what exactly failed, but I remember I spent few days trying to plug things so everything works and this line was just necessary.

wbpcode · 2024-06-03T07:11:57Z

source/extensions/extensions_metadata.yaml

+envoy.outlier_detection_monitors.common:
+  categories:
+  - envoy.outlier_detection_monitors
+  security_posture: robust_to_untrusted_downstream
+  status: alpha
+  undocumented: true


Same as above.

wbpcode · 2024-06-03T07:15:49Z

envoy/upstream/outlier_detection.h

+// Types of outlier detection extension results which can be reported.
+enum class ExtResultType;


Please just add the definition of this ResultType in the header file.

Done. Moved full definition of ResultType to the main header file.

wbpcode · 2024-06-03T07:42:59Z

envoy/upstream/outlier_detection.h

+/*
+ * Class carries result of a transaction with upstream entity
+ * or generated internally by Envoy.
+ * Different categories of results will be derived from that base class.
+ * Those categories of results are fed only into Outlier Detection extensions.
+ */
+class ExtResult {
+public:
+  ExtResult() = delete;
+  ExtResult(ExtResultType type) : type_(type) {}
+  virtual ExtResultType type() const { return type_; };
+  virtual ~ExtResult() = default;
+
+private:
+  const ExtResultType type_;
+};


I also doesn't get why this is necessary.

That is the base class for different types of results. So, Http results will be reported via a class derived from ExtResult. This is needed because when matching to user-defined buckets happens, the type is checked first. So, Http codes are matched only versus a bucket which has the same type. It does not make sense to match Http code against a bucket which has say database errors:

envoy/source/extensions/outlier_detection_monitors/common/monitor_base_impl.cc

Lines 34 to 38 in 43361bb

// if the bucket is not interested in this type of result/error

// just ignore it.

if (!bucket->matchType(result)) {

continue;

}

The only common property across all different results is type.

wbpcode · 2024-06-03T07:57:41Z

source/extensions/outlier_detection_monitors/common/monitor_base_impl.h

+class Monitor {
+public:
+  Monitor(const std::string& name, uint32_t enforce) : name_(name), enforce_(enforce) {}
+  Monitor() = delete;
+  virtual ~Monitor() {}
+  void reportResult(const ExtResult&);
+
+  void
+  setCallback(std::function<void(uint32_t, std::string, absl::optional<std::string>)> callback) {
+    callback_ = callback;
+  }
+
+  void reset() { onReset(); }
+  std::string name() const { return name_; }
+
+  void processBucketsConfig(
+      const envoy::extensions::outlier_detection_monitors::common::v3::ErrorBuckets& config);
+  void addErrorBucket(ErrorsBucketPtr&& bucket) { buckets_.push_back(std::move(bucket)); }
+
+protected:
+  virtual bool onError() PURE;
+  virtual void onSuccess() PURE;
+  virtual void onReset() PURE;
+  // Default extra info is empty string. Descendant classes may overwrite it.
+  virtual std::string getFailedExtraInfo() { return ""; }
+
+  std::string name_;
+  uint32_t enforce_{100};
+  std::vector<ErrorsBucketPtr> buckets_;
+  std::function<void(uint32_t, std::string, absl::optional<std::string>)> callback_;
+};
+
+using MonitorPtr = std::unique_ptr<Monitor>;
+
+class MonitorsSet {
+public:
+  void addMonitor(MonitorPtr&& monitor) { monitors_.push_back(std::move(monitor)); }
+  const std::vector<MonitorPtr>& monitors() { return monitors_; }
+
+private:
+  std::vector<MonitorPtr> monitors_;
+};


Please define the abstract interface of extended Monitor in the outlier_detection.h and use absl::InlinedVector<MonitorPtr, 3> as MonitorSet. By this way, the code base in common/upstream needn't to depend on the code in extensions/.

You can create a MonitorBase here as common base class of different implementations.

OK. Good idea. Created base abstract class ExtMonitor in outlier_detection.h and derived intermediate class called ExtMonitorBase in source/extensions.... Other monitors can be derived from ExtMonitorBase.

wbpcode · 2024-06-03T07:58:34Z

source/extensions/outlier_detection_monitors/common/monitor_base_impl.h

+class MonitorFactoryContext {
+public:
+  MonitorFactoryContext(ProtobufMessage::ValidationVisitor& validation_visitor)
+      : validation_visitor_(validation_visitor) {}
+  ProtobufMessage::ValidationVisitor& messageValidationVisitor() { return validation_visitor_; }
+
+private:
+  ProtobufMessage::ValidationVisitor& validation_visitor_;
+};
+
+class MonitorFactory : public Envoy::Config::TypedFactory {
+public:
+  ~MonitorFactory() override = default;
+
+  virtual MonitorPtr createMonitor(const std::string& name, const Protobuf::Message& config,
+                                   MonitorFactoryContext& context) PURE;
+
+  std::string category() const override { return "envoy.outlier_detection_monitors"; }
+};


ditto. Move these classes to the outlier_detection.h

Hmm. I am confused. I thought that outlier_detection.h is mostly to define abstract classes. Are you sure that outlier_detection.h is better location? I was under impression that source/extensions/... should hide as many details as possible.

wbpcode · 2024-06-03T08:03:52Z

source/common/upstream/outlier_detection_impl.cc

+
+  // Store extensions' config. It will be used to create extensions monitors
+  // when host are added to the cluster.
+  extensions_config_ = config.monitors();


You may need to get the monitor factory and load the monitor config here. Then, only do the creation of Monitor instances when we try to create the host detection monitor to ensure there is no any exception (like no factory, invalid config, etc.) will be throwed at that time.

For example:

// Factory should load the configuration and return a lambda. using MonitorFn = std::function<MonitorPtr()>; class MonitorFactory : public Envoy::Config::TypedFactory { public: ~MonitorFactory() override = default; virtual MonitorFn createMonitor(const std::string& name, const Protobuf::Message& config, MonitorFactoryContext& context) PURE; std::string category() const override { return "envoy.outlier_detection_monitors"; } };

wbpcode · 2024-06-03T08:06:29Z

source/common/upstream/outlier_detection_impl.cc

+  extensions_config_ = config.monitors();
+}
+
+std::unique_ptr<Extensions::Outlier::MonitorsSet> DetectorConfig::createMonitorExtensions(


See, reducing unnecessary heap allocation always be better. So, may be an absl::InlinedVector<MonitorPtr, 3> would be better.

Done. MonitorsSet uses inlined vector now. Thanks for good suggestion!

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-06-05T23:53:19Z

@wbpcode Thanks for your initial comments. I agree with you that this is large PR. The additional problem is that extensions should work along with "previous" implementation. I believe that code can be cleaner, but keeping backwards compatibility sometimes requires building not-very clean approach. Once we start deprecating "old" way we can start cleaning the code.

I addressed most of your comments. One or two comments require more thoughts and I will work on them tomorrow.

cpakulski added 11 commits April 24, 2024 14:45

Bridged extensions to upstream's outlier detector.

bf7fcd2

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Added test skeleton for common.

5f9a76d

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Eject and uneject is working for consecutive error extension.

1f918ba

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Added creating evemt log. Integrated with "enforcing" logic.

744c1d0

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Added support for locally originated events.

766c176

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Added visibility exception.

4f76dc5

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Added unit tests for monitor, local origin events, consecutive errors…

2b7e368

… monitor. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Added tests to verify co-existance with legacy outlier detector.

99e07c4

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Removed name from buckets.

c1c59f9

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Monitor class cleanup.

16abd3e

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Comment, cleanup.

86b3512

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only bot added the api label May 14, 2024

repokitteh-read-only bot assigned wbpcode May 14, 2024

cpakulski added 9 commits May 14, 2024 21:05

Merge remote-tracking branch 'upstream/main' into od_ext_impl

caad4dd

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Fixed proto numbering.

6ceea01

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Added comments to event proto.

2ca515d

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Fixed extensions metadata error.

8b1e9d5

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Metadata change and function rename.

0e86c96

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Fixed docs. Added sponsor.

e4ffac4

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Updated docs.

d63ed76

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Fixed proto.

1eef3ca

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Fixed conflict in classes namespace.

06b037e

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski changed the title ~~WIP upstream: implementation of outlier detection extensions~~ upstream: implementation of outlier detection extensions May 17, 2024

cpakulski marked this pull request as ready for review May 17, 2024 19:13

Updated proto comments.

8358b72

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Fixed format.

6ca3260

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

wbpcode assigned nezdolik May 22, 2024

wbpcode reviewed May 22, 2024

View reviewed changes

api/envoy/data/cluster/v3/outlier_detection_event.proto Show resolved Hide resolved

wbpcode reviewed May 22, 2024

View reviewed changes

api/envoy/data/cluster/v3/outlier_detection_event.proto Show resolved Hide resolved

Updated proto for consecutive errors monitor.

7085b6c

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski requested review from wbpcode and nezdolik May 22, 2024 16:59

nezdolik reviewed May 23, 2024

View reviewed changes

cpakulski added 2 commits May 23, 2024 20:23

Take monitor name from configured extension name.

d83e4f2

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Made consecutive errors counter thread safe.

a382300

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski added 2 commits May 24, 2024 19:20

Attempt to fix coverage.

98f6b58

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Reshuffled lines to address coverage issue.

b3c0cf0

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Merge remote-tracking branch 'upstream/main' into od_ext_impl

43361bb

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

nezdolik reviewed May 27, 2024

View reviewed changes

wbpcode requested changes Jun 3, 2024

View reviewed changes

wbpcode added the waiting label Jun 4, 2024

cpakulski added 2 commits June 5, 2024 23:33

Moved outlier's base classes to envoy main header file.

f562e81

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Merge remote-tracking branch 'upstream/main' into od_ext_impl

43fcfe9

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only bot removed the waiting label Jun 5, 2024


		void ConsecutiveErrorsMonitor::onReset() { counter_ = 0; }

		class ConsecutiveErrorsMonitorFactory

		# Outlier detection monitors
		/*/extensions/extensions/outlier_detection_monitors @cpakulski @nezdolik

		// Types of outlier detection extension results which can be reported.
		enum class ExtResultType;

	// if the bucket is not interested in this type of result/error
	// just ignore it.
	if (!bucket->matchType(result)) {
	continue;
	}

upstream: implementation of outlier detection extensions #34154

Are you sure you want to change the base?

upstream: implementation of outlier detection extensions #34154

Conversation

cpakulski commented May 14, 2024 • edited

repokitteh-read-only bot commented May 14, 2024

repokitteh-read-only bot commented May 14, 2024

cpakulski commented May 17, 2024

cpakulski commented May 17, 2024

nezdolik commented May 20, 2024

cpakulski commented May 21, 2024

cpakulski commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nezdolik commented May 23, 2024

cpakulski commented May 23, 2024 • edited

cpakulski commented May 24, 2024

cpakulski commented May 24, 2024

cpakulski commented May 27, 2024

Choose a reason for hiding this comment

wbpcode Jun 3, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nezdolik commented May 27, 2024

nezdolik commented May 29, 2024

repokitteh-read-only bot commented May 29, 2024

wbpcode left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbpcode Jun 3, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpakulski commented Jun 5, 2024

cpakulski commented May 14, 2024 •

edited

cpakulski commented May 23, 2024 •

edited

wbpcode Jun 3, 2024 •

edited

wbpcode Jun 3, 2024 •

edited