-
-
Notifications
You must be signed in to change notification settings - Fork 952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Flags #3134
Comments
This is a terrific write-up. I love the example use cases, press-release preview, the drawbacks to the design, and the alternatives considered. Well done @yhakbar 👏 Some feedback in a somewhat random order:
|
Responding to @brikis98 : Feature Flag DynamicityThis design did assume that it would be fully compatible with usage of external web services for feature flag management! I wanted to focus on the core functionality of how the I would guess that the majority of users leveraging the feature flag functionality proposed here would be setting and adjusting environment variables dynamically in their CI/CD pipelines like GitHub Actions, GitLab CI, Jenkins, etc. Prioritizing the ability to toggle feature flags via environment variables and CLI flags was a way to ensure that the feature flag functionality could be used in a wide variety of CI/CD environments, without relying on an external service. e.g. In the context of a GitHub Actions workflow, configuration like the following would allow for the env:
TG_FLAG_use_service_module_v2: ${{ vars.TG_FLAG_use_service_module_v2 }}
run: terragrunt apply -auto-approve Now, for users who are currently using a feature flag management service, I think the current design does not preclude them from using it. There are two ways that I would expect users to use the feature flag functionality as currently proposed in conjunction with a feature flag management service:
I like the idea of seamless integration with feature flag management services that doesn't require leveraging Terragrunt functionality in a manner this sophisticated, however. If this is commonly done within the community, it might be worth it to prioritize a system for integrating with these services directly. Maybe a plugin system that provides nice interfaces for common feature flag management services? Mixing of ConcernsI agree that there's definitely tension between the feature flag concept, the error suppression concept and the module skip concept. The error suppression and module skip concepts do end up constricting the feature flag implementation in such a way that it's not as flexible as folks typically want feature flags to be. Tying it to those concepts requires that the feature flag is boolean to allow for the module to be skipped or not, and that the feature flag is used to determine whether or not to suppress errors. As you described, this prevents usage of string or numeric feature flags. At the same time, I could imagine users wanting to tightly integrate those concepts, as it might only make sense to suppress particular errors within the context of a feature flag being enabled. What do we think about having three separate configuration blocks for feature flags, error suppression, and module skipping? This would allow for more flexibility in how these concepts are used together, and would allow for more complex feature flag configurations that don't necessarily involve error suppression or module skipping. So, instead of: feature "feature_name"{
default = false # Optionally default it so that you can opt-in or out.
# Conditions that result in the feature being skipped.
skip {
actions = ["all"] # Actions to skip when active. Other options might be ["plan", "apply", "all_except_output"], etc
}
# Alter behavior on failure
failure {
ignorable_errors = ".*" # Specify a pattern that will be detected in the error for ignores, or just ignore any error
message = "Flaky feature failing here!" # Add an optional warning message if it fails
# Key-value map that can be used to emit signals on failure
signals = {
safe_to_revert = true # Signal that the apply is safe to revert on failure
}
}
} We could have: feature "feature_name"{
default = "A"
}
skip {
if = feature.feature_name.value == "A"
actions = ["all"]
}
failure {
ignorable_errors = feature.feature_name.value == "A" ? [".*"]: []
message = feature.feature_name.value == "A" ? "Flaky feature failing here!" : "Woah, this feature is supposed to be solid!"
signals = {
safe_to_revert = feature.feature_name.value == "A"
}
} And folks might just conventionally keep the blocks together within the I worry that this might introduce quite a bit of complexity to the configuration, but it might be worth it for the added flexibility. It would allow for the values of feature flags to take on more complex values, and for the other concepts to be used outside of the context of feature flags. Feature Flags as FunctionsI like the idea of not needing additional configuration blocks for feature flags, and instead using them as functions that can be used in various places in the configuration. I don't know if one would end up being more expensive to maintain than the other, so it might be worth preferring the cheaper option. There may be advantages to having the feature flag defined via a block, however. e.g. It might be easier to see all of the feature flags that are available in configuration at a glance: feature "feature_name"{
default = "A"
}
terraform {
source = "github.com/foo//bar?ref=${feature.feature_name.value == "A" ? "v2" : "v1"}"
} Might be easier to spot than: terraform {
source = "github.com/foo//bar?ref=${feature_flag("feature_name") == "A" ? "v2" : "v1"}"
} That would be especially relevant when searching for feature flags to remove once features are stable. This might even lend itself to a There is also functionality that could be added to the feature block that would be difficult to add to a function. For example, configuring a default value for a feature flag might be more likely to be consistent when done via a block than when done via a function. e.g. To keep a default value consistent across all uses of a function, you might have to do something like: locals {
do_experiment = feature_flag("DO_EXPERIMENT", false) # Where the second argument is the default value
value1 = local.do_experiment ? "A" : "B"
value2 = local.do_experiment ? "C" : "D"
# Because a different default is used here, it's harder to reason about the value of the feature flag
value3 = feature_flag("DO_EXPERIMENT", true) ? "E" : "F"
} Whereas, with a block, it's a lot more explicit: feature "do_experiment"{
default = false
}
locals {
value1 = feature.do_experiment.value ? "A" : "B"
value2 = feature.do_experiment.value ? "C" : "D"
# Here we're explicitly negating the value of the feature flag, the default can't vary between uses
value3 = !feature.do_experiment.value ? "E" : "F"
} Having a block also allows for more complex feature flag configurations in the future, like the ability to configure a provider for integration with a feature flag management service or to have validations, etc.
|
I think the env-var and Note that I'm only looking for guidance here; not first-class features built into TG itself. At least, not at this stage. If this somehow becomes super popular, sure, we can think about native support in plugins or whatever, but for now, I just want to make sure that if we say "TG supports feature flags," that we support it's most common use case, which is enabling/disabling features with a click in a UI.
I'm a big +1 on that. I think we'd want to iterate on exactly what the blocks are, but having these as separate entities seems much more powerful, maintainable, understandable, etc.
Your analysis is convincing. The block approach wins, hands-down, for helping with readability, understanding, and static analysis/commands based off feature toggles.
I think making it clear what the behavior will be when a module is skipped (or fails and the failure is ignored). If we use mock outputs or skip or whatever else, we need to make sure it's clear and expected for the user. Maybe even some sort of "use last known in case of skipped or failed dependency" setting, where we use the last known good outputs? Not sure on this, but again, clarity is king here :) |
Summary
Provide first class support for feature flags as part of Terragrunt HCL configuration.
Allow for dynamic configuration of behavior in select
terragrunt.hcl
files based on the presence or absence of feature flags that are set via environment variables and CLI flags.Motivation
Terragrunt is frequently used in monorepo contexts, and it lends itself to this in how it segments IAC state into separate directories. One definition of monorepos is a single codebase with multiple independent, but related, projects. By this definition, Terragrunt is very much an IAC monorepo tool. Multiple units of IAC are defined independently, and as a whole, they represent a repository of IAC.
Feature flags are a common way to manage the complexity of a monorepo. They allow for the gradual rollout of new features, the ability to turn off features that are not ready for production, and the ability to manage the complexity of a large codebase.
This is especially important in the context of Terragrunt, where infrastructure is most safely updated when updated in small, incremental changes. In addition, the ability to control how failure is handled in IAC is extremely important. Preventing full resolution of an apply across multiple Terragrunt units because a known flaky unit is failing is not always something that can be remediated by the use of retries, and it can be expensive to do so. Occasionally, it is better to ignore the failure of a known flaky unit and continue with the rest of applies, assuming that the failure is not critical to the overall success of the apply.
An example of such a failure would be a dependency chain where one service is deployed by Terragrunt, and has a
url
output where the service can be accessed, and another service which uses adependency
block to pass thaturl
into the environment variables of a second service.In this example, if the first service fails to deploy, the second service will also fail to deploy. However, if the first service is known to be flaky, and the second service is not dependent on the first service being deployed successfully, it is better to ignore the failure of the first service and continue with the deployment of the second service, leveraging the
url
output from a previous successful apply.Reasons that a unit might be marked in this way include:
Proposal
Provide a combination of:
Proposed Syntax
Examples
The syntax is intended to be flexible enough to support a couple different use-cases that are common when using feature flags.
Dynamic Module Example
Mark a
terragrunt.hcl
file as having a feature that triggers usage of a new module that is not yet stable. In lower environments, this flag is enabled, and in production, it is disabled.In addition, if the apply fails, it is safe to revert the apply, and a special error message is logged to the console.
In this contrived example, the "v2" tag of the module is not currently stable, however, to encourage continuous integration, the platform team has decided to merge in configurations that can use it when a flag is enabled. In the dev environment, the feature flag is enabled, and in the production environment, it is disabled.
When an apply fails, as is expected, a special message is emitted to STDERR to indicate that the source of failure is due to a failure in a feature flag.
In addition, on failure, a special
failure-signals.json
file will be created in the same directory as theterragrunt.hcl
file with a payload that the platform team knows will be useful to handle the failure safely. In this scenario, the logic that's being used here that the team has agreed upon is that if anyterragrunt apply
occurs, revert to the last commit and re-run the apply, if asate_to_revert
entry is found in thefailure-signals.json
for the correspondingterragrunt.hcl
file that was applied.The logic here is definitely not what would work for most organizations to achieve a reliable mechanism for reverting a failed apply. It is merely a demonstration of why authors might want signals emitted on failure.
Unreliable Module Example
Mark a
terragrunt.hcl
file as being unreliable, and ignore any failures with errors matchingNetworking Error
that might occur when applying it.$ tree . ├── reliable │ └── terragrunt.hcl └── unreliable └── terragrunt.hcl
In this example, users are able to mark the
terragrunt.hcl
file in theunreliable
directory as being unreliable, knowing that it predictably produces an error with the messageNetworking Error
that can be safely ignored when re-applied.The ability to ignore errors in the
unreliable
module is handy here, as thereliable
module has a stable output from theunreliable
module that doesn't change much, and uses it as an input.Examples of modules that can have this kind of relationship include:
The dependent modules can continue to codify their dependency relationships to get access to inputs like the database hostname, which is frequently required to connect to the database, and the cluster ID can be passed to the pod, so that its placement can be targeted to the cluster.
In both scenarios, users might find it convenient to be able to avoid failing to successfully deploy the dependent modules when predictable, intermittent errors occur in the dependency.
When using feature flags to support this kind of functionality, the feature flag can be opted-out, via setting an environment variable like so:
TG_FLAG_unreliable='false'
This allows for platform teams to safely test out removal of ignored failures until the
feature
configuration blocks can be removed (possibly by only disabling the feature in lower environments).In-progress Module Example
Mark a
terragrunt.hcl
file as being in-progress, skipping all operations on it until a certain feature is complete. The feature can be manually turned on when developing locally, but is off by default.When developing the module locally, use the following flag to activate the module:
This is a simple way to allow incomplete IaC work to be integrated into a code-base without requiring that the code be fully mature before merging it in.
Rapid, frequent and incremental integration is the standard in Continuous Integration, and this provides a mechanism for achieving that for large IaC code bases.
Feedback Requested
I don't think this is the best way to define this syntax, but I think it's a decent starting point. Something should be attempted, then iterated on based on feedback. I'm open to suggestions on how to improve this syntax.
Technical Details
Some components that will definitely be impacted include:
feature
blocks.error_hook
s andretryable_errors
already alter behavior of a normal Terragrunt execution on failure, This would be another tool that can change how errors are handled in Terragrunt due to thefeature.failure
block.TG_FLAG_<feature name>
.--feature
(or maybe--terragrunt-feature
).terragrunt command
when thefeature.skip
conditions are met.terragrunt run-all command
when they havefeature.skip
conditions met.Press Release
First Class Feature Flags
Terragrunt now has built in support for feature flags, allowing behavior of Terragrunt executions to be altered dynamically at runtime.
Feature flags are a staple of modern DevOps best practices, and using them in Terragrunt will allow you to improve the scalability of your IaC code base.
Use feature flags to support the following, and more:
Feature flags are available as of [RELEASE]. To learn more about how to use them, click [here](link to feature flag documentation).
Drawbacks
Some drawbacks of this proposal include:
terragrunt.hcl
file. Users have already been encounteringterragrunt.hcl
files that are too long and difficult to maintain. This added complexity might maketerragrunt.hcl
files even more difficult to reason about.terragrunt.hcl
files might be very difficult to reason about.skip
logic, during execution of the module if theenabled
status of the feature is used in controlling behavior, and iffailure
logic is used to handle failure.Alternatives
get_env
, and adding custom logic to adjust behavior of executions based on the values of the environment variables.ignored_errors
companion to theretryable_errors
that just ignores errors instead of retrying them. Customers have been asking for functionality like this to support handling both of failures that are not intermittent enough that they might recover from retrying over a short duration, and to handle errors in modules that are computationally or temporally expensive to just retry soon after failure.get_env
andrun_cmd
. Provide nice walkthroughs on how to achieve common feature flag patterns with existing tooling in Terragrunt.These alternatives, while less expensive than undertaking the introduction of net new functionality in Terragrunt, were considered less beneficial, as first class support for feature flags is generally something that makes a good match for Terragrunt, in my opinion.
Option #2 is also not necessarily mutually exclusive. It might be a good idea to pursue that anyways.
Migration Strategy
None
Unresolved Questions
See the section above about the syntax of feature flags.
I also am not sure how expensive this functionality would be to implement and maintain.
Would the community be interested in this functionality, or would they be more interested in any of the alternatives?
References
Proof of Concept Pull Request
N/A
Support Level
Customer Name
No response
Edits
The text was updated successfully, but these errors were encountered: