Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed (before master node created) to create EKS cluster fails deletion at first try, succeeds on second #3316

Open
pregnor opened this issue Nov 16, 2020 · 2 comments
Labels

Comments

@pregnor
Copy link
Member

pregnor commented Nov 16, 2020

Describe the bug

When an EKS cluster fails the creation before the control plane successfully being created, the first deletion attempt runs into issues when trying to delete the node pool label set and fails with secret not found, but the subsequent deletion attempt succeeds without leaving any resources behind.

Steps to reproduce the issue:

  1. Try creating an EKS cluster with insufficient privileges to cluster role creation, envelope encryption or possibly to VPC creation.
  2. Observe the creation to fail before the control plane could be created.
  3. Try deleting the failed cluster.
  4. Observe the deletion to fail with secret not found on node pool stack deletion.
  5. Try deleting the failed cluster again.
  6. Observe the deletion to succeed.
  7. Check the AWS resources for the cluster to be completely removed.

Expected behavior

The first deletion attempt should succeed as the second does.

Additional context

My guess is on a non-conditional node pool label set deletion early during the deletion process even when the node pool label set operator had not been installed before.

Error:

ERRO[2096] Activity error.                               
ActivityType=delete-node-pool-label-set 
Domain=pipeline 
RunID=5b16698d-fb2d-4d74-934a-de1160bd68e2 
TaskList=pipeline 
WorkerID=42495@-@pipeline 
WorkflowID=2a8a5d23-ff29-4fae-93c5-0293da27dbfd_2 
application=pipeline.worker 
component=cadence-worker 
environment=production 
error="secret not found" 
errorVerbose="secret not found
github.com/banzaicloud/pipeline/internal/common/commonadapter.(*SecretStore).GetSecretValues
	/Users/pregnor/development/src/github.com/banzaicloud/pipeline/internal/common/commonadapter/secret.go:87
github.com/banzaicloud/pipeline/internal/kubernetes.DefaultConfigFactory.FromSecret
	/Users/pregnor/development/src/github.com/banzaicloud/pipeline/internal/kubernetes/config_factory.go:55
github.com/banzaicloud/pipeline/internal/kubernetes.DynamicClientFactory.FromSecret
	/Users/pregnor/development/src/github.com/banzaicloud/pipeline/internal/kubernetes/client_factory.go:64
github.com/banzaicloud/pipeline/internal/cluster.DynamicClientFactory.FromClusterID
	/Users/pregnor/development/src/github.com/banzaicloud/pipeline/internal/cluster/client_factory.go:108
github.com/banzaicloud/pipeline/internal/cluster/clusterworkflow.DeleteNodePoolLabelSetActivity.Execute
	/Users/pregnor/development/src/github.com/banzaicloud/pipeline/internal/cluster/clusterworkflow/delete_node_pool_label_set_activity.go:48
reflect.Value.call
	/usr/local/Cellar/go/1.15.5/libexec/src/reflect/value.go:476
reflect.Value.Call
	/usr/local/Cellar/go/1.15.5/libexec/src/reflect/value.go:337
go.uber.org/cadence/internal.(*activityExecutor).Execute
	/Users/pregnor/development/pkg/mod/go.uber.org/cadence@v0.13.4/internal/internal_worker.go:710
go.uber.org/cadence/internal.(*activityTaskHandlerImpl).Execute
	/Users/pregnor/development/pkg/mod/go.uber.org/cadence@v0.13.4/internal/internal_task_handlers.go:1820
go.uber.org/cadence/internal.(*activityTaskPoller).ProcessTask
	/Users/pregnor/development/pkg/mod/go.uber.org/cadence@v0.13.4/internal/internal_task_pollers.go:886
go.uber.org/cadence/internal.(*baseWorker).processTask
	/Users/pregnor/development/pkg/mod/go.uber.org/cadence@v0.13.4/internal/internal_worker_base.go:321
runtime.goexit
	/usr/local/Cellar/go/1.15.5/libexec/src/runtime/asm_amd64.s:1374"
@pregnor pregnor changed the title Failed (before control plane) to create EKS cluster fails deletion at first try, succeeds on second. Failed (before control plane) to create EKS cluster fails deletion at first try, succeeds on second Nov 16, 2020
@pregnor pregnor changed the title Failed (before control plane) to create EKS cluster fails deletion at first try, succeeds on second Failed (before master node created) to create EKS cluster fails deletion at first try, succeeds on second Mar 18, 2021
@pregnor
Copy link
Member Author

pregnor commented Mar 18, 2021

An easier reproduction:

diff --git a/templates/eks/amazon-eks-iam-cf.yaml b/templates/eks/amazon-eks-iam-cf.yaml
index a797c34cd..b6330f938 100644
--- a/templates/eks/amazon-eks-iam-cf.yaml
+++ b/templates/eks/amazon-eks-iam-cf.yaml
@@ -2,6 +2,10 @@ AWSTemplateFormatVersion: '2010-09-09'
 Description: 'Amazon EKS IAM'
 
 Parameters:
+  BreakMe:
+    Type: String
+    Description: reproduction requirement.
+
   ClusterName:
     Type: String
     Description: The name of the EKS cluster.
  • EKS cluster creation, fails in 1 minute and first delete fails with secret not found, second succeeds.

@janosSarusiKis
Copy link
Contributor

DeleteNodePoolLabelSetActivity is the one that fails to run properly on the first time. The second time this activity does not triggered, so this may be the reason why the second deletion succeed. It is possible that cluster creation fails before node pool label set creations, causing the delete to fail on trying to delete non existing resource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants