Canceled deployment every time I try to redeploy them

any idea why all my deployments are now canceled every time I try to redeploy them. I am an admin of the account but it seems I dont have permission to redeploy anymore.

here are the deployment logs:

🏁 Deployment request ce578534-8cba-4a40-9d86-e1224bebe9d3-89-1692047457 for stage 1 `APPLICATION DEFAULT` has been sent to the engine
⏳ Your deployment is 1 in the Queue
🚀 Qovery Engine starts to execute the deployment
🧑‍🏭 Provisioning 1 docker builder with 4000m CPU and 8gib RAM for parallel build. This can take some time
🗂️ Provisioning container repository zb96fa686
📥 Cloning repository: https://github.com/Non-Fungible-Labs/thingie-art-manager-api.git to /home/qovery/.qovery-workspace/ce578534-8cba-4a40-9d86-e1224bebe9d3-89-1692047457/build/zb96fa686
🕵️ Checking if image already exist remotely 594084547872.dkr.ecr.eu-west-1.amazonaws.com/zb96fa686:16051326618033803529-cac0b0631af770fdf2e3c8af08eeac9aee5ba00d
🎯 Skipping build. Image already exist in the registry 594084547872.dkr.ecr.eu-west-1.amazonaws.com/zb96fa686:16051326618033803529-cac0b0631af770fdf2e3c8af08eeac9aee5ba00d
✅ Container image 594084547872.dkr.ecr.eu-west-1.amazonaws.com/zb96fa686:10837494947257010242-cac0b0631af770fdf2e3c8af08eeac9aee5ba00d is built and ready to use
💣 Deployment aborted following a failure to deploy a service
Qovery Engine has terminated the deployment

Hi @herve76 , could you share your web console URL to check this. Thank you

https://console.qovery.com/organization/ca6140f3-75f4-4d2b-9d0e-a014010dc781/project/3c003023-e391-40f0-a78f-90275a9e5d34/environment/ce578534-8cba-4a40-9d86-e1224bebe9d3/logs/b96fa686-f380-4043-8481-9ea156991d44/deployment-logs/ce578534-8cba-4a40-9d86-e1224bebe9d3-90

@herve76 our engineering team is looking at what happen

1 Like

I also lost my permission to delete services.

Can you see with your organization admin? He might has changed your permission.

Feels like something is wrong with permissions. Your user arn:aws:iam::[...]872:user/stephane@non[...].xyz is properly in Admins groups, but some permissions seem to be missing.
Do you mind double checking that following this doc?

Let us know if it helps.

Hi @bchastanier,

I took a look at the user arn:aws:iam::[…]872:user/stephane@non[…].xyz. I can confirm that we haven’t even touched user permissions against this account this week.

Please could you share more details regarding the missing permissions.

Regards,
Fred

Hey @realfredlai,

Did you had a chance to follow all steps from the doc and confirm everything is correct? Did you changed anything earlier (not only this week, but within last 3 weeks around those permissions)?
Your cluster happen to fail to be updated, I saw that this week because we triggered a manual update to update Qovery stack but maybe the issue was there a bit before.

I am continuing to investigate to see if I see anything obvious.

Hey @bchastanier Fred might now be sleeping. I think Fred double-checked that all our permissions are following all steps of the doc. No changes were made on our side within the past weeks.

Thanks Ben for investigating our issue.

I am trying my best but I cannot connect to the cluster with the user set in Qovery :frowning:

Is there a way for you to connect to the cluster and confirm the user arn:aws:iam::[...]872:user/stephane@non[...].xyz is properly set in aws-auth configmap via this command kubectl get configmap aws-auth -n kube-system -o yaml if not, can you please add it under mapUsers:

[...]
mapUsers: |
    - userarn: arn:aws:iam::[...]872:user/stephane@non[...].xyz
      username: USERNAME # <= you put username of the user having this ARN
      groups:
        - system:masters
[...]

Save the config map.

Let me know when it’s done.

1 Like

Hi @bchastanier,

From what we checked in the EKS authenticator log, we can see there was a cluster change at 15/08/2023 0:23:10, and after the cluster change, stephane’s account started getting permission issue a few hours later. Do you know what’s been changed at that moment?

Also, the EKS created by Qovery’s Terraform, we don’t have anyway to access into the EKS at the moment, we are hoping Qovery’s Terraform can help to update the aws-auth config so that we can have another AWS user to access the cluster.


Thanks,
Yang

We filtered some k8s API server log, seems that there was a lambda AWSWesleyClusterManagerLambda tried to manage something but got token expired so it did not finish the job, and after that happened, the node doesn’t have permission, and stephane’s account lost permission. please help to confirm, if this was happened in your log.

BTW, the time in the screenshot is UTC

1 Like

Hello @juehai,

Thanks for your inputs here. I can explain when happen on our end.

The lambda AWSWesleyClusterManagerLambda seems to come from AWS control plane and seems to be used by AWS to monitor EKS deployments.

Last week we released a change in our stack allowing to fully move Qovery stack to role base instead of user base authentication, see this thread.
To rollout it out, a cluster update was needed in order to:

  1. Remove iam-eks-user-mapper user from IAM
  2. Create new role for iam-eks-user-mapper in IAM
  3. Deploy the new app allowing to grant IAM users from Admins group access to EKS cluster

As to what’s going on on your cluster, I guess it boils down to EKS aws-auth configmap being empty or at least not having stephane’s account anymore hence we cannot access to it via credentials from this account.
The only root cause I can see of this to happen is if somehow there was an issue during your cluster update between step 1 and 2 (aws user account in IAM for the tool have been removed but the new app hasn’t been deployed) the iam-eks-user-mapper app is still the old one trying to use the old user to update the aws-auth configmap but it fails.
If so, if I had access to the cluster via the cluster creator (what I usually always have) I would manually add stephane’s user to aws-auth, then triggers a cluster update again which should fix the issue.

The issue here is that I don’t understand why stephane’s account access is completely lost, usually if such issue happen, I can still connect to the cluster using cluster’s creator credentials (the ones you provided to Qovery), which makes me think something else happened somehow.

So now, there is 3 solutions I can think of:

  1. (if the issue comes from the case described above): if you can use a master user from your AWS account (having all rights), you can try to trigger this command which is suppose to add stephane’s account to cluster aws-auth configmap;
    Let me now once this is done so I can try to connect to the cluster and check its status.
  eksctl create iamidentitymapping --region eu-west-1 --cluster qovery-z16cd1bde --arn arn:aws:iam::594084547872:user/stephane@nonfungiblelabs.xyz --group system:masters
  1. Open a case to AWS support with your cluster info, asking why we cannot connect to the cluster anymore and ask them if they can add this user back to access it. There might also be an error on their side triggered by the cluster update somehow. In anycase this can help.

  2. Create a new cluster via Qovery and move your workload to it using environment clone and target the new cluster.

I really think there is something fishy going on here, so I would try solution 1 and then fallback to AWS support to get more insights. Solution 3 should be last call.

Please keep me posted so I can help you further.

Cheers

1 Like

Hey @herve76 @juehai,

Did you got any updates here?
Let me know how I can help.

Cheers

I managed to get access to the cluster using the clusters creator credentials.
And yes, the configmap is empty, I’m seeing many authenticator errors from the iam-eks-user-mapper.

[ERRO] 2023/08/23 10:17 InvalidClientTokenId: The security token included in the request is invalid.
	status code: 403, request id: 57167a8b-0675-468e-9b54-33e4e6416be4
[INFO] 2023/08/23 10:17 successfully updated user roles
[INFO] 2023/08/23 10:17 &ConfigMap{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:aws-auth,GenerateName:,Namespace:kube-system,SelfLink:,UID:7f10ff8b-e7ee-43eb-9e26-e6f523986926,ResourceVersion:171046175,Generation:0,CreationTimestamp:2022-06-20 00:13:58 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[{vpcLambda Update v1 2022-06-20 00:13:58 +0000 UTC nil} {kubectl-edit Update v1 2022-07-12 10:21:26 +0000 UTC nil} {app Update v1 2023-08-14 16:27:38 +0000 UTC nil}],},Data:map[string]string{mapRoles: - groups:
  - system:bootstrappers
  - system:nodes
  rolearn: arn:aws:iam::594084547872:role/qovery-eks-workers-z16cd1bde
  username: system:node:{{EC2PrivateDNSName}}
,mapUsers: []
,},BinaryData:map[string][]byte{},}
[ERRO] 2023/08/23 10:17 InvalidClientTokenId: The security token included in the request is invalid.
	status code: 403, request id: 47b53392-9612-40b7-b958-9cdb7dd8982f
[INFO] 2023/08/23 10:17 successfully updated user roles
[INFO] 2023/08/23 10:17 &ConfigMap{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:aws-auth,GenerateName:,Namespace:kube-system,SelfLink:,UID:7f10ff8b-e7ee-43eb-9e26-e6f523986926,ResourceVersion:171046175,Generation:0,CreationTimestamp:2022-06-20 00:13:58 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[{vpcLambda Update v1 2022-06-20 00:13:58 +0000 UTC nil} {kubectl-edit Update v1 2022-07-12 10:21:26 +0000 UTC nil} {app Update v1 2023-08-14 16:27:38 +0000 UTC nil}],},Data:map[string]string{mapRoles: - groups:
  - system:bootstrappers
  - system:nodes
  rolearn: arn:aws:iam::594084547872:role/qovery-eks-workers-z16cd1bde
  username: system:node:{{EC2PrivateDNSName}}
,mapUsers: []
,},BinaryData:map[string][]byte{},}
[ERRO] 2023/08/23 10:17 InvalidClientTokenId: The security token included in the request is invalid.
	status code: 403, request id: 2aa3eeb4-2531-4dee-89fa-81ebec685182
[INFO] 2023/08/23 10:17 successfully updated user roles
[INFO] 2023/08/23 10:17 &ConfigMap{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:aws-auth,GenerateName:,Namespace:kube-system,SelfLink:,UID:7f10ff8b-e7ee-43eb-9e26-e6f523986926,ResourceVersion:171046175,Generation:0,CreationTimestamp:2022-06-20 00:13:58 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[{vpcLambda Update v1 2022-06-20 00:13:58 +0000 UTC nil} {kubectl-edit Update v1 2022-07-12 10:21:26 +0000 UTC nil} {app Update v1 2023-08-14 16:27:38 +0000 UTC nil}],},Data:map[string]string{mapRoles: - groups:
  - system:bootstrappers
  - system:nodes
  rolearn: arn:aws:iam::594084547872:role/qovery-eks-workers-z16cd1bde
  username: system:node:{{EC2PrivateDNSName}}
,mapUsers: []
,},BinaryData:map[string][]byte{},}
[ERRO] 2023/08/23 10:18 InvalidClientTokenId: The security token included in the request is invalid.
	status code: 403, request id: 7feaa61b-2e89-4bfa-b337-0660a8d96402
[INFO] 2023/08/23 10:18 successfully updated user roles
[INFO] 2023/08/23 10:18 &ConfigMap{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:aws-auth,GenerateName:,Namespace:kube-system,SelfLink:,UID:7f10ff8b-e7ee-43eb-9e26-e6f523986926,ResourceVersion:171046175,Generation:0,CreationTimestamp:2022-06-20 00:13:58 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[{vpcLambda Update v1 2022-06-20 00:13:58 +0000 UTC nil} {kubectl-edit Update v1 2022-07-12 10:21:26 +0000 UTC nil} {app Update v1 2023-08-14 16:27:38 +0000 UTC nil}],},Data:map[string]string{mapRoles: - groups:
  - system:bootstrappers
  - system:nodes
  rolearn: arn:aws:iam::594084547872:role/qovery-eks-workers-z16cd1bde
  username: system:node:{{EC2PrivateDNSName}}
,mapUsers: []
,},BinaryData:map[string][]byte{},}

I tried adding stephane to the configmap using eksctl but it quickly got wiped again (almost instant)

What’s the next course of action, it seems like the deployment that was triggered failed and caused issues in our cluster, can you suggest how we can deploy the application manually to amend this issue? This is the result of describing the deployment

Name:                   iam-eks-user-mapper
Namespace:              kube-system
CreationTimestamp:      Mon, 20 Jun 2022 12:16:07 +1200
Labels:                 app.kubernetes.io/instance=iam-eks-user-mapper
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=iam-eks-user-mapper
                        app.kubernetes.io/version=0.1.0
                        helm.sh/chart=iam-eks-user-mapper-0.1.0
Annotations:            deployment.kubernetes.io/revision: 2
                        meta.helm.sh/release-name: iam-eks-user-mapper
                        meta.helm.sh/release-namespace: kube-system
Selector:               app.kubernetes.io/instance=iam-eks-user-mapper,app.kubernetes.io/name=iam-eks-user-mapper
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/instance=iam-eks-user-mapper
                    app.kubernetes.io/name=iam-eks-user-mapper
  Service Account:  iam-eks-user-mapper
  Containers:
   iam-eks-user-mapper:
    Image:      public.ecr.aws/r3m4q3r9/iam-eks-user-mapper:v1.0.0
    Port:       <none>
    Host Port:  <none>
    Command:
      ./app
      --aws-iam-group
      Admins
      --k8s-cap
      system:masters
    Limits:
      cpu:     20m
      memory:  32Mi
    Requests:
      cpu:     10m
      memory:  32Mi
    Environment:
      AWS_REGION:             eu-west-1
      AWS_ACCESS_KEY_ID:      <REDACTED>
      AWS_SECRET_ACCESS_KEY:  <set to the key 'awsKey' in secret 'iam-eks-user-mapper'>  Optional: false
    Mounts:                   <none>
  Volumes:                    <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  iam-eks-user-mapper-6477f5848d (0/0 replicas created)
NewReplicaSet:   iam-eks-user-mapper-6498884dcd (1/1 replicas created)
Events:          <none>

Just to add to my previous reply.

I see 2 options of how to fix that.

  1. We add the cluster creator credentials to AWS and do an update this way hoping it deploys correctly this time.

  2. We do a help chart upgrade with the correct values to update the version.

As for the issue - I can see the old version of the service is still deployed. However the user that the keys are tied to is deleted and the role that the newer version is supposed to use exists.

My thinking is that when the deploy was triggered, the user got deleted, the role got created, but the service deploy failed. The service cannot read the IAM group anymore so it overrides the aws-auth configmap with empty values

Hey @fv-kuba !

So it just confirms my theory:

As to what’s going on on your cluster, I guess it boils down to EKS aws-auth configmap being empty or at least not having stephane’s account anymore hence we cannot access to it via credentials from this account.
The only root cause I can see of this to happen is if somehow there was an issue during your cluster update between step 1 and 2 (aws user account in IAM for the tool have been removed but the new app hasn’t been deployed) the iam-eks-user-mapper app is still the old one trying to use the old user to update the aws-auth configmap but it fails.

In order to get everything back on track, here’s what you can do:

  1. Scale down iam-eks-user-mapper deployment down to 0
  2. Wait for iam-eks-user-mapper pod to be terminated
  3. Manually add stephane to the configmap using eksctl as you did

At this point, stephane access just be good, but iam-eks-user-mapper needs to be deployed to its last version. So at this point, just let me know I will take it from here or if you want to handle it yourself:

  1. once stephane access are back, you can try to connect to the cluster with his account, if it works, then move to step 2
  2. trigger a cluster update and wait for it to finish
  3. Check iam-eks-user-mapper pod, should be running without errors
  4. Check aws-auth configmap, should have some users (all users from your Admin group, including stephane)
  5. Try to connect with stephane account, if ok should be all good
  6. Trigger an application deploy, should be ok

Let me know once you are doing it and/or when you are done with it, I can assist if you want.

Cheers

Hey, I will trigger an update now.

1 Like

Seems to be all good, let me know :slight_smile: