[Karpenter] - Error CoreDNS installation

Hello :wave:

I’m trying to pass our cluster with the excellent new feature using Karpenter :muscle:

But in my case I have an error on CoreDNS:

I’m thinking about delete and recreate CoreDNS add-on but I’m not confident on what could happen on my cluster.

My actual node groups:

{
    "nodegroups": [
        "qovery-20240827181650210400000001"
    ]
}

The description:

{
    "nodegroup": {
        "nodegroupName": "qovery-20240827181650210400000001",
        "nodegroupArn": "arn:aws:eks:eu-west-1:XXXXXXXXX:nodegroup/qovery-zdc37d137/qovery-20240827181650210400000001/68c8ca87-047e-d567-8d65-6ba9b6576467",
        "clusterName": "qovery-zdc37d137",
        "version": "1.28",
        "releaseVersion": "1.28.11-20240817",
        "createdAt": "2024-08-27T20:16:54.093000+02:00",
        "modifiedAt": "2024-09-12T09:06:52.298000+02:00",
        "status": "ACTIVE",
        "capacityType": "ON_DEMAND",
        "scalingConfig": {
            "minSize": 3,
            "maxSize": 30,
            "desiredSize": 5
        },
        "instanceTypes": [
            "t3a.medium"
        ],
        "subnets": [
            "subnet-042c467f2128ea259",
            "subnet-05a060b5f0ce709fd",
            "subnet-0e6eb8f44e172f983",
            "subnet-0c720f7aa0917d36e",
            "subnet-07b54265ebef2b6a3",
            "subnet-088e44d3cae56e0dd"
        ],
        "amiType": "AL2_x86_64",
        "nodeRole": "arn:aws:iam::XXXXXXXX:role/qovery-eks-workers-zdc37d137",
        "labels": {},
        "resources": {
            "autoScalingGroups": [
                {
                    "name": "eks-qovery-20240827181650210400000001-68c8ca87-047e-d567-8d65-6ba9b6576467"
                }
            ]
        },
        "health": {
            "issues": []
        },
        "updateConfig": {
            "maxUnavailablePercentage": 10
        },
        "launchTemplate": {
            "name": "terraform-20230502090830471400000001",
            "version": "5",
            "id": "lt-00870d69fbdb99527"
        },
        "tags": {
            "QoveryNodeGroupId": "zdc37d137-1",
            "ClusterId": "zdc37d137",
            "QoveryProduct": "EKS",
            "Region": "eu-west-1",
            "Service": "EKS",
            "OrganizationLongId": "cf7a89a0-7fe4-4096-9f62-a99aa7dd3f21",
            "OrganizationId": "zcf7a89a0",
            "creationDate": "2023-05-02T09:08:29Z",
            "QoveryNodeGroupName": "default",
            "ClusterLongId": "dc37d137-c921-4f9b-88e2-a0e26c461b42"
        }
    }
}

Thanks for your help :pray:

Hello @Mike
Can you share the web console url of your cluster?

Thank you

Hello @Pierre_Gerbelot :wave:

Yes sorry: https://console.qovery.com/organization/cf7a89a0-7fe4-4096-9f62-a99aa7dd3f21/cluster/dc37d137-c921-4f9b-88e2-a0e26c461b42/logs

I’m investigating the issue. I will let you know once it has been fixed

1 Like

The cluster update has been fixed, and your cluster is now running Karpenter.

You can redeploy your entire environments.

We will work on making the CoreDNS update more robust in the next release.

Thank you.

1 Like

Thank a lot for your reactivity @Pierre_Gerbelot :hand_with_index_finger_and_thumb_crossed:

Hi @Pierre_Gerbelot

I have the same issue on my cluster, could you help me ?

The link to my console is : https://console.qovery.com/organization/fcdd956b-60bb-4d2c-906c-80d7ac3bb53d/cluster/b1193cd7-bbd6-4fbd-a6b7-c7ee0062c121/logs

Thanks,
Samuel

Hello @wewelll ,
I’m looking into the issue, and I will let you know once it is fixed.
Thank you

1 Like

thank you for your reactivity @Pierre_Gerbelot ! I see that you have launched a new deployment of the cluster, and there is a new issue, the iam-eks-user-mapper deployment is timing out.

Also I can’t reach my deployed apps anymore …

Hey @wewelll,

Yes, sorry for the inconvenience, there is an issue with CNI addon, we are working on making your cluster available again.

1 Like

Your cluster has been fixed, even though the status shows an error in the web console.

We have locked your cluster until Monday, when we will apply a definitive correction.

We encountered an issue with Datadog that prevented all pods from starting. We were forced to delete the Datadog mutating webhook configuration because APM is activated by default. We saw you have tried to update the datadog helm chart. Please try again to re-install the mutating webhook.

Sorry for the inconvenience

Notice: For users who encounter the same issue, please wait until Monday (09/16/24) when a fix will be delivered.

Thank you.

Thank you for your answer.

Yes I’m using mutateUnlabelled: true on Datadog charts because at the time I set it up, I could not label each service with the admission controller.

I could get rid of the mutateUnlabelled: true now

I redeployed the Datadog charts and it worked, but now I have another issue when deploying my applications, I see an ImagePullBackOff error with message Failed to pull image .... no match for platform in manifest: not found

Hello @wewelll
When you migrated to Karpenter, I believe you selected ARM64 as the architecture, whereas your node group was previously configured with AMD64 nodes.

Now, when you deploy your applications, the ARM64 architecture is being used, but the image you’re using is not compatible with this architecture. This is causing the ImagePullBackOff error.

Was this architecture change made on purpose?

No it was a mistake on my side, I need AMD64 nodes.

Is it possible to update the architecture ? I can’t do it from the console because my cluster is locked …

Now that you have changed the architecture in the cluster settings, I think you should redeploy all the environments running on this cluster. (also datadog)

Thank you

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.