[KARPENTER] - ARM nodes into AMD Cluster selection

Hello :wave:

Since I changed my cluster to use Karpenter (with Spot instances enabled) I see a weird behaviour.

Karpenter provides ARM nodes architecture sometimes but my cluster is configured to only provides AMD nodes architecture.

I saw this post exactly same as my problem. For me Metabase crash regularly cause is official docker image do not work on ARM.

Could you see if the Karpenter move went well or if something goes wrong?

I saw too that Karpenter do not optimize very well the resources and sometimes I need to delete some Karpenter nodeClaims to remove some expansive nodes to back to a normal billing or to remove the ARM nodes.

Thanks a lot :pray:

For a quick fix, I have created a new Metabase Docker image for multi arch and specified it into the helm values.yml.

But the problem of why Karpenter that provides ARM architecture nodes when I configure only AMD architecture nodes persists.

Hi @Mike , just a quick message to tell you that someone from our support engineering team will take a look tomorrow at your thread :slight_smile:

1 Like

Hello @Mike,

Currently, we have configured Karpenter to create nodes from a list of instance categories.
In this list, there are nodes of different architecture types (amd64 and arm64).
The “Default node architecture” option allows applications to be constrained to deploy on the defined architecture.
This is done by adding the architecture in the node affinity section:

   nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64

This option does not constrain the architecture of the nodes that Karpenter can create.
However, in the coming weeks, we will introduce more fine-grained configuration options for the nodes that Karpenter can use. This will allow you to limit node creation to a single architecture type if needed.

When a Helm chart only supports one architecture, you can apply the same constraint to ensure the pods are deployed on a node with the required architecture.

Regarding the question about node optimization with Karpenter, there are several factors that could prevent Karpenter from reducing the number of nodes.
In this case, I believe the main issue is that you have enabled spot instances. Currently, Karpenter does not “consolidate” spot instances.
“Spot consolidation” is in Alpha, which is why we haven’t enabled it yet.

Let me know if this does not address your concern.

Regards

1 Like

Hello @Pierre_Gerbelot :wave:

Thanks for your answers, I understand more my actual behaviour :pray:

For informations.
I have switched my Cluster with Karpenter to not using Spot instances.
Results were worst.
I have switched back to Spot instances.

We received feedback from several clients who migrated to Karpenter, reporting instability with some of their services or database containers.
When Karpenter attempted to remove a node to optimize resources, applications with only one replica experienced downtime. Since Karpenter is often more aggressive than traditional cluster auto-scaling, this downtime became problematic.

One solution to address this issue would be to add a Pod Disruption Budget (PDB) or apply annotations to prevent Karpenter from moving these specific pods.
However, this would also prevent Karpenter from reclaiming the node where these pods are running, limiting optimization.

To resolve this, we choosed to introduce a new node pool called “stable,” configured with the “WhenEmpty” disruption mode. (the node is removed by KArpenter only when no pod is running on it)
The trade-off with this approach is that Karpenter does not optimize resource usage on this node pool. In the future, we plan to introduce a configuration window that will allow us to change the policy to WhenUnderUtilized during specific periods.

Other clients with a similar setup have observed cost reductions due to the use of spot instances, even though the number of instances could theoretically be optimized further. Is this not the case in your environment?

@Pierre_Gerbelot

Thanks for this complete information :pray:
With eks-node-viewer I can see my resources and costs (per hour and per month). Since the change to Karpenter I have seen that sometimes we have similar costs, sometimes lower and sometimes higher. But in general (end of this month) the budget has slightly increased for the same or slightly lower usage of resources.

I could say more about this at the end of October when we will use Karpenter for a full month.

I clean the nodes regularly to not let the costs increase even with the Spot instance because of what you explained before.

OK I see a start of answer for Karpenter costs increase.
I have a lot of pods in “Terminating” status that not end.

Now I need to investigate on what cause this behaviour. @Pierre_Gerbelot if you have an idea?
I’m using Airbyte that creates himself pods with his own kubectl, so it’s my first though.

Hello @Mike,

Are you experiencing any issues with the pods started by Airbyte?

When Karpenter optimizes costs by replacing a node that is running some Airbyte pods, the node is drained, and the pods attempt to terminate. However, Airbyte needs to complete its job before the pod can be destroyed. Therefore, I believe this is the expected behavior of both Airbyte and Karpenter.

Regards

Yes i’m experiencing issues with the pods started by Airbyte.

I don’t find any solutions to back to a normal behaviour :confused:

You can try blocking Karpenter to disrupt the pods created by Airbyte by adding this annotation:

  karpenter.sh/do-not-disrupt: "true"

I’m not familiar with Airbyte, but I believe you can use the notation JOB_KUBE_ANNOTATIONS to set the above annotation.

Regards

Thanks a lot @Pierre_Gerbelot :pray:

But unfortunately, I have already tested this solution with no success :confused:

I could be wrong, but in your values of the Airbyte chart, the jobs key should be included in the global key as shown below:

global:
  storage:
     .....
  jobs:
    kube:
      annotations:
        karpenter.sh/do-not-disrupt: "true"

and not

global:
 storage:
     .....

jobs:
  kube:
    annotations:
      karpenter.sh/do-not-disrupt: "true"

Sorry for the mistake :pray:
I had already tried this with good hierarchy.
No success.

Seems quite more stable for now with:

jobs:
    kube:
      annotations:
        karpenter.sh/do-not-disrupt: "true"
    resources:
      limits:
        cpu: 100m

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.