[KARPENTER] - ARM nodes into AMD Cluster selection

Mike · September 28, 2024, 2:57pm

Hello

Since I changed my cluster to use Karpenter (with Spot instances enabled) I see a weird behaviour.

Karpenter provides ARM nodes architecture sometimes but my cluster is configured to only provides AMD nodes architecture.

I saw this post exactly same as my problem. For me Metabase crash regularly cause is official docker image do not work on ARM.

Could you see if the Karpenter move went well or if something goes wrong?

I saw too that Karpenter do not optimize very well the resources and sometimes I need to delete some Karpenter nodeClaims to remove some expansive nodes to back to a normal billing or to remove the ARM nodes.

Thanks a lot

Mike · September 29, 2024, 9:58am

For a quick fix, I have created a new Metabase Docker image for multi arch and specified it into the helm values.yml.

But the problem of why Karpenter that provides ARM architecture nodes when I configure only AMD architecture nodes persists.

rophilogene · September 29, 2024, 10:12am

Hi @Mike , just a quick message to tell you that someone from our support engineering team will take a look tomorrow at your thread

Pierre_Gerbelot · September 30, 2024, 7:12am

Hello @Mike,

Currently, we have configured Karpenter to create nodes from a list of instance categories.
In this list, there are nodes of different architecture types (amd64 and arm64).
The “Default node architecture” option allows applications to be constrained to deploy on the defined architecture.
This is done by adding the architecture in the node affinity section:

   nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64

This option does not constrain the architecture of the nodes that Karpenter can create.
However, in the coming weeks, we will introduce more fine-grained configuration options for the nodes that Karpenter can use. This will allow you to limit node creation to a single architecture type if needed.

When a Helm chart only supports one architecture, you can apply the same constraint to ensure the pods are deployed on a node with the required architecture.

Regarding the question about node optimization with Karpenter, there are several factors that could prevent Karpenter from reducing the number of nodes.
In this case, I believe the main issue is that you have enabled spot instances. Currently, Karpenter does not “consolidate” spot instances.
“Spot consolidation” is in Alpha, which is why we haven’t enabled it yet.

Let me know if this does not address your concern.

Regards

Mike · September 30, 2024, 7:38am

Hello @Pierre_Gerbelot

Thanks for your answers, I understand more my actual behaviour

Mike · September 30, 2024, 11:31am

For informations.
I have switched my Cluster with Karpenter to not using Spot instances.
Results were worst.
I have switched back to Spot instances.

Pierre_Gerbelot · September 30, 2024, 12:54pm

We received feedback from several clients who migrated to Karpenter, reporting instability with some of their services or database containers.
When Karpenter attempted to remove a node to optimize resources, applications with only one replica experienced downtime. Since Karpenter is often more aggressive than traditional cluster auto-scaling, this downtime became problematic.

One solution to address this issue would be to add a Pod Disruption Budget (PDB) or apply annotations to prevent Karpenter from moving these specific pods.
However, this would also prevent Karpenter from reclaiming the node where these pods are running, limiting optimization.

To resolve this, we choosed to introduce a new node pool called “stable,” configured with the “WhenEmpty” disruption mode. (the node is removed by KArpenter only when no pod is running on it)
The trade-off with this approach is that Karpenter does not optimize resource usage on this node pool. In the future, we plan to introduce a configuration window that will allow us to change the policy to WhenUnderUtilized during specific periods.

Other clients with a similar setup have observed cost reductions due to the use of spot instances, even though the number of instances could theoretically be optimized further. Is this not the case in your environment?

Mike · September 30, 2024, 1:36pm

@Pierre_Gerbelot

Thanks for this complete information
With eks-node-viewer I can see my resources and costs (per hour and per month). Since the change to Karpenter I have seen that sometimes we have similar costs, sometimes lower and sometimes higher. But in general (end of this month) the budget has slightly increased for the same or slightly lower usage of resources.

I could say more about this at the end of October when we will use Karpenter for a full month.

I clean the nodes regularly to not let the costs increase even with the Spot instance because of what you explained before.

Mike · October 3, 2024, 5:29pm

OK I see a start of answer for Karpenter costs increase.
I have a lot of pods in “Terminating” status that not end.

Now I need to investigate on what cause this behaviour. @Pierre_Gerbelot if you have an idea?
I’m using Airbyte that creates himself pods with his own kubectl, so it’s my first though.

Pierre_Gerbelot · October 4, 2024, 9:06am

Hello @Mike,

Are you experiencing any issues with the pods started by Airbyte?

When Karpenter optimizes costs by replacing a node that is running some Airbyte pods, the node is drained, and the pods attempt to terminate. However, Airbyte needs to complete its job before the pod can be destroyed. Therefore, I believe this is the expected behavior of both Airbyte and Karpenter.

Regards

Mike · October 4, 2024, 11:37am

Yes i’m experiencing issues with the pods started by Airbyte.

I don’t find any solutions to back to a normal behaviour

Pierre_Gerbelot · October 4, 2024, 12:32pm

You can try blocking Karpenter to disrupt the pods created by Airbyte by adding this annotation:

  karpenter.sh/do-not-disrupt: "true"

I’m not familiar with Airbyte, but I believe you can use the notation JOB_KUBE_ANNOTATIONS to set the above annotation.

Regards

Mike · October 4, 2024, 3:47pm

Thanks a lot @Pierre_Gerbelot

But unfortunately, I have already tested this solution with no success

Pierre_Gerbelot · October 4, 2024, 4:19pm

I could be wrong, but in your values of the Airbyte chart, the jobs key should be included in the global key as shown below:

global:
  storage:
     .....
  jobs:
    kube:
      annotations:
        karpenter.sh/do-not-disrupt: "true"

and not

global:
 storage:
     .....

jobs:
  kube:
    annotations:
      karpenter.sh/do-not-disrupt: "true"

Mike · October 4, 2024, 4:45pm

Sorry for the mistake
I had already tried this with good hierarchy.
No success.

Mike · October 6, 2024, 9:23pm

Seems quite more stable for now with:

jobs:
    kube:
      annotations:
        karpenter.sh/do-not-disrupt: "true"
    resources:
      limits:
        cpu: 100m

system · October 17, 2024, 2:53pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Karpenter Provisioning AMD Instances - Request for Intel CPU Support AWS	2	135	June 11, 2024
Karpenter available on dev Clusters – Next Step: Production! Product Preview	1	67	October 11, 2024
[UPCOMING FEATURE] Reduce your cloud cost with spot instances + Karpenter autoscaler - Looking for beta testers! Product Preview	10	585	October 8, 2024
Zero nodes available for deployment with 30 nodes while not fully used Deployment	2	209	March 30, 2024
Unable to create cluster with karpenter AWS qovery , aws-ec2 , aws	2	48	September 24, 2024

[KARPENTER] - ARM nodes into AMD Cluster selection

Related topics