Since I changed my cluster to use Karpenter (with Spot instances enabled) I see a weird behaviour.
Karpenter provides ARM nodes architecture sometimes but my cluster is configured to only provides AMD nodes architecture.
I saw this post exactly same as my problem. For me Metabase crash regularly cause is official docker image do not work on ARM.
Could you see if the Karpenter move went well or if something goes wrong?
I saw too that Karpenter do not optimize very well the resources and sometimes I need to delete some Karpenter nodeClaims to remove some expansive nodes to back to a normal billing or to remove the ARM nodes.
Currently, we have configured Karpenter to create nodes from a list of instance categories.
In this list, there are nodes of different architecture types (amd64 and arm64).
The “Default node architecture” option allows applications to be constrained to deploy on the defined architecture.
This is done by adding the architecture in the node affinity section:
This option does not constrain the architecture of the nodes that Karpenter can create.
However, in the coming weeks, we will introduce more fine-grained configuration options for the nodes that Karpenter can use. This will allow you to limit node creation to a single architecture type if needed.
When a Helm chart only supports one architecture, you can apply the same constraint to ensure the pods are deployed on a node with the required architecture.
Regarding the question about node optimization with Karpenter, there are several factors that could prevent Karpenter from reducing the number of nodes.
In this case, I believe the main issue is that you have enabled spot instances. Currently, Karpenter does not “consolidate” spot instances.
“Spot consolidation” is in Alpha, which is why we haven’t enabled it yet.
Let me know if this does not address your concern.
We received feedback from several clients who migrated to Karpenter, reporting instability with some of their services or database containers.
When Karpenter attempted to remove a node to optimize resources, applications with only one replica experienced downtime. Since Karpenter is often more aggressive than traditional cluster auto-scaling, this downtime became problematic.
One solution to address this issue would be to add a Pod Disruption Budget (PDB) or apply annotations to prevent Karpenter from moving these specific pods.
However, this would also prevent Karpenter from reclaiming the node where these pods are running, limiting optimization.
To resolve this, we choosed to introduce a new node pool called “stable,” configured with the “WhenEmpty” disruption mode. (the node is removed by KArpenter only when no pod is running on it)
The trade-off with this approach is that Karpenter does not optimize resource usage on this node pool. In the future, we plan to introduce a configuration window that will allow us to change the policy to WhenUnderUtilized during specific periods.
Other clients with a similar setup have observed cost reductions due to the use of spot instances, even though the number of instances could theoretically be optimized further. Is this not the case in your environment?
Thanks for this complete information
With eks-node-viewer I can see my resources and costs (per hour and per month). Since the change to Karpenter I have seen that sometimes we have similar costs, sometimes lower and sometimes higher. But in general (end of this month) the budget has slightly increased for the same or slightly lower usage of resources.
I could say more about this at the end of October when we will use Karpenter for a full month.
I clean the nodes regularly to not let the costs increase even with the Spot instance because of what you explained before.
Now I need to investigate on what cause this behaviour. @Pierre_Gerbelot if you have an idea?
I’m using Airbyte that creates himself pods with his own kubectl, so it’s my first though.
Are you experiencing any issues with the pods started by Airbyte?
When Karpenter optimizes costs by replacing a node that is running some Airbyte pods, the node is drained, and the pods attempt to terminate. However, Airbyte needs to complete its job before the pod can be destroyed. Therefore, I believe this is the expected behavior of both Airbyte and Karpenter.