However, I recently created another cluster using Karpenter (using spot instances) and i cloned all my services from old k3s cluster to his new cluster. Everything else is working fine, however I notice that datadog service is failing to start with the same settings
It always ends up being 2 pods up and the remaining 3 in starting state. After the timeout period, the deployment terminates.
Is this something with karpenter or am i doing something wrong?
My overidden yaml values for DD helm is as follows
# The following YAML contains the minimum configuration required to deploy the Datadog Agent
# on your cluster. Update it accordingly to your needs
datadog:
# here we use a Qovery secret to retrieve the Datadog API Key (See next step)
apiKey: qovery.env.DD_API_KEY
# Update the site depending on where you want to store your data in Datadog
site: datadoghq.eu
# Update the cluster name with the name of your choice
clusterName: fere-all
logs:
enabled: true
containerCollectAll: true
We have a documentation on how to integrate Datadog with Qovery here.
In section 2 Create the datadog service within Qovery you can see we have some specific configuration for Karpenter.
# The following YAML contains the minimum configuration required to deploy the Datadog Agent
# on your cluster. Update it accordingly to your needs
datadog:
# here we use a Qovery secret to retrieve the Datadog API Key (See next step)
apiKey: qovery.env.DD_API_KEY
# Update the site depending on where you want to store your data in Datadog
site: datadoghq.eu
# Update the cluster name with the name of your choice
clusterName: qoverycluster
agents:
tolerations:
- operator: Exists
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- fargate
Can you try overriding your value file with this configuration and let me know if it works?
# The following YAML contains the minimum configuration required to deploy the Datadog Agent
# on your cluster. Update it accordingly to your needs
datadog:
# here we use a Qovery secret to retrieve the Datadog API Key (See next step)
apiKey: qovery.env.DD_API_KEY
# Update the site depending on where you want to store your data in Datadog
site: datadoghq.eu
# Update the cluster name with the name of your choice
clusterName: fere-all
logs:
enabled: true
containerCollectAll: true
agents:
tolerations:
- operator: Exists
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- fargate
So, now I see 15 pods running. Out of them, 13 are fine, 2 are in “Starting” for almost 40 mins now.
There is a known bug with Karpenter where it will not create daemonset’s pod on a node already full.
To prevent it, you need to deploy a Priority Class and use this Priority Class when you deploy your DaemonSet.
Let me know if you need help doing this.
Additional note: We had some feedback on problems with Datadog costs when using Karpenter. Karpenter will spawn multiple nodes for you and Datadog bills customers based on the number of nodes.