Possible issue when deploying datadog cluster agent in karpenter

So, I had been using k3s cluster with datadog integration using this guide with datadog version 3.66.0 and following settings

URL = https://console.qovery.com/organization/0f2a7baf-84e9-4b4e-a219-72fb44811f99/project/1e2ddd37-24f7-4e38-aaf1-250a5987410a/environment/f4e9e073-8bac-4015-af45-a5bbc5ac64ec/application/3148c5df-dc3a-43a4-8bac-843641638648/settings/general

However, I recently created another cluster using Karpenter (using spot instances) and i cloned all my services from old k3s cluster to his new cluster. Everything else is working fine, however I notice that datadog service is failing to start with the same settings

It always ends up being 2 pods up and the remaining 3 in starting state. After the timeout period, the deployment terminates.

I tried with

  1. increasing timeout to 20mins, no difference
  2. using the latest version of dd, no difference.

The URL of new dd app is https://console.qovery.com/organization/0f2a7baf-84e9-4b4e-a219-72fb44811f99/project/1e2ddd37-24f7-4e38-aaf1-250a5987410a/environment/e6c22b6a-6363-45da-b310-a61bdb04969e/application/4a17d43f-1e4a-44bb-a45c-ed8db3648ad8/general

Is this something with karpenter or am i doing something wrong?

My overidden yaml values for DD helm is as follows

# The following YAML contains the minimum configuration required to deploy the Datadog Agent
# on your cluster. Update it accordingly to your needs
datadog:
  # here we use a Qovery secret to retrieve the Datadog API Key (See next step)
  apiKey: qovery.env.DD_API_KEY
  # Update the site depending on where you want to store your data in Datadog
  site: datadoghq.eu
  # Update the cluster name with the name of your choice
  clusterName: fere-all
  logs:
    enabled: true
    containerCollectAll: true

If it helps, my log file is at dpaste: 5TFZA4Q6M

Hello @0xbitmonk,

We have a documentation on how to integrate Datadog with Qovery here.

In section 2 Create the datadog service within Qovery you can see we have some specific configuration for Karpenter.

# The following YAML contains the minimum configuration required to deploy the Datadog Agent
# on your cluster. Update it accordingly to your needs
datadog:
  # here we use a Qovery secret to retrieve the Datadog API Key (See next step)
  apiKey: qovery.env.DD_API_KEY
  # Update the site depending on where you want to store your data in Datadog
  site: datadoghq.eu
  # Update the cluster name with the name of your choice
  clusterName: qoverycluster
agents:
  tolerations:
    - operator: Exists
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: eks.amazonaws.com/compute-type
              operator: NotIn
              values:
                - fargate

Can you try overriding your value file with this configuration and let me know if it works?

Regards,
Charles-Edouard

Ah, I see. My apologies. I missed that one.

After your response, I updated the new values

# The following YAML contains the minimum configuration required to deploy the Datadog Agent
# on your cluster. Update it accordingly to your needs
datadog:
  # here we use a Qovery secret to retrieve the Datadog API Key (See next step)
  apiKey: qovery.env.DD_API_KEY
  # Update the site depending on where you want to store your data in Datadog
  site: datadoghq.eu
  # Update the cluster name with the name of your choice
  clusterName: fere-all
  logs:
    enabled: true
    containerCollectAll: true
agents:
  tolerations:
    - operator: Exists
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: eks.amazonaws.com/compute-type
              operator: NotIn
              values:
                - fargate

So, now I see 15 pods running. Out of them, 13 are fine, 2 are in “Starting” for almost 40 mins now.

Yes my bad I forgot that you also need to add a Priority Class.

You can check our documentation here.

There is a known bug with Karpenter where it will not create daemonset’s pod on a node already full.

To prevent it, you need to deploy a Priority Class and use this Priority Class when you deploy your DaemonSet.

Let me know if you need help doing this.

Additional note: We had some feedback on problems with Datadog costs when using Karpenter. Karpenter will spawn multiple nodes for you and Datadog bills customers based on the number of nodes.

Regards,
Charles-Edouard

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.