Error while waiting for change on pods/job/certificate

We recently deployed a BYOK cluster which suddently stopped working. Cloudflare is returning “Web server is returning an unknown error Error code 520” and we see a lot of logs like this:

INFO spawn_services_watcher{namespace=Some(“z5cae5e43-staging2”)}: cluster_agent::service_status::watcher: error while waiting for change on pods/job/certificate stream for namespace Some(“z5cae5e43-staging2”): failed to perform initial object list: ApiError: “404 page not found\n”: Failed to parse error data (ErrorResponse { status: “404 Not Found”, message: “"404 page not found\n"”, reason: “Failed to parse error data”, code: 404 })

The cluster is deployed without cert-manager and custom domain via Cloudflare. Could you explain what’s happening?

Hi, can you please be more precise by what “suddently stopped working”. Thanks

We deployed this cluster on Monday and cloned our environments from an old managed cluster. It worked fine until today. I’m guessing it could be related to some undocumented dependency on cert-manager and initial certificates expiring. Like I mentioned before, we deployed the cluster without cert-manager. It’s also using custom domain via Cloudflare.

Hey !

Actually, for BYOK, cert-manager is required if you want to have TLS, CF documentation here.

In your case, you should install cert-manager in order to get your certificates properly generated.

If, after having cert-manager installed and configured your applications are not reachable, it means you probably have other issues.

Please note that while BYOK is more flexible, it is more suitable for advanced Kubernetes users to install and manage it in autonomy, if you need help with your setup, you can have a look at our support plans.

Cheers

@bchastanier what if we don’t need TLS? self signed nginx certificate would be sufficient for us. We didn’t check the “Generate certificate” box in Qovery so I don’t understand where the TLS/cert-manager requirement comes from?

Hey,

I am not sure to fully grasp the issue here, would you mind sharing broken app URL (qovery console link) so I can have look?

So far what I did:

1- Checked the DNS of the web app and it seems to have nothing configured behind it:

❯ dig CNAME app.t***.dev +short
// Nothing

Doing the same thin on A record returns something

❯ dig A app.t***.dev +short
188.114.xx.x
188.114.xx.x

Those IPs seems to point directly to AWS LB.

:warning: We should find a CNAME instead of an A record, it’s maybe normal depending on your Cloudflare configuration. Can you please give more details on this setup?

2- Certificate for this domain has been generated on June 24th, which is before your switch to the BYOK cluster.

As for errors triggered by Qovery agent, there should be a self signed certificate on it already, we did a patch something like 2 weeks ago. Can you make sure to update to the last chart version?
Qovery Agent is doing only status reporting and has nothing to do with services being unreachable.

Anything else that can help me understand the issue / context is welcome so I can better help.

Thanks

Hi @bchastanier,

It appears you were right and the Qovery error was not related to services being unreachable. We resolved that on our side.

Let’s continue on fixing this Qovery Cluster Agent error though:

Error while waiting for change on pods/job/certificate

It’s flooding our logs (printing almost every 1ms) and already generated >100GB in logs. I think you should look into adding some sort of back-off logic to the retries happening here.

Do you think it could be related to old chart version? I will double check we are using the latest chart but it seems all chart updates are being pushed as v1.0.0 and makes it very difficult to track changes.

That something we will be working on in the future, but if you can try to update the chart with latest source, it’s likely to solve the issue. Please let me know so I can investigate further if needed.

Cheers

I think I’m at the latest version but the issue still persist. I pulled the latest values for clusterAgentVersion and engineVersion from Qovery UI. What additional info can I provide? The logs keep printing the same 2 lines for different namespaces:

2024-07-19T13:03:32.296779Z  INFO spawn_services_watcher{namespace=None}: cluster_agent::service_status::watcher: error while waiting for change on pods/job/certificate stream for namespace None: failed to perform initial object list: ApiError: "404 page not found\n": Failed to parse error data (ErrorResponse { status: "404 Not Found", message: "\"404 page not found\\n\"", reason: "Failed to parse error data", code: 404 })
2024-07-19T13:03:32.298664Z  WARN spawn_services_watcher{namespace=Some("zbe75eee0-pr-2252td-766-fix-right-to-left-emails")}: kube_client::client: Unsuccessful data error parse: 404 page not found

We are working on a fix, we will push it asap, I will let you now once it landed.

Hey @prki,

We just released a fix for this one, not emitting any logs if vert-manager is not install.

Can you update qovery chart? (Qovery agent image version should be b97bc6496687fb0efaf7424686405582070039e7 once updated).

public.ecr.aws/r3m4q3r9/cluster-agent:b97bc6496687fb0efaf7424686405582070039e7

Thanks

We have updated the chart and will be validating the fix soon. We noticed this warning:

2024-07-19T15:56:22.861118Z WARN spawn_services_watcher{namespace=None}: kube_client::client: Unsuccessful data error parse: 404 page not found

Is this expected?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.