Exploding Cloudwatch costs

Hi,

I’ve got a cost explosion problem in AWS, due to cloudwatch, proably due to something related to qovery-cluster-agent that spawn huge quantity of logs suddenly.

Here is my cloudwatch dashboards:


We can clearly see that starting from 9 Nov, log processing goes insane.

I have a single cluster with a staging app, and about 10ish dev app spawned manually. nothing special happened those last days on the development side, no traffic pick, nothing special.
I’ve since stopped all my deployments / qovery environment to try to stop the bleeding, but it doesn’t seem to change anything.

I’ve noticed that one of my nodes were showing weird CPU usage:

which seems to come from promtrail and qovery-cluster-agent on that node:

and logs from qovery-cluster-agent are indeed going pretty insane, in DEBUG (don’t know why it’s in debug level), and spit errors about some forbidden acces:

I think it’s this one in particular:

 INFO spawn_pods_watcher{namespace=None}: cluster_agent::service_status: error while waiting for change on pods/job stream for namespace None: failed to perform initial object list: ApiError: jobs.batch is forbidden: User "system:servicea │
│ ccount:qovery:qovery-cluster-agent" cannot list resource "jobs" in API group "batch" at the cluster scope: Forbidden (ErrorResponse { status: "Failure", message: "jobs.batch is forbidden: User \"system:serviceaccount:qovery:qovery-cluster-agent\" cannot list resourc │
│ e \"jobs\" in API group \"batch\" at the cluster scope", reason: "Forbidden", code: 403 })

nothing changed in my aws account, policies, nor network config in recent days, so I don’t really se what is going on.

Any idea about how to get that under control, and stop cloudwatch cost exploding ? (+1000% in last 2 days)

1 Like

Hi @Yann ,

I’m looking into it. I think it is because of recent changes we’ve made to prepare the job support. Let me dig into it and get back to you with a fix and a clear answer.

Thanks

Hi @Yann ,

I confirm the issue was on our side. As you have shown in the logs, the issue was missing permissions on the Kubernetes RBAC API for the agent to prepare the coming “jobs” feature. As a result, it led to high CPU usage because of a loop constantly asking the API the permission.

A fix has been made and tested. A temporary workaround has been deployed onto your cluster, so you don’t have the issue anymore. I will release the final fix in the coming hours, your cluster will also be updated with the final fixed version.

Thanks for the report, and sorry for the inconvenience

Pierre

Thank you Pierre,

Sorry to have made you work on it on a weekend, but impact was high for us :sweat_smile:

I was thinking to elements that can mitigate those kind of incident, I had two question / suggestion related to that:

Do you think qovery-cluster-agent should have Debug level of log, it can be a bit dense in those situations ?

Could we leverage, as qovery users, the grafana instance that is installed by qovery to visualise more things, or even couple it to some alertmanager setup to have some configurable monitoring included by default with a qovery ?

1 Like

Hi @Yann,

First of all, I disabled debug logs on the agent and other Qovery applications. It doesn’t impact the cost since application logs are not stored by default.

Here Kubernetes API side errors were logged because of the missing RBAC, but they are not easily catchable (only Cloudwatch side). So the cost comes only from the Kubernetes API, not the application’s logs.

Using the Grafana Qovery instance is not something we may open at some point since we have to work on the graphs stack first. This is on our roadmap. For now, there is only log access.

We will propose next year a better integration with the monitoring. I can’t tell you more about it at the moment, because we have to define what we’ll provide exactly and how.

If you’re interested in alerting, graphs, etc… a quick win is to use Datadog. Also, I recently made articles on the log part. Feel free to look at it if you need it: Perform advanced search in your application logs

Pierre