Exploding Cloudwatch costs

Yann · November 10, 2022, 11:10pm

Hi,

I’ve got a cost explosion problem in AWS, due to cloudwatch, proably due to something related to qovery-cluster-agent that spawn huge quantity of logs suddenly.

Here is my cloudwatch dashboards:

We can clearly see that starting from 9 Nov, log processing goes insane.

I have a single cluster with a staging app, and about 10ish dev app spawned manually. nothing special happened those last days on the development side, no traffic pick, nothing special.
I’ve since stopped all my deployments / qovery environment to try to stop the bleeding, but it doesn’t seem to change anything.

I’ve noticed that one of my nodes were showing weird CPU usage:

which seems to come from promtrail and qovery-cluster-agent on that node:

and logs from qovery-cluster-agent are indeed going pretty insane, in DEBUG (don’t know why it’s in debug level), and spit errors about some forbidden acces:

I think it’s this one in particular:

 INFO spawn_pods_watcher{namespace=None}: cluster_agent::service_status: error while waiting for change on pods/job stream for namespace None: failed to perform initial object list: ApiError: jobs.batch is forbidden: User "system:servicea │
│ ccount:qovery:qovery-cluster-agent" cannot list resource "jobs" in API group "batch" at the cluster scope: Forbidden (ErrorResponse { status: "Failure", message: "jobs.batch is forbidden: User \"system:serviceaccount:qovery:qovery-cluster-agent\" cannot list resourc │
│ e \"jobs\" in API group \"batch\" at the cluster scope", reason: "Forbidden", code: 403 })

nothing changed in my aws account, policies, nor network config in recent days, so I don’t really se what is going on.

Any idea about how to get that under control, and stop cloudwatch cost exploding ? (+1000% in last 2 days)

Pierre_Mavro · November 12, 2022, 8:17am

Hi @Yann ,

I’m looking into it. I think it is because of recent changes we’ve made to prepare the job support. Let me dig into it and get back to you with a fix and a clear answer.

Thanks

Pierre_Mavro · November 12, 2022, 8:59am

Hi @Yann ,

I confirm the issue was on our side. As you have shown in the logs, the issue was missing permissions on the Kubernetes RBAC API for the agent to prepare the coming “jobs” feature. As a result, it led to high CPU usage because of a loop constantly asking the API the permission.

A fix has been made and tested. A temporary workaround has been deployed onto your cluster, so you don’t have the issue anymore. I will release the final fix in the coming hours, your cluster will also be updated with the final fixed version.

Thanks for the report, and sorry for the inconvenience

Pierre

Yann · November 14, 2022, 5:32am

Thank you Pierre,

Sorry to have made you work on it on a weekend, but impact was high for us

I was thinking to elements that can mitigate those kind of incident, I had two question / suggestion related to that:

Do you think qovery-cluster-agent should have Debug level of log, it can be a bit dense in those situations ?

Could we leverage, as qovery users, the grafana instance that is installed by qovery to visualise more things, or even couple it to some alertmanager setup to have some configurable monitoring included by default with a qovery ?

Pierre_Mavro · November 18, 2022, 2:36pm

Hi @Yann,

First of all, I disabled debug logs on the agent and other Qovery applications. It doesn’t impact the cost since application logs are not stored by default.

Here Kubernetes API side errors were logged because of the missing RBAC, but they are not easily catchable (only Cloudwatch side). So the cost comes only from the Kubernetes API, not the application’s logs.

Using the Grafana Qovery instance is not something we may open at some point since we have to work on the graphs stack first. This is on our roadmap. For now, there is only log access.

We will propose next year a better integration with the monitoring. I can’t tell you more about it at the moment, because we have to define what we’ll provide exactly and how.

If you’re interested in alerting, graphs, etc… a quick win is to use Datadog. Also, I recently made articles on the log part. Feel free to look at it if you need it: Perform advanced search in your application logs

Pierre

kincorvia · August 5, 2023, 4:41pm

I keep finding this thread for some reason and it really scares me that a mistake by Qovery could produce crippling costs to a smaller company. I wonder what strategies can be used to prevent this.

Topic		Replies	Views
Disable or reduce Cloudwatch logging due to excesive cost Questions and Answers	2	132	May 20, 2024
Cloudwatch logging IAM error Questions and Answers	3	810	February 14, 2023
Processing of user data Questions and Answers qovery	2	28	September 20, 2024
Tag AWS instances created with Qovery Cluster name Questions and Answers aws	2	155	March 25, 2024
Keep control of your Kubernetes costs with Kubecost - available for AWS EKS News aws , cost	0	510	August 25, 2022

Exploding Cloudwatch costs

Related topics