Hi,
I’ve got a cost explosion problem in AWS, due to cloudwatch, proably due to something related to qovery-cluster-agent that spawn huge quantity of logs suddenly.
Here is my cloudwatch dashboards:
We can clearly see that starting from 9 Nov, log processing goes insane.
I have a single cluster with a staging app, and about 10ish dev app spawned manually. nothing special happened those last days on the development side, no traffic pick, nothing special.
I’ve since stopped all my deployments / qovery environment to try to stop the bleeding, but it doesn’t seem to change anything.
I’ve noticed that one of my nodes were showing weird CPU usage:
which seems to come from promtrail and qovery-cluster-agent on that node:
and logs from qovery-cluster-agent are indeed going pretty insane, in DEBUG (don’t know why it’s in debug level), and spit errors about some forbidden acces:
I think it’s this one in particular:
INFO spawn_pods_watcher{namespace=None}: cluster_agent::service_status: error while waiting for change on pods/job stream for namespace None: failed to perform initial object list: ApiError: jobs.batch is forbidden: User "system:servicea │
│ ccount:qovery:qovery-cluster-agent" cannot list resource "jobs" in API group "batch" at the cluster scope: Forbidden (ErrorResponse { status: "Failure", message: "jobs.batch is forbidden: User \"system:serviceaccount:qovery:qovery-cluster-agent\" cannot list resourc │
│ e \"jobs\" in API group \"batch\" at the cluster scope", reason: "Forbidden", code: 403 })
nothing changed in my aws account, policies, nor network config in recent days, so I don’t really se what is going on.
Any idea about how to get that under control, and stop cloudwatch cost exploding ? (+1000% in last 2 days)