Deployment failed : HELM UPGRADE FAILED

Hi there,
Since yesterday we’re experiencing an HELM error command, upgrade failed, each time we launched a deployment.
Here’s the full message :

Helm command `UPGRADE` for release application-z31330131-z31330131 terminated with an error: CommandError { full_details: Some("Error: UPGRADE FAILED: create: failed to create: rpc error: code = Unknown desc = database is lockedcreate: failed to createhelm.sh/helm/v3/pkg/storage/driver.(*Secrets).Create\thelm.sh/helm/v3/pkg/storage/driver/secrets.go:164helm.sh/helm/v3/pkg/storage.(*Storage).Create\thelm.sh/helm/v3/pkg/storage/storage.go:69helm.sh/helm/v3/pkg/action.(*Upgrade).performUpgrade\thelm.sh/helm/v3/pkg/action/upgrade.go:320helm.sh/helm/v3/pkg/action.(*Upgrade).RunWithContext\thelm.sh/helm/v3/pkg/action/upgrade.go:148main.newUpgradeCmd.func2\thelm.sh/helm/v3/cmd/helm/upgrade.go:200github.com/spf13/cobra.(*Command).execute\tgithub.com/spf13/cobra@v1.2.1/command.go:856github.com/spf13/cobra.(*Command).ExecuteC\tgithub.com/spf13/cobra@v1.2.1/command.go:974github.com/spf13/cobra.(*Command).Execute\tgithub.com/spf13/cobra@v1.2.1/command.go:902main.main\thelm.sh/helm/v3/cmd/helm/helm.go:87runtime.main\truntime/proc.go:225runtime.goexit\truntime/asm_amd64.s:1371UPGRADE FAILEDmain.newUpgradeCmd.func2\thelm.sh/helm/v3/cmd/helm/upgrade.go:202github.com/spf13/cobra.(*Command).execute\tgithub.com/spf13/cobra@v1.2.1/command.go:856github.com/spf13/cobra.(*Command).ExecuteC\tgithub.com/spf13/cobra@v1.2.1/command.go:974github.com/spf13/cobra.(*Command).Execute\tgithub.com/spf13/cobra@v1.2.1/command.go:902main.main\thelm.sh/helm/v3/cmd/helm/helm.go:87runtime.main\truntime/proc.go:225runtime.goexit\truntime/asm_amd64.s:1371: Command terminated with a non success exit status code: exit status: 1"), message_safe: "Helm upgrade error" }

Console : https://console.qovery.com/platform/organization/659a7215-944e-48b0-8128-4e250a030c8a/projects/08a4cc2f-95cc-46d0-86c5-7031ab5417d3/environments/496b0aad-490d-465e-99b4-3cec1b4581b2/applications?fullscreenLogs=true

Note : the error appeared after a storage capacity upgrade from 20GB to 50GB (and a bunch of weird application deployment with 700 or 900 pods… which leads to fail the deployment ; don’t know if it’s related to the Helm issue above).

Thanks for help.

Hello @RSZ-Raphael,

Having a look.

This cluster has more than 2500+ pods, 95% of those are either Evicted or UnknownStatus, doing some cleaning and trying to understand what happened, I will let you know ASAP.

Ok, cleaned everything, updated the cluster, and triggered a redeploy of your app, all good.

About the root cause, I am not 100% certain as of yet, but, from what I saw, your cluster node was already a bit over capacity and some pods started to be evicted and nothing new can be deployed.

Please let me know if you need further help.

Cheers

Thanks @bchastanier , but before closing this issue, do you have any explanation on what happened to lead this situation? We also noticed that a “deployment error” tags is still displayed on the UI.

Sorry, yes, I added more details about the root cause in my previous comment.
Regarding “deployment error” tags still displayed, let me check.

Also, you increased disk size because it was full? It might have be triggering all those eviction / reboot as well.

Regarding “deployment error” tags still displayed, it seems that one of your DNS failed to be solved but this should not be blocking.

Deploy has been restarted, since and the log stating that disappear (EC2 logs are not persisted), I cannot link it to you. I am waiting this run fails again.

From time to time, if you deploy apps one by one in an env, global env status might be wrong in very specific cases. Solution being to redeploy the whole env. This should solve the issue.

Thank you @bchastanier ! all’s working fine now.
Best.
Raphaël.