We encountered a problem this weekend, 1 application encountered an error ImagePullBackOff and stopped.
The application ran smoothly for months, and we didn’t deploy anything since so I guess it must be a maintenance-related deployment
I didn’t find a clue in application logs or in New Relic. It’s a simple nginx server serving a frontend app.
A manual redeploy solved the problem.
We encountered this error many times, but it’s the first time we encounter it since we’re in production.
I’m wondering if it the problem comes from the mirroring registry (Scaleway) or the “source” container registry on GitLab.
I think it’s a temporary problem because it came back to normal without doing more than a redeploy.
This error happens when the image is not anymore in the container registry.
When deploying an application, an mirroring is done by Qovery on your cluster container registry: you can see the line 🪞 Mirroring image to private cluster registry to ensure reproducibility
Did you by curiosity delete some of your cluster registry images ? (The pattern of the repository is the following: qovery-mirror-${container_id})
Our cloud provider is Scaleway, is it possible that a retention policy is applied by Scaleway and images are automatically deleted after a period of time ?
I didn’t delete any image in the mirror container registries.
I never modify qovery-* elements in Scaleway.
In the container registry, I can see 16 qovery-mirror-* registries, 14 with 1 image and 2 with 0 registries. Some image have been there for 4 months, but the namespaces with 0 images have been created 2 months ago
So the retention policy doesn’t seem a good explanation.
Thank you. The network connection was my first guess.
According to the information I’ve found, k8s tries 5 times to recover the image before it goes into error.
Is it possible to increase this parameter, and/or is it possible to configure an exponential delay in k8s/Qovery ?
Another question: is it possible to find out what triggered the deployment? I’d like to understand better what happened to make the system more resilient. As we didn’t triggered a deployment ourselves, I was surprised to discover a “deployment” error instead of an application error.
I’m going to look at what kubectl has to offer. As a k8s beginner, if you already know a command to check this, I’d be interested.
I just realized I’ve forgotten your useful Audit logs view, now I know there was probably some maintenance task on june 26th that needed to redeploy apps.
According to the information I’ve found, k8s tries 5 times to recover the image before it goes into error.
Is it possible to increase this parameter, and/or is it possible to configure an exponential delay in k8s/Qovery ?
According to the docs, the back off period doesn’t seem to be customizable in k8s (it’s already a exponential delay, up to max 5min).
As we didn’t triggered a deployment ourselves, I was surprised to discover a “deployment” error instead of an application error.
What do you mean by a “deployment” error ? I guess you may have seen your service marked in “Error” in the “service status” column from your environment page ?
I just realized I’ve forgotten your useful Audit logs view, now I know there was probably some maintenance task on june 26th that needed to redeploy apps.
Yes we regularly update the clusters to upgrade the components Qovery installs (qovery-agent, charts). This cluster deployment should have no impact on your applications: the error ImagePullBackOff may have happen if your service has been moved to another node, which failed has the pod couldn’t be started
Thank you for your answers, I have a better understanding of the subject.
When I wrote “deployement” error it was because I supposed that the image would only be retrieved from the registry container before a deployment, whatever the trigger.
We’re going to improve our alerts so that we can react more quickly in the event of a problem.
To limit the cost we limited the number of instances to 1 for less critical services, increasing to 2 instances minimum (as you recommend in prod) would have avoided a mistake of this kind?
Yes you’re right as you would have 2 pods running each one on 2 different nodes.
This would of course reduce the risk but not totally remove it: if the 2 nodes where your pods are running are shut down AND there is at the same time an issue on scaleway registry like there was saturday, this would result with the same issue. (This is a rare edge case, but just so you know it “can” happen)
Thank you again, now I’m able to visualise how it works.
We’ll try to increase the number of instances on a few more services.
And now we’ve improved our alerts on new relic, we’ll be more reactive to restore services if a rare edge case happens again.