Can't deploy on production cluster

Sryther · May 10, 2024, 6:47pm

INFORMATION TO PROVIDE

I’m struggling deploying apps on our production cluster. Whenever I try to deploy an existing image version, I got that:

🪞 Mirroring image to private cluster registry to ensure reproducibility
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
❌ Failed to mirror image <namespace>/api:<tag> due to Docker terminated with a non success exit status code: ExitStatus(unix_wait_status(256))

Organization ID: 558f6260-d61b-47bc-bb08-f28d2d655ce9
Project ID: 70edc842-a910-4f27-9411-e3e786d39c0c

Cloud provider: Scaleway.

bchastanier · May 10, 2024, 8:34pm

Hey @Sryther,

I am looking. In the meantime, can you double check that credentials for the production registry is working with the credentials set 2 months ago?

Trying to list repositories using this registry gives me a permission issue:

Can you try to create a new container registry Scaleway Registry - Production project New in the organization with new credentials which are working and use this repository?

Let me know how it goes,

Cheers

Sryther · May 13, 2024, 7:41am

Hello @bchastanier
Thanks for your quick answer. As fare as I know, the credentials have not changed. I set them again and the error is now different but it looks like it comes from us.
Thanks!

EDIT: maybe the error should be described in a better way than an ExitStatus. If the credentials were wrong, I remember Qovery would tell me it back in the days.

bchastanier · May 13, 2024, 7:57am

Hello !

Indeed checks are done properly on container setup the first time (and edit I guess) but later on error gets fuzzy.
I loop @Alessandro_carrano in order to see what can be done on that front to improve the product, thanks for the feedback

Looking quickly to the deploy, seems to have an issue on the Job:
ValueError: invalid literal for int() with base 10: '1 RESET_PASSWORD_EMAIL_KEY_EXPIRE_DAYS'

Let me know if you need further assistance.

Cheers,

Matthieu_Delanoe · May 13, 2024, 8:50pm

Hello Benjamin,

Now it works when deploying on prod, but it doesn’t work anymore when deploying on develop and staging environments. All environments use the same registry defined in Qovery settings for the source of the images. So reading images shouldn’t be an issue since it works for prod.
Writing cache images is done on 2 different Scaleway caches, and we have not changed those settings for the past 2 months, and it was working well yesterday.

@bchastanier in the settings of Qovery we can’t confirm which access key is associated with which registry, so a bit hard to debug for us. Could you send me over emails the keys that are stored on Qovery? So that we can double check their access rights.
(product improvement suggestion: display the stored access key when editing the registry, just not the secret)

Do you see anything in your logs? Ex of log that failed on dev : https://console.qovery.com/organization/558f6260-d61b-47bc-bb08-f28d2d655ce9/project/70edc842-a910-4f27-9411-e3e786d39c0c/environment/12d194e4-230d-4eca-b87a-7ea9906db114/logs/408bbda6-81e7-4753-83c9-095890715662/deployment-logs

🔓 Login to registry rg.fr-par.scw.cloud as user nologin
🪞 Mirroring image to private cluster registry to ensure reproducibility
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
❌ Failed to mirror image airsaas/api:develop-ff649f5e due to Docker terminated with a non success exit status code: ExitStatus(unix_wait_status(256))

Since the registries for source images and registries for mirrors are both on Scaleway, accessible using different access keys, are we sure there no mix up in the access keys used? Like you use the key for the source registry to write in the mirror registry? That would definitely explain this behavior
(shot at understanding what’s going on…)

Thx!

bchastanier · May 13, 2024, 10:59pm

Hey @Matthieu_Delanoe !

I need to have a closer look, but the issue seems to boils down to the way docker handle login to registries: Docker login creates a map where the key is container registry hostname such as:

{
   "creds": {
       "rg.fr-par.scw.cloud": "<auth-token>",
       "another-registry.com": "<auth-token>",
    }
}

In your case, if my understanding is correct, you have, for each cluster two registries (1 tied to the cluster for caching and one where you push your images via CI), on our end, we connect to both registries via 2 docker login commands:

docker login ... rg.fr-par.scw.cloud <token-cluster-mirroring-registry>
docker login ... rg.fr-par.scw.cloud <token-your-container-images-registry>

Because SCW registry for a given region has the same hostname, the second docker login command will erase the creds for the first login in the map above.
This is a known limitation on our end, we do not manage this part has it’s done by Docker and haven’t found a way to overcome this for the time being, here’s a github ticket describing this issue.

That being said, there are several workarounds to make it work:

give your cluster user read / write access to your images repo (the one where your CI pushes your images), so on this repositories, you should have both users allowed: your dev cluster and your production cluster users. This will prevent docker login from erasing the first token (it will erase it, but both tokens being the same, won’t be an issue).
setup your mages repo in another SCW region, since the hostname will change, it won’t clash, but it will induce network costs eventually, so not the best here …
this one won’t work for you but I add it just FYI, but you can have your repo images hosted on another provider (dockerhub, github, gitlab, or any generic container registry)

I will have a closer look tomorrow and will provide more information for you to investigate, but the issue looks like the docker login hostname clash.

Cheers

Matthieu_Delanoe · May 14, 2024, 7:37am

@bchastanier since it’s cached domain based, could we just CNAME our source registry to a custom domain and then use it on Qovery? Wouldn’t that work?

bchastanier · May 14, 2024, 8:08am

Hey @Matthieu_Delanoe !

We do thought about it back then but it was for the mirror, not for the external one. So in theory sticking a CNAME on your own registry and using this DNS it should work indeed.

Let me know how it goes,
Cheers

bchastanier · May 14, 2024, 8:21am

Actually, it’s likely not going to work for other operations because TLS certificates won’t be valid.

What about option 1 having the two users a granted access to the external registry, so clusters and cluster mirror registry will have the same token?

Let me know how it goes,
Cheers

Matthieu_Delanoe · May 14, 2024, 8:28am

Yeah we’re currently looking at all the option to see which one will be best. For 1, what you mean is that we update all 3 registries declarations on Qovery and add the same access key to the 3 of them, with access rights to all registries, right?

bchastanier · May 14, 2024, 8:38am

Sorry I wasn’t clear, yeah, basically, the idea would be to have, for each cluster, the user having access to your external registry.

Let me illustrate it:

Your current setup
As of today, I guess you have 3 credentials:

dev environment (used to operate your dev cluster) => let’s call it dev user
production environment (used to operate your production cluster) => let’s call it production user
external registry user (used to read / write eventually on your registry where your images are stored) => let’s call it registry user
Nor dev user nor production user have access to this registry.

The target setup:
Ideally you would have two credentials:

dev environment (used to operate your dev cluster), BUT also have a read / write? access to external registry => dev user
production environment (used to operate your production cluster), BUT also have a read / write? access to external registry => production user

Doing so, when doing docker login and further operations, user on the cluster (cluster mirror) and external registry user will be the same and will be able to operate on both registries, which should solve the issue.

Let me know if it helps

Matthieu_Delanoe · May 14, 2024, 1:22pm

@bchastanier we modified the policies linked to the different users used for the 3 registries on Qovery, allowing them to access all registries, and so far it seems to work. We’re able to deploy again on all 3 environments.

Thanks for the assistance!

system · May 21, 2024, 1:23pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
"CreateError - Invalid container registry information" after updating mirror registry Questions and Answers	11	188	April 9, 2024
Scaleway - Impossible to mirror Scaleway	13	274	December 9, 2023
[SOLVED] Mirroring Image from ECR fails in production Deployment qovery , aws	1	77	April 30, 2024
Changing cluster credentials Deployment qovery	4	20	September 25, 2024
Deployment crashes after a few seconds Deployment	4	464	December 2, 2024

Can't deploy on production cluster

Related topics