Can't deploy on production cluster

INFORMATION TO PROVIDE

I’m struggling deploying apps on our production cluster. Whenever I try to deploy an existing image version, I got that:

🪞 Mirroring image to private cluster registry to ensure reproducibility
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
❌ Failed to mirror image <namespace>/api:<tag> due to Docker terminated with a non success exit status code: ExitStatus(unix_wait_status(256))

Organization ID: 558f6260-d61b-47bc-bb08-f28d2d655ce9
Project ID: 70edc842-a910-4f27-9411-e3e786d39c0c

Cloud provider: Scaleway.

Hey @Sryther,

I am looking. In the meantime, can you double check that credentials for the production registry is working with the credentials set 2 months ago?

Trying to list repositories using this registry gives me a permission issue:

Can you try to create a new container registry Scaleway Registry - Production project New in the organization with new credentials which are working and use this repository?

Let me know how it goes,

Cheers

Hello @bchastanier
Thanks for your quick answer. As fare as I know, the credentials have not changed. I set them again and the error is now different but it looks like it comes from us.
Thanks!

EDIT: maybe the error should be described in a better way than an ExitStatus. If the credentials were wrong, I remember Qovery would tell me it back in the days.

Hello !

Indeed checks are done properly on container setup the first time (and edit I guess) but later on error gets fuzzy.
I loop @Alessandro_carrano in order to see what can be done on that front to improve the product, thanks for the feedback :slight_smile:

Looking quickly to the deploy, seems to have an issue on the Job:
ValueError: invalid literal for int() with base 10: '1 RESET_PASSWORD_EMAIL_KEY_EXPIRE_DAYS'

Let me know if you need further assistance.

Cheers,

Hello Benjamin,

Now it works when deploying on prod, but it doesn’t work anymore when deploying on develop and staging environments. All environments use the same registry defined in Qovery settings for the source of the images. So reading images shouldn’t be an issue since it works for prod.
Writing cache images is done on 2 different Scaleway caches, and we have not changed those settings for the past 2 months, and it was working well yesterday.

@bchastanier in the settings of Qovery we can’t confirm which access key is associated with which registry, so a bit hard to debug for us. Could you send me over emails the keys that are stored on Qovery? So that we can double check their access rights.
(product improvement suggestion: display the stored access key when editing the registry, just not the secret)

Do you see anything in your logs? Ex of log that failed on dev : https://console.qovery.com/organization/558f6260-d61b-47bc-bb08-f28d2d655ce9/project/70edc842-a910-4f27-9411-e3e786d39c0c/environment/12d194e4-230d-4eca-b87a-7ea9906db114/logs/408bbda6-81e7-4753-83c9-095890715662/deployment-logs

🔓 Login to registry rg.fr-par.scw.cloud as user nologin
🪞 Mirroring image to private cluster registry to ensure reproducibility
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
🪞 Retrying Mirroring image due to error...
❌ Failed to mirror image airsaas/api:develop-ff649f5e due to Docker terminated with a non success exit status code: ExitStatus(unix_wait_status(256))

Since the registries for source images and registries for mirrors are both on Scaleway, accessible using different access keys, are we sure there no mix up in the access keys used? Like you use the key for the source registry to write in the mirror registry? That would definitely explain this behavior
(shot at understanding what’s going on…)

Thx!

Hey @Matthieu_Delanoe !

I need to have a closer look, but the issue seems to boils down to the way docker handle login to registries: Docker login creates a map where the key is container registry hostname such as:

{
   "creds": {
       "rg.fr-par.scw.cloud": "<auth-token>",
       "another-registry.com": "<auth-token>",
    }
}

In your case, if my understanding is correct, you have, for each cluster two registries (1 tied to the cluster for caching and one where you push your images via CI), on our end, we connect to both registries via 2 docker login commands:

docker login ... rg.fr-par.scw.cloud <token-cluster-mirroring-registry>
docker login ... rg.fr-par.scw.cloud <token-your-container-images-registry>

Because SCW registry for a given region has the same hostname, the second docker login command will erase the creds for the first login in the map above.
This is a known limitation on our end, we do not manage this part has it’s done by Docker and haven’t found a way to overcome this for the time being, here’s a github ticket describing this issue.

That being said, there are several workarounds to make it work:

  1. give your cluster user read / write access to your images repo (the one where your CI pushes your images), so on this repositories, you should have both users allowed: your dev cluster and your production cluster users. This will prevent docker login from erasing the first token (it will erase it, but both tokens being the same, won’t be an issue).

  2. setup your mages repo in another SCW region, since the hostname will change, it won’t clash, but it will induce network costs eventually, so not the best here …

  3. this one won’t work for you but I add it just FYI, but you can have your repo images hosted on another provider (dockerhub, github, gitlab, or any generic container registry)

I will have a closer look tomorrow and will provide more information for you to investigate, but the issue looks like the docker login hostname clash.

Cheers

@bchastanier since it’s cached domain based, could we just CNAME our source registry to a custom domain and then use it on Qovery? Wouldn’t that work?

Hey @Matthieu_Delanoe !

We do thought about it back then but it was for the mirror, not for the external one. So in theory sticking a CNAME on your own registry and using this DNS it should work indeed.

Let me know how it goes,
Cheers

Actually, it’s likely not going to work for other operations because TLS certificates won’t be valid.

What about option 1 having the two users a granted access to the external registry, so clusters and cluster mirror registry will have the same token?

Let me know how it goes,
Cheers

Yeah we’re currently looking at all the option to see which one will be best. For 1, what you mean is that we update all 3 registries declarations on Qovery and add the same access key to the 3 of them, with access rights to all registries, right?

Sorry I wasn’t clear, yeah, basically, the idea would be to have, for each cluster, the user having access to your external registry.

Let me illustrate it:

Your current setup
As of today, I guess you have 3 credentials:

  • dev environment (used to operate your dev cluster) => let’s call it dev user
  • production environment (used to operate your production cluster) => let’s call it production user
  • external registry user (used to read / write eventually on your registry where your images are stored) => let’s call it registry user
    Nor dev user nor production user have access to this registry.

The target setup:
Ideally you would have two credentials:

  • dev environment (used to operate your dev cluster), BUT also have a read / write? access to external registry => dev user
  • production environment (used to operate your production cluster), BUT also have a read / write? access to external registry => production user

Doing so, when doing docker login and further operations, user on the cluster (cluster mirror) and external registry user will be the same and will be able to operate on both registries, which should solve the issue.

Let me know if it helps

@bchastanier we modified the policies linked to the different users used for the 3 registries on Qovery, allowing them to access all registries, and so far it seems to work. We’re able to deploy again on all 3 environments.

Thanks for the assistance!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.