Image builds caching with AWS ECR

I will duplicate the prelude of my other question here in order to provide better context:

After looking at the response to this topic and browsing through the linked engine code, I have a couple of questions regarding the way Qovery builds images.

  • We have multiple logical services (by logical service I mean a Qovery project and a corresponding Qovery environment)
  • This logical service/application has multiple ‘sidecars’ (i.e. Qovery services). These sidecars are usually the web server, a couple of background job processor servers, etc
  • All of these sidecars use the same application code, the only difference between them is the entry point, so ideally the containers should be created from the same image

When looking at the ECR instance that Qovery has created, I can not see any cache artefacts. Docker layers containing dependency installs such as APT packages and Ruby gems also do not seem to persist the versions of dependencies that are shared between deployments.

This code comment seems to confirm that remote caching is currently disabled.

Question 1:
Is this comment still relevant? It seems like AWS ECR private registries now support cache-from and cache-to according to this announcement, but I might be wrong.
The main bulk of image build process is occupied by dependency pulling - enabling the cache would speed the build times significantly.

Question 2 - follow-up:
In case it’s possible to use remote caching and say it gets enabled some time soonish, this caching will only work in the scope of a single Qovery service, since the cache artefact bears the name of the application, similar to ordinary images, is this correct?
So in order to leverage layer caching between multiple sidecars/Qovery services, so that Ruby gems and APT dependencies get cached, we’d still have to build our own image build pipeline and set up Qovery services with the ‘Container registry’ source?

Hi @daniilsvetlov , first of all, great analysis. Thanks for asking all those questions. I will try to respond as clearly as possible. I also put in cc @a_carrano and @Julien_Dan .

You’re completely right. At the time we integrated build cache via Buildkit into our engine, AWS ECR didn’t support it. It’s now supported :partying_face: . I’m not sure if there is a task on our end to support it, but I’m sure @Julien_Dan will take a look with the team. On my end, I discussed this with one of our staff engineers who worked on that part, and he is aware that now AWS ECR supports this. So, I’m pretty sure we are not far from providing something here. Of course, we need to test this part and integrate it properly.

Quick question: do you have some numbers to share with us? How long does your build take today? How long will it take with the cache? (I could take a look myself, of course).

Indeed, you’re right, but this is something we discussed this week with our team to make it efficient across different environments (and projects). The goal would be to provide an option to enable service caching across multiple services within the same and other environments. This is basically to avoid building twice the same image.

Today yes, but in the (hopefully) near future, we want to provide better caching support. Regardless of where your app is deployed. The idea would be to avoid wasting time rebuilding something that has been built once.

I hope it helps.

Thanks for the respones! It’s good to hear that caching could be enabled soon.

Quick question: do you have some numbers to share with us? How long does your build take today? How long will it take with the cache? (I could take a look myself, of course).

I think you’ll gain more insights if you look it up. E.g. the layer that does apk add --update --upgrade takes ~20-30 seconds to build; layer that does installation of our app dependencies takes ~90 seconds. Sprinkle in some intermediate layers and one of our images takes exactly 5 minutes to build. Take into account that this has to happen twice in sequence (omitting the fact that apps have to build the image individually, since it doesn’t have a big impact on deployment time due to parallelisation) since we have lifecycle jobs that run first, and we get at least 10 minutes spent on image builds alone during deployment.

We have proceeded with the solution outlined in the adjacent discussion by building our own CI pipeline that builds images (with ECR’s cache-to and cache-from flags) and deploys them to qovery using qovery X update and qovery environment deploy - the whole CI workflow with a deployment takes 3 minutes (including build time and waiting for pods to be in the running state) on average now since most of the Docker layers get fetched from the cache.

We have some follow-up questions:

  1. First off, I understand that the potential feature you mentioned where a Qovery service would be able to reuse an existing image is probably in the very early conception stages.
    Still, do you think that adopting this feature, once it’s released, will require us to recreate existing services that are currently set up with CR source and are deployed using images built by us (asking this since changing the source back to Git and vice versa requires creation of another service)?
    We’re trying to assess how hard would it be to migrate running services to this feature once it’s released, so we can ditch our manual approach when the time comes and the caching + image reuse issues get resolved by Qovery.

  2. Would this feature support setting custom ECR repository lifecycle rules, security scanning options, and so on?

  3. Would this feature allow us to define image names and the way tags are going to be generated?
    Asking this since our current approach with manual builds allows us to use these images to run tests in CI (outside of AWS), since we control the way tags are generated (we don’t use latest).

No, this is something we are working on to make it possible for an app to be switched from Git to Container Registry and vice versa. cc @Julien_Dan @a_carrano

Ok, it makes perfect sense. I know how painful this part is at the moment.

What do you have in mind here? Is there anything missing from our ECR configuration that you would need? We can add options inside our cluster advanced settings.

It’s not planned to allow you to define image names since we make sure that there is no clash. What do you need exactly? Do you need to be able to have deterministic image names?

Romaric.

No, this is something we are working on to make it possible for an app to be switched from Git to Container Registry and vice versa.

That’s good to hear!

What do you have in mind here? Is there anything missing from our ECR configuration that you would need? We can add options inside our cluster advanced settings.

We have the scan on push feature enabled for our ECR repositories, I guess adding it as an advanced setting could be useful.

It’s not planned to allow you to define image names since we make sure that there is no clash. What do you need exactly?

I was just theorising since our current setup allows us to use the built images to run tests in CI, since we know how a tag is generated. Some kind of determinism in image names and tags (e.g. an additional tag in the form of a git hash) perhaps could help, but I understand it might be problematic to implement.

I put @Julien_Dan (Technical Product Manager) in cc :slight_smile:

It makes perfect sense, and we can explain how to “predict” the URLs, but you will still need to make some API calls to Qovery. Out of curiosity, what type of tests your CI are running in that case?

Out of curiosity, what type of tests your CI are running in that case?

Just ordinary unit and integration tests of an application.

Hey @daniilsvetlov,

Just a quick update, we will test in the next weeks the remote cache for AWS ECR.

Julien

1 Like

Hi @daniilsvetlov,

the feature has been released! we haven’t verified yet the improvement in terms of deployment speed but you will see a

importing cache manifest from xxx.dkr.ecr.us-east-2.amazonaws.com/zYYYY:cache

when the layer is already available (this is on a AWS cluster).

We have added the info in our latest changelog Qovery Helper, search bar, remote cache usage on app build ...

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.