Kubernetes solves my problem by allowing multiple containers in one pod.
But from what I understand, in Qovery each “service” is deployed in a different pod, right?
You’re right! This is to avoid misbehavior in production. Suppose you have a container having an issue in a pod (exit, crashing because of too much memory consumption…), anything wrong which will restart a container. In that case, the whole pod is restarting.
Here is an example: let’s say you have a frontend and a backend app in the same container. If your frontend app crash (Nginx, PHP-fpm/whatever is not correctly adapted to the workload, a bug, or anything else), then your backend will restart with the frontend. What happens in this situation? All the current work your backend was on, will be lost. Why? Your backend was fine and maybe was treating important stuff; reloading it can take some time as well! So why should it be restarted in that case? Moreover, you have a high chance it will bring unexpected issues than it was supposed to solve.
In addition, the autoscaling of your pod will be a nightmare to manage as you don’t have the same behavior of the 2 apps and can’t act the same behavior on the same metrics (CPU/ram/anything else). Multiple containers in a single pod should only be used for specific reasons (network mesh topology, automatic TLS endpoints, app metrics…). By experience, it’s a terrible practice to have several containers in a pod, this is why we don’t propose it.
To be honest, I don’t see how I could run a production application with real users without ensuring my environment is sane (= all my services are in good health AND at the same version).
I agree with that, but not with the way to do it. I don’t want to offend you; just bring my point of view based on years of experience. My goal is just to help you as much as I can, not tell you what you have to do.
So, from my experience, this architecture can work well on specific use cases, like simple ones, without much data to manipulate. But it’s not that common. With standard products with a database, you’ll have to manage schema upgrades and won’t be able to handle rollback on a schema easily, right?
For applications, it should be the same. Your applications should be able to support ~2 versions of their dependencies, waiting for the deprecation and code removal. Helping you only to update/release what you need, making the rollback simple because you can update one app at a time and not make a big bang.
As time passing, you will undoubtedly multiply the number of applications you have, and maintaining all of them at the same version at the same time will be incredibly time-consuming and reduce your velocity of delivery. And at some points, rollback will be almost impossible without a massive impact on your users (more products, more changes, more people to sync with each other…).
So the theory is excellent, but the practice is not. This is why you find rolling update as the default strategy on most (maybe all) application/container schedulers.
To conclude, adding blue-green deployment to the product is something we’re considering because, as I mentioned, in some particular cases, it’s interesting. But as it’s not requested by our customers that much, it’s not that common on production, and it’s not on Qovery top priority list, so I don’t have ETA to give you for this .
My last advice is I encourage you to switch to a classical rolling update method as you will undoubtedly switch to it in the future.
Let me give you an example. I have two services to deploy (my backend and my frontend). Let’s
say that my frontend deployment finishes first. Then my users won’t be able to use new features (using new API endpoints) until the backend is deployed. That’s even worse if my backend deployment fails because I’m stucked with no features working until I deploy again a working version.
In this case, your backend should be able to handle both version of the frontend. When the service is reloading, customers will be on one frontend or another, but in anycase it will work. You can use sticky sessions as well if you’re afraid of having the switch too fast for a single customer.
It raises a few questions:
- How do your other customers handle this problem ?
Maintaining 2 versions or as @itajenglish mentioned above: feature flags.
- Is there a way to minimize the time between the deployment of my two services?
It only depends on your app’s starting time. If you manage 2 versions in parallel, it’s ok. In my previous experiences, I had the same application deployed 5000+ times (instances). You just can’t do the blue-green deployment in this case because you need to double the number of instances just for the update (you can imagine the bill and or bare metal it’s too expensive to be considered). Rolling update and, canary deployments, features tags are helping a lot in that case. This is why I encourage you to consider a rolling update strategy.
- I guess an option would be to plan deployments during the night? (is it what the “deploy on specific timeframe” feature does?)
Once again, I had an experience in the financial industry for ~6y. You can’t imagine how painful it is to release off hours. Hours constraints are exhausting for everyone. If you can avoid it, go for it. What is commonly made in the industry is to add tests, and validate with a pre-prod or ephemeral environment (thanks to Qovery for making it so easy). So you can safely make updates on working hours without fear.
And because we’re all humans, shit happens, and this is why you can regularly find post-mortems even for big companies like Google, Facebook… and this is also why you’re looking for an easy way to rollback.
I’m not saying you shouldn’t care about issues, but with what I explained above, it will drastically reduce the ones you can encounter.
- Besides not being optimal, still causing downtime and limiting my capacity to deploy frequently, what happen if something goes wrong ? In the case the deployment of one of the service fails (or succeed but the app health check fails), is there a way to instantly rollback a service ?
With the rolling update strategy, when you run a deployment, Kubernetes uses probes to ensure your app is correctly working. So if you’re correctly using liveness and readiness probes, your application will not be used until probes will tell Kubernetes to redirect traffic on it.
And if you miss-configured the probes, or probes are saying everything is ok, but you have a bug to rollback, yes, you can select the application version to rollback. It’s effortless for Qovery to perform a rollback.
- Is there a way to do it programmatically? (so that I don’t need to monitor actively each deploy)
Avoiding the monitoring of your app is, I think, one of the biggest mistakes you can make. You don’t want to spend time on observability and monitoring, I can understand it even if disagree, as you have concerns about your application uptime. In this case, I suggest you to look at one or some observability/monitoring/apm tools directly on your app as a library like Datadog, Newrelic, Sentry, Logrocket… It will be effortless to have more than the minimum you should have.
Sorry for the long message too