Customize NGINX ingress and Kubernetes Deployment configuration

Hi Covery team,

We would like to fine tune our Kubernetes configuration, and it seems it cannot be done though the Covery interface. Here are the modification we have directly on the Kubernetes cluster, and that we would like to persist in Qovery, to avoid update from Qovery:

  • Configure the lifecycle.preStop field in the podTemplate section of our deployments.
  • Configure NGINX ingress controler (customize configmap, and annotations).

How can we persist these change in Qovery ?

Thanks and regards,

Hi @Alexis_Bel ,

I invite you looking at our documentation. It’s possible to add custom annotations to all Kubernetes resources and also customise some resources by using what we call advanced settings.

However, maybe using the Qovery lifecycle jobs is a better option than lifecycle.preStop?

Could you give more inputs on what you need to achieve? (It would help me to better guide you). Thank you.

Hi @rophilogene,

Thanks for your answer! However, the current docs can’t help me solve my issues.

I would like to updated values from the NGINX ingress controller already deployed when the cluster is provisioned (there is no “service” nginx in my Qovery environment). What is the recommended method for that?

Regarding Advanced setting, there is no section for lifecycle.prestop field. I need to set this one because, for some of our pods, once terminated, they are directly removed from the cluster, before NGINX ingress controller is able to update its configuration, leading to “Host is unreachable” errors. I simply add “sleep 10” to delay the termination of the pod.

Thanks a lot,

Alexis

Hi again @rophilogene,

To give you more context, we are currently facing 502 HTTP errors on our application during the rollout and scale-down of application and NGINX pods.

After investigation, we have identified two root causes for this issue:

  1. The first issue is related to requests sent between the NGINX ingress controller pods and application pods. When our application scales down or rolls out, pods are terminated too quickly, and the NGINX configuration is not updated in time, causing it to continue sending requests to pods that no longer exist (this delay lasts about 10 seconds). To avoid this issue, we delay the pod termination process with the lifecycle.prestop directive. However, there is no way to configure this in Qovery, even with advanced settings.
  2. The second issue is related to requests sent from the AWS NLB to the NGINX pods. When an NGINX pod is terminated (during scale-down or rollout), the NLB continues to send requests to the node (which is still considered healthy by the NLB health check), resulting in 502 errors. The same issue occurs when a node is scaled down. The recommended approach would be to use the Load-Balancer IP target type instead of the instance target type, to directly target NGINX pods and not worker nodes. This can be achieved with the AWS Load Balancer Controller.

Does Qovery provide these features out-of-the-box? It seems the default Qovery configuration doesn’t address the two common issues described above. Our clusters were created a few years ago, so perhaps there have been improvements since then?

What are your recommendations for resolving these issues on the Qovery platform?

Thank you for your assistance.

Hello @Alexis_Bel,

Yes this is something we are going to improve by integrating the AWS LB Controller.

To give you more context, today the nginx-ingress-controller is the “old” way (called also in-tree) to manage nginx pods.
The AWS LB Controller would allow us to enable some feature on LB side like the proxy protocol that would fix those scaling issues.

We’ll communicate once this is ready on our side, it is in our current sprint but requires some tests on our side so no accurate ETA yet.

2 Likes

Hi @Melvin_Zottola,

Could you provide a rough estimate of the date for this improvement release or a temporary workaround in the meantime?

We often have this issue impacting production customers, a minimal portion and for a limited duration but still a real customer-facing problem that we consider major.

We are paying services like Qovery and managed clusters for the exact purpose of not dealing with such issues, what do you advise?

Thank you

Hi @clemg ,

The team will provide an ETA on this - I put @a_carrano and @Pierre_Mavro in cc

Hi @clemg ,

I’m currently working on the implementation and migration to the ALB controller. It will be available and rolled out next month (June 2024).

Regarding your point 1, my advice is:

  1. We do not propose (yet?) pre-stop hooks. However, maybe you can handle the SIGTERM in your app and, if running on Qovery (based on available env vars), you decide to add a sleep for 10 seconds before closing the app. What do you think about this?
  2. I don’t know your app, I’m just giving general guidance here: even with a workaround/fix, you will still face hardware issues/bugs that can still happen and that you can’t control or prevent. Implementing a retry client side is the best option to avoid a bad user experience.

Note: on this particular subject there is a current issue raised and an ongoing PR which may solve this issue. We’re following their progression and update Qovery config once it will be out.

Pierre

1 Like

Hi again,

Talking with the team about this issue, we are also talking about implementing post-start and pre-stop hooks with a default sleep 15 on pre-stop to avoid this issue.

I’ll keep you posted on the update on this internal reflection.

Pierre

Hi @Pierre_Mavro,

Thanks for your message and these details.

The ALB controller will be a key improvement for us (or any changes that fix these current limitations); we can’t wait to have it! For now, to try mitigating the issue, we disabled auto-scaling on nginx pods, but we still have issues in the event of nginx re-applying a new config.

Regarding the pre-stop, I definitely think you should propose a default sleep. This seems to be a basic and common case, and as Qovery managed k8s users, we shouldn’t have to worry about this.

As for your point about retries, we also have API-only users from other systems and products, so we can’t control how they implement calls to our API. We want to provide a service with the highest quality possible.

Hi @clemg ,

The pre-stop and post-start features will be released in the week. I’ll keep you posted when it will be out. I’ll then continue the ALB integration.

Pierre

Hi @clemg ,

Just to let you know that we’ve just released the pre-stop and pre-start lifecycles. They are available in the advanced settings. Current deployed applications/containers do not have a default sleep to avoid changing the behavior of what’s running for all of our customers. However, every new application/container created will inherit a sleep 15s in the stop process.

The documentation is ongoing and should be released shortly.

Pierre

Great! Thank you for this first improvement :pray:

What is your recommendation for applying this new setting to all our existing apps/environments (and how can we ensure no downtime for production)? Should we have to recreate them one by one?

You can just update the advance settings (pre stop lifecycle) of your apps with the default value and redeploy: [“/bin/sh”, “-c”, “sleep 15”].

You can ensure there is no downtime with an observability tool like Datadog. I also encourage you to have a look to this page: How to Achieve Zero-Downtime Application with Kubernetes

Stay tuned for the ALB controller rollout in the month.

Pierre

Hi @Pierre_Mavro,

We’ve been experiencing reliability issues as well and looking forward to all the improvements happening here. I have a couple questions to better understand the changes:

  1. sleep 15 became a default/recommended pre stop hook. Was this needed to better accommodate applications that do not handle SIGTERM correctly? Is sleeping in pre stop hook needed when application properly finishes all pending requests before exiting?

  2. Will ALB rollout change the current load balancer setup in any way? Is it still gonna be 1 NLB per cluster?

Thanks,
Pranas

Hello @prki

The sleep 15 is not there to accommodate applications that do not handle SIGTERM. It is still required for an application to handle SIGTERM to avoid disruption during rollout.

The pre-stop is there to ensure the application has received all pending/in-flight requests before being asked to shut down.

It is to avoid lost request due to various system updating their state, and the app stopping too quickly.

  1. The setup should not change, it will still be one NLB per cluster. We are upgrading the Kubernetes controller that manage this LB. We are mainly interested to get the proxy protocol feature to allow us to load balancer traffic across all nodes and still be able to retain client source ip.