Airbyte 1.1.0 installation on Qovery managed cluster broke again

Airbyte 1.1.0 deployment on a Qovery managed EKS was working a few weeks ago. It broke again:

https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/8fdd5013-c37f-45af-928a-a3b88179e1ef/environment/bbd9323d-6009-4ae4-af85-2026269e705d/logs/bdb5a18b-a560-4f33-98a0-c905c41105f8/deployment-logs

|Helm timed out for release helm-zbdb5a18b-airbyte during helm UPGRADE: Command killed due to timeout: Killing process AWS_ACCESS_KEY_ID="xxx" AWS_DEFAULT_REGION="us-east-1" AWS_SECRET_ACCESS_KEY="xxx" KUBECONFIG="/home/qovery/.qovery-workspace/bbd9323d-6009-4ae4-af85-2026269e705d-1-1732060600/bootstrap/z7d0953dd/qovery-kubeconfigs-z7d0953dd/z7d0953dd.yaml" "helm" "upgrade" "helm-zbdb5a18b-airbyte" "/home/qovery/.qovery-workspace/bbd9323d-6009-4ae4-af85-2026269e705d-1-1732060600/helm_charts/bdb5a18b-a560-4f33-98a0-c905c41105f8/chart" "--install" "-n" "zbbd9323d-airbyte-production" "--values" "/home/qovery/.qovery-workspace/bbd9323d-6009-4ae4-af85-2026269e705d-1-1732060600/helm_charts/bdb5a18b-a560-4f33-98a0-c905c41105f8/chart/file1" "--timeout" "600s" "--wait" "--atomic" "--debug" due to Timeout(600s)|
|198|20 Nov, 00:11:47.89|`āŒ Deployment of helm chart failed !

https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/8fdd5013-c37f-45af-928a-a3b88179e1ef/environment/bbd9323d-6009-4ae4-af85-2026269e705d/services/deployments

Same problem with JupyterHub deployment. I followed the instructions here:

Deployment errors here:

https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/0014bb51-b28a-4048-a385-262dab6ec1b5/environment/f025bac3-b0d4-4542-8158-ecf690dd13a7/logs/e20d6fbe-57d4-41fd-acb2-f997a9f3d8a6/deployment-logs

ready.go:284: [debug] PersistentVolumeClaim is not bound: zf025bac3-frontier-production/jupyterhub-hub-db-dir

https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/0014bb51-b28a-4048-a385-262dab6ec1b5/environment/f025bac3-b0d4-4542-8158-ecf690dd13a7/services/deployments

Hello,

Iā€™ll check your problem with the team and get back to you.

Regards,
Charles-Edouard

Thanks! Iā€™d like to get both Airbyte and JupyterHub (latest versions of the Helm charts) deployed successfully. Iā€™m trying a non-Karpenter enabled EKS cluster first. I was able to deploy both previously on ARM nodes, Iā€™ll try AMD nodes today.

For what itā€™s worth, I spun up 3 Qovery managed EKS clusters:

  • no-karpenter AMD nodes
  • no-karpenter ARM nodes
  • karpenter AMD nodes

Airbyte 1.1.0 helm chart deployments are failing on all three of these clusters:
https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/8fdd5013-c37f-45af-928a-a3b88179e1ef/environments/general

JupyterHub 4.0.0 Helm Chart deployments are failing on all 3 of the clusters:
https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/0014bb51-b28a-4048-a385-262dab6ec1b5/environments/general

Hello @data_admin ,

We checked your latest problem and it looks like you tried to deploy your application on a stopped Cluster.

Can you start your cluster and try again?

Please let me know if you still have errors after starting your clusters.

Regards,
Charles-Edouard

I wouldnā€™t make a mistake that basic. All clusters were live and operational when I tried to deploy the latest versions of Airbyte and JupyterHub on them. You can look at the deployment histories of Airbyte and JupyterHub to see the failing deployments. You can also duplicate the problem yourself by deploying Airbyte and JupyterHub onto your own EKS clusters.

Thank you for your feedback,

Do you mind if we try to start your cluster again to investigate the problem on your config?

Regards,
Charles-Edouard

I already did that, you can see the deployment errors here:

https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/8fdd5013-c37f-45af-928a-a3b88179e1ef/environment/04771e89-0617-4c99-b627-025b8033df80/services/deployments

https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/0014bb51-b28a-4048-a385-262dab6ec1b5/environment/103d13c4-6d71-4711-9478-8e0f86abe4c0/services/deployments

Airbyte 1.1.0 deployment on AWS EKS cluster (3 node AMD t4g.xlarge) was working 2 weeks ago. Now itā€™s failing the exact same setup:

https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/8fdd5013-c37f-45af-928a-a3b88179e1ef/environment/6d101bde-aae9-45f3-a34f-95920eeeb215/services/deployments

Feel free to start / stop any clusters and jupyterhub / airbyte deployments if you need to debug.

Hello,

I am looking at your issue.
One of the issue is that your cluster size is to small. Your have for example minio pod that canā€™t start because the cluster has reached its maximun number of nodes.
So you should give some extra room for the cluster to be able to expand, and increase the maximun number of nodes of the cluster.

The second issue is more on our side, where we set a hard timeout of 5 min to download of the depency of the chart, and it seems now that Airbyte is hitting this timeout all the time.

I am going to increase our hard timeout, and keep you in touch.

Hi back we have increased our timeout for the dependency fetch, now the only thing left for you is to increase the cluster max node size. As the minio pod canā€™t start.

pod didn't trigger scale-up: 1 max node group size reached

I donā€™t think this makes sense; two weeks ago Airbyte deployment was working on clusters with 3 nodes.

I increase the number of nodes on our clusters to 5. Also have a karpenter enabled cluster:

https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/clusters/general

Airbyte deployments are failing on all 3 clusters, you can check the deployment histories:

https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/8fdd5013-c37f-45af-928a-a3b88179e1ef/environments/general

You may want to try to duplicate this problem on your own EKS clusters.

Iā€™m just following these instructions:

Github repo was updated GitHub - evoxmusic/qovery-airbyte: Deploy Airbyte on Kubernetes with Qovery
working a few weeks ago

Hello, any updates? Should I shut down our Qovery managed cluster if Qovery doesnā€™t have the bandwidth to debug this? You guys are welcome to re-deploy the cluster if you have the bandwidth to help debug this.

I did take a look and here are my notes:

  1. First of all, stick to the version from my tutorial (1.1.x at the time) just to make sure I did have a valid working version.
  2. I did try to deploy it on your EKS cluster with and without Karpenter. The error was the same - related to minio airbyte pod not starting. So I suspect a bug issue with this pod and I have a few ideas that Iā€™ll explore today. (I suspect some metadata still present on the persistent storage for minio and blocking the proper startup of this service).

In the meantime, and for production purpose, Iā€™d suggest using a s3 storage from AWS or equivalent to remove the dependency to minio (which is not recommended for production purpose).

You can find the documentation here: State and Logging Storage | Airbyte Documentation

1 Like

Hey that works for us - thanks! I was able to get the latest version of the Airbyte Helm chart v 1.2.0 deployed using S3:

EDIT: I removed the above environments since both are working; will replace with a more permanent one later.

Donā€™t worry about the minio airbyte pod; weā€™re not going to use it in production. Iā€™ll also try using an external AWS managed database service with Airbyte:

Can you also help with deploying the JupyterHub Helm chart? I followed these instructions:

and the deployment failed:
https://console.qovery.com/organization/5fafa1c2-689c-4a54-8ea1-533a7230a2a5/project/0014bb51-b28a-4048-a385-262dab6ec1b5/environment/103d13c4-6d71-4711-9478-8e0f86abe4c0/services/general

There seems to be a
ready.go:284: [debug] PersistentVolumeClaim is not bound: z103d13c4-jupyterhub-production-amd/jupyterhub-hub-db-dir
bug that wasnā€™t there 2 weeks ago. I was able to deploy JupyterHub from the Helm chart successfully using the instructions above two weeks ago.

Hello @data_admin ,

We are investigating this with the team.

Regards,
Charles-Edouard

Hey @data_admin,

We have found the culprit, default storage class (gp2) is no tagged as default.
We are working on a viable fix.

In the meantime, I can patch your cluster setting gp2 by default, it should solve your issue for the time being.

Cheers

1 Like