Handling crash loops

What is the recommended protocol for addressing crash loops in an environment?

In our case, we ran into an issue with our Django project where Celery was initializing and calling the database prior to Django migrations running, thereby trying to access db tables that didn’t exist yet. We can:

  • Stop the application
  • Make order-of-operations adjustments with our Dockerfile scripts and build so migrations run first
  • Redeploy the app

But as soon as the Docker container is built the crash loop begins again, which triggers deploy/build failures. We also can’t – sh into the app and and manually run the migration for the same reason (crash exits out).

How can an application be brought out of this cycle (without completely recreating it)? Are there other tools/approaches here we’re not thinking of? thanks

Qovery application: Qovery

Hello @ChrisBolman1,

From what I understand, you have:

  1. A DB (managed / containered? / handled via Qovery?), from what I see it seems this DB is outside of Qovery, am I right?
  2. A django app
  3. Several Django crons

You want to deploy the whole environment, and to do so steps should be:

  1. DB should be created: you can create a lifecycle job triggered on env start / stop in a new step which will handle managed DB creation / existence, check this doc
  2. Django migration to kick in (done on Django application start?) (can be merged with step 3.)
  3. Django application to start
  4. Django crons to be deployed

Let me know if it’s clear enough.

Cheers

Hi @bchastanier, thanks. To confirm, we have a Postgres DB hosted on AWS (alongside the Django app with several crons you mention). The DB already exists, we don’t need to create it, we just need to migrate it before the application crashes. Will review lifecycle docs and see if that could work for us to run migrate at the container default stage.

Ok ! You can then just make sure migration is done before Celery / Django app tries to connect. It can be a lifecycle job doing only the migration in a dedicated step (before your app is starting).

Let me know when you managed to make it work.

Cheers

Hi @bchastanier I read up on Terraform today and created a lifecycle job following your docs

  1. In the lifecycle job I do not see “job” as an available scope option (whereas I see it in your docs) - only “project”, “environment”, and “service” - am I missing something? I’m just setting our vars as “environment” - or I guess these should be “service”?

  2. What I’d like to do is have Terraform:

  • Spin up a temporary EC2 instance resource
  • Install minimum dependencies (or is this not even necessary because the dependencies have already been installed in Qovery build process?)
  • cd into app path
  • run python3 manage.py migrate
provisioner "remote-exec" {
    inline = [
      "sudo apt update",
      "sudo apt install -y python3 python3-pip", # install dependencies, necessary?
      "pip3 install django", # install dependencies, necessary?
      "cd /usr/src/app", # set path
      "python3 manage.py migrate"
    ]

I am using same WORKDIR /usr/src/app path as in our standard deployment Dockerfile.

  1. Is there any guidance or documentation on making the EC2 connect and mounting a private key in Docker and Terraform? Connecting to EC2 via Terraform seems like the most complex part here, now I just need to handle the timeout trying to connect to the temp EC2 instance. It’s good to know the lifecycle job capability exists but this is still a fairly complex solution to try and implement for the type of issue we’re trying to troubleshoot. It would also be nice if it was easier to cancel or step jobs from running - currently “cancel deploy” on a job doesn’t seem to do anything until the job has completed running (or timed out) [“Some operation cannot be stopped (i.e: terraform actions) and need to be completed before stopping the deployment”.

aws_instance.temporary_instance: Still creating... [5m20s elapsed]

aws_instance.temporary_instance: Still creating... [5m30s elapsed]

``

Error: remote-exec provisioner error

with aws_instance.temporary_instance,

on main.tf line 29, in resource "aws_instance" "temporary_instance":

29: provisioner "remote-exec" {

timeout - last error: dial tcp xx.xxx.xxx.xx:22: i/o timeout

Hey @ChrisBolman1,

Regarding 1. indeed, we changed naming around jobs (service now). Doc has been updated.

Indeed the setup you are trying to spin up looks way too complex for what it does. May I ask you why your migration step is not within the app starting process?

I have a Django project, and here’s the setup I have on Qovery for migration:

  1. A step to create the DB (I understand it’s not needed on your side)

  2. A job step starting my app cron jobs (can be done at the end), each Django job is using my main app container with a custom ENV variable value for cron which leads to trigger cron jobs only and kill container once done
    image

  3. Start my app, executing migration first, erroring if migration fails and start the app if OK

My configuration is rather simple, here’s my app Dockerfile (base image includes Node + Python because I need npm to build front):

FROM nikolaik/python-nodejs:python3.7-nodejs19
ARG TRILLE_SITE_VERSION_TAG
ARG TRILLE_SITE_VERSION_COMMIT
ADD . /app/
WORKDIR /app
RUN echo "Set disable_coredump false" >> /etc/sudo.conf # https://unix.stackexchange.com/questions/578949/sudo-setrlimitrlimit-core-operation-not-permitted #
RUN pip install pipenv && \
  apt-get update && \
  apt-get upgrade -y
RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - 
RUN apt-get install -y --no-install-recommends gcc python3-dev libssl-dev libpq-dev dos2unix libmagic-dev musl-dev gettext && \
  pipenv install --dev --deploy --system && \
  apt-get remove -y gcc python3-dev libssl-dev libpq-dev && \
  apt-get autoremove -y && \
  pip uninstall pipenv -y
RUN dos2unix entrypoint.sh
RUN chmod +x entrypoint.sh
ENV PYTHONUNBUFFERED 1
ENV TRILLE_SITE_VERSION_TAG=${TRILLE_SITE_VERSION_TAG}
ENV TRILLE_SITE_VERSION_COMMIT=${TRILLE_SITE_VERSION_COMMIT}
EXPOSE 8000
HEALTHCHECK --interval=15s --timeout=5s --retries=5 --start-period=10s CMD wget -qO- http://localhost:8000 || exit 1
ENTRYPOINT ["/app/entrypoint.sh"]

And here’s my entrypoint.sh:

#!/bin/sh

# Generate statics via NPM
cd jstools
npm install && npm run build
cd -

# Collect static assets
echo "Collect static assets" && python manage.py collectstatic --no-input

# Apply database migrations
echo "Apply database migrations"
python -u manage.py migrate --noinput

# Create django super user
echo "Create django super user"
python -u manage.py createsuperuser --noinput

# Populate data if required
POPULATE_TEST_DATA="${TRILLE_APP_POPULATE_TEST_DATA:-False}"
if [ "$POPULATE_TEST_DATA" = "True" ]
then
	echo "load data from fake_data.json"
    python -u manage.py loaddata data/fake_data.json

    # Re-sync Algolia DB with actual data
    echo "Reindex algolia"
    python -u manage.py algolia_reindex
fi

# If is CRON mode, just launch the app, execute cron jobs and kill the app
IS_CRON_MODE="${TRILLE_IS_CRON_MODE:-False}"
if [ "$IS_CRON_MODE" = "True" ]
then
	echo "Launch cron jobs"
    python -u manage.py runcrons
    exit 0
fi

# Run the app
echo "Run the app"
python -u manage.py runserver 0.0.0.0:8000

exit 0

Did you already give a try to a similar solution?

Cheers

Thanks @bchastanier. The primary issue we’ve been running into as I outlined in my original post is when a process (typically django-celery) is initialized and starts making calls to our RDS database server prior to a new migration being run.

Our main objective is to make sure migrations are always run before celery can start, so in frontend.sh we often use:

sleep 60
celery -A project.settings worker -l info &
sleep 60
celery -A project.settings beat -l info

To prevent these crash loops from occurring. If you or someone on your team is able to check, are you able to see why this environment is still failing to deploy?

Qovery Staging App

I believe we’ve addressed all the underlying issues, yet Qovery still shows our container crashing. Yet there is no longer a crash loop, and the app is running correctly per our live logs and checks.

Hey @ChrisBolman1,

From what I see, your app is still crash-looping, message being:

/usr/local/lib/python3.8/site-packages/storages/backends/s3boto3.py:281: UserWarning: The default behavior of S3Boto3Storage is insecure and will change in django-storages 2.0. By default files and new buckets are saved with an ACL of 'public-re
  warnings.warn(
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
/usr/local/lib/python3.8/site-packages/storages/backends/s3boto3.py:281: UserWarning: The default behavior of S3Boto3Storage is insecure and will change in django-storages 2.0. By default files and new buckets are saved with an ACL of 'public-re
  warnings.warn(
Operations to perform:
  Apply all migrations: account, admin, apiv1, auth, authtoken, billing, causes, common, community, contenttypes, conversion_views, django_celery_beat, forms, magic_link, management, products, sessions, sites, socialaccount, thumbnail, twilio_bo
Running migrations:
  No migrations to apply.
  Your models have changes that are not yet reflected in a migration, and so won't be applied.
  Run 'manage.py makemigrations' to make new migrations, and then re-run 'manage.py migrate' to apply them.

Where do you check your app logs? Are you sure you are looking at the version being deployed?

Cheers

@bchastanier ah thanks, I didn’t realize that “pod name” is a filter with different options, was only looking at live running app. very helpful.

@bchastanier addressed migrations, deploys are still crashing. I see no errs now related to pod app-z2685498d-6969cd5c67-ff2cl ?


Update: that one may have been a fluke, I redeployed and finally got the deploy to go through :white_check_mark:

1 Like