Lifecycle services are often (not always) failing for we don't know what reason

INFORMATION

Relevant information to this issue:

  • Programming language and version: Ruby 3.2.2 + Rails 7.0.4
  • Link to your application: Qovery

ISSUE

When cloning our blueprint environment to create a Preview Environment, we have several Lifecyle such as:

  1. “Clone Staging DB” with the following rake task:
task :copy, %i[qovery_environment erase_data local_environment] => :environment do |_t, args|
    is_db_empty = ActiveRecord::Base.connection.tables.empty?

    puts "[1/2] Get source database credentials"
    source = db_name(args[:qovery_environment])

    if is_db_empty
       puts "[2/2] Import source database into destination database"
       `pg_dump -Fc #{source} | pg_restore --no-owner --no-acl -d #{destination_db_name}`
    elsif args[:erase_data] == "true"
      puts "[2/2] Import source database into destination database and erase existing data"
      `pg_dump -Fc #{source} | pg_restore --no-owner --no-acl -d #{destination_db_name} --clean`
    else
      puts "[2/2] Database is not empty, use `rake db:copy[#{args[:qovery_environment]},true]` to erase data"
    end
  end
  1. “DB - run migration” which is doing a simple:
rails db:migrate

Both of them are sometimes working but most of the time failing on deployment. Sometimes they work better when we Force run them.

We customized our Rakefile in order to get some logs and to print the exist status of the task to try to understand why Qovery considers it as a “failure” while our scripts seem to be successful.

Here is our Rakefile:

# frozen_string_literal: true

require 'English'

# Add your own tasks in files placed in lib/tasks ending in .rake,
# for example lib/tasks/capistrano.rake, and they will automatically be available to Rake.

# By default, ensure the current environment is "test".
ENV.update("RACK_ENV" => "test") if ARGV.empty? || ARGV.any? { |param| param.include?("parallel") }

require_relative "config/application"

Appsignal.config = Appsignal::Config.new(
  Rails.application.credentials.appsignal_push_api_key,
  ENV.fetch("RAILS_ENV", "development"),
  { log_path: "log" }
)

Appsignal.start
Appsignal.start_logger

Rails.application.load_tasks

task default: %i[spec]

task :before_hook do
  @start_time = Time.current
  @logger = Appsignal::Logger.new(ARGV.join(" "))

  puts "#{Time.now} - rake #{ARGV.join(' ')} - Start..."
  @logger.info "Start..."
end

task :after_hook do
  at_exit do
    end_time = Time.current
    duration = ((end_time - @start_time) * 1000).round(2)

    Appsignal.set_gauge("rake #{ARGV.join(' ')} duration", duration)

    @logger.info "Done. Duration: #{duration.round(2)}ms"

    if $ERROR_INFO.nil? || ($ERROR_INFO.is_a?(SystemExit) && $ERROR_INFO.success?)
      puts "#{Time.now} - rake #{ARGV.join(' ')} - Success - Duration: #{duration}ms"
    else
      code = $ERROR_INFO.is_a?(SystemExit) ? $ERROR_INFO.status : 1
      puts "#{Time.now} - rake #{ARGV.join(' ')} - Failure - Code: #{code} - Duration: #{duration}ms"
    end

    Appsignal.stop "rake"
    exit 0
  end
end

tasks = Rake.application.tasks

tasks.each do |task|
  next if [Rake::Task["before_hook"], Rake::Task["after_hook"]].include?(task)

  task.enhance(%i[before_hook after_hook])
end

The lifecycle service doing a simple rails db:migrate is also failing on our Staging environment (a non-cloned environment).

Do you have any suggestion on how to figure out why Qovery is considering it as a failure status while it’s a success on our side?

Hello @jmeiss !

Are you sure your db:migrate returns a proper exit 0 or might it returns anything else due to warnings or errors?

Also, it seems there is a rails db:migrate:status command, would it be possible to run it before / after migration just to see if there is any migrations applied?

Any chance you can print return status of this task?

The sometimes working smells fishy.

Cheers !

Hey @bchastanier thanks for your reply and suggestion.

If you have a look at that service: Qovery

We are printing out the return status of each rake task, including rails db:migrate via the edited Rakefile you can see in my previous post.

Following the logs, it looks successful.

I just tried to run the same command via qovery shell from a container running the same image tag and I’m getting the same log which looks successful:

Services:
✔ web
# rails db:migrate
2023-05-10 16:02:03 +0200 - rake  - Start...
2023-05-10T14:03:02.290Z pid=1883 tid=1ql3 INFO: Sidekiq 7.1.0 connecting to Redis with options {:size=>5, :pool_name=>"internal", :url=>"redis://:REDACTED@zafd24c96-redis:6379/0"}
2023-05-10 16:03:06 +0200 - rake  - Success - Duration: 62827.19ms

Running rails db:migrate:status gives the following log:

# rails db:migrate:status
2023-05-10 16:06:41 +0200 - rake  - Start...

database: postgres

 Status   Migration ID    Migration Name
--------------------------------------------------
   up     20220621155616  ********** NO FILE **********
   up     20220627112525  ********** NO FILE **********
   up     20220627112526  ********** NO FILE **********
   up     20220627112527  ********** NO FILE **********
   up     20220628094635  ********** NO FILE **********
   up     20220628154416  ********** NO FILE **********
   up     20220701123928  ********** NO FILE **********
   up     20220708141409  ********** NO FILE **********
   up     20220802213443  ********** NO FILE **********
   up     20220817131239  ********** NO FILE **********
   up     20220830143544  ********** NO FILE **********
   up     20220831144314  ********** NO FILE **********
   up     20220905132727  ********** NO FILE **********
   up     20220908140832  ********** NO FILE **********
   up     20220912091925  ********** NO FILE **********
   up     20220912142515  ********** NO FILE **********
   up     20220912153344  ********** NO FILE **********
   up     20220919123347  ********** NO FILE **********
   up     20220922123122  ********** NO FILE **********
   up     20220923103239  ********** NO FILE **********
   up     20220930123813  ********** NO FILE **********
   up     20221003132255  ********** NO FILE **********
   up     20221012135632  Install audited
   up     20221026145618  ********** NO FILE **********
   up     20221028142703  ********** NO FILE **********
   up     20221108112356  ********** NO FILE **********
   up     20221108151657  ********** NO FILE **********
   up     20221108153141  ********** NO FILE **********
   up     20221110142109  ********** NO FILE **********
   up     20221122134454  ********** NO FILE **********
   up     20221124194945  ********** NO FILE **********
   up     20221129214118  ********** NO FILE **********
   up     20221202124809  Init schema
   up     20221207140059  Create authentication
   up     20221216111845  Move authentications from user to authentications
   up     20221221180151  Add ai data to post records
   up     20230103105303  Add home ui settings to city
   up     20230103163416  Remove feature demand anomaly
   up     20230112165018  Remove current contributors
   up     20230119132456  Add privacy policy to spaces
   up     20230123102838  Rename typeform columns
   up     20230123153358  Create post permissions
   up     20230125162446  Create action mailbox tablesaction mailbox
   up     20230127135119  Set defualt value of fields config
   up     20230202123753  Add mail discussion id to post records
   up     20230208130701  Remove assoc name
   up     20230208132931  Add subject to note records
   up     20230209130113  Remove identities columns
   up     20230213143909  Add citizen notification settings to city
   up     20230226160117  Create newsletters
   up     20230227135317  Add sent at to newsletters
   up     20230306203633  Add arabic columns
   up     20230308102940  Add query to newsletter
   up     20230309094221  Add voting options to post record
   up     20230418133041  Add index for ahoy
   up     20230504085232  Delete is subscribed on mailchimp

2023-05-10 16:06:47 +0200 - rake  - Success - Duration: 6791.1ms

Do you have any suggestion on how I could print the same status as what Qovery is reading to set the deployment status on the service?

Is it possible to add --trace to all your rack rails commands?
Same on pg_dump and pg_restore, I think there is a verbose flag -v.

I just did it on the lifecyle DB Migrate on Staging and now the deployment is successful :face_with_raised_eyebrow:
I’ll redeploy few times to see if it’s constant or not.

** Invoke db:migrate (first_time)
** Invoke db:load_config (first_time)
** Invoke environment (first_time)
** Invoke before_hook (first_time)
** Execute before_hook
2023-05-10 16:28:06 +0200 - rake --trace - Start...
** Invoke after_hook (first_time)
** Execute after_hook
** Execute environment
** Invoke before_hook 
** Invoke after_hook 
** Execute db:load_config
** Invoke before_hook 
** Invoke after_hook 
** Execute db:migrate
** Invoke db:_dump (first_time)
** Invoke before_hook 
** Invoke after_hook 
** Execute db:_dump
2023-05-10 16:28:13 +0200 - rake --trace - Success - Duration: 6995.51ms

The issue reappeared when on the same lifecycle service, I deployed a new image tag.

It failed with the following log:

But when Redeploying it, then it worked.
It still don’t get what’s going on :thinking:

Not sure to understand why neither, I will continue to look at it. One question I have though is: why DB Clone Staging Job is in application stage?
I would imagine to have it between Db migration and Application steps. Am I missing anything?

That’s a good point. It might be a misconfiguration. I updated our blueprint. Thx

1 Like

For the record, a Github action has just been triggered after a merge on master. The same issue appeared a second time:

Looping it back internally, someone might have a clue, will let you now ASAP.

1 Like

Hey @jmeiss,

Just to be sure, do you think it’s possible to add an explicit exit 0 to your job just to be sure a success status is returned?

From what I see on this one for example job has ran but is just hanging forever until timeout as if no return was specified.

image

Cheers

Morning @bchastanier,

My last run was:

  • disable the custom Rakefile in order to get it back like a default one
  • add an exit 0 at the end of our Clone DB Staging in order to get the following rake task:
  desc "Copy source database to destination database"
  task :copy, %i[qovery_environment erase_data local_environment] => :environment do |_t, args|
    is_db_empty = ActiveRecord::Base.connection.tables.empty?

    puts "[1/2] Get source database credentials"
    source = db_name(args[:qovery_environment])

    if is_db_empty
       puts "[2/2] Import source database into destination database"
       `pg_dump -Fc #{source} | pg_restore --no-owner --no-acl -d #{destination_db_name}`
    elsif args[:erase_data] == "true"
      puts "[2/2] Import source database into destination database and erase existing data"
      `pg_dump -Fc #{source} | pg_restore --no-owner --no-acl -d #{destination_db_name} --clean`
    else
      puts "[2/2] Database is not empty, use `rake db:copy[#{args[:qovery_environment]},true]` to erase data"
    end

    exit 0
  end

Result:

  • Deploy from CLI: KO
  • Re-run from UI: KO
  • Re-deploy from UI: OK

Also, there’s a difference in the column Version where we can see that the working version is fetching the image from the qovery-mirror. Any idea why it worked only with that version?

Can you share CLI command you use to deploy the given app?
Also which action are you using to rerun from the UI? Redeploy?

First run via CLI raises perl warnings, are we sure those don’t return non 0 exit status?

There is still a job not terminating its task and waiting timeout to be triggered

image

Looking at pipeline settings I also noticed that all pipelines are not set the same, DB - run migration is sometimes set in a step before DB - Clone staging, I guess it shouldn’t be the case right?

The staging has the same config,

I would expect all to have a similar pipeline than this one:

    - name: Create environment
      if: ${{ env.environment_exists != 'true' }}
      shell: bash
      run: |
        qovery environment clone \
          --organization my_org \
          --project WebApp \
          --environment blueprint \
          --environment-type preview \
          --new-environment-name "${{ env.environment }}" \
          --apply-deployment-rule

    - name: Deploy databases
      if: ${{ env.environment_exists != 'true' }}
      shell: bash
      run: |
        qovery database deploy \
          --organization my_org \
          --project WebApp \
          --environment "${{ env.environment }}" \
          --database postgresql \
          --watch

        qovery database deploy \
          --organization my_org \
          --project WebApp \
          --environment "${{ env.environment }}" \
          --database redis \
          --watch

    - name: Clone Staging Database
      if: ${{ env.environment_exists != 'true' }}
      shell: bash
      run: |
        qovery lifecycle deploy \
          --organization my_org \
          --project WebApp \
          --environment "${{ env.environment }}" \
          --lifecycle "DB - Clone Staging" \
          --tag ${{ env.image_tag }} \
          --watch

    - name: Configure environment
      if: ${{ env.environment_exists != 'true' }}
      shell: bash
      run: |
        qovery container env create \
          --organization my_org \
          --project WebApp \
          --environment "${{ env.environment }}" \
          --container web \
          --scope environment \
          --key PULL_REQUEST_ID \
          --value ${{ steps.PR.outputs.number }}

        qovery lifecycle deploy \
          --organization my_org \
          --project WebApp \
          --environment "${{ env.environment }}" \
          --lifecycle "ENV - Setup / Destroy preview environment" \
          --tag ${{ env.image_tag }} \
          --watch

    - name: "Set env GIT_COMMIT_ID=${{ env.image_tag }}"
      shell: bash
      run: |
        qovery container env delete \
          --organization my_org \
          --project WebApp \
          --environment "${{ env.environment }}" \
          --container web \
          --key GIT_COMMIT_ID

        qovery container env create \
          --organization my_org \
          --project WebApp \
          --environment "${{ env.environment }}" \
          --container web \
          --key GIT_COMMIT_ID \
          --value ${{ env.image_tag }} \
          --scope ENVIRONMENT

    - name: Deploy containers
      shell: bash
      run: |
        qovery container deploy \
          --organization my_org \
          --project WebApp \
          --environment "${{ env.environment }}" \
          --containers "web,worker,crontab" \
          --tag ${{ env.image_tag }} \
          --watch

    - name: Run DB migration
      shell: bash
      run: |
        qovery lifecycle deploy \
          --organization my_org \
          --project WebApp \
          --environment "${{ env.environment }}" \
          --lifecycle "DB - run migration" \
          --tag ${{ env.image_tag }} \
          --watch

Sorry, yes I meant Redeploy

To be frank, I don’t know. Yesterday I was testing with the lifecycle DB migrate that is doing a simple rails db:migrate and I was getting the same behaviour (from what I remember because after that many tries, I’m not sure anymore :confused: )

Yes that’s because the Pipeline you’re looking at has been created before blueprint pipeline update you suggested yesterday.
If you want to have a look, you should only check [PR-10460] Try to fix lifecycle deployment that is the pipeline I’m currently doing all my tests.

My last assumptions are:

  • DB - Clone Staging fails because of the perl warning. (any suggestion on how to fix it? I tried to add those lines in the Dockerfile but without success:
# https://www.thomas-krenn.com/en/wiki/Perl_warning_Setting_locale_failed_in_Debian
RUN export LANGUAGE=en_US.UTF-8
RUN export LC_ALL=en_US.UTF-8
RUN export LANG=en_US.UTF-8
RUN locale-gen en_US.UTF-8
RUN dpkg-reconfigure locales
  • Something in the before_hook and/or after_hook is causing that job to not terminate its task. I removed the hooks to get a default Rakefile and see if it fixes it. For now, DB - migration doesn’t fail on the pipeline benefiting from that last change

I am not a perl expert unfortunately :frowning: but seems some runs succeeded even with those warnings, so it’s maybe not an issue.

Let see how it goes then, from what I see, it seems to work now, but let me know :slight_smile:

Cheers

I retried several times to Deploy / Redeploy / Force Run and I’m getting inconsistent results.

Same observation: the warning doesn’t prevent the job to succeed but it kinda looks like it’s always failing on the first try.

I’m still digging and I’ll keep you posted.

1 Like

My last try is a very basic test.

I cloned blueprint to create https://console.qovery.com/organization/12e0c625-6fb8-434b-adeb-041614fa7a46/project/a7f79a8f-8c0e-46a5-8748-1ba9846481c9/environment/95274c99-1382-4725-a3aa-8dddab8abb15/services/deployments

The lifecycle job DB - Clone Staging failed on deploy from the CLI.

The thing is that this job is only printing a message then an exit 0:

  task :copy, %i[qovery_environment erase_data local_environment] => :environment do |_t, args|
    # is_db_empty = ActiveRecord::Base.connection.tables.empty?

    # puts "[1/2] Get source database credentials"
    # source = db_name(args[:qovery_environment])

    # if is_db_empty
    #    puts "[2/2] Import source database into destination database"
    #    `pg_dump -Fc #{source} | pg_restore --no-owner --no-acl -d #{destination_db_name}`
    # elsif args[:erase_data] == "true"
    #   puts "[2/2] Import source database into destination database and erase existing data"
    #   `pg_dump -Fc #{source} | pg_restore --no-owner --no-acl -d #{destination_db_name} --clean`
    # else
    #   puts "[2/2] Database is not empty, use `rake db:copy[#{args[:qovery_environment]},true]` to erase data"
    # end
    puts "That task does nothing except printing this message then exit 0."

    exit 0
  end

Then I Force Run it from the UI and now it succeeds. :exploding_head:

I’m not getting any of it :confused:

I did the same test several times.

Now I’m getting reproductive sequence:

  1. deploy a very simple rake task (it only does a puts) via the CLI: it fails but it looks like the job did not finish because I can “Stop” it and the logs are still in a loading state
  2. “Force run” or “Redeploy” and then is succeeds.

Someone from my team suggested that issue since we are using newrelic: Hangs with rake task · Issue #340 · newrelic/newrelic-ruby-agent · GitHub but it didn’t solve the problem.

That issue is the last step preventing us from migrating our Production server from Heroku :confused:

I just saw that log:

before I refresh the tab to see the puts I was expecting:

I don’t know if this can help you

Hey @jmeiss,

The first log you see is just our agent starting to look for container log which are not there because the container is still creating, it’s like that for all apps, nothing wrong here, we can do something to improve that but it’s not your issue root cause.

The issue you shared really looks like what we are seeing :confused:

To be honest I think there is an issue with your app / setup, I have little to no experience with rails stack, maybe someone here with similar setup might provide an hint, I will try to find someone.

I will come back to you, in the meantime, please keep me posted.

Cheers