Basic replibyte setup and crazy memory usage

Issues information

  • OS: Ubuntu 20.04
  • databases: postgres
  • Programming language and version: Ruby 3.1.2
  • Link to your project on Github/Gitlab:

Your issue

Hello, all! I’ve been having some trouble getting replibyte to work. I’ve read the provided documentation and searched for discussions online, but I still don’t know what I’m doing wrong.

My issues are two:

  • first, I’ve created a simple, preliminary config file for testing purposes and succeeded in uploading my transformed database dump to a bucket in S3. Upon inspecting this dump, however, I noticed no data was transformed! I’ve copied my conf.yaml file below, so you can help point out where I went worng.
  • also, I’ve only ever been able to create a dump using my test database. This is because transforming my production dump has, so far, always required more memory than I have available (upwards of 20GB of RAM!). The database in question is by no means tiny (close to 6GB after pg_restore), but I’ve also heard of colleagues using replibyte on much larger data sets, so something funky must be going on.

Lastly, here are the commands I’ve been using to accomplish what little I’ve managed so far:

transform and upload dump:
cat test_dump.sql | replibyte -c conf.yaml dump create -n test_transform -i -s postgresql
download transformed dump (I’ve also not been able to update my local db with this data, so I’ve saved it to a local file):
replibyte -c conf.yaml dump restore local -i postgres -v test_transform -o > test_transform.sql

Thanks in advance for any help given, and I hope we can sort this out so I can make use of this wonderful tool!

Dockerfile content (if any)

# Dockerfile development version
FROM ruby:3.1.2-bullseye

# Install Postgresql 14
RUN apt-get update -y
RUN apt install curl ca-certificates gnupg libzmq5-dev -y
RUN curl https://www.postgresql.org/media/keys/ACCC4CF8.asc \
          | gpg --dearmor \
          | tee /etc/apt/trusted.gpg.d/apt.postgresql.org.gpg >/dev/null
RUN sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ bullseye-pgdg main" > /etc/apt/sources.list.d/postgresql.list'
RUN apt update
RUN apt-get -y install postgresql-14

# Install node
RUN curl -sL https://deb.nodesource.com/setup_16.x -o /tmp/nodesource_setup.sh
RUN bash /tmp/nodesource_setup.sh
RUN apt install nodejs

# Skip installing gem documentation
RUN set -eux; \
	mkdir -p /usr/local/etc; \
	{ \
		echo 'install: --no-document'; \
		echo 'update: --no-document'; \
	} >> /usr/local/etc/gemrc

# Install gems
WORKDIR /app
COPY Gemfile Gemfile.lock ./
RUN gem install bundler
RUN mkdir -p vendor/cache
ARG BUNDLE_WITHOUT=development:test
RUN bundle config set without "$BUNDLE_WITHOUT"
RUN bundle check || bundle install --jobs $(nproc)
COPY . ./ 

# Start server
EXPOSE 3000
ENTRYPOINT ["/app/bin/docker-entrypoint.sh"]

CMD ["bin/rails", "server", "-b", "0.0.0.0"]

config.yaml content

source:
  connection_uri: $DATABASE_URL
  # database_subset: # downscale database while keeping it consistent
  transformers:
    - database: public
      table: users
      columns:
        - name: name
          transformer_name: first-name
        - name: email
          transformer_name: email
    - database: public
      table: people
      columns: 
        - name: email
          transformer_name: email
        - name: name
          transformer_name: first-name
datastore:
  aws:
    bucket: yuri-db-seed
    region: us-east-1
    credentials:
      access_key_id: $AWS_SEED_BUCKET_ACCESS_KEY_ID
      secret_access_key: $AWS_SEED_BUCKET_SECRET_ACCESS_KEY