At 2:23 AM last Tuesday, something went wrong with our database. Backups were a few days old, and so recovery took longer. That’s when I realized our backup system had to be better.

Managed databases can be expensive to run. Going self-managed is the alternative, but it comes with its own responsibilities to achieve resiliency. Part of that for me involves building a backup system that:
Runs automatically at 2 AM
Retries up to 10 times if something fails
Sends Slack notifications on success or failure
Stores backups in S3
Makes restoration a single command
No database was harmed in the making of this system. Here's how I did it.
The Stack
Database: PostgreSQL 17 in Docker
Storage: AWS S3 with lifecycle policies
Orchestration: Bash scripts + cron
Notifications: Slack webhooks
Backup tool: pg_dump (PostgreSQL's built-in tool)
Architecture Overview
The backup system has three main components:
The Backup Script - Handles the actual
pg_dump, S3 upload, and local cleanup. Includes retry logic for resilience.Cron Jobs - Schedules backups (production daily, staging weekly).
Notification Webhook - POSTs status updates directly to Slack webhook.
This is straightforward… cron triggers the script → script backs up database → uploads to S3 → sends notification → cleans up old local files
Implementation
Step 1: Configure AWS and Docker
AWS Setup:
First, you want configure AWS CLI on your host machine and create S3 buckets for your backups. Install AWS CLI if it’s not.
aws configure
# Enter your credentials and region
# Test access
aws s3 ls
Docker Setup:
Add a volume mount to your database service in compose.yml:
services:
db:
image: postgres:17.2-bookworm
volumes:
- db_data:/var/lib/postgresql/data
- ./backups:/backups # Add this for backup files
environment:
POSTGRES_DB: myapp_db
POSTGRES_USER: myapp_user
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
db_data:
Restart your containers to apply the changes:
docker compose up -d
Production Note: I actually use Docker Swarm in production/staging for better orchestration. The backup strategy is pretty much the same. Just update container filters to
docker ps -qf "name=stackname_db.1"instead of"name=projectname_db". I'll cover this in the Production Considerations section.
Step 2: The Backup Script
Create ~/scripts/backup.sh with three key functions:
run_backup()- Executespg_dumpviadocker exec, uploads to S3, cleans up old local filessend_notification()- POSTs backup status directly to SlackRetry logic - attempts backup up to 10 times with 10-minute intervals
Here's the simplified structure:
#!/bin/bash
set -euo pipefail
ENV="${1:-prod}"
MAX_RETRIES=10
RETRY_INTERVAL=600 # 10 minutes
# load configs (Slack webhook, AWS region, etc.)
source ~/.backup_env
# environment-specific config
if [ "$ENV" == "prod" ]; then
DB_NAME="myapp_prod"
DB_USER="myapp_user"
S3_BUCKET="myapp-backups"
else
DB_NAME="myapp_staging"
DB_USER="myapp_user"
S3_BUCKET="myapp-staging-backups"
fi
BACKUP_FILE="backup_$(date +%Y-%m-%dT%H-%M-%S).dump"
send_notification() {
local status=$1
local error_msg=${2:-""}
# determine color
local color="good"
[[ "$status" == *"failure"* ]] && color="danger"
# build the Slack payload
local payload=$(cat <<EOF
{
"text": "DB Backup for ${ENV}: ${status}",
"attachments": [{
"color": "${color}",
"fields": [
{"title": "Environment", "value": "${ENV}", "short": true},
{"title": "Status", "value": "${status}", "short": true},
{"title": "Backup File", "value": "${BACKUP_FILE}", "short": false}
$([ -n "$error_msg" ] && echo ", {\"title\": \"Error\", \"value\": \"${error_msg}\", \"short\": false}")
]
}]
}
EOF
)
# send to Slack
curl -X POST "$SLACK_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "$payload" --silent --show-error
}
run_backup() {
# find database container
local container_id=$(docker ps -qf "name=db")
# run pg_dump inside container
docker exec "$container_id" \
pg_dump -U $DB_USER -d $DB_NAME \
-Fc -b -v -f /backups/$BACKUP_FILE
# Upload to S3
aws s3 cp ~/backups/$BACKUP_FILE \
s3://$S3_BUCKET/ --region us-east-1
# Keep only last 2 local backups
ls -t ~/backups/*.dump | tail -n +3 | xargs -r rm
}
# Main execution with retry logic
if run_backup; then
send_notification "success"
exit 0
else
send_notification "first_failure" "Initial backup failed"
for i in $(seq 1 $MAX_RETRIES); do
sleep $RETRY_INTERVAL
if run_backup; then
send_notification "success_after_retry"
exit 0
fi
done
send_notification "final_failure" "All retries exhausted"
exit 1
fi
Breaking down the docker exec command with pg_dump:
docker exec -d 42f3acbe6d70 pg_dump -U $DB_USER -d $DB_NAME -Fc -b -v -f /backups/$BACKUP_FILE
│ │ │ │ │ │ │ ││ │ │ | │
│ │ │ │ │ │ │ ││ │ │ | └─ Output file path with variable.
│ │ │ │ │ │ │ ││ │ │ └─── File name flag.
│ │ │ │ │ │ │ ││ │ └────── Specifies verbose mode.
│ │ │ │ │ │ │ ││ └───────── Include large objects in the dump.
│ │ │ │ │ │ │ │└──────────── Output a custom-format archive suitable for input into pg_restore
│ │ │ │ │ │ │ └───────────── Selects the format of the output.
│ │ │ │ │ │ └───────────────────────── Specifies the name of the database to connect to.
│ │ │ │ │ └───────────────────────────────────── User name to connect as.
│ │ │ │ └────────────────────────────────────────────── Export a PostgreSQL database as an SQL script or to other formats
│ │ │ └─────────────────────────────────────────────────────────── Container ID
│ │ └───────────────────────────────────────────────────────────── Detached mode: run command in the background
└──────└─────────────────────────────────────────────────────────────────── Execute a command in a running container
Key decisions explained:
Why
-Fcformat? Custom format is compressed and allows selective restoration of specific tables.Why retry logic? Networks fail. AWS can hiccup. Retries make the system resilient.
Why keep 2 local backups? Quick access for recent restores without hitting S3, but doesn't fill the disk.
Why direct Slack posting? Fewer dependencies. If you already have a notification service, you could POST there instead and let it handle formatting. Setting up a Slack webhook. What it looks like on Slack:

Make the script executable:
chmod +x ~/scripts/backup.sh
Create the environment file with your secrets:
cat > ~/.backup_env << 'EOF'
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
EOF
chmod 600 ~/.backup_env # only the file's owner has read and write access
Step 3: Automation with Cron
Schedule the backups to run automatically:
crontab -e
# add these lines:
# Production - daily at 2 AM
0 2 * * * /home/username/scripts/backup.sh prod >> /home/username/.db_backup_logs/backup-prod.log 2>&1
# Staging - Sundays at 2 AM
0 2 * * 0 /home/username/scripts/backup.sh staging >> /home/username/.db_backup_logs/backup-staging.log 2>&1
# Cleanup old logs - daily at 3 AM
0 3 * * * find /home/username/.db_backup_logs -name "*.log" -mtime +30 -delete
mkdir -p ~/.db_backup_logs
Step 4: The Restoration Script
Backups are useless if you can't restore them. You could do it manually, but why not just use Bash scripting as well? Create ~/scripts/db-restore.sh:
#!/bin/bash
set -euo pipefail
# Usage: ./db-restore.sh prod backup_2025-12-13T02-00-00.dump [--full]
ENV="${1}"
BACKUP_FILE="${2}"
FULL_RESTORE="${3:-}"
# Download from S3
aws s3 cp s3://myapp-backups/$BACKUP_FILE ~/restore/
# Copy to container
CONTAINER_ID=$(docker ps -qf "name=db")
docker cp ~/restore/$BACKUP_FILE $CONTAINER_ID:/tmp/
# Restore (with confirmation prompt)
if [ "$FULL_RESTORE" == "--full" ]; then
# Drop and recreate database
docker exec -i $CONTAINER_ID psql -U postgres <<EOF
DROP DATABASE IF EXISTS myapp_prod;
CREATE DATABASE myapp_prod;
EOF
fi
# Run pg_restore
docker exec $CONTAINER_ID \
pg_restore -U myapp_user -d myapp_prod \
--clean --if-exists -v /tmp/$BACKUP_FILE
# Cleanup
docker exec $CONTAINER_ID rm /tmp/$BACKUP_FILE
rm ~/restore/$BACKUP_FILE
Make it executable:
chmod +x ~/scripts/db-restore.sh
Testing restoration is very important. You might want to do this regularly, perhaps monthly.
# Download latest prod backup
LATEST=$(aws s3 ls s3://myapp-backups/ | sort | tail -n 1 | awk '{print $4}')
# Restore to staging
./scripts/db-restore.sh staging $LATEST --full
Production Considerations
Docker Swarm vs Compose
In production, I use Docker Swarm instead of Compose for better orchestration and secrets management. It’s actually the same backup strategy with a few changes:
Container naming:
Compose:
projectname_db_1Stack:
stackname_db.1.container_id
Finding containers:
# works for both
docker ps -qf "name=db"
Secrets management:
Docker Swarm handles secrets more securely:
services:
db:
secrets:
- db_password
environment:
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
db_password:
external: true
Create the secret:
# I find stdin to be easier to work with
printf "your-database-password-here" | docker secret create db-password -
Then in the backup script, read the password from inside the container:
docker exec $CONTAINER_ID sh -c \
"PGPASSWORD=\$(cat /run/secrets/db_password) pg_dump ..."
S3 Lifecycle Policies
If you want to prevent backup costs from growing forever, consider auto-deleting old backups:
Go to S3 Console → Your bucket → Management → Lifecycle rules
Create rule: Delete objects older than 30 days
Apply to prefix:
prod-db-backups/
Monitoring
Beyond Slack notifications, you could also:
Review logs:
tail -50 ~/.db_backup_logs/backup-prod.logCheck S3:
aws s3 ls s3://myapp-backups/ | tailTest restoration at intervals (as mentioned above)
The Result
Automated nightly backups
Slack notifications keeping us notified
Testing restoration process
Off-site storage in S3
and of course, peace of mind, heh…
S3 buckets are way cheaper than RDS instances, so this to me is part of keeping costs low for projects without sacrificing the safety net of reliable backups.
Try It Yourself
Full scripts available in this GitHub Gist. The system is environment-agnostic. Only update the configuration variables for your setup.
Questions? Drop them in the comments.
Originally published at blog.theolujay.dev