Why this migration, and why now
The project in question was a large legacy Django application — years of accumulated features, a growing test suite, and a deployment process that involved SSH and crossed fingers. It worked, until it didn't. Scaling meant vertical scaling. Deployments meant downtime. Monitoring meant checking the error inbox.
The goal wasn't to rewrite anything. It was to Dockerise the application, move it to AWS, introduce proper infrastructure-as-code, and add observability — without breaking existing functionality or taking months to do it.
Step 1: Dockerise the application
Docker is the foundation. Before any AWS resources exist, the application needs to run identically in a container on your laptop and in production. This forces you to resolve all the "works on my machine" issues before they become production incidents.
The key decisions for a Django Dockerfile:
- Use a slim Python base image (
python:3.12-slim) — avoid Alpine unless you have a specific reason - Run as a non-root user — security baseline, not optional
- Copy
requirements.txtbefore source code — layer caching makes rebuilds fast - Use
gunicornwithuvicornworkers for async Django views - Store secrets in environment variables, never baked into the image
A docker-compose.yml that matches production configuration — PostgreSQL, Redis, the Django app — makes local development reliable and eliminates environment-specific bugs before they reach CI.
Step 2: CI/CD with GitHub Actions
Before touching AWS, the CI pipeline needs to build the Docker image, run the test suite inside it, and push to ECR on merge to main. This gives you confidence that every deployment is built from a tested, immutable artefact.
- name: Build and push to ECR
run: |
docker build -t $ECR_REGISTRY/$ECR_REPO:$GITHUB_SHA .
docker push $ECR_REGISTRY/$ECR_REPO:$GITHUB_SHA
Tag images with the Git commit SHA, not latest. This makes rollbacks trivial: deploy the previous SHA rather than rebuilding.
Step 3: Infrastructure as code with Terraform
Every AWS resource is defined in Terraform before it's created. This isn't optional — manual console configuration doesn't survive team growth, incident recovery, or environment duplication.
The initial Terraform modules we wrote:
- VPC with public and private subnets, NAT gateway, and route tables
- ECR repository with lifecycle policies to clean up old images
- RDS PostgreSQL with parameter groups, automated backups, and multi-AZ
- ElastiCache Redis cluster for session and cache storage
- ECS Fargate service with task definitions referencing ECR images
- Application Load Balancer with HTTPS termination and health checks
- IAM roles with least-privilege policies for each service
terraform plan before every apply. Review the diff. Never apply in anger.
Step 4: Introduce SQS for async work
The monolith had several synchronous operations that shouldn't be synchronous — sending emails, generating reports, calling third-party APIs. We introduced SQS queues and Lambda functions to handle these asynchronously.
The pattern is simple: Django views publish a message to SQS instead of doing the work inline, and a Lambda function (or Celery worker) consumes the queue. Benefits:
- Web request response times drop dramatically
- Failed jobs retry automatically with exponential backoff
- Dead-letter queues capture messages that exhaust retries for manual review
- Queue depth is a leading indicator of system stress — visible in CloudWatch
Step 5: Observability with CloudWatch
You cannot operate what you cannot observe. We configured CloudWatch from day one:
- Custom metrics emitted from the application for business-level events (orders processed, API calls made, export jobs queued)
- Alarms on P99 latency, error rate, and queue depth with SNS notifications to the team Slack channel
- Log groups with metric filters to surface application errors without tailing logs manually
- CloudWatch Dashboards with a single view of application health, queue depth, and database connections
The first time an alarm fires in a staging environment rather than being discovered by a user in production, the investment pays for itself.
What we would do differently
Start with Terraform, not the console. The temptation to click through the AWS console to "just get it working" is strong and costly — you'll spend twice as long codifying what you built manually. Write the Terraform first, even if it's slower initially.
Instrument before you migrate. Add application metrics and logging to the monolith before it moves to AWS. You want a baseline of normal behaviour to compare against after the migration.
Don't underestimate IAM. Every hour spent designing minimal IAM policies at the start saves days of debugging mysterious permission errors later.