Migrating a Django Project to AWS: A Practical Step-by-Step Guide

From a single Dockerised monolith to a resilient AWS architecture with Lambda, SQS, CloudWatch alarms, and Terraform-managed infrastructure. The real decisions, trade-offs, and lessons from doing this in production.

Why this migration, and why now

The project in question was a large legacy Django application — years of accumulated features, a growing test suite, and a deployment process that involved SSH and crossed fingers. It worked, until it didn't. Scaling meant vertical scaling. Deployments meant downtime. Monitoring meant checking the error inbox.

The goal wasn't to rewrite anything. It was to Dockerise the application, move it to AWS, introduce proper infrastructure-as-code, and add observability — without breaking existing functionality or taking months to do it.

Step 1: Dockerise the application

Docker is the foundation. Before any AWS resources exist, the application needs to run identically in a container on your laptop and in production. This forces you to resolve all the "works on my machine" issues before they become production incidents.

The key decisions for a Django Dockerfile:

  • Use a slim Python base image (python:3.12-slim) — avoid Alpine unless you have a specific reason
  • Run as a non-root user — security baseline, not optional
  • Copy requirements.txt before source code — layer caching makes rebuilds fast
  • Use gunicorn with uvicorn workers for async Django views
  • Store secrets in environment variables, never baked into the image

A docker-compose.yml that matches production configuration — PostgreSQL, Redis, the Django app — makes local development reliable and eliminates environment-specific bugs before they reach CI.

Step 2: CI/CD with GitHub Actions

Before touching AWS, the CI pipeline needs to build the Docker image, run the test suite inside it, and push to ECR on merge to main. This gives you confidence that every deployment is built from a tested, immutable artefact.

- name: Build and push to ECR
  run: |
    docker build -t $ECR_REGISTRY/$ECR_REPO:$GITHUB_SHA .
    docker push $ECR_REGISTRY/$ECR_REPO:$GITHUB_SHA

Tag images with the Git commit SHA, not latest. This makes rollbacks trivial: deploy the previous SHA rather than rebuilding.

Step 3: Infrastructure as code with Terraform

Every AWS resource is defined in Terraform before it's created. This isn't optional — manual console configuration doesn't survive team growth, incident recovery, or environment duplication.

The initial Terraform modules we wrote:

  • VPC with public and private subnets, NAT gateway, and route tables
  • ECR repository with lifecycle policies to clean up old images
  • RDS PostgreSQL with parameter groups, automated backups, and multi-AZ
  • ElastiCache Redis cluster for session and cache storage
  • ECS Fargate service with task definitions referencing ECR images
  • Application Load Balancer with HTTPS termination and health checks
  • IAM roles with least-privilege policies for each service

terraform plan before every apply. Review the diff. Never apply in anger.

Step 4: Introduce SQS for async work

The monolith had several synchronous operations that shouldn't be synchronous — sending emails, generating reports, calling third-party APIs. We introduced SQS queues and Lambda functions to handle these asynchronously.

The pattern is simple: Django views publish a message to SQS instead of doing the work inline, and a Lambda function (or Celery worker) consumes the queue. Benefits:

  • Web request response times drop dramatically
  • Failed jobs retry automatically with exponential backoff
  • Dead-letter queues capture messages that exhaust retries for manual review
  • Queue depth is a leading indicator of system stress — visible in CloudWatch

Step 5: Observability with CloudWatch

You cannot operate what you cannot observe. We configured CloudWatch from day one:

  • Custom metrics emitted from the application for business-level events (orders processed, API calls made, export jobs queued)
  • Alarms on P99 latency, error rate, and queue depth with SNS notifications to the team Slack channel
  • Log groups with metric filters to surface application errors without tailing logs manually
  • CloudWatch Dashboards with a single view of application health, queue depth, and database connections

The first time an alarm fires in a staging environment rather than being discovered by a user in production, the investment pays for itself.

What we would do differently

Start with Terraform, not the console. The temptation to click through the AWS console to "just get it working" is strong and costly — you'll spend twice as long codifying what you built manually. Write the Terraform first, even if it's slower initially.

Instrument before you migrate. Add application metrics and logging to the monolith before it moves to AWS. You want a baseline of normal behaviour to compare against after the migration.

Don't underestimate IAM. Every hour spent designing minimal IAM policies at the start saves days of debugging mysterious permission errors later.

Put it into practice

Need help building this?

We've done it in production. We can help you do the same.

Start a conversation