#claude-code#devops#infrastructure#ci-cd#terraform#docker#tutorial

Claude Code for DevOps: Infrastructure, CI/CD, and Monitoring with AI

DevOps is YAML, scripts, and config files — exactly what Claude Code excels at. Here's how DevOps engineers use it for Terraform, GitHub Actions, Docker, monitoring, and incident response.

AI Builder ClubApril 12, 20264 min read

DevOps work is 90% config files, scripts, and YAML. Claude Code was born for this.

The pattern is always the same: you know what you want the system to do, but translating that into the exact Terraform syntax, the right GitHub Actions workflow, or the correct Docker multi-stage build takes 30 minutes of reading docs. Claude Code reads the docs faster than you and has seen thousands of production configs.


Use Case 1: Terraform Modules from Scratch

Create a Terraform module for our production infrastructure on AWS:

1. VPC with public and private subnets across 3 AZs
2. ECS Fargate cluster for our Next.js app
   - Service with auto-scaling (min 2, max 10 tasks, scale on CPU > 70%)
   - ALB with HTTPS listener (ACM certificate)
   - Health check on /api/health
3. RDS PostgreSQL 15 in private subnet
   - Multi-AZ, db.t3.medium, 100GB gp3 storage
   - Automated backups, 7-day retention
4. ElastiCache Redis for session storage
5. S3 bucket for file uploads with CloudFront CDN

Use separate files: main.tf, variables.tf, outputs.tf for each module.
Modules in modules/ directory (vpc, ecs, rds, redis, cdn).
Tag everything with: Environment, Project, ManagedBy=terraform.

Time saved: A full production Terraform setup is a 1-2 day task. Claude Code generates it in minutes.


Use Case 2: GitHub Actions CI/CD Pipeline

Create a GitHub Actions workflow at .github/workflows/deploy.yml:

On push to main:
1. Run TypeScript type checking
2. Run ESLint
3. Run the test suite (Jest)
4. Build the Next.js app
5. If all pass, deploy to Vercel production
6. After deploy, run a smoke test (curl the /api/health endpoint, expect 200)
7. If smoke test fails, automatically rollback the Vercel deployment

On pull request:
1. Run steps 1-4 (no deploy)
2. Post a comment on the PR with the build status and test coverage
3. Deploy a Vercel preview and post the preview URL as a PR comment

Use caching for node_modules, concurrency groups so multiple pushes
don't deploy simultaneously, and environment secrets for VERCEL_TOKEN.

Use Case 3: Docker Multi-Stage Build Optimization

Our Dockerfile builds a Next.js app but the image is 1.2GB and
takes 8 minutes to build. Optimize it.

Goals:
- Final image under 200MB
- Build time under 3 minutes with warm cache
- Use multi-stage build (deps stage, build stage, runner stage)
- Runner stage should use node:20-alpine
- Only copy production artifacts to the final stage
- Add proper .dockerignore and health check instruction
- Pin all base image versions by SHA digest for reproducibility
- Don't run as root in the final stage

Use Case 4: Monitoring and Alerting Setup

Set up monitoring for our Next.js app deployed on Vercel:

1. Create a lib/monitoring.ts module that wraps logging and metrics:
   - Structured JSON logging (not console.log)
   - Request duration tracking for all API routes
   - Error rate tracking with stack traces
   - Custom business metrics (signups, purchases, api_calls)

2. Create an API route app/api/health/route.ts:
   - Check database connectivity (Supabase query)
   - Check Stripe API reachability
   - Return 200 with status of each dependency, or 503 if any is down

3. Create alert rules:
   - Error rate > 5% for 5 minutes → Slack alert
   - P99 latency > 3s for 10 minutes → Slack alert
   - Health check down for 2 minutes → PagerDuty alert

Use Case 5: Incident Runbooks

Create incident runbooks in docs/runbooks/ for our most common incidents:

1. database-connection-exhausted.md
   - Symptoms, diagnosis steps, resolution, prevention

2. stripe-webhook-failures.md
   - Symptoms, diagnosis, resolution, prevention

3. deployment-rollback.md
   - When to rollback, steps, post-rollback actions

4. high-latency.md
   - Symptoms, diagnosis, resolution by cause

Each runbook should follow the same template: Severity, Symptoms,
Diagnosis, Resolution, Prevention, Escalation contacts.

Why this works: Nobody writes runbooks until after an incident. Claude Code generates thorough, well-structured runbooks from your architecture description.


DevOps CLAUDE.md Template

# CLAUDE.md

## Infrastructure
AWS: ECS Fargate, RDS PostgreSQL, ElastiCache, S3/CloudFront.
Deployment: Vercel (app), Terraform (infra).
CI/CD: GitHub Actions. Monitoring: Grafana Cloud + structured logging.

## Conventions
- Terraform: modules in modules/, environments in envs/
- Docker: multi-stage builds, alpine base, non-root user
- GitHub Actions: reusable workflows in .github/workflows/
- Scripts: bash with set -euo pipefail, shellcheck compliant
- Secrets: never in code, always in environment variables

## Don'ts
- Never hardcode AWS credentials or API keys
- Don't modify production Terraform state manually
- No latest tags for Docker images — always pin versions

If you're using Claude Code for infrastructure and want to share patterns with other DevOps engineers, join AI Builder Club. We discuss IaC patterns, CI/CD optimization, and real production setups.

Get the free AI Builder Newsletter

Weekly deep-dives on AI tools, automation workflows, and builder strategies. Join 5,000+ readers.

No spam. Unsubscribe anytime.

Go deeper with AI Builder Club

Join 1,000+ ambitious professionals and builders learning to use AI at work.

  • Expert-led courses on Cursor, MCP, AI agents, and more
  • Weekly live workshops with industry builders
  • Private community for feedback, collaboration, and accountability