Claude Code for DevOps: Infrastructure, CI/CD, and Monitoring with AI
DevOps is YAML, scripts, and config files — exactly what Claude Code excels at. Here's how DevOps engineers use it for Terraform, GitHub Actions, Docker, monitoring, and incident response.
DevOps work is 90% config files, scripts, and YAML. Claude Code was born for this.
The pattern is always the same: you know what you want the system to do, but translating that into the exact Terraform syntax, the right GitHub Actions workflow, or the correct Docker multi-stage build takes 30 minutes of reading docs. Claude Code reads the docs faster than you and has seen thousands of production configs.
Use Case 1: Terraform Modules from Scratch
Create a Terraform module for our production infrastructure on AWS:
1. VPC with public and private subnets across 3 AZs
2. ECS Fargate cluster for our Next.js app
- Service with auto-scaling (min 2, max 10 tasks, scale on CPU > 70%)
- ALB with HTTPS listener (ACM certificate)
- Health check on /api/health
3. RDS PostgreSQL 15 in private subnet
- Multi-AZ, db.t3.medium, 100GB gp3 storage
- Automated backups, 7-day retention
4. ElastiCache Redis for session storage
5. S3 bucket for file uploads with CloudFront CDN
Use separate files: main.tf, variables.tf, outputs.tf for each module.
Modules in modules/ directory (vpc, ecs, rds, redis, cdn).
Tag everything with: Environment, Project, ManagedBy=terraform.
Time saved: A full production Terraform setup is a 1-2 day task. Claude Code generates it in minutes.
Use Case 2: GitHub Actions CI/CD Pipeline
Create a GitHub Actions workflow at .github/workflows/deploy.yml:
On push to main:
1. Run TypeScript type checking
2. Run ESLint
3. Run the test suite (Jest)
4. Build the Next.js app
5. If all pass, deploy to Vercel production
6. After deploy, run a smoke test (curl the /api/health endpoint, expect 200)
7. If smoke test fails, automatically rollback the Vercel deployment
On pull request:
1. Run steps 1-4 (no deploy)
2. Post a comment on the PR with the build status and test coverage
3. Deploy a Vercel preview and post the preview URL as a PR comment
Use caching for node_modules, concurrency groups so multiple pushes
don't deploy simultaneously, and environment secrets for VERCEL_TOKEN.
Use Case 3: Docker Multi-Stage Build Optimization
Our Dockerfile builds a Next.js app but the image is 1.2GB and
takes 8 minutes to build. Optimize it.
Goals:
- Final image under 200MB
- Build time under 3 minutes with warm cache
- Use multi-stage build (deps stage, build stage, runner stage)
- Runner stage should use node:20-alpine
- Only copy production artifacts to the final stage
- Add proper .dockerignore and health check instruction
- Pin all base image versions by SHA digest for reproducibility
- Don't run as root in the final stage
Use Case 4: Monitoring and Alerting Setup
Set up monitoring for our Next.js app deployed on Vercel:
1. Create a lib/monitoring.ts module that wraps logging and metrics:
- Structured JSON logging (not console.log)
- Request duration tracking for all API routes
- Error rate tracking with stack traces
- Custom business metrics (signups, purchases, api_calls)
2. Create an API route app/api/health/route.ts:
- Check database connectivity (Supabase query)
- Check Stripe API reachability
- Return 200 with status of each dependency, or 503 if any is down
3. Create alert rules:
- Error rate > 5% for 5 minutes → Slack alert
- P99 latency > 3s for 10 minutes → Slack alert
- Health check down for 2 minutes → PagerDuty alert
Use Case 5: Incident Runbooks
Create incident runbooks in docs/runbooks/ for our most common incidents:
1. database-connection-exhausted.md
- Symptoms, diagnosis steps, resolution, prevention
2. stripe-webhook-failures.md
- Symptoms, diagnosis, resolution, prevention
3. deployment-rollback.md
- When to rollback, steps, post-rollback actions
4. high-latency.md
- Symptoms, diagnosis, resolution by cause
Each runbook should follow the same template: Severity, Symptoms,
Diagnosis, Resolution, Prevention, Escalation contacts.
Why this works: Nobody writes runbooks until after an incident. Claude Code generates thorough, well-structured runbooks from your architecture description.
DevOps CLAUDE.md Template
# CLAUDE.md
## Infrastructure
AWS: ECS Fargate, RDS PostgreSQL, ElastiCache, S3/CloudFront.
Deployment: Vercel (app), Terraform (infra).
CI/CD: GitHub Actions. Monitoring: Grafana Cloud + structured logging.
## Conventions
- Terraform: modules in modules/, environments in envs/
- Docker: multi-stage builds, alpine base, non-root user
- GitHub Actions: reusable workflows in .github/workflows/
- Scripts: bash with set -euo pipefail, shellcheck compliant
- Secrets: never in code, always in environment variables
## Don'ts
- Never hardcode AWS credentials or API keys
- Don't modify production Terraform state manually
- No latest tags for Docker images — always pin versions
If you're using Claude Code for infrastructure and want to share patterns with other DevOps engineers, join AI Builder Club. We discuss IaC patterns, CI/CD optimization, and real production setups.
Get the free AI Builder Newsletter
Weekly deep-dives on AI tools, automation workflows, and builder strategies. Join 5,000+ readers.
No spam. Unsubscribe anytime.
Go deeper with AI Builder Club
Join 1,000+ ambitious professionals and builders learning to use AI at work.
- ✓Expert-led courses on Cursor, MCP, AI agents, and more
- ✓Weekly live workshops with industry builders
- ✓Private community for feedback, collaboration, and accountability