Claude Code for Data Engineers: Build Pipelines, Write SQL, and Automate ETL
Data engineering is 80% boilerplate. Claude Code writes your SQL migrations, builds ETL pipelines, generates dbt models, and debugs query performance — here's how.
Data engineering is the perfect use case for Claude Code and almost nobody talks about it.
Here's why: data work is highly structured, deeply boilerplate-heavy, and requires touching many files for a single logical change. Add a new table? You need a migration, a model, tests, documentation, and downstream updates. Claude Code handles the entire chain.
Use Case 1: Database Migrations
Create a database migration that adds an analytics_events table:
- id: uuid, primary key, default gen_random_uuid()
- user_id: uuid, references users(id), indexed
- event_name: text, not null, indexed
- properties: jsonb, default '{}'
- session_id: text, indexed
- created_at: timestamptz, default now(), indexed
- page_url: text
- referrer: text
Add a composite index on (user_id, event_name, created_at) for our most common query pattern.
Add row-level security: users can only read their own events. Service role can read/write all.
Follow the migration pattern in supabase/migrations/ — use the timestamp naming convention.
Time saved: Writing migrations with proper indexes, RLS, and naming conventions takes 15-20 minutes. Claude Code: 1 minute.
Use Case 2: Complex SQL Queries
Write a SQL query for our analytics dashboard that shows:
For each day in the last 30 days:
- Total unique users (by user_id)
- New users (first event ever was that day)
- Returning users (had events before that day)
- Total events
- Average events per user
- Top 3 event names by count
- Day-over-day change as a percentage for unique users
Use CTEs to keep it readable. The table is analytics_events with columns: user_id, event_name, created_at.
Optimize for a table with ~10M rows — avoid correlated subqueries.
This query has 5+ CTEs, window functions, and aggregations. Writing it correctly by hand takes 30-45 minutes. Claude Code generates it in seconds.
Use Case 3: ETL Pipeline with Error Handling
Build a Python ETL pipeline that:
1. Extracts data from our Stripe API — all invoices from the last 24 hours
2. Transforms: flatten the nested JSON into a flat schema with columns:
invoice_id, customer_id, customer_email, amount_cents, currency,
status, subscription_id, product_name, created_at, paid_at
3. Loads into our PostgreSQL analytics database, table: stripe_invoices
4. Handles: API pagination (Stripe uses cursor-based), rate limiting
(respect Stripe's headers), partial failures (log and continue,
don't fail the whole batch), duplicates (upsert on invoice_id)
Use stripe Python library for extraction, pandas for transformation,
sqlalchemy for loading, structlog for logging.
Include a dry-run mode (--dry-run flag) that does extract + transform
but prints instead of loading. Include a backfill mode
(--start-date --end-date) for historical data.
This should be production-ready: proper error handling, logging,
retry logic, and a clear main() entry point.
Use Case 4: dbt Model Generation
Read the existing dbt models in models/ to understand our naming
conventions, materialization choices, and documentation patterns.
Create a new dbt model chain for our billing analytics:
1. models/staging/stg_stripe_invoices.sql — clean the raw stripe_invoices table.
Cast types, rename columns to our convention (snake_case),
filter out test invoices (where customer_email contains '@test.').
2. models/intermediate/int_monthly_revenue.sql — aggregate to monthly
revenue per customer. Include: customer_id, month, total_revenue,
invoice_count, average_invoice_amount, first_invoice_date,
is_first_month (boolean).
3. models/marts/fct_revenue_metrics.sql — final mart with: month,
total_mrr, new_mrr (from first-month customers), expansion_mrr,
churned_mrr, net_new_mrr, customer_count, arpu.
For each model: add a .yml file with descriptions for every column,
add tests (not_null on keys, accepted_values where appropriate,
relationships tests for foreign keys).
Materialization: staging as views, intermediate as ephemeral, marts as tables.
Use Case 5: Query Performance Debugging
This query is running in 45 seconds on a table with 50M rows.
It needs to run in under 2 seconds.
[paste the slow query]
Here's the current table definition and indexes:
[paste CREATE TABLE and index definitions]
Here's the EXPLAIN ANALYZE output:
[paste the explain output]
Diagnose why it's slow and give me:
1. The optimized query
2. Any new indexes needed (with the CREATE INDEX statements)
3. An explanation of what was causing the slowdown
4. The expected improvement
Data Engineering CLAUDE.md Template
# CLAUDE.md
## Project
Data platform for [company]. PostgreSQL + dbt + Python ETL pipelines.
## Stack
- PostgreSQL 15 (hosted on Supabase)
- dbt-core for transformations
- Python 3.12 for ETL scripts
- SQLAlchemy 2.0 for database access
- pandas for data manipulation
- structlog for logging
## Conventions
- SQL: lowercase keywords, CTEs over subqueries, explicit column lists (no SELECT *)
- Python: type hints everywhere, dataclasses for schemas, structlog for all logging
- dbt: staging → intermediate → marts pattern, .yml docs for every model
- Migrations: timestamp-prefixed, descriptive names
- All queries must use parameterized inputs (no string interpolation)
## Don'ts
- No SELECT * in production queries
- No raw SQL string interpolation (always parameterize)
- No pandas operations on datasets > 1M rows without chunking
- Don't modify production migration files (create new ones)
If you're a data engineer using Claude Code and want to share patterns with other builders, join AI Builder Club. We have data engineers sharing pipeline architectures, dbt patterns, and query optimization techniques.
Get the free AI Builder Newsletter
Weekly deep-dives on AI tools, automation workflows, and builder strategies. Join 5,000+ readers.
No spam. Unsubscribe anytime.
Go deeper with AI Builder Club
Join 1,000+ ambitious professionals and builders learning to use AI at work.
- ✓Expert-led courses on Cursor, MCP, AI agents, and more
- ✓Weekly live workshops with industry builders
- ✓Private community for feedback, collaboration, and accountability