#claude-code#data-engineering#sql#python#etl#dbt#tutorial

Claude Code for Data Engineers: SQL, ETL, Debugging

Claude Code for data engineers: write SQL migrations, scaffold dbt models, build Airflow DAGs, and debug slow queries in your stack.

AI Builder ClubApril 12, 20264 min read

Data engineering is the perfect use case for Claude Code and almost nobody talks about it.

Here's why: data work is highly structured, deeply boilerplate-heavy, and requires touching many files for a single logical change. Add a new table? You need a migration, a model, tests, documentation, and downstream updates. Claude Code handles the entire chain.

Use Case 1: Database Migrations

code

Create a database migration that adds an analytics_events table:
- id: uuid, primary key, default gen_random_uuid()
- user_id: uuid, references users(id), indexed
- event_name: text, not null, indexed
- properties: jsonb, default '{}'
- session_id: text, indexed
- created_at: timestamptz, default now(), indexed
- page_url: text
- referrer: text

Add a composite index on (user_id, event_name, created_at) for our most common query pattern.
Add row-level security: users can only read their own events. Service role can read/write all.
Follow the migration pattern in supabase/migrations/ — use the timestamp naming convention.

Time saved: Writing migrations with proper indexes, RLS, and naming conventions takes 15-20 minutes. Claude Code: 1 minute.

Use Case 2: Complex SQL Queries

code

Write a SQL query for our analytics dashboard that shows:

For each day in the last 30 days:
- Total unique users (by user_id)
- New users (first event ever was that day)
- Returning users (had events before that day)
- Total events
- Average events per user
- Top 3 event names by count
- Day-over-day change as a percentage for unique users

Use CTEs to keep it readable. The table is analytics_events with columns: user_id, event_name, created_at.
Optimize for a table with ~10M rows — avoid correlated subqueries.

This query has 5+ CTEs, window functions, and aggregations. Writing it correctly by hand takes 30-45 minutes. Claude Code generates it in seconds.

Use Case 3: ETL Pipeline with Error Handling

code

Build a Python ETL pipeline that:

1. Extracts data from our Stripe API — all invoices from the last 24 hours
2. Transforms: flatten the nested JSON into a flat schema with columns:
   invoice_id, customer_id, customer_email, amount_cents, currency,
   status, subscription_id, product_name, created_at, paid_at
3. Loads into our PostgreSQL analytics database, table: stripe_invoices
4. Handles: API pagination (Stripe uses cursor-based), rate limiting
   (respect Stripe's headers), partial failures (log and continue,
   don't fail the whole batch), duplicates (upsert on invoice_id)

Use stripe Python library for extraction, pandas for transformation,
sqlalchemy for loading, structlog for logging.

Include a dry-run mode (--dry-run flag) that does extract + transform
but prints instead of loading. Include a backfill mode
(--start-date --end-date) for historical data.

This should be production-ready: proper error handling, logging,
retry logic, and a clear main() entry point.

Use Case 4: dbt Model Generation

code

Read the existing dbt models in models/ to understand our naming
conventions, materialization choices, and documentation patterns.

Create a new dbt model chain for our billing analytics:

1. models/staging/stg_stripe_invoices.sql — clean the raw stripe_invoices table.
   Cast types, rename columns to our convention (snake_case),
   filter out test invoices (where customer_email contains '@test.').

2. models/intermediate/int_monthly_revenue.sql — aggregate to monthly
   revenue per customer. Include: customer_id, month, total_revenue,
   invoice_count, average_invoice_amount, first_invoice_date,
   is_first_month (boolean).

3. models/marts/fct_revenue_metrics.sql — final mart with: month,
   total_mrr, new_mrr (from first-month customers), expansion_mrr,
   churned_mrr, net_new_mrr, customer_count, arpu.

For each model: add a .yml file with descriptions for every column,
add tests (not_null on keys, accepted_values where appropriate,
relationships tests for foreign keys).

Materialization: staging as views, intermediate as ephemeral, marts as tables.

Claude Code 101 · Rebuilt June 2026

You've read the theory. The course is where you ship.

3 guided Labs - a live website, a full-stack app with Stripe payments, and a business automation with measured 10x token savings - plus the Template Vault: CLAUDE.md templates, 9 skills, 5 subagent definitions, a hooks pack, and the security audit prompt.

Start shipping with Claude Code

Use Case 5: Query Performance Debugging

code

This query is running in 45 seconds on a table with 50M rows.
It needs to run in under 2 seconds.

[paste the slow query]

Here's the current table definition and indexes:
[paste CREATE TABLE and index definitions]

Here's the EXPLAIN ANALYZE output:
[paste the explain output]

Diagnose why it's slow and give me:
1. The optimized query
2. Any new indexes needed (with the CREATE INDEX statements)
3. An explanation of what was causing the slowdown
4. The expected improvement

Data Engineering CLAUDE.md Template

markdown

# CLAUDE.md

## Project
Data platform for [company]. PostgreSQL + dbt + Python ETL pipelines.

## Stack
- PostgreSQL 15 (hosted on Supabase)
- dbt-core for transformations
- Python 3.12 for ETL scripts
- SQLAlchemy 2.0 for database access
- pandas for data manipulation
- structlog for logging

## Conventions
- SQL: lowercase keywords, CTEs over subqueries, explicit column lists (no SELECT *)
- Python: type hints everywhere, dataclasses for schemas, structlog for all logging
- dbt: staging → intermediate → marts pattern, .yml docs for every model
- Migrations: timestamp-prefixed, descriptive names
- All queries must use parameterized inputs (no string interpolation)

## Don'ts
- No SELECT * in production queries
- No raw SQL string interpolation (always parameterize)
- No pandas operations on datasets > 1M rows without chunking
- Don't modify production migration files (create new ones)

If you're a data engineer using Claude Code and want to share patterns with other builders, join AI Builder Club. We have data engineers sharing pipeline architectures, dbt patterns, and query optimization techniques.

Sources & Verification

This guide is written from hands-on testing, then cross-checked against primary sources - official documentation and first-party announcements. Field results and opinions are labeled as such. See our editorial standards.

Claude Code documentation - Official Claude Code docs - setup, capabilities, and workflows
Create custom subagents (Claude Code Docs) - Subagents and skills these role workflows build on

Join AI Builder Club

✓65+ lessons, 22+ workshops

✓350+ plug-and-play prompts & skills

✓Weekly live builder workshop

✓Premium tools (e.g. 10xCoder, AI tutor)

✓AI Builder Pack ($5,000+ in exclusive AI credits & perks)

1k+

Join 1,000+ builders already inside

Start shipping →30-day money-back · Cancel anytime

$37/mo

Live workshop

Get the free newsletter

Weekly deep-dives on AI tools, automation workflows, and builder strategies. Join 5,000+ readers.

No spam. Unsubscribe anytime.

Related Guides in This Series

Claude Code for DevOps: Terraform, Actions, Docker — Claude Code for DevOps: real examples of Terraform modules, GitHub Actions pipelines, Docker multi-stage builds, and incident runbooks.
Claude Code for API Development: Ship APIs Faster — Claude Code for API development: routes, validation, auth, error handling, rate limiting, docs, and tests from one prompt. Here's the playbook.
Claude Code Best Practices: 10 Prompting Patterns — Most developers use Claude Code like a chatbot. Here are 10 prompting patterns — with real examples — that turn it into a 10x engineering force multiplier.
Claude Code Hooks: Complete Guide with 7 Recipes (2026) — Make formatting, security, and quality gates fire 100% of the time. Lifecycle events, matchers, handler types, and 7 copy-paste Claude Code hook recipes.

Continue Learning

Claude Code 101

You've read the theory. The course is where you ship: 3 guided Labs (live website, full-stack app with payments, business automation) plus the Template Vault starter kit. Rebuilt June 2026.

← Back to Blog