By the end of this tutorial, you’ll have a working Claude code review automation agent that integrates with GitHub pull requests, understands your codebase’s conventions, and catches real bugs — not just style violations that ESLint already handles. We’re talking about the kind of feedback that usually requires a senior engineer: “this function will deadlock if called concurrently”, “you’re missing error handling on the S3 upload timeout”, “this SQL query is vulnerable to injection despite the ORM”.
Standard linters are pattern matchers. They’re good at what they do — enforcing formatting, flagging unused variables, catching obvious anti-patterns. But they have zero understanding of intent. Claude does. That’s the gap this tutorial fills.
What You’ll Build
A Python-based agent that:
- Reads a git diff (from a PR or local branch comparison)
- Loads relevant context files from the repo (architecture notes, style guides, related modules)
- Sends a structured review request to Claude with that context
- Returns categorised feedback: bugs, security issues, style violations, and suggestions
- Posts comments directly to GitHub PRs via the API
Running this on Claude Haiku 3.5 for a typical 200-line diff costs roughly $0.003–0.006 per review at current pricing. On Sonnet 3.5, you’re looking at $0.02–0.04 for deeper analysis. Both are worth it compared to the 20 minutes a senior dev spends on a routine review.
- Install dependencies — Set up the Python environment with Anthropic SDK and GitHub API client
- Build the diff extractor — Pull structured diffs from git or GitHub PRs
- Load codebase context — Feed Claude your conventions, architecture notes, and related files
- Design the review prompt — Structure the system prompt for consistent, categorised output
- Parse and post feedback — Extract structured results and post inline PR comments
- Wire up the GitHub webhook — Trigger reviews automatically on PR open/update
Step 1: Install Dependencies
You’ll need Python 3.11+, the Anthropic SDK, PyGithub for the API integration, and GitPython for local diff extraction.
pip install anthropic PyGithub gitpython python-dotenv pydantic
Set up your environment variables:
# .env
ANTHROPIC_API_KEY=sk-ant-...
GITHUB_TOKEN=ghp_...
GITHUB_WEBHOOK_SECRET=your_webhook_secret
Step 2: Build the Diff Extractor
The diff is your primary input. You want it structured — file name, line numbers, old vs new content — not just a raw unified diff blob. Claude can handle raw diffs, but structured input produces dramatically better inline comments.
from dataclasses import dataclass
from github import Github
import re
@dataclass
class FileDiff:
filename: str
status: str # added, modified, removed
additions: int
deletions: int
patch: str # the actual diff content
raw_url: str # link to full file for context
def extract_pr_diffs(repo_name: str, pr_number: int) -> list[FileDiff]:
"""Extract structured diffs from a GitHub PR."""
g = Github(os.getenv("GITHUB_TOKEN"))
repo = g.get_repo(repo_name)
pr = repo.get_pull(pr_number)
diffs = []
for file in pr.get_files():
# Skip binary files, lock files, and generated code
if should_skip_file(file.filename):
continue
diffs.append(FileDiff(
filename=file.filename,
status=file.status,
additions=file.additions,
deletions=file.deletions,
patch=file.patch or "", # None for binary files
raw_url=file.raw_url
))
return diffs
def should_skip_file(filename: str) -> bool:
"""Filter out files that don't benefit from AI review."""
skip_patterns = [
r'\.lock$', r'package-lock\.json$', r'yarn\.lock$',
r'\.min\.js$', r'dist/', r'build/', r'\.pb\.go$',
r'migrations/\d+_.*\.sql$' # auto-generated migrations
]
return any(re.search(p, filename) for p in skip_patterns)
Step 3: Load Codebase Context
This is where most AI code review tools fail: they review the diff in isolation. Claude performing a review without knowing your team’s patterns is like a contractor reviewing blueprints without knowing the building code you’re targeting.
You want to inject three types of context:
- Static conventions: your CONTRIBUTING.md, architecture decision records, style guide
- Dynamic context: files that import or are imported by the changed files
- Domain context: relevant interfaces, type definitions, or schemas
import os
from pathlib import Path
from github import Github
def load_review_context(repo_name: str, pr_number: int, diffs: list[FileDiff]) -> dict:
"""Build context payload for Claude."""
g = Github(os.getenv("GITHUB_TOKEN"))
repo = g.get_repo(repo_name)
context = {
"conventions": load_static_conventions(repo),
"related_files": {},
}
for diff in diffs:
# Load the full current version of changed files for context
try:
content = repo.get_contents(diff.filename)
if content.size < 50_000: # skip huge files
context["related_files"][diff.filename] = \
content.decoded_content.decode("utf-8")
except Exception:
pass
return context
def load_static_conventions(repo) -> str:
"""Pull project-level docs that define standards."""
convention_files = [
"CONTRIBUTING.md", ".github/CONTRIBUTING.md",
"docs/architecture.md", "docs/conventions.md",
"CLAUDE_REVIEW_CONTEXT.md" # custom file you create for this agent
]
conventions = []
for filepath in convention_files:
try:
content = repo.get_contents(filepath)
conventions.append(f"## {filepath}\n{content.decoded_content.decode('utf-8')}")
except Exception:
continue
return "\n\n".join(conventions) if conventions else "No explicit conventions found."
I’d strongly recommend creating a CLAUDE_REVIEW_CONTEXT.md in your repo root. Think of it as a briefing document for the reviewer: what patterns to look for, what patterns to ignore, team-specific security requirements, and which third-party libraries you use and how. This single file has more impact on review quality than any amount of prompt engineering.
If you’re thinking about token costs for loading all this context on every review, check out LLM caching strategies for cutting API costs 30–50% — prompt caching on Claude 3.5 makes the static convention context essentially free to reuse across reviews.
Step 4: Design the Review Prompt
The system prompt is doing a lot of work here. You want structured output (so you can parse it programmatically), but you also want Claude to think carefully before categorising. The trick is asking for reasoning before the verdict — it produces fewer false positives.
import anthropic
import json
from pydantic import BaseModel
class ReviewComment(BaseModel):
filename: str
line_number: int | None # None = file-level comment
severity: str # "bug", "security", "style", "suggestion"
title: str
body: str
suggested_fix: str | None
class ReviewResult(BaseModel):
summary: str
overall_verdict: str # "approve", "request_changes", "comment"
comments: list[ReviewComment]
SYSTEM_PROMPT = """You are a senior software engineer performing a thorough code review.
You have deep expertise in security vulnerabilities, concurrency issues, and production reliability.
Your job is to review the provided diff and return structured feedback.
Rules:
- Only flag real issues — do NOT comment on style unless it's in the conventions doc
- For bugs and security issues, always explain the exact failure scenario
- Do not repeat what ESLint/mypy/tsc would already catch — assume those tools are running
- Be specific about line numbers when possible
- Suggested fixes should be concrete code, not vague advice
Severity definitions:
- "bug": will cause incorrect behavior or crashes in production
- "security": creates a vulnerability (injection, auth bypass, data exposure, etc.)
- "style": violates project conventions (only if explicitly defined in context)
- "suggestion": performance improvement, readability, or better pattern — not required
Return ONLY valid JSON matching this schema:
{
"summary": "2-3 sentence overview of the changes",
"overall_verdict": "approve|request_changes|comment",
"comments": [
{
"filename": "src/api/users.py",
"line_number": 42,
"severity": "bug",
"title": "Missing error handling on database timeout",
"body": "If the DB connection times out, this will raise an unhandled exception...",
"suggested_fix": "try:\\n result = db.query(...)\\nexcept TimeoutError as e:\\n raise HTTPException(503, detail=str(e))"
}
]
}"""
def run_review(diffs: list[FileDiff], context: dict) -> ReviewResult:
client = anthropic.Anthropic()
# Build the user message
diff_content = format_diffs_for_prompt(diffs)
conventions = context.get("conventions", "")
related = context.get("related_files", {})
related_content = "\n\n".join([
f"=== {fname} (full file for context) ===\n{content}"
for fname, content in list(related.items())[:5] # cap at 5 files
])
user_message = f"""## Project Conventions
{conventions}
## Related Files (for context)
{related_content}
## Diff to Review
{diff_content}
Please review this diff and return your analysis as JSON."""
response = client.messages.create(
model="claude-sonnet-4-5", # use Haiku for simple diffs to save cost
max_tokens=4096,
temperature=0, # deterministic output for consistent reviews
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": user_message}]
)
raw_json = response.content[0].text
data = json.loads(raw_json)
return ReviewResult(**data)
def format_diffs_for_prompt(diffs: list[FileDiff]) -> str:
"""Format diffs clearly for Claude."""
parts = []
for diff in diffs:
parts.append(f"### {diff.filename} ({diff.status}, +{diff.additions}/-{diff.deletions})\n```diff\n{diff.patch}\n```")
return "\n\n".join(parts)
Note temperature=0. For code review, you want consistency. The same diff reviewed twice should produce the same findings. If you’re curious about when to deviate from this default, the temperature and Top-P guide covers exactly when to adjust LLM randomness in production contexts like this.
Step 5: Parse and Post Feedback to GitHub
from github import Github
def post_review_to_github(
repo_name: str,
pr_number: int,
result: ReviewResult,
commit_sha: str
):
g = Github(os.getenv("GITHUB_TOKEN"))
repo = g.get_repo(repo_name)
pr = repo.get_pull(pr_number)
# Build inline comments for the review
review_comments = []
for comment in result.comments:
if comment.line_number:
body = f"**[{comment.severity.upper()}] {comment.title}**\n\n{comment.body}"
if comment.suggested_fix:
body += f"\n\n**Suggested fix:**\n```\n{comment.suggested_fix}\n```"
review_comments.append({
"path": comment.filename,
"line": comment.line_number,
"body": body
})
# Map verdict to GitHub review event
event_map = {
"approve": "APPROVE",
"request_changes": "REQUEST_CHANGES",
"comment": "COMMENT"
}
pr.create_review(
commit=repo.get_commit(commit_sha),
body=f"🤖 **Claude Code Review**\n\n{result.summary}",
event=event_map.get(result.overall_verdict, "COMMENT"),
comments=review_comments
)
Step 6: Wire Up the GitHub Webhook
The last piece is making this trigger automatically. You’ll expose a small webhook endpoint that GitHub calls when a PR is opened or updated.
from fastapi import FastAPI, Request, HTTPException
import hmac, hashlib, os
app = FastAPI()
@app.post("/webhook/github")
async def github_webhook(request: Request):
# Verify the webhook signature
secret = os.getenv("GITHUB_WEBHOOK_SECRET").encode()
signature = request.headers.get("X-Hub-Signature-256", "")
body = await request.body()
expected = "sha256=" + hmac.new(secret, body, hashlib.sha256).hexdigest()
if not hmac.compare_digest(signature, expected):
raise HTTPException(403, "Invalid signature")
payload = await request.json()
event = request.headers.get("X-GitHub-Event")
# Only handle PR open and synchronize (new commits pushed)
if event == "pull_request" and payload["action"] in ["opened", "synchronize"]:
repo_name = payload["repository"]["full_name"]
pr_number = payload["pull_request"]["number"]
commit_sha = payload["pull_request"]["head"]["sha"]
# Run async so webhook returns quickly
import asyncio
asyncio.create_task(run_full_review(repo_name, pr_number, commit_sha))
return {"status": "ok"}
async def run_full_review(repo_name: str, pr_number: int, commit_sha: str):
diffs = extract_pr_diffs(repo_name, pr_number)
if not diffs:
return # nothing reviewable
context = load_review_context(repo_name, pr_number, diffs)
result = run_review(diffs, context)
post_review_to_github(repo_name, pr_number, result, commit_sha)
For deployment, this FastAPI app can run on any small VPS or serverless platform. If you’re evaluating hosting options, the comparison of serverless platforms for Claude agents covers the tradeoffs between Vercel, Replicate, and Beam for exactly this kind of workload.
Common Errors
Claude Returns Invalid JSON
This happens roughly 2–5% of the time, usually on very long diffs where the model loses track of the schema. Fix: wrap your json.loads() in a retry loop, and add an explicit instruction in the system prompt: “If you cannot produce valid JSON, return an empty comments array rather than prose.” You can also use Claude’s structured output patterns to force schema compliance.
import re
def parse_review_response(raw: str) -> dict:
"""Extract JSON even if Claude wraps it in markdown."""
# Strip markdown code fences if present
match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', raw, re.DOTALL)
if match:
raw = match.group(1)
return json.loads(raw.strip())
GitHub “422 Unprocessable Entity” on Review Comments
This means your line number doesn’t exist in the diff. GitHub will only accept comments on lines that appear in the PR’s diff. Fix: validate that the line number falls within the patch before sending. If it doesn’t, convert the comment to a file-level comment (omit the line field) rather than dropping it entirely.
Context Window Exceeded on Large PRs
A 1,000-line diff plus 5 related files can easily exceed 100k tokens. Fix: implement a chunking strategy — split large PRs into logical file groups, review each group separately, then synthesise. Alternatively, skip loading full related files for large diffs and rely only on the diff content plus conventions. You can also filter aggressively: skip test files, generated code, and configuration-only changes from the AI review entirely.
What to Build Next
Add a learning loop: Store dismissed and accepted comments in a database. After 30 days, analyse which comment types your team consistently dismisses and add them to a “suppress” list in the system prompt. You can also track which bug-severity comments were later confirmed as real issues — this becomes your accuracy baseline for benchmarking the agent over time, similar to the approach covered in Claude agent benchmarking frameworks.
Bottom Line: Who Should Build This
Solo founders and small teams: This pays off immediately. If you’re the only reviewer on a codebase, Claude catches the class of bugs you miss when reviewing your own code. Run it on every PR, use Haiku for routine changes, escalate to Sonnet for anything touching auth, payments, or data access. Total cost for a solo dev’s PR volume is under $10/month.
Teams with existing review processes: Deploy this as a first-pass reviewer that fires within 60 seconds of a PR being opened. By the time a human reviewer gets to it, the mechanical issues are already flagged. Engineers report spending 30–40% less time on routine review feedback when the first pass is automated.
Enterprise: The context-loading architecture here is the real asset. You can extend it to pull from your internal Confluence docs, load relevant Jira tickets, or incorporate your security team’s custom vulnerability patterns. Claude code review automation at that scale justifies dedicated infrastructure — consider a worker queue rather than inline async tasks to handle volume spikes during sprint-end PR floods.
The one thing to get right from day one: write that CLAUDE_REVIEW_CONTEXT.md file. Thirty minutes documenting your team’s non-obvious patterns will do more for review quality than any amount of prompt tuning.
Frequently Asked Questions
How accurate is Claude code review automation compared to a human engineer?
For security issues and concurrency bugs, Claude catches a surprisingly high percentage of what a senior engineer would flag — especially when given proper context. Where it falls short is domain-specific business logic bugs (“this discount calculation is wrong because our pricing model changed last quarter”) that require knowledge outside the codebase. Treat it as a first-pass reviewer, not a replacement for human review on high-stakes changes.
What’s the difference between using Claude for code review versus ESLint or SonarQube?
ESLint and SonarQube are pattern matchers — they apply static rules regardless of context. Claude understands intent, so it can identify a race condition that emerges from how two functions interact, or flag an SQL injection risk that slips through ORM parameterisation. The two approaches are complementary, not competitive. Run static analysis tools first, then Claude for semantic review.
Can I run this against a local branch without GitHub?
Yes. Replace the GitHub diff extraction with GitPython: repo = git.Repo("."); diff = repo.git.diff("main", "HEAD"). You lose the inline PR comment posting, but you can print structured results to the terminal or write them to a file. This is useful for pre-commit hooks where you want fast local feedback before pushing.
How do I prevent Claude from commenting on things my team intentionally does differently?
The CLAUDE_REVIEW_CONTEXT.md file is the right place for this. Include a “Patterns We Intentionally Use” section that lists conventions which might look wrong to an outside reviewer. For example: “We use raw SQL via psycopg2, not an ORM — parameterised queries are handled in our db.py wrapper, don’t flag direct string formatting there.” This eliminates the majority of false positives within a week of tuning.
Which Claude model should I use for code review?
Claude Haiku 3.5 handles routine reviews (style, simple logic, obvious bugs) at ~$0.003 per review. Claude Sonnet 3.5 or Sonnet 4 is worth it for security-sensitive code, complex concurrency, or anything touching authentication and data access — the reasoning quality difference is meaningful. A tiered approach based on which files changed (auth, payments, infra = Sonnet; everything else = Haiku) gives you the best cost/quality balance.
How do I handle large PRs that exceed the context window?
Split by logical concern: review database changes separately from API changes separately from frontend changes. Each group gets its own API call with only the relevant related files for context. For PRs over 500 lines, prioritise reviewing files by risk level — auth, data access, and external API calls first — and skip pure test additions or config changes entirely.
Put this into practice
Try the Review Agent agent — ready to use, no setup required.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

