Sunday, April 5

By the end of this tutorial, you’ll have a working Python agent that crawls a website, runs a full SEO content audit on each page using Claude, and outputs a structured JSON report with actionable recommendations — handling 100+ pages in under 10 minutes. SEO content audit automation that used to require a Screaming Frog license, a content strategist, and half a day can now run unattended on a schedule.

Here’s the full scope: we’ll crawl pages, extract on-page signals (title tags, meta descriptions, heading structure, word count, internal links, keyword density), pass the content to Claude for semantic analysis, and generate a prioritized action list per page. The whole thing costs roughly $0.08–$0.15 to audit a 100-page site at Claude Haiku 3 pricing.

  1. Install dependencies — Set up crawling and Claude API libraries
  2. Build the crawler — Extract page content and on-page SEO signals
  3. Write the audit prompt — Craft the system prompt that drives consistent analysis
  4. Run Claude against each page — Process pages in async batches
  5. Aggregate and score results — Build the final JSON report with prioritized fixes
  6. Export and schedule — Save reports and run audits on a cron

Step 1: Install Dependencies

You need four libraries: anthropic for the Claude API, crawl4ai for fast async crawling (it handles JavaScript-rendered pages, which requests won’t), beautifulsoup4 for HTML parsing, and aiohttp for async HTTP requests.

pip install anthropic crawl4ai beautifulsoup4 aiohttp lxml python-dotenv
# crawl4ai requires a one-time browser install:
python -m crawl4ai.setup

Store your API key in a .env file — never hardcode it. We’ll use python-dotenv to load it.

# .env
ANTHROPIC_API_KEY=sk-ant-...
TARGET_DOMAIN=https://yourdomain.com
MAX_PAGES=150

Step 2: Build the Crawler

The crawler needs to do more than fetch HTML — it needs to extract structured SEO signals before we even touch Claude. This keeps your API costs down because you only send Claude what it needs, not raw HTML.

import asyncio
import re
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class PageSignals:
    url: str
    title: str
    meta_description: str
    h1_tags: list[str]
    h2_tags: list[str]
    word_count: int
    internal_links: list[str]
    external_links: list[str]
    images_without_alt: int
    canonical_url: Optional[str]
    body_text: str  # first 3000 chars — enough for Claude to assess relevance

def extract_signals(url: str, html: str, domain: str) -> PageSignals:
    soup = BeautifulSoup(html, "lxml")
    
    title = soup.find("title")
    meta_desc = soup.find("meta", attrs={"name": "description"})
    canonical = soup.find("link", attrs={"rel": "canonical"})
    
    # Extract heading structure
    h1s = [tag.get_text(strip=True) for tag in soup.find_all("h1")]
    h2s = [tag.get_text(strip=True) for tag in soup.find_all("h2")]
    
    # Count words in body (exclude nav, footer, scripts)
    for tag in soup(["script", "style", "nav", "footer", "header"]):
        tag.decompose()
    body_text = soup.get_text(separator=" ", strip=True)
    word_count = len(body_text.split())
    
    # Internal vs external links
    all_links = [a.get("href", "") for a in soup.find_all("a", href=True)]
    parsed_domain = urlparse(domain).netloc
    internal = [l for l in all_links if parsed_domain in l or l.startswith("/")]
    external = [l for l in all_links if l.startswith("http") and parsed_domain not in l]
    
    # Images missing alt text
    images = soup.find_all("img")
    missing_alt = sum(1 for img in images if not img.get("alt", "").strip())
    
    return PageSignals(
        url=url,
        title=title.get_text(strip=True) if title else "",
        meta_description=meta_desc.get("content", "") if meta_desc else "",
        h1_tags=h1s,
        h2_tags=h2s,
        word_count=word_count,
        internal_links=internal[:20],  # cap to avoid token bloat
        external_links=external[:10],
        images_without_alt=missing_alt,
        canonical_url=canonical.get("href") if canonical else None,
        body_text=body_text[:3000],
    )

async def crawl_site(start_url: str, max_pages: int = 100) -> list[PageSignals]:
    visited = set()
    to_visit = [start_url]
    results = []
    domain = start_url
    
    async with AsyncWebCrawler(verbose=False) as crawler:
        while to_visit and len(visited) < max_pages:
            url = to_visit.pop(0)
            if url in visited:
                continue
            visited.add(url)
            
            try:
                result = await crawler.arun(url=url)
                if result.success:
                    signals = extract_signals(url, result.html, domain)
                    results.append(signals)
                    
                    # Queue discovered internal links
                    parsed = urlparse(domain).netloc
                    new_links = [
                        urljoin(url, l) for l in signals.internal_links
                        if parsed in urljoin(url, l) and urljoin(url, l) not in visited
                    ]
                    to_visit.extend(new_links[:10])  # breadth-first, cap queue growth
            except Exception as e:
                print(f"Failed to crawl {url}: {e}")
    
    return results

One thing crawl4ai gets right that plain requests doesn’t: it waits for JavaScript to hydrate the DOM before extracting HTML. If your target site uses React or Next.js, this matters a lot for getting real title tags and meta descriptions. For a more comprehensive guide on building crawlers that work with dynamic content, the article on building web-browsing Claude agents covers browser automation patterns that pair well with this approach.

Step 3: Write the Audit Prompt

This is where most implementations fall apart. Vague prompts produce vague audits. You want Claude to return structured JSON you can actually process — not a wall of markdown prose.

SYSTEM_PROMPT = """You are an expert SEO analyst. You will receive on-page signals for a single webpage and must return a structured JSON audit. Be specific and actionable — not generic.

Return ONLY valid JSON matching this schema:
{
  "url": "string",
  "overall_score": 0-100,
  "issues": [
    {
      "type": "string",  // e.g. "missing_meta_description", "thin_content", "duplicate_h1"
      "severity": "critical|high|medium|low",
      "description": "string",
      "recommendation": "string"  // specific, not generic
    }
  ],
  "content_gaps": ["string"],  // topics the page should cover but doesn't
  "primary_keyword_assessment": "string",
  "estimated_search_intent": "informational|transactional|navigational|commercial"
}

Scoring guide: 90-100 = minimal issues, 70-89 = some optimisation needed, 50-69 = significant gaps, <50 = major problems."""

def build_user_message(signals: PageSignals) -> str:
    return f"""Audit this page:

URL: {signals.url}
Title tag: {signals.title or "MISSING"}
Meta description: {signals.meta_description or "MISSING"}
H1 tags: {signals.h1_tags or ["MISSING"]}
H2 tags (first 8): {signals.h2_tags[:8]}
Word count: {signals.word_count}
Internal links: {len(signals.internal_links)}
External links: {len(signals.external_links)}
Images missing alt text: {signals.images_without_alt}
Canonical URL: {signals.canonical_url or "not set"}

Page content sample (first 3000 chars):
{signals.body_text}"""

The schema constraint in the system prompt is doing real work here. Without it, Claude will narrate your SEO problems like a consultant on retainer. Constraining the output to JSON with a fixed schema means you can loop over issues programmatically and sort by severity. For more on writing system prompts that produce reliable structured outputs from Claude, this framework for consistent agent behavior applies directly.

Step 4: Run Claude Against Each Page

We’ll batch the API calls with a semaphore to avoid hitting rate limits. At Claude Haiku 3 pricing (~$0.00025 per 1K input tokens, ~$0.00125 per 1K output tokens), a full 100-page audit with ~800 tokens input and ~400 tokens output per page runs to about $0.07–$0.12 total. Use Sonnet 3.5 if you want deeper semantic analysis — expect 8–10x the cost but noticeably better content gap identification.

import anthropic
import json
import asyncio
from typing import Any

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

async def audit_page(signals: PageSignals, semaphore: asyncio.Semaphore) -> dict[str, Any]:
    async with semaphore:
        try:
            # Claude API is synchronous — run in thread pool to not block event loop
            loop = asyncio.get_event_loop()
            response = await loop.run_in_executor(
                None,
                lambda: client.messages.create(
                    model="claude-haiku-4-5",  # swap to claude-sonnet-4-5 for deeper analysis
                    max_tokens=800,
                    system=SYSTEM_PROMPT,
                    messages=[{"role": "user", "content": build_user_message(signals)}]
                )
            )
            raw = response.content[0].text.strip()
            
            # Strip markdown code fences if Claude adds them despite instructions
            if raw.startswith("```"):
                raw = re.sub(r"^```(?:json)?\n?", "", raw)
                raw = re.sub(r"\n?```$", "", raw)
            
            return json.loads(raw)
        except json.JSONDecodeError as e:
            # Return a stub so we don't lose the URL from the report
            return {"url": signals.url, "overall_score": 0, "error": f"Parse failed: {e}", "issues": []}
        except Exception as e:
            return {"url": signals.url, "overall_score": 0, "error": str(e), "issues": []}

async def audit_all_pages(pages: list[PageSignals], concurrency: int = 5) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)  # 5 concurrent API calls is safe
    tasks = [audit_page(page, semaphore) for page in pages]
    return await asyncio.gather(*tasks)

The run_in_executor pattern matters here. The Anthropic Python SDK is synchronous, so if you call it directly in an async function, you’ll block the event loop and kill your concurrency gains. This wraps the blocking call in a thread pool so other crawl/audit tasks keep running. If you’re processing thousands of pages, look at the batch processing workflow guide — the Anthropic batch API cuts costs by 50% for non-real-time workloads.

Step 5: Aggregate and Score Results

from collections import Counter
import statistics

def generate_report(audit_results: list[dict], domain: str) -> dict:
    valid = [r for r in audit_results if "error" not in r]
    errors = [r for r in audit_results if "error" in r]
    
    scores = [r["overall_score"] for r in valid]
    all_issues = [issue for r in valid for issue in r.get("issues", [])]
    
    # Aggregate issue frequency
    issue_types = Counter(i["type"] for i in all_issues)
    critical_issues = [i for i in all_issues if i["severity"] == "critical"]
    
    # Pages needing immediate attention (score < 50)
    urgent_pages = sorted(
        [r for r in valid if r["overall_score"] < 50],
        key=lambda x: x["overall_score"]
    )
    
    # Content gap summary
    all_gaps = [gap for r in valid for gap in r.get("content_gaps", [])]
    
    return {
        "domain": domain,
        "pages_audited": len(valid),
        "pages_failed": len(errors),
        "average_score": round(statistics.mean(scores), 1) if scores else 0,
        "median_score": round(statistics.median(scores), 1) if scores else 0,
        "score_distribution": {
            "90_100": sum(1 for s in scores if s >= 90),
            "70_89": sum(1 for s in scores if 70 <= s < 90),
            "50_69": sum(1 for s in scores if 50 <= s < 70),
            "below_50": sum(1 for s in scores if s < 50),
        },
        "top_issue_types": issue_types.most_common(10),
        "critical_issues_count": len(critical_issues),
        "urgent_pages": urgent_pages[:10],  # worst 10 pages
        "content_gaps": list(set(all_gaps))[:20],
        "per_page_results": valid,
        "failed_pages": errors,
    }

Step 6: Export and Schedule the Audit

import json
import os
from datetime import datetime
from dotenv import load_dotenv

load_dotenv()

async def main():
    domain = os.getenv("TARGET_DOMAIN", "https://example.com")
    max_pages = int(os.getenv("MAX_PAGES", 100))
    
    print(f"Crawling {domain} (max {max_pages} pages)...")
    pages = await crawl_site(domain, max_pages)
    print(f"Crawled {len(pages)} pages. Running audit...")
    
    audit_results = await audit_all_pages(pages, concurrency=5)
    report = generate_report(audit_results, domain)
    
    # Save timestamped report
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"seo_audit_{timestamp}.json"
    with open(filename, "w") as f:
        json.dump(report, f, indent=2)
    
    print(f"\n=== AUDIT COMPLETE ===")
    print(f"Pages audited: {report['pages_audited']}")
    print(f"Average SEO score: {report['average_score']}/100")
    print(f"Critical issues: {report['critical_issues_count']}")
    print(f"Report saved to: {filename}")
    
    # Print top 3 urgent pages
    for page in report['urgent_pages'][:3]:
        print(f"\n[URGENT] {page['url']} — Score: {page['overall_score']}")
        for issue in page.get('issues', [])[:2]:
            print(f"  → {issue['type']}: {issue['recommendation']}")

if __name__ == "__main__":
    asyncio.run(main())

To schedule this as a weekly audit, add it to cron:

# Run every Monday at 8am
0 8 * * 1 cd /path/to/audit && python audit.py >> audit.log 2>&1

Common Errors

Claude returns Markdown instead of JSON

Even with explicit instructions to return “ONLY valid JSON,” Claude will occasionally wrap the response in triple-backtick code fences. The regex strip in Step 4 handles this. If you’re still seeing issues, add "Do NOT wrap in markdown code fences." to your system prompt — it sounds redundant but it works. This is a known model behavior documented in our piece on reducing LLM hallucinations in production; constraining output format is one of the most effective grounding strategies.

Crawl4ai returns empty HTML for some pages

Some pages trigger bot detection or have aggressive rate limiting. crawl4ai uses Playwright internally, so you can pass user_agent and wait_for parameters: await crawler.arun(url=url, wait_for="css:.main-content", user_agent="Mozilla/5.0 ..."). For sites that hard-block headless browsers, fall back to requests with a real browser user-agent — you’ll lose JS rendering but get something.

Rate limit errors from the Anthropic API

With 5 concurrent calls at default tier limits, you’ll rarely hit this. If you do, reduce concurrency to 2-3 and add exponential backoff. The Anthropic SDK raises anthropic.RateLimitError — catch it and retry with await asyncio.sleep(60). At Tier 1 (default), Haiku allows 50 requests/minute which is more than enough for auditing at concurrency=5.

What to Build Next

Add competitor comparison. Crawl 2-3 competitor domains with the same agent, then pass both your page’s signals and the competitor’s equivalent page to Claude in a single prompt, asking it to identify what the competitor covers that you don’t. This gives you a content gap analysis grounded in actual live data, not keyword tool exports. The same architecture from our AI-powered competitor monitoring guide integrates cleanly here — you’d reuse the crawl layer and just modify the audit prompt to accept two-page comparisons.

When to Use This (Bottom Line)

Solo founders and small teams: Run this monthly on your own site. The $0.10 cost per 100-page audit is genuinely negligible — you’ll get more actionable output than most paid SEO tools because Claude can reason about content quality, not just count characters in title tags.

Agencies: Wrap this in a simple Flask endpoint, take a domain as input, return the JSON report. You now have a client-facing audit tool that runs in 10 minutes instead of hours. Add a simple HTML template to render the report and you have a deliverable.

Enterprise teams: Add the Claude batch API for overnight processing of large sites (1000+ pages), hook the JSON output into your BI dashboard, and schedule the cron to run weekly. The comparison reports between audit runs will show you exactly which optimizations moved the needle.

The core value of SEO content audit automation isn’t replacing your SEO strategist — it’s eliminating the 80% of audit work that’s mechanical so they can focus on the 20% that requires judgment. This pipeline delivers that.

Frequently Asked Questions

How much does it cost to run a Claude SEO audit on 100 pages?

Using Claude Haiku 3 with ~800 input tokens and ~400 output tokens per page, expect roughly $0.07–$0.12 for 100 pages. Switching to Claude Sonnet 3.5 for deeper semantic analysis costs approximately $0.80–$1.20 for the same run — still cheaper than most SaaS SEO tools per audit.

Can this tool crawl JavaScript-rendered pages built with React or Next.js?

Yes — crawl4ai uses Playwright under the hood, which renders JavaScript before extracting HTML. This means title tags injected by client-side routing, meta descriptions set dynamically, and content loaded via API calls will all be captured correctly. Plain requests-based crawlers will miss this entirely.

How do I avoid Claude returning inconsistent JSON structures across pages?

Two things help: include the exact JSON schema in the system prompt with field types annotated, and explicitly say “return ONLY valid JSON, no markdown, no explanation.” If you need guaranteed schema compliance, add a Pydantic validation step after parsing and re-prompt on validation failures. A strict JSON mode will be more reliable than hoping instructions hold across all responses.

What’s the difference between running this audit with Haiku vs Sonnet?

Haiku is fast and cheap but misses nuanced content quality signals — it’ll catch missing meta descriptions and thin content reliably, but struggle with identifying subtle topic gaps or evaluating whether content matches search intent accurately. Sonnet 3.5 gives noticeably better content gap analysis and more specific recommendations. For mechanical issues (missing tags, duplicate H1s), Haiku is fine. For a full content strategy audit, Sonnet is worth the cost.

Can I run this on a site I don’t own?

Technically yes, but check the site’s robots.txt and terms of service first. crawl4ai respects robots.txt by default. For competitor analysis, crawl only publicly accessible pages and rate-limit your requests to avoid impacting their infrastructure — 1 request per second is a safe default for sites you don’t control.

Put this into practice

Try the Content Marketer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.


Share.
Leave A Reply