Sunday, April 5

Most founders and developers track competitors the same way: they remember to check a few websites once a month, skim a couple of newsletters, and call it done. Then they get blindsided when a competitor ships a pricing change, launches a new feature, or pivots their positioning entirely. Competitor monitoring AI solves this with a system that does the watching for you — scraping pages, detecting changes, and sending you a clean daily digest without you lifting a finger.

This article walks through a complete end-to-end implementation. By the end, you’ll have a working workflow that scrapes competitor pages on a schedule, diffs the content against previous runs, passes the changes to Claude for summarisation, and delivers a structured report to Slack or email. I’ll cover the architecture, show real code, and flag the parts that break in production.

What You’re Actually Building

The system has four stages:

  1. Scraping — fetch target pages (pricing, blog, changelog, home page) on a schedule
  2. Storage + diffing — compare today’s content against yesterday’s snapshot
  3. Summarisation — send meaningful diffs to Claude and ask for a structured analysis
  4. Delivery — push the summary to wherever your team actually reads things

You can run this entirely in Python on a cron job, or wire it into n8n if you want a visual workflow with retry logic. I’ll show the Python core first because understanding the primitives matters, then show how to wrap it in n8n for production use.

Scraping Competitor Pages Without Getting Blocked

Basic requests + BeautifulSoup works fine for most SaaS pricing and blog pages. Where it breaks: SPAs rendered client-side (anything built in React that doesn’t SSR), Cloudflare-protected pages, and sites that fingerprint headless browsers. For static pages, keep it simple.

import httpx
from bs4 import BeautifulSoup
import hashlib
import json
from pathlib import Path
from datetime import datetime

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; MarketMonitor/1.0)",
    "Accept-Language": "en-US,en;q=0.9",
}

def fetch_page_text(url: str) -> str:
    """Fetch a URL and return clean visible text, stripping nav/footer noise."""
    resp = httpx.get(url, headers=HEADERS, timeout=15, follow_redirects=True)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")

    # Remove boilerplate elements
    for tag in soup(["script", "style", "nav", "footer", "header"]):
        tag.decompose()

    return soup.get_text(separator="\n", strip=True)


def snapshot_path(url: str, base_dir: Path) -> Path:
    """Generate a stable filename from the URL."""
    url_hash = hashlib.md5(url.encode()).hexdigest()[:10]
    return base_dir / f"{url_hash}.json"


def load_snapshot(path: Path) -> dict | None:
    if path.exists():
        return json.loads(path.read_text())
    return None


def save_snapshot(path: Path, content: str) -> None:
    path.write_text(json.dumps({
        "content": content,
        "captured_at": datetime.utcnow().isoformat()
    }))

For JavaScript-heavy sites, reach for Playwright instead. It adds about 3–5 seconds per page and requires a browser binary, but it handles 95% of modern SPAs. The trade-off is real: a 10-site monitor that uses Playwright will take 40–60 seconds per run versus 5–10 seconds with httpx. Fine for a daily job, annoying for anything faster.

Diffing Content and Filtering Out Noise

Raw text diffs between two webpage snapshots are messy. Timestamps, ad slots, and “last updated” strings create false positives constantly. You need to normalise before diffing.

import re
from difflib import unified_diff

def normalise(text: str) -> str:
    """Strip common noise: dates, times, 'X minutes ago', dynamic counters."""
    text = re.sub(r'\d{1,2}:\d{2}(:\d{2})?\s?(AM|PM)?', '', text)
    text = re.sub(r'\b\d+ (minutes?|hours?|days?) ago\b', '', text)
    text = re.sub(r'\b(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\w*\s\d{1,2},?\s\d{4}', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()


def compute_diff(old: str, new: str) -> str:
    """Return a unified diff string, empty string if no meaningful changes."""
    old_lines = normalise(old).splitlines()
    new_lines = normalise(new).splitlines()

    diff_lines = list(unified_diff(old_lines, new_lines, lineterm='', n=2))
    return "\n".join(diff_lines)

One thing the documentation never tells you: if you’re monitoring a blog’s index page, the diff will trigger on every new post even if the posts themselves are irrelevant. Track the blog index to detect new posts, but follow the link and summarise the actual post content separately. Treating the index as a signal and the post as the payload makes the summaries far more useful.

Passing Changes to Claude for Intelligent Summarisation

This is where the workflow earns its keep. A raw diff is unreadable to anyone who isn’t a developer. Claude can turn it into a structured competitive intelligence brief in a single API call.

import anthropic

client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

SYSTEM_PROMPT = """You are a competitive intelligence analyst. 
Given a content diff from a competitor's website, extract meaningful business changes.
Ignore cosmetic changes (whitespace, punctuation, formatting).
Focus on: pricing changes, new features, positioning shifts, new pages, removed content.
Return a JSON object with keys: summary (2-3 sentence overview), changes (list of specific changes), 
significance (low/medium/high), and recommended_action (what the reader should do with this info)."""

def summarise_diff(url: str, diff: str, competitor_name: str) -> dict:
    """Ask Claude to interpret a content diff and return structured analysis."""
    if not diff.strip():
        return {"summary": "No meaningful changes detected.", "changes": [], "significance": "low"}

    # Truncate very long diffs — Claude has context limits, and huge diffs are rarely useful
    truncated_diff = diff[:8000] if len(diff) > 8000 else diff

    message = client.messages.create(
        model="claude-haiku-4-5",  # ~$0.001 per run at typical diff sizes
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=[{
            "role": "user",
            "content": f"Competitor: {competitor_name}\nURL: {url}\n\nContent diff:\n{truncated_diff}\n\nAnalyse this diff."
        }]
    )

    raw = message.content[0].text
    try:
        # Claude reliably returns valid JSON when instructed clearly in the system prompt
        return json.loads(raw)
    except json.JSONDecodeError:
        return {"summary": raw, "changes": [], "significance": "unknown"}

I use claude-haiku-4-5 for this, not Sonnet or Opus. At typical diff sizes (500–3000 tokens), a full daily run across 10 competitor pages costs roughly $0.01–0.03. Sonnet would be 5x more expensive for analysis quality you won’t notice at this task. Save the more capable models for synthesis across multiple competitors or strategic analysis — I’ll cover that below.

Wiring It Together: The Main Loop

from pathlib import Path

COMPETITORS = [
    {"name": "Acme Corp", "urls": [
        "https://acmecorp.com/pricing",
        "https://acmecorp.com/blog",
    ]},
    {"name": "RivalSaaS", "urls": [
        "https://rivalsaas.io/pricing",
        "https://rivalsaas.io/changelog",
    ]},
]

SNAPSHOT_DIR = Path("./snapshots")
SNAPSHOT_DIR.mkdir(exist_ok=True)

def run_monitoring_cycle() -> list[dict]:
    """Run one full monitoring cycle. Returns list of findings with changes."""
    findings = []

    for competitor in COMPETITORS:
        for url in competitor["urls"]:
            path = snapshot_path(url, SNAPSHOT_DIR)
            old_snapshot = load_snapshot(path)

            try:
                current_text = fetch_page_text(url)
            except Exception as e:
                print(f"Failed to fetch {url}: {e}")
                continue

            if old_snapshot is None:
                # First run — just save baseline, nothing to diff
                save_snapshot(path, current_text)
                continue

            diff = compute_diff(old_snapshot["content"], current_text)

            if diff:
                analysis = summarise_diff(url, diff, competitor["name"])
                findings.append({
                    "competitor": competitor["name"],
                    "url": url,
                    "analysis": analysis,
                    "detected_at": datetime.utcnow().isoformat(),
                })

            # Always update snapshot after processing
            save_snapshot(path, current_text)

    return findings

Generating the Daily Digest with Claude Sonnet

Individual page summaries are useful, but the real value comes from a synthesised daily brief that connects the dots across all competitors. This is where I’d use a more capable model — once per day, so the cost is negligible.

def generate_daily_digest(findings: list[dict]) -> str:
    """Use Claude Sonnet to synthesise all findings into a strategic brief."""
    if not findings:
        return "No meaningful competitor changes detected today."

    findings_text = json.dumps(findings, indent=2)

    message = client.messages.create(
        model="claude-sonnet-4-5",  # Worth the step up for strategic synthesis
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"""Here are today's competitor monitoring findings:

{findings_text}

Write a concise daily brief (max 400 words) covering:
1. Most significant changes and why they matter
2. Any patterns across multiple competitors
3. Top 2-3 recommended actions for our product/marketing team

Write for a technical founder who reads fast and wants signal, not noise."""
        }]
    )

    return message.content[0].text

Delivery: Slack Webhook in Under 20 Lines

import httpx
import os

def send_to_slack(digest: str, findings: list[dict]) -> None:
    webhook_url = os.environ["SLACK_WEBHOOK_URL"]
    
    change_count = len(findings)
    high_sig = [f for f in findings if f.get("analysis", {}).get("significance") == "high"]

    blocks = [
        {"type": "header", "text": {"type": "plain_text", "text": "🔍 Daily Competitor Monitor"}},
        {"type": "section", "text": {"type": "mrkdwn", 
            "text": f"*{change_count} changes detected* ({len(high_sig)} high significance)"}},
        {"type": "section", "text": {"type": "mrkdwn", "text": digest}},
    ]

    httpx.post(webhook_url, json={"blocks": blocks})

Running This in n8n for Production Reliability

The Python script works fine as a cron job, but n8n gives you retry logic, error alerting, and a UI for non-technical teammates to inspect runs. The translation is straightforward:

  • Schedule Trigger → runs daily at 7am
  • Code node → runs the scraping and diffing logic (paste the Python as JavaScript or call an external script via HTTP)
  • HTTP Request node → calls the Anthropic API directly (n8n has a built-in node for this)
  • IF node → routes to Slack only if significance is medium or high
  • Slack node → delivers the digest

The n8n approach is especially useful if you want to add branching logic — for example, trigger a separate flow that creates a Notion page or Linear ticket when a competitor changes their pricing. Wiring that in Python means more code; in n8n it’s a five-minute configuration change.

What Breaks in Production (Honest Assessment)

A few failure modes you’ll hit within the first week:

  • Rate limiting and IP blocks — rotate user agents, add random delays (2–5s between requests), and consider a residential proxy service if you’re monitoring more than 20 URLs daily. ScraperAPI costs about $29/month for 250k requests and handles most blocks transparently.
  • Login-gated content — you can’t scrape SaaS app interiors without maintaining session cookies. Stick to public-facing pages: pricing, home, blog, changelog, docs.
  • Claude returning malformed JSON — happens maybe 2–3% of the time even with explicit instructions. The try/except block above handles it gracefully, but you should also log these cases and review the raw output periodically to tune your prompts.
  • Diff explosion after a site redesign — when a competitor relaunches their site, the diff is essentially the entire page. Cap diff length (already done above at 8000 chars) and consider a “magnitude” check: if more than 60% of lines changed, flag it as a site redesign rather than a content update.

When to Use This and Who It’s For

Solo founders and small teams get the most immediate value here. If you’re currently doing competitor monitoring manually — or not at all — this system pays for itself in the first week. The infrastructure cost is near-zero: a $5 VPS or a free-tier cloud function, plus Claude API costs under $1/day for a 10-competitor setup.

For growth teams and product managers, the strategic synthesis step (the Sonnet-powered daily digest) is the real deliverable. Route it to a #competitor-intel Slack channel and you’ve created a lightweight intelligence function without hiring an analyst.

If you’re at a larger company already using tools like Crayon or Klue, this custom approach gives you flexibility those platforms don’t: you can track any URL, customise the analysis prompt for your specific market context, and integrate directly with your internal tooling. It won’t match their UI or historical data, but it costs 95% less.

The full system — scraping, diffing, AI summarisation, and delivery — is maybe 200 lines of Python. That’s a morning’s work for something that replaces hours of manual competitor monitoring AI work every week. Start with three to five URLs you already check manually, let it run for a week, and tune the normalisation and prompts based on what you see.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply