Most inbound lead flows are broken in the same way: a form submission arrives, sits in a shared inbox for hours, gets manually reviewed by whoever has time, and ends up routed based on gut feel or whoever’s turn it is in the rotation. AI lead qualification fixes this at the source — every lead gets scored in seconds, and the right salesperson gets it immediately, with context already attached.
This article walks through building a working lead qualification and routing system using an LLM as the scoring engine, with n8n as the orchestration layer and a CRM webhook as the destination. You’ll get actual scoring logic, prompt templates, routing conditions, and cost estimates. The whole pipeline can process a lead in under three seconds.
What the Pipeline Actually Does
Before writing any code, it helps to be precise about what “qualification” means in a system context. There are three distinct jobs here:
- Scoring: Assign a numeric or categorical score based on fit criteria (budget, company size, use case, urgency)
- Enrichment: Add context the form didn’t capture (industry vertical, tech stack signals from email domain, etc.)
- Routing: Send the lead to the right queue, rep, or Slack channel based on the score and attributes
Traditional rule-based scoring handles structured fields well — if budget > 10000, add 20 points. But it falls apart on unstructured data: a freeform “tell us about your use case” field that says “we’re rebuilding our entire data infra after a Series B” contains strong buying signals that no regex will catch. That’s exactly where the LLM earns its place.
Architecture Overview
The stack I’d recommend for most teams building this without a full engineering department:
- n8n (self-hosted or cloud) as the workflow engine — handles webhooks, conditional routing, CRM writes
- Claude Haiku or GPT-4o-mini as the scoring LLM — cheap, fast, good enough for classification tasks
- HubSpot or Pipedrive as the CRM destination — both have solid REST APIs
- Slack for rep notifications with the lead summary attached
The flow: form submission → webhook → n8n → LLM scoring → routing logic → CRM update + Slack notify. Total processing time under 3 seconds. At Claude Haiku pricing (~$0.00025 per 1K input tokens), a typical lead scoring prompt costs roughly $0.001–0.003 per lead. You can process 10,000 leads for about $20.
Building the Scoring Prompt
The scoring prompt is the core of the system. Most implementations get this wrong by asking the LLM to do too many things at once or by giving it no schema to return. Here’s a prompt that works reliably in production:
SCORING_PROMPT = """
You are a B2B sales qualification assistant. Analyze the following lead submission
and return a JSON object with your assessment.
Lead data:
- Name: {name}
- Company: {company}
- Email: {email}
- Budget: {budget}
- Use case (freeform): {use_case}
- Company size: {company_size}
Score this lead based on the following criteria (our ICP: B2B SaaS companies,
10-500 employees, annual budget $20k+, pain points around data or automation):
Return ONLY valid JSON in this exact format:
{{
"score": <integer 0-100>,
"tier": <"hot" | "warm" | "cold">,
"icp_fit": <"strong" | "moderate" | "weak">,
"budget_signal": <"confirmed" | "implied" | "missing">,
"urgency_signal": <"high" | "medium" | "low">,
"key_signals": [<list of 2-3 specific phrases from the use case that drove your score>],
"routing_reason": <one sentence explaining who should receive this lead and why>
}}
Tier thresholds: hot = 75+, warm = 45-74, cold = below 45.
Be conservative — only mark hot if multiple strong signals are present.
"""
Two things matter here that most guides skip. First, the key_signals field forces the model to cite evidence from the actual text — this means your sales reps see why the score was assigned, not just the number. Second, the “be conservative” instruction reduces false positives on hot leads, which is the failure mode that destroys rep trust in any scoring system.
Parsing the Response Reliably
LLMs occasionally wrap JSON in markdown code fences or add trailing text. Don’t assume clean output — parse defensively:
import json
import re
def parse_score_response(raw_response: str) -> dict:
# Strip markdown code fences if present
cleaned = re.sub(r"```(?:json)?|```", "", raw_response).strip()
try:
return json.loads(cleaned)
except json.JSONDecodeError:
# Fallback: try to extract JSON object with regex
match = re.search(r"\{.*\}", cleaned, re.DOTALL)
if match:
return json.loads(match.group())
# If all else fails, return a default cold score rather than crashing
return {
"score": 0,
"tier": "cold",
"icp_fit": "weak",
"budget_signal": "missing",
"urgency_signal": "low",
"key_signals": [],
"routing_reason": "Parse error — manual review required"
}
That fallback matters in production. A parse failure shouldn’t drop the lead — it should flag it for human review and move on.
Routing Logic: Beyond Simple Score Thresholds
Score alone is a blunt instrument. A score of 80 might mean “enterprise deal, needs AE” or “SMB that’s perfect for a product-led motion.” Routing should use the full structured output from the LLM, not just the number.
Here’s the routing decision table I use. Implement this as a series of conditions in n8n’s Switch node, or as a function if you’re writing this in Python:
def route_lead(score_data: dict, lead_data: dict) -> dict:
tier = score_data["tier"]
budget = score_data["budget_signal"]
company_size = lead_data.get("company_size", "")
# Enterprise route: hot lead + large company
if tier == "hot" and "enterprise" in company_size.lower():
return {
"queue": "enterprise_ae",
"sla_hours": 1,
"slack_channel": "#leads-enterprise",
"priority": "urgent"
}
# Standard hot route
if tier == "hot":
return {
"queue": "senior_sdr",
"sla_hours": 2,
"slack_channel": "#leads-hot",
"priority": "high"
}
# Warm with confirmed budget — still worth a quick call
if tier == "warm" and budget == "confirmed":
return {
"queue": "sdr_pool",
"sla_hours": 8,
"slack_channel": "#leads-warm",
"priority": "medium"
}
# Warm, budget implied — nurture sequence
if tier == "warm":
return {
"queue": "nurture_sequence",
"sla_hours": 24,
"slack_channel": "#leads-warm",
"priority": "low"
}
# Cold — automated nurture only, no rep time
return {
"queue": "cold_nurture",
"sla_hours": None,
"slack_channel": None, # No Slack notification for cold
"priority": "none"
}
The cold path deliberately skips Slack notifications. Every cold lead pinging your reps trains them to ignore the channel. Protect the signal-to-noise ratio from the start.
n8n Workflow Implementation
In n8n, the workflow looks like this as a node sequence:
- Webhook node — receives the form POST, validates required fields
- Function node — builds the scoring prompt by interpolating lead fields
- HTTP Request node — calls the Anthropic or OpenAI API with the prompt
- Function node — parses the JSON response, runs the routing function
- Switch node — branches on
queuevalue from routing output - HubSpot node (per branch) — creates/updates contact with score properties
- Slack node (hot/warm branches) — sends formatted notification with lead summary
The HubSpot integration deserves attention. You need to create custom properties in HubSpot first: ai_score (number), ai_tier (single-line text), ai_icp_fit, ai_key_signals (multi-line text), and ai_routing_reason. These show up on the contact record so reps have the full context when they open the CRM before making a call.
Slack Notification Format That Actually Gets Read
Slack notifications from automated systems get ignored if they look automated. Format them to front-load the signal:
def format_slack_message(lead: dict, score: dict, routing: dict) -> dict:
tier_emoji = {"hot": "🔥", "warm": "♨️", "cold": "❄️"}
return {
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"{tier_emoji[score['tier']]} New {score['tier'].upper()} Lead — {lead['company']}"
}
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": f"*Score:* {score['score']}/100"},
{"type": "mrkdwn", "text": f"*Budget:* {score['budget_signal']}"},
{"type": "mrkdwn", "text": f"*ICP Fit:* {score['icp_fit']}"},
{"type": "mrkdwn", "text": f"*SLA:* {routing['sla_hours']}h"}
]
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Why this score:*\n" + "\n".join(f"• {s}" for s in score['key_signals'])
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Routing note:* {score['routing_reason']}"
}
}
]
}
What Breaks in Production and How to Handle It
Running this in production for a few months surfaces predictable failure modes:
LLM latency spikes: Anthropic and OpenAI both have occasional slow responses (5-10 seconds). Set a timeout of 8 seconds on your HTTP request node and have a fallback path that sends the lead to a manual review queue with the raw form data attached. Never let a lead drop silently.
Score drift: The LLM’s interpretation of your ICP shifts as the model is updated. Run a weekly audit — pull 20 random leads from the past week and manually review whether the scores match your reps’ assessments. When drift appears, update the prompt with additional examples or tighter criteria language.
Gaming the form: Prospects learn to write “we have a $50k budget” in freeform fields even when they don’t. The model will believe them. Add a hard rule: any lead where budget_signal == "confirmed" but the structured budget field is blank gets flagged for human review before hot routing. Trust the structured field over the freeform one.
ICP mismatch: Your ICP definition in the prompt is probably too vague initially. After two weeks, look at all the hot leads that didn’t convert to meetings. Pull their key_signals fields and update the prompt to explicitly exclude those patterns. Treat the prompt like a classifier you’re iteratively improving — because that’s exactly what it is.
Model Selection: Haiku vs GPT-4o-mini vs Full Models
For this specific task, I’d use Claude Haiku or GPT-4o-mini over larger models without hesitation. Lead scoring is a classification task with a well-defined schema. You don’t need reasoning depth — you need speed, reliability, and low cost.
Claude Haiku processes a typical scoring prompt (roughly 400 input tokens, 200 output tokens) for about $0.0002. GPT-4o-mini is comparable at around $0.0003. Claude Sonnet or GPT-4o costs 15-20x more for the same task with no measurable quality improvement on structured extraction.
The one case where I’d reach for a larger model: if your use case descriptions are highly technical (e.g., a DevOps tooling company where leads describe complex infrastructure setups) and the small models consistently miss the ICP signals. In that case, the improved accuracy on niche technical content may justify the cost.
When to Build This vs. Buy It
Tools like MadKudu, Clearbit Reveal, and 6sense do AI lead scoring out of the box. They’re good products. Here’s when building your own makes sense instead:
- Your ICP is niche enough that generic models don’t capture it (they’re trained on broad B2B patterns)
- You have heavy unstructured data — long use case descriptions, support tickets repurposed as lead signals
- You need to own the scoring logic for compliance or explainability reasons
- You want to integrate signals from non-standard sources (product usage data, community activity, support history)
- Volume is low enough that SaaS pricing doesn’t pencil out
If you’re doing under 500 leads/month and your ICP is well-defined with mostly structured fields, a simple n8n + HubSpot workflow without an LLM might be sufficient. Add the LLM layer when the freeform fields contain meaningful signal you’re currently ignoring.
Bottom Line: Who Should Build This
Solo founders and small teams get the highest ROI here — you’re replacing the “whoever-has-time reviews the inbox” problem with a system that never sleeps and costs pennies per lead. Build the n8n version described above; it takes a day to implement and immediately pays for itself.
Technical teams at growth-stage companies should build this as a microservice rather than n8n, wrap it with a proper queue (Redis or SQS), and add a feedback loop where rep outcomes are logged back against the AI scores. That feedback data becomes your fine-tuning dataset if you ever want to move to a fine-tuned model.
Enterprises with existing CRM infrastructure and dedicated RevOps teams should evaluate MadKudu or 6sense first — the custom integrations and reporting dashboards save significant engineering time. But this build-your-own approach remains valid for teams with non-standard data sources or compliance constraints that SaaS vendors can’t accommodate.
The core insight behind effective AI lead qualification isn’t the model — it’s designing a structured output schema that your routing logic and reps can actually use. Get that right and the rest follows.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

