Sunday, April 5

Most AI assistant products make an implicit promise they can’t keep: “We’ll personalize your experience” — while quietly centralizing every preference, habit, and behavioral signal on their servers. The uncomfortable truth is that most “personalized AI” is just surveillance with a friendly UI. Building privacy-first AI agents that actually understand user context without hoovering up sensitive data is a genuinely hard architectural problem — and most tutorials skip it entirely.

This article is about building what I’m calling Monalith-style personal assistant agents: agents that maintain rich user context locally, reason about personal data without transmitting it unnecessarily, and degrade gracefully when context is unavailable. The architecture is inspired by Google’s shift toward personal intelligence agents, but approached from a developer-first perspective — with working code, real tradeoffs, and honest failure modes.

Why “Privacy-First” Is Harder Than It Sounds

Here’s the misconception I see constantly: developers assume “privacy-first” just means encrypting data at rest and adding a consent checkbox. That’s compliance theater. Real privacy-first architecture means deciding what data leaves the device or user session at all — and minimizing that surface before you write a single API call.

The actual challenge is that LLM-based agents are fundamentally stateless by default. Every Claude or GPT-4 API call is a fresh context window. To give your agent “memory” of the user’s preferences, habits, and history, you have two options:

  • Centralize user profiles on your server — easy to build, privacy nightmare, regulatory exposure under GDPR/CCPA
  • Store context client-side and selectively inject it — harder to implement, but actually privacy-respecting

Most production teams default to option one because option two requires solving hard problems around context compression, selective disclosure, and what I call the “minimum necessary context” principle. Let’s solve those problems.

The Monalith Architecture: Local Context, Remote Reasoning

The core idea is a split-brain design: context lives with the user, reasoning happens in the cloud, and the two meet only at inference time — with the user controlling what gets sent.

Component 1: The Local Context Store

This is a structured JSON object (or lightweight SQLite DB for native apps) that lives on the user’s device or in their browser’s encrypted storage. It holds behavioral signals, preferences, and derived summaries — but not raw sensitive data.


import json
import hashlib
from datetime import datetime
from typing import Optional

class LocalContextStore:
    """
    Manages user context locally. Raw data never leaves this class
    unless explicitly exported for inference.
    """
    
    def __init__(self, storage_path: str):
        self.storage_path = storage_path
        self.context = self._load()
    
    def _load(self) -> dict:
        try:
            with open(self.storage_path, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {
                "preferences": {},
                "behavioral_signals": [],
                "derived_summaries": {},
                "sensitive_topics": [],  # topics user has flagged as private
                "last_updated": None
            }
    
    def record_signal(self, signal_type: str, value: str, sensitive: bool = False):
        """
        Record behavioral signal. Sensitive signals are hashed
        before storage — preserving pattern without content.
        """
        entry = {
            "type": signal_type,
            "timestamp": datetime.utcnow().isoformat(),
            # Hash sensitive values — we track pattern frequency, not content
            "value": hashlib.sha256(value.encode()).hexdigest()[:12] if sensitive else value
        }
        self.context["behavioral_signals"].append(entry)
        self._save()
    
    def export_for_inference(self, include_sensitive: bool = False) -> dict:
        """
        Produces a minimal context payload for API injection.
        Strips behavioral signals, returns only derived summaries
        and non-sensitive preferences by default.
        """
        payload = {
            "preferences": self.context["preferences"],
            "summaries": self.context["derived_summaries"]
        }
        if include_sensitive:
            # Only if user explicitly granted — prompt for consent first
            payload["behavioral_context"] = self.context["behavioral_signals"][-20:]
        return payload
    
    def _save(self):
        self.context["last_updated"] = datetime.utcnow().isoformat()
        with open(self.storage_path, 'w') as f:
            json.dump(self.context, f, indent=2)

The key design decision: sensitive behavioral signals are hashed before storage. You can track that a user frequently asks about a topic without storing what they asked. This makes the local store far less valuable to an attacker — and less legally problematic if your app’s storage is ever subpoenaed.

Component 2: Context Compression Before Transmission

Naively injecting full user history into every API call has two problems: it’s expensive (a 5,000-token context history costs ~$0.006 per call with Claude Haiku, which adds up fast at scale) and it sends more data than needed. The solution is context compression — summarizing the local store into a minimal representation before injection.


def compress_context_for_system_prompt(context: dict, max_tokens: int = 300) -> str:
    """
    Converts structured context into a compact system prompt fragment.
    Keeps it under 300 tokens to avoid inflating inference costs.
    """
    lines = []
    
    prefs = context.get("preferences", {})
    if prefs:
        pref_str = ", ".join(f"{k}: {v}" for k, v in list(prefs.items())[:5])
        lines.append(f"User preferences: {pref_str}")
    
    summaries = context.get("summaries", {})
    if summaries.get("communication_style"):
        lines.append(f"Communication style: {summaries['communication_style']}")
    if summaries.get("expertise_domains"):
        domains = ", ".join(summaries["expertise_domains"][:3])
        lines.append(f"Domain expertise: {domains}")
    
    # Cap at roughly max_tokens (rough character estimate: 4 chars/token)
    result = "\n".join(lines)
    return result[:max_tokens * 4]

A compressed context fragment typically runs 150-250 tokens. At Claude Haiku pricing (~$0.00025 per 1K input tokens), that’s roughly $0.00004 per call for the context overhead — negligible. This is exactly the kind of optimization covered in depth in the LLM caching strategies guide.

The Inference Layer: Injecting Context Without Leaking It

System Prompt Construction

The agent’s system prompt is assembled at runtime by merging a static base prompt with the user’s compressed context. The LLM provider never sees the raw local store — only the summary.


import anthropic

def build_privacy_respecting_agent(user_context: dict, user_message: str) -> str:
    """
    Constructs and executes an inference call with minimal context exposure.
    """
    client = anthropic.Anthropic()
    
    # Compress context before it leaves the local environment
    context_fragment = compress_context_for_system_prompt(user_context)
    
    system_prompt = f"""You are a personal assistant. Use the following context 
about this user to personalize your responses, but do not reference or repeat 
this context back to the user unless directly relevant.

User context:
{context_fragment}

Important: Do not ask the user to confirm or expand on personal details. 
Work with what you have."""
    
    message = client.messages.create(
        model="claude-haiku-4-5",  # Haiku for cost efficiency on frequent calls
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}]
    )
    
    return message.content[0].text

One thing the documentation doesn’t make obvious: Claude’s system prompt is included in the input token count for billing, but Anthropic does not use conversation data from API calls to train models by default. So the context you inject is processed in-flight and not retained. That’s meaningfully different from consumer products — worth making explicit in your app’s privacy policy.

Handling the Three Hardest Edge Cases

Edge Case 1: Consent Management for Context Upgrades

Users will eventually want richer personalization that requires sharing more context. You need a consent gate — not a blanket permission at signup, but granular consent at the moment of upgrade. Build it as a context tier system:


class ContextTier:
    MINIMAL = 0    # Preferences only, no behavioral data
    STANDARD = 1   # Preferences + anonymized behavioral summaries  
    FULL = 2       # Full behavioral context (explicit user opt-in required)

def get_context_for_tier(store: LocalContextStore, tier: ContextTier) -> dict:
    if tier == ContextTier.MINIMAL:
        return {"preferences": store.context["preferences"]}
    elif tier == ContextTier.STANDARD:
        return store.export_for_inference(include_sensitive=False)
    else:
        # Require explicit, session-specific user confirmation before calling this
        return store.export_for_inference(include_sensitive=True)

Edge Case 2: Profile Drift and Stale Context

User preferences change. A context store built six months ago will mislead the agent if you don’t handle staleness. The fix is decay weighting — older signals count less, and preferences older than a configurable TTL get flagged for re-confirmation. The ethical implications of persistent user profiling are worth reading about separately; this breakdown of AI user profiling ethics covers the edge cases most developers miss.

Edge Case 3: Cross-Device Sync Without Centralization

This is the hardest one. If your context store is purely local, users lose their personalization when they switch devices. The privacy-respecting solutions are:

  • End-to-end encrypted sync (like iCloud Keychain or Signal’s encrypted backups) — your server stores ciphertext it can’t read
  • User-controlled export/import — let users manually export their context as an encrypted file
  • Zero-knowledge proof schemes — advanced, but increasingly viable for high-security use cases

Skip any solution that requires your server to hold decryption keys. That defeats the architecture entirely.

Misconceptions That Will Bite You in Production

Misconception 1: “On-device models solve the privacy problem.” They reduce it. But a 3B-parameter on-device model won’t deliver the quality users expect for complex reasoning tasks. A hybrid approach — local context, cloud inference with minimal context injection — is more practical for most consumer apps today. If you’re evaluating model hosting tradeoffs, the Claude Agents vs OpenAI Assistants architecture comparison is a useful reference for understanding what you’re giving up at each tier.

Misconception 2: “Anonymization is good enough.” Re-identification attacks on “anonymized” behavioral data are well-documented. Hashing individual signals (as in the code above) is better than storing plaintext, but the right bar is: would this data be damaging if the local store was extracted by malware? Design accordingly.

Misconception 3: “Users don’t care about privacy until something goes wrong.” This is increasingly false, especially in the EU, and doubly false for health, finance, and communications apps. Users who understand what your agent does with their data will trust it more — and your conversion and retention numbers will reflect that. Privacy-first architecture is a product feature, not just a compliance checkbox.

Real Cost Numbers for a Consumer-Scale Deployment

For an agent making 10 calls/day per active user, using Claude Haiku with a 300-token context injection:

  • Context overhead: ~$0.00004 per call → $0.0004/user/day
  • Average response (400 tokens output): ~$0.0005 per call → $0.005/user/day
  • Total API cost at 10k DAU: roughly $550/month
  • Context storage (local JSON, ~5KB/user): effectively free

The local context approach adds zero marginal infrastructure cost compared to centralized profile databases — no vector DB, no user profile service, no encryption-at-rest overhead on your end. At scale, that’s significant. For deeper cost modeling, the LLM cost calculator can help you model different call volumes and model tiers before committing to an architecture.

When to Use This Architecture (And When Not To)

Use Monalith-style privacy-first AI agents when:

  • Your app handles health, financial, legal, or communications data
  • You’re targeting EU users and want GDPR compliance by design rather than by patch
  • Your users are privacy-conscious (security tools, productivity apps, professional tools)
  • You want to differentiate on trust — especially against Big Tech incumbents who centralize everything

Skip this complexity when:

  • Your app is purely stateless (single-turn queries, no personalization needed)
  • You’re building internal enterprise tooling where centralized data is already accepted and governed
  • You’re in early prototype stage and need to validate the use case before optimizing architecture

My honest take: For solo founders building consumer AI apps in 2025, the Monalith local-context architecture is the right default. It’s not significantly harder than a centralized approach once you’ve internalized the pattern, it reduces your regulatory surface area, and it gives you a genuine privacy story to tell users — which is increasingly a competitive differentiator. Start with the three-tier consent system, implement context compression from day one, and add cross-device sync only when users ask for it.

Frequently Asked Questions

How do privacy-first AI agents handle cross-device sync without centralizing data?

The cleanest approach is end-to-end encrypted sync where your server holds only ciphertext it cannot decrypt — similar to how iCloud Keychain works. Alternatively, you can let users manually export and import an encrypted context file. Avoid any design where your server holds decryption keys, since that makes you a centralized data processor regardless of how you describe it.

Does Anthropic store the user context I inject into Claude API calls?

By default, Anthropic does not use data from API calls to train models, and does not retain conversation data beyond the processing window. This is different from consumer Claude.ai usage. Always verify the current data processing terms in Anthropic’s usage policies before production deployment, especially if you’re handling regulated data categories.

What is the minimum viable context injection for meaningful personalization?

In practice, 150-250 tokens covering communication style preference, 2-3 domain expertise signals, and response format preferences produces noticeably better outputs than a generic system prompt. You don’t need behavioral history to get good personalization — derived summaries of preferences do most of the work at a fraction of the token cost.

Can I use on-device models instead of cloud APIs to avoid data transmission entirely?

Yes, but with quality tradeoffs. Models that run on-device today (Phi-3 Mini, Gemma 2B, Llama 3.2 3B) handle simple tasks well but struggle with multi-step reasoning and nuanced context interpretation. A hybrid approach — on-device for simple queries and context summarization, cloud API for complex reasoning — is currently the best balance of privacy and capability.

How should I handle GDPR right-to-erasure requests with a local context architecture?

If context is stored purely client-side, erasure is trivially handled by the user deleting their local data — no server-side deletion request needed. If you have any server-side components (encrypted sync, telemetry, logs), you need standard GDPR deletion pipelines for those. Document clearly in your privacy policy which data lives where, so users know what deletion actually clears.

Put this into practice

Try the Ai Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply