MCP Architecture Patterns: Building Robust AI Integrations

Model Context Protocol (MCP) is becoming the industry standard for extending AI capabilities. After building 12 production MCP servers (including a 26-tool YouTube server and dual-vault Obsidian integration), I've learned what works and what doesn't. Here's the real-world architecture guide.

What is Model Context Protocol?

MCP is an open protocol that lets AI assistants connect to external data sources and tools. Think of it as an API standard specifically designed for AI context—not for humans calling endpoints.

Key Concepts

Server: Provides tools/resources to AI assistants
Tool: Function the AI can call (like "get_video_transcript")
Resource: Static data the AI can read (like file contents)
Prompt: Pre-configured conversation starters

The protocol handles:

Tool discovery (AI learns what tools are available)
Parameter validation (ensures correct inputs)
Authentication (OAuth, API keys)
Error handling (standardized responses)
Streaming (for large responses)

Dual-Vault Architecture Pattern

One of my most useful patterns: connecting AI assistants to two separate Obsidian vaults simultaneously. This solves the problem of mixing personal and work contexts.

┌──────────────────────────────────────────┐
│         Claude Desktop                   │
│  (or Claude Code)                        │
└──────────────────────────────────────────┘
            │
    ┌───────┴───────┐
    │               │
    ▼               ▼
┌─────────┐    ┌─────────┐
│ Personal│    │  Work   │
│  MCP    │    │  MCP    │
│ Port    │    │ Port    │
│ 22360   │    │ 22361   │
└─────────┘    └─────────┘
    │               │
    ▼               ▼
┌─────────┐    ┌─────────┐
│Personal │    │  Work   │
│ Vault   │    │  Vault  │
│         │    │         │
│Business │    │Property │
│Projects │    │  Mgmt   │
└─────────┘    └─────────┘

Configuration Example

{
  "mcpServers": {
    "obsidian-personal": {
      "command": "npx",
      "args": ["mcp-remote", "http://localhost:22360/sse"],
      "env": {}
    },
    "obsidian-work": {
      "command": "npx",
      "args": ["mcp-remote", "http://localhost:22361/sse"],
      "env": {}
    }
  }
}

Why This Works

Separate contexts: AI assistants see tools prefixed by server name
No confusion: "obsidian-personal__get_file" vs "obsidian-work__get_file"
Security: Each vault has its own authentication/permissions
Scaling: Can add more vaults without conflicts

Real Usage Examples

Query: "What tasks do I have this week?"

The AI assistant searches both vaults, aggregates tasks, and presents them separated by context. It knows which vault is for personal vs work based on server names.

Query: "Update the work property file for Highland Terrace"

Your AI assistant knows to use obsidian-work__append_content not personal vault. Clear separation prevents cross-contamination.

Tool Design Patterns

Pattern 1: Context-First Design

Always return lightweight context by default, with options for more detail:

@server.tool()
async def get_video_context(
    video_id: str,
    include_transcript: bool = False,  # Optional
    include_comments: bool = False,    # Optional
    include_transcript_status: bool = False  # Optional
):
    """Get video metadata (lightweight by default)."""

    # Always return: title, description, stats, chapters
    context = await fetch_metadata(video_id)  # ~2-3KB

    # Optionally add heavy data
    if include_transcript:
        context['transcript'] = await fetch_transcript(video_id)  # Could be 100KB+

    if include_comments:
        context['comments'] = await fetch_comments(video_id)  # Could be 50KB+

    return context

Why This Matters

AI context windows are large but not infinite. Tools that return 100KB+ of data by default consume the context budget quickly. Start small, let the AI request more if needed.

Pattern 2: Smart Chunking

For large content, return indexes/summaries first with chunking options:

@server.tool()
async def get_video_transcript(
    video_id: str,
    chunk_size: Optional[int] = None,  # If set, return chunks
    chunk_index: Optional[int] = 0,    # Which chunk
    time_range: Optional[List[str]] = None  # Or get specific time range
):
    """Get transcript with intelligent size handling."""

    transcript = await fetch_transcript(video_id)

    # Long video? Return index/summary by default
    if len(transcript) > 50000 and chunk_size is None:
        return {
            "type": "index",
            "summary": generate_summary(transcript),
            "duration": get_duration(transcript),
            "preview": transcript[:500],  # First 500 chars
            "recommendations": [
                "Use chunk_size=100 to get full transcript in chunks",
                "Use time_range=['00:10:00', '00:20:00'] for specific section"
            ]
        }

    # Handle chunking or time ranges
    if chunk_size:
        return get_chunk(transcript, chunk_size, chunk_index)
    elif time_range:
        return get_time_range(transcript, time_range)

    # Short video: return full transcript
    return transcript

Pattern 3: Multi-Account Quota Management

For APIs with daily quotas, implement automatic account rotation:

class AccountManager:
    async def get_youtube_client(self, preferred_account: Optional[str] = None):
        """Get YouTube client with automatic quota balancing."""

        if preferred_account:
            return await self._get_client(preferred_account)

        # Round-robin through accounts
        for account in self.accounts:
            quota_used = await self._check_quota(account)

            if quota_used < 9000:  # Under daily limit of 10,000
                return await self._get_client(account)

        # All accounts exhausted
        raise QuotaExceededError("All accounts at quota limit")

    async def _get_client(self, account: str):
        """Get authenticated client with token refresh."""
        credentials = await self._load_credentials(account)

        if credentials.expired:
            credentials.refresh()
            await self._save_credentials(account, credentials)

        return youtube_client(credentials)

Case Study: YouTube MCP Server

YouTube MCP v0.3.6

26 tools for video analysis, transcript extraction, comment search
356 tests with 100% pass rate
Multi-account OAuth with automatic token refresh
Smart caching achieving 90% quota savings
Zero-quota transcripts via OAuth caption download

Architecture Decisions

Decision 1: Metadata-First Approach

get_video_context returns ~2-3KB by default (title, description, stats). This lets AI assistants decide if they need transcript/comments before requesting heavy data.

Decision 2: Search Across Accounts

search_all_accounts tool searches multiple Google accounts in parallel, deduplicates results, and tracks which accounts found each video. Perfect for quota balancing.

Decision 3: Code Block Detection

search_transcript can detect code blocks in transcripts (Python, JS, TypeScript, etc.) and extract them with context. Great for learning from technical tutorials.

Real Usage Example

Query: "Find authentication examples in this tutorial"

AI workflow:

1. get_video_context → Confirms video exists, checks duration
2. check_transcript_availability → Verifies captions exist
3. search_transcript(search_query="authentication", search_type="code")
4. Returns code blocks with timestamps and YouTube URLs

Total quota used: 1 unit (metadata check) + 0 units (transcript is free with OAuth)

Case Study: Obsidian MCP Integration

Obsidian MCP Suite

Dual-vault architecture (personal + work)
Real-time sync via MCP plugin running in Obsidian
Full markdown support with frontmatter parsing
Search, create, update, delete operations
Cross-vault queries with context separation

Architecture Decisions

Decision 1: Obsidian Plugin as MCP Server

Instead of building a standalone server that watches files, the MCP server runs inside Obsidian as a plugin. This gives real-time access to Obsidian's API and metadata.

Decision 2: Port-Based Vault Separation

Each vault runs on a different port (22360, 22361). Claude Desktop connects to both via mcp-remote bridge. Tools are automatically namespaced by server name.

Decision 3: Markdown-Aware Operations

Tools understand Obsidian conventions: frontmatter, internal links [[like this]], tags #like-this, and block references ^like-this.

Real Usage Example

Query: "Create a meeting note for today's board meeting at Highland Terrace"

AI workflow:

1. Uses obsidian-work__get_file_contents to read property file
2. Extracts board member names, meeting topics from property context
3. Creates note with obsidian-work__append_content
4. Includes proper frontmatter, links to property file, agenda structure

Note created in correct vault with proper context and links

Best Practices for Production MCP Servers

1. Comprehensive Error Handling

@server.tool()
async def get_video_transcript(video_id: str):
    try:
        transcript = await fetch_transcript(video_id)
        return transcript
    except TranscriptNotAvailable:
        return {
            "error": "No captions available",
            "suggestions": [
                "Check if video has auto-captions",
                "Try another video from same channel"
            ]
        }
    except QuotaExceeded:
        return {
            "error": "API quota exceeded",
            "suggestions": [
                "Wait until midnight Pacific for reset",
                "Add another Google account"
            ]
        }

2. Tool Descriptions Matter

AI assistants use descriptions to decide which tool to call. Be specific:

Bad ✗

"Get video data"

Good ✓

"Get video metadata (title,
description, stats, chapters).
Lightweight by default (~2-3KB).
Use include_transcript for
full captions."

3. Implement Caching

Cache expensive operations with TTL:

@cache(ttl=3600)  # 1 hour
async def get_video_metadata(video_id: str):
    # Expensive API call
    return await youtube_api.videos().list(id=video_id)

YouTube MCP achieves 90% quota reduction through caching.

4. Authentication Best Practices

OAuth 2.0: Use for Google/Microsoft/GitHub APIs
API Keys: Store in environment variables, not code
Token Refresh: Implement automatic refresh before expiration
Multi-Account: Support multiple credentials for quota balancing

5. Rate Limiting

from asyncio import Semaphore

class RateLimiter:
    def __init__(self, max_concurrent=5):
        self.semaphore = Semaphore(max_concurrent)

    async def __aenter__(self):
        await self.semaphore.acquire()

    async def __aexit__(self, *args):
        self.semaphore.release()

rate_limiter = RateLimiter(max_concurrent=5)

@server.tool()
async def search_videos(query: str):
    async with rate_limiter:
        return await youtube_api.search(q=query)

Common Mistakes to Avoid

❌ Returning Too Much Data by Default

Problem: Tool returns 100KB transcript when AI only needed video title.
Solution: Lightweight context first, optional parameters for heavy data.

❌ No Chunking for Large Content

Problem: 3-hour video transcript overwhelms context window.
Solution: Return summary/index by default, offer chunking or time-range options.

❌ Poor Error Messages

Problem: "Error 403" with no context.
Solution: Return actionable error messages with suggestions.

❌ No Quota Management

Problem: Hit API quota limit at 10am, server unusable rest of day.
Solution: Multi-account rotation, caching, quota tracking.

❌ Blocking Operations

Problem: Synchronous API calls block entire server.
Solution: Use async/await throughout, connection pooling for HTTP clients.

Key Takeaways

🎯 Context-first design: Return lightweight data by default, heavy data on request.
📦 Smart chunking: Large content needs indexes/summaries, not full dumps.
🔄 Multi-account patterns: Quota balancing is essential for production systems.
🏗️ Dual-vault architecture: Separate MCP servers for separate contexts prevents confusion.
💾 Caching matters: Can reduce API costs by 90% with intelligent caching.
📝 Tool descriptions are critical: AI assistants use them to decide which tool to call.

Production AI Engineering

Cost optimization, validation, and monitoring

Building Autonomous Agents

Multi-step workflows with MCP integration

About Eli

Systems Architect with 12 production MCP servers including YouTube MCP (26 tools, 356 tests) and dual-vault Obsidian integration. Specializing in Model Context Protocol architecture and production AI systems.

View my MCP servers → Get in touch →

Need MCP Architecture Consulting?

I help teams design and build production-ready MCP servers with proper authentication, caching, and error handling.

Schedule a Consultation

MCP Architecture Patterns: Building Robust AI Integrations

What is Model Context Protocol?

Key Concepts

Dual-Vault Architecture Pattern

Configuration Example

Real Usage Examples

Tool Design Patterns

Pattern 1: Context-First Design

Pattern 2: Smart Chunking

Pattern 3: Multi-Account Quota Management

Case Study: YouTube MCP Server

YouTube MCP v0.3.6

Architecture Decisions

Real Usage Example

Case Study: Obsidian MCP Integration

Obsidian MCP Suite

Architecture Decisions

Real Usage Example

Best Practices for Production MCP Servers

1. Comprehensive Error Handling

2. Tool Descriptions Matter

3. Implement Caching

4. Authentication Best Practices

5. Rate Limiting

Common Mistakes to Avoid

❌ Returning Too Much Data by Default

❌ No Chunking for Large Content

❌ Poor Error Messages

❌ No Quota Management

❌ Blocking Operations

Key Takeaways

Related Articles

Production AI Engineering

Building Autonomous Agents

About Eli

Need MCP Architecture Consulting?