Production AI Engineering: The Reality Behind the Hype

Everyone talks about building AI demos. Few discuss what it takes to run AI systems in production where mistakes cost money and downtime affects real customers. Here's what I learned running production AI systems for property management (a portfolio of properties) and e-commerce (7,000+ products).

The Demo-to-Production Gap

Demo AI System

✓ Works on example data
✓ Happy path only
✓ No error handling
✓ Manual testing
✓ Cost? Who cares!
✓ "It worked once"

Production AI System

✓ Handles edge cases
✓ Comprehensive error handling
✓ Validation layers
✓ 356 automated tests
✓ $0.011/1k tokens tracked
✓ 100% uptime required

The gap is real. Your demo works on hand-picked examples. Production needs to handle every edge case your customers throw at it. Here's how to bridge that gap.

Cost Optimization: Real Numbers

Model Selection Strategy

Not every task needs Claude Sonnet 4.5. Here's the decision tree I use:

tr>

Task Type	Model Choice	Cost
Complex reasoning, architecture	Sonnet 4.5	$3/M in, $15/M out
Data extraction, classification	Haiku	$0.80/M in, $4/M out
Bulk processing, simple tasks	Cloudflare AI	$0.011/1k "Neurons"
Sensitive data, offline work	Local (MLX)	~$0 (electricity)

Real Example: e-commerce Pricing

Processing 7,000 products for pricing analysis:

Sonnet 4.5: $42/run (overkill for this task)
Haiku: $12/run (good balance)
Cloudflare AI: $1.10/run (current choice)

Savings: $40.90 per run × 30 runs/month = $1,227/month saved

Prompt Caching

For repeated operations with the same context, prompt caching reduces costs by 90%. Example from YouTube MCP server:

@cache(ttl=3600)  # Cache for 1 hour
async def get_video_metadata(video_id: str):
    """Cache expensive API calls."""
    return await youtube_api.videos().list(id=video_id)

# First call: Full API cost
# Subsequent calls within 1 hour: Free (from cache)
# Result: 90% cost reduction on repeated lookups

Batch Processing

Process multiple items in one request instead of N individual requests:

Bad: Individual Requests

for product in products:
    analyze(product)  # 7,000 API calls
# Cost: $42

Good: Batch Processing

analyze_batch(products[:100])
# Process 100 at once
# 70 calls total, same output
# Cost: $21 (50% savings)

Validation Layers: Trust But Verify

AI is probabilistic. Production systems need deterministic validation. Never trust AI output without verification—especially for financial or legal data.

Layer 1: Schema Validation

from pydantic import BaseModel, validator

class ResaleCertificate(BaseModel):
    """Strict schema for resale certificate data."""
    property_address: str
    hoa_fee_monthly: float
    assessment_amount: float
    outstanding_balance: float

    @validator('hoa_fee_monthly')
    def validate_fee_reasonable(cls, v):
        """HOA fees should be $50-$1000/month typically."""
        if not 50 <= v <= 1000:
            raise ValueError(f"HOA fee ${v} seems wrong")
        return v

    @validator('outstanding_balance')
    def validate_balance_positive(cls, v):
        """Balance can't be negative."""
        if v < 0:
            raise ValueError("Negative balance not allowed")
        return v

# Use it:
try:
    cert = ResaleCertificate(**ai_output)
except ValidationError as e:
    # Flag for human review
    alert_human(e)

Layer 2: Business Logic Validation

def validate_pricing(product, ai_suggested_price):
    """Business logic: prices must make sense."""

    # Check 1: Positive margin
    cost = product.wholesale_cost + product.shipping_cost
    if ai_suggested_price <= cost:
        return ValidationError("Price below cost")

    # Check 2: Within bounds of similar products
    similar = get_similar_products(product)
    avg_price = mean([p.price for p in similar])

    if abs(ai_suggested_price - avg_price) / avg_price > 0.20:
        return ValidationError(f"Price ${ai_suggested_price} is >20% from average ${avg_price}")

    # Check 3: Competitive but not undercut
    competitor_prices = get_competitor_prices(product)
    min_competitor = min(competitor_prices)

    if ai_suggested_price < min_competitor * 0.95:
        return ValidationError("Price undercuts market by >5%")

    return ValidationSuccess()

Layer 3: Human-in-the-Loop

For high-stakes operations, require human approval:

Approval Workflow Example

1. AI processes data and generates output
2. Validation layers check for errors
3. If validation passes → Queue for human approval
4. Human reviews AI reasoning + output
5. Approve → Commit to production
6. Reject → Flag pattern for future improvement

Critical operations that need human approval:

Financial transactions (QuickBooks commits)
Legal documents (resale certificates)
Price changes affecting >$1000 in inventory
Customer-facing communications

Testing Strategies: The 356-Test Example

My YouTube MCP server has 356 tests with 100% pass rate. Here's the testing pyramid I use:

Testing Pyramid for AI Systems

Level 1: Unit Tests (70%)

Test individual functions with mocked AI responses

Level 2: Integration Tests (20%)

Test workflows with real API calls (rate limited)

Level 3: End-to-End Tests (10%)

Test full user workflows in staging environment

Unit Test Example

import pytest
from unittest.mock import AsyncMock

@pytest.mark.asyncio
async def test_transcript_extraction():
    """Test transcript tool with mocked API."""

    # Mock the YouTube API response
    mock_youtube = AsyncMock()
    mock_youtube.captions().download.return_value = "Test transcript"

    # Test the tool
    result = await get_video_transcript(
        video_id="test123",
        youtube_client=mock_youtube
    )

    # Assertions
    assert result["text"] == "Test transcript"
    assert result["duration"] > 0
    mock_youtube.captions().download.assert_called_once()

@pytest.mark.asyncio
async def test_transcript_not_available():
    """Test error handling when no captions exist."""

    mock_youtube = AsyncMock()
    mock_youtube.captions().download.side_effect = TranscriptNotAvailable()

    result = await get_video_transcript(
        video_id="nocaptions",
        youtube_client=mock_youtube
    )

    # Should return helpful error, not crash
    assert "error" in result
    assert "suggestions" in result

Integration Test Example

@pytest.mark.integration
@pytest.mark.asyncio
async def test_full_video_analysis():
    """Test complete workflow with real API (rate limited)."""

    # Use real video ID (public test video)
    video_id = "dQw4w9WgXcQ"

    # Step 1: Get context
    context = await get_video_context(video_id)
    assert context["title"]

    # Step 2: Check transcript availability
    availability = await check_transcript_availability(video_id)
    assert availability["available"]

    # Step 3: Search transcript
    results = await search_transcript(
        video_id=video_id,
        search_query="test"
    )
    assert len(results) >= 0  # May or may not have matches

# Run integration tests with: pytest -m integration --maxfail=1

Monitoring & Observability

You can't improve what you don't measure. Here's what I track:

$0.011

Cost per 1k tokens (Cloudflare AI)

90%

Quota savings via caching

356

Tests at 100% pass rate

Structured Logging

import structlog

logger = structlog.get_logger()

async def process_product_pricing(product_id):
    """Process pricing with full observability."""

    logger.info(
        "pricing.started",
        product_id=product_id,
        timestamp=datetime.now()
    )

    try:
        # Get AI suggestion
        start = time.time()
        ai_price = await get_ai_price_suggestion(product_id)
        duration = time.time() - start

        logger.info(
            "pricing.ai_complete",
            product_id=product_id,
            suggested_price=ai_price,
            tokens_used=ai_price.tokens,
            duration_ms=duration * 1000,
            model="claude-haiku"
        )

        # Validate
        validation = validate_pricing(product_id, ai_price)

        if not validation.passed:
            logger.warning(
                "pricing.validation_failed",
                product_id=product_id,
                reason=validation.reason,
                ai_price=ai_price,
                expected_range=validation.expected_range
            )
            return None

        logger.info(
            "pricing.success",
            product_id=product_id,
            final_price=ai_price
        )

        return ai_price

    except Exception as e:
        logger.error(
            "pricing.error",
            product_id=product_id,
            error=str(e),
            traceback=traceback.format_exc()
        )
        raise

Key Metrics Dashboard

Track these metrics in your monitoring system:

Request volume: How many AI calls per hour/day
Token usage: Input/output tokens by model
Cost tracking: Dollars spent per day/week/month
Error rate: Percentage of failed requests
Validation failures: How often AI output fails validation
Response time: P50, P95, P99 latencies
Cache hit rate: Percentage served from cache

Real Numbers from Production

YouTube MCP Server (26 Tools)

Daily requests: ~200 (transcript extractions, searches, analysis)
Quota usage: ~2,000 units/day (out of 10,000 limit)
Cache hit rate: 90% (5x reduction in API calls)
Tests: 356 with 100% pass rate
Error rate: <1% (mostly quota exhaustion)
Multi-account: 3 Google accounts for quota balancing

E-commerce Pricing (7,000 Products)

Processing time: ~45 minutes for full catalog (overnight run)
Model: Cloudflare AI (Llama 3 70B)
Cost per run: $1.10 (vs $42 with Sonnet 4.5)
Validation failure rate: 3% (flagged for human review)
Monthly savings vs manual: ~60 hours of human time
ROI: Positive after 2 weeks

Lessons Learned the Hard Way

🔴 Lesson 1: Start with the Smallest Model That Works

I initially used Sonnet 4.5 for everything. Cost ballooned. Switching simple tasks to Haiku and Cloudflare AI saved $1,200+/month with no quality loss.

🔴 Lesson 2: Validation is Non-Negotiable

AI hallucinated a $400 "monthly HOA fee" that was actually annual ($400/12 = $33/month). Caught by validation. Would have been embarrassing in production.

🔴 Lesson 3: Cache Everything You Can

YouTube MCP initially hit quota limits by noon. Added caching → 90% reduction in API calls. Now runs all day without issues.

🔴 Lesson 4: Test Error Cases, Not Just Happy Path

Most bugs happen on edge cases: missing data, malformed inputs, API timeouts. Test those explicitly. My 356 tests mostly cover error scenarios.

🔴 Lesson 5: Logging is Your Best Friend

Structured logging lets you debug production issues without accessing customer data. Log inputs, outputs, validation results, and AI reasoning.

Key Takeaways

💰 Model selection matters: Cloudflare AI at $0.011/1k tokens vs Claude at $3-15/M tokens—choose based on task complexity.
✅ Validate everything: Schema validation + business logic + human approval for critical operations.
🧪 Test exhaustively: 356 tests covering edge cases and error scenarios = production confidence.
📊 Monitor religiously: Track costs, errors, validation failures, and response times in real-time.
💾 Cache aggressively: 90% cache hit rate = 90% cost reduction for repeated operations.
🎯 Start small, iterate: Ship working systems fast, improve based on real production data.

MCP Architecture Patterns

Building robust integrations with caching and error handling

Building Autonomous Agents

Multi-step workflows with validation checkpoints

About Eli

Running production AI systems for property management (a portfolio of properties) and e-commerce (7,000+ products). 356 tests, $0.011/1k tokens, 90% cache hit rate. Real systems, real numbers, real lessons.

View my projects → Get in touch →

Need Production AI Consulting?

I help teams move from AI demos to production systems with proper validation, monitoring, and cost optimization.

Schedule a Consultation