Production Engineering Nov 1, 2025 14 min read

Production AI Engineering: The Reality Behind the Hype

Cost optimization, validation layers, testing strategies, and monitoring. Real numbers from production systems processing 7,000+ products with 356 tests at 100% pass rate.

Eli
Eli
Systems Architect, NH/VT

Everyone talks about building AI demos. Few discuss what it takes to run AI systems in production where mistakes cost money and downtime affects real customers. Here's what I learned running production AI systems for property management (a portfolio of properties) and e-commerce (7,000+ products).

The Demo-to-Production Gap

Demo AI System

  • ✓ Works on example data
  • ✓ Happy path only
  • ✓ No error handling
  • ✓ Manual testing
  • ✓ Cost? Who cares!
  • ✓ "It worked once"

Production AI System

  • ✓ Handles edge cases
  • ✓ Comprehensive error handling
  • ✓ Validation layers
  • ✓ 356 automated tests
  • ✓ $0.011/1k tokens tracked
  • ✓ 100% uptime required

The gap is real. Your demo works on hand-picked examples. Production needs to handle every edge case your customers throw at it. Here's how to bridge that gap.

Cost Optimization: Real Numbers

Model Selection Strategy

Not every task needs Claude Sonnet 4.5. Here's the decision tree I use:

tr>
Task Type Model Choice Cost
Complex reasoning, architecture Sonnet 4.5 $3/M in, $15/M out
Data extraction, classification Haiku $0.80/M in, $4/M out
Bulk processing, simple tasks Cloudflare AI $0.011/1k "Neurons"
Sensitive data, offline work Local (MLX) ~$0 (electricity)

Real Example: e-commerce Pricing

Processing 7,000 products for pricing analysis:

  • Sonnet 4.5: $42/run (overkill for this task)
  • Haiku: $12/run (good balance)
  • Cloudflare AI: $1.10/run (current choice)

Savings: $40.90 per run × 30 runs/month = $1,227/month saved

Prompt Caching

For repeated operations with the same context, prompt caching reduces costs by 90%. Example from YouTube MCP server:

@cache(ttl=3600)  # Cache for 1 hour
async def get_video_metadata(video_id: str):
    """Cache expensive API calls."""
    return await youtube_api.videos().list(id=video_id)

# First call: Full API cost
# Subsequent calls within 1 hour: Free (from cache)
# Result: 90% cost reduction on repeated lookups

Batch Processing

Process multiple items in one request instead of N individual requests:

Bad: Individual Requests
for product in products:
    analyze(product)  # 7,000 API calls
# Cost: $42
Good: Batch Processing
analyze_batch(products[:100])
# Process 100 at once
# 70 calls total, same output
# Cost: $21 (50% savings)

Validation Layers: Trust But Verify

AI is probabilistic. Production systems need deterministic validation. Never trust AI output without verification—especially for financial or legal data.

Layer 1: Schema Validation

from pydantic import BaseModel, validator

class ResaleCertificate(BaseModel):
    """Strict schema for resale certificate data."""
    property_address: str
    hoa_fee_monthly: float
    assessment_amount: float
    outstanding_balance: float

    @validator('hoa_fee_monthly')
    def validate_fee_reasonable(cls, v):
        """HOA fees should be $50-$1000/month typically."""
        if not 50 <= v <= 1000:
            raise ValueError(f"HOA fee ${v} seems wrong")
        return v

    @validator('outstanding_balance')
    def validate_balance_positive(cls, v):
        """Balance can't be negative."""
        if v < 0:
            raise ValueError("Negative balance not allowed")
        return v

# Use it:
try:
    cert = ResaleCertificate(**ai_output)
except ValidationError as e:
    # Flag for human review
    alert_human(e)

Layer 2: Business Logic Validation

def validate_pricing(product, ai_suggested_price):
    """Business logic: prices must make sense."""

    # Check 1: Positive margin
    cost = product.wholesale_cost + product.shipping_cost
    if ai_suggested_price <= cost:
        return ValidationError("Price below cost")

    # Check 2: Within bounds of similar products
    similar = get_similar_products(product)
    avg_price = mean([p.price for p in similar])

    if abs(ai_suggested_price - avg_price) / avg_price > 0.20:
        return ValidationError(f"Price ${ai_suggested_price} is >20% from average ${avg_price}")

    # Check 3: Competitive but not undercut
    competitor_prices = get_competitor_prices(product)
    min_competitor = min(competitor_prices)

    if ai_suggested_price < min_competitor * 0.95:
        return ValidationError("Price undercuts market by >5%")

    return ValidationSuccess()

Layer 3: Human-in-the-Loop

For high-stakes operations, require human approval:

Approval Workflow Example

  1. 1. AI processes data and generates output
  2. 2. Validation layers check for errors
  3. 3. If validation passes → Queue for human approval
  4. 4. Human reviews AI reasoning + output
  5. 5. Approve → Commit to production
  6. 6. Reject → Flag pattern for future improvement

Critical operations that need human approval:

  • Financial transactions (QuickBooks commits)
  • Legal documents (resale certificates)
  • Price changes affecting >$1000 in inventory
  • Customer-facing communications

Testing Strategies: The 356-Test Example

My YouTube MCP server has 356 tests with 100% pass rate. Here's the testing pyramid I use:

Testing Pyramid for AI Systems

Level 1: Unit Tests (70%)

Test individual functions with mocked AI responses

Level 2: Integration Tests (20%)

Test workflows with real API calls (rate limited)

Level 3: End-to-End Tests (10%)

Test full user workflows in staging environment

Unit Test Example

import pytest
from unittest.mock import AsyncMock

@pytest.mark.asyncio
async def test_transcript_extraction():
    """Test transcript tool with mocked API."""

    # Mock the YouTube API response
    mock_youtube = AsyncMock()
    mock_youtube.captions().download.return_value = "Test transcript"

    # Test the tool
    result = await get_video_transcript(
        video_id="test123",
        youtube_client=mock_youtube
    )

    # Assertions
    assert result["text"] == "Test transcript"
    assert result["duration"] > 0
    mock_youtube.captions().download.assert_called_once()

@pytest.mark.asyncio
async def test_transcript_not_available():
    """Test error handling when no captions exist."""

    mock_youtube = AsyncMock()
    mock_youtube.captions().download.side_effect = TranscriptNotAvailable()

    result = await get_video_transcript(
        video_id="nocaptions",
        youtube_client=mock_youtube
    )

    # Should return helpful error, not crash
    assert "error" in result
    assert "suggestions" in result

Integration Test Example

@pytest.mark.integration
@pytest.mark.asyncio
async def test_full_video_analysis():
    """Test complete workflow with real API (rate limited)."""

    # Use real video ID (public test video)
    video_id = "dQw4w9WgXcQ"

    # Step 1: Get context
    context = await get_video_context(video_id)
    assert context["title"]

    # Step 2: Check transcript availability
    availability = await check_transcript_availability(video_id)
    assert availability["available"]

    # Step 3: Search transcript
    results = await search_transcript(
        video_id=video_id,
        search_query="test"
    )
    assert len(results) >= 0  # May or may not have matches

# Run integration tests with: pytest -m integration --maxfail=1

Monitoring & Observability

You can't improve what you don't measure. Here's what I track:

$0.011
Cost per 1k tokens (Cloudflare AI)
90%
Quota savings via caching
356
Tests at 100% pass rate

Structured Logging

import structlog

logger = structlog.get_logger()

async def process_product_pricing(product_id):
    """Process pricing with full observability."""

    logger.info(
        "pricing.started",
        product_id=product_id,
        timestamp=datetime.now()
    )

    try:
        # Get AI suggestion
        start = time.time()
        ai_price = await get_ai_price_suggestion(product_id)
        duration = time.time() - start

        logger.info(
            "pricing.ai_complete",
            product_id=product_id,
            suggested_price=ai_price,
            tokens_used=ai_price.tokens,
            duration_ms=duration * 1000,
            model="claude-haiku"
        )

        # Validate
        validation = validate_pricing(product_id, ai_price)

        if not validation.passed:
            logger.warning(
                "pricing.validation_failed",
                product_id=product_id,
                reason=validation.reason,
                ai_price=ai_price,
                expected_range=validation.expected_range
            )
            return None

        logger.info(
            "pricing.success",
            product_id=product_id,
            final_price=ai_price
        )

        return ai_price

    except Exception as e:
        logger.error(
            "pricing.error",
            product_id=product_id,
            error=str(e),
            traceback=traceback.format_exc()
        )
        raise

Key Metrics Dashboard

Track these metrics in your monitoring system:

  • Request volume: How many AI calls per hour/day
  • Token usage: Input/output tokens by model
  • Cost tracking: Dollars spent per day/week/month
  • Error rate: Percentage of failed requests
  • Validation failures: How often AI output fails validation
  • Response time: P50, P95, P99 latencies
  • Cache hit rate: Percentage served from cache

Real Numbers from Production

YouTube MCP Server (26 Tools)

  • Daily requests: ~200 (transcript extractions, searches, analysis)
  • Quota usage: ~2,000 units/day (out of 10,000 limit)
  • Cache hit rate: 90% (5x reduction in API calls)
  • Tests: 356 with 100% pass rate
  • Error rate: <1% (mostly quota exhaustion)
  • Multi-account: 3 Google accounts for quota balancing

E-commerce Pricing (7,000 Products)

  • Processing time: ~45 minutes for full catalog (overnight run)
  • Model: Cloudflare AI (Llama 3 70B)
  • Cost per run: $1.10 (vs $42 with Sonnet 4.5)
  • Validation failure rate: 3% (flagged for human review)
  • Monthly savings vs manual: ~60 hours of human time
  • ROI: Positive after 2 weeks

Lessons Learned the Hard Way

🔴 Lesson 1: Start with the Smallest Model That Works

I initially used Sonnet 4.5 for everything. Cost ballooned. Switching simple tasks to Haiku and Cloudflare AI saved $1,200+/month with no quality loss.

🔴 Lesson 2: Validation is Non-Negotiable

AI hallucinated a $400 "monthly HOA fee" that was actually annual ($400/12 = $33/month). Caught by validation. Would have been embarrassing in production.

🔴 Lesson 3: Cache Everything You Can

YouTube MCP initially hit quota limits by noon. Added caching → 90% reduction in API calls. Now runs all day without issues.

🔴 Lesson 4: Test Error Cases, Not Just Happy Path

Most bugs happen on edge cases: missing data, malformed inputs, API timeouts. Test those explicitly. My 356 tests mostly cover error scenarios.

🔴 Lesson 5: Logging is Your Best Friend

Structured logging lets you debug production issues without accessing customer data. Log inputs, outputs, validation results, and AI reasoning.

Key Takeaways

  • 💰 Model selection matters: Cloudflare AI at $0.011/1k tokens vs Claude at $3-15/M tokens—choose based on task complexity.
  • Validate everything: Schema validation + business logic + human approval for critical operations.
  • 🧪 Test exhaustively: 356 tests covering edge cases and error scenarios = production confidence.
  • 📊 Monitor religiously: Track costs, errors, validation failures, and response times in real-time.
  • 💾 Cache aggressively: 90% cache hit rate = 90% cost reduction for repeated operations.
  • 🎯 Start small, iterate: Ship working systems fast, improve based on real production data.
Eli

About Eli

Running production AI systems for property management (a portfolio of properties) and e-commerce (7,000+ products). 356 tests, $0.011/1k tokens, 90% cache hit rate. Real systems, real numbers, real lessons.

Need Production AI Consulting?

I help teams move from AI demos to production systems with proper validation, monitoring, and cost optimization.

Schedule a Consultation