Production AI Engineering: The Reality Behind the Hype
Cost optimization, validation layers, testing strategies, and monitoring. Real numbers from production systems processing 7,000+ products with 356 tests at 100% pass rate.
Cost optimization, validation layers, testing strategies, and monitoring. Real numbers from production systems processing 7,000+ products with 356 tests at 100% pass rate.
Everyone talks about building AI demos. Few discuss what it takes to run AI systems in production where mistakes cost money and downtime affects real customers. Here's what I learned running production AI systems for property management (a portfolio of properties) and e-commerce (7,000+ products).
The gap is real. Your demo works on hand-picked examples. Production needs to handle every edge case your customers throw at it. Here's how to bridge that gap.
Not every task needs Claude Sonnet 4.5. Here's the decision tree I use:
| Task Type | Model Choice | Cost |
|---|---|---|
| Complex reasoning, architecture | Sonnet 4.5 | $3/M in, $15/M out |
| Data extraction, classification | Haiku | $0.80/M in, $4/M out |
| Bulk processing, simple tasks | Cloudflare AI | $0.011/1k "Neurons" | tr>
| Sensitive data, offline work | Local (MLX) | ~$0 (electricity) |
Real Example: e-commerce Pricing
Processing 7,000 products for pricing analysis:
Savings: $40.90 per run × 30 runs/month = $1,227/month saved
For repeated operations with the same context, prompt caching reduces costs by 90%. Example from YouTube MCP server:
@cache(ttl=3600) # Cache for 1 hour
async def get_video_metadata(video_id: str):
"""Cache expensive API calls."""
return await youtube_api.videos().list(id=video_id)
# First call: Full API cost
# Subsequent calls within 1 hour: Free (from cache)
# Result: 90% cost reduction on repeated lookups
Process multiple items in one request instead of N individual requests:
for product in products:
analyze(product) # 7,000 API calls
# Cost: $42
analyze_batch(products[:100])
# Process 100 at once
# 70 calls total, same output
# Cost: $21 (50% savings)
AI is probabilistic. Production systems need deterministic validation. Never trust AI output without verification—especially for financial or legal data.
from pydantic import BaseModel, validator
class ResaleCertificate(BaseModel):
"""Strict schema for resale certificate data."""
property_address: str
hoa_fee_monthly: float
assessment_amount: float
outstanding_balance: float
@validator('hoa_fee_monthly')
def validate_fee_reasonable(cls, v):
"""HOA fees should be $50-$1000/month typically."""
if not 50 <= v <= 1000:
raise ValueError(f"HOA fee ${v} seems wrong")
return v
@validator('outstanding_balance')
def validate_balance_positive(cls, v):
"""Balance can't be negative."""
if v < 0:
raise ValueError("Negative balance not allowed")
return v
# Use it:
try:
cert = ResaleCertificate(**ai_output)
except ValidationError as e:
# Flag for human review
alert_human(e)
def validate_pricing(product, ai_suggested_price):
"""Business logic: prices must make sense."""
# Check 1: Positive margin
cost = product.wholesale_cost + product.shipping_cost
if ai_suggested_price <= cost:
return ValidationError("Price below cost")
# Check 2: Within bounds of similar products
similar = get_similar_products(product)
avg_price = mean([p.price for p in similar])
if abs(ai_suggested_price - avg_price) / avg_price > 0.20:
return ValidationError(f"Price ${ai_suggested_price} is >20% from average ${avg_price}")
# Check 3: Competitive but not undercut
competitor_prices = get_competitor_prices(product)
min_competitor = min(competitor_prices)
if ai_suggested_price < min_competitor * 0.95:
return ValidationError("Price undercuts market by >5%")
return ValidationSuccess()
For high-stakes operations, require human approval:
Approval Workflow Example
Critical operations that need human approval:
My YouTube MCP server has 356 tests with 100% pass rate. Here's the testing pyramid I use:
Test individual functions with mocked AI responses
Test workflows with real API calls (rate limited)
Test full user workflows in staging environment
import pytest
from unittest.mock import AsyncMock
@pytest.mark.asyncio
async def test_transcript_extraction():
"""Test transcript tool with mocked API."""
# Mock the YouTube API response
mock_youtube = AsyncMock()
mock_youtube.captions().download.return_value = "Test transcript"
# Test the tool
result = await get_video_transcript(
video_id="test123",
youtube_client=mock_youtube
)
# Assertions
assert result["text"] == "Test transcript"
assert result["duration"] > 0
mock_youtube.captions().download.assert_called_once()
@pytest.mark.asyncio
async def test_transcript_not_available():
"""Test error handling when no captions exist."""
mock_youtube = AsyncMock()
mock_youtube.captions().download.side_effect = TranscriptNotAvailable()
result = await get_video_transcript(
video_id="nocaptions",
youtube_client=mock_youtube
)
# Should return helpful error, not crash
assert "error" in result
assert "suggestions" in result
@pytest.mark.integration
@pytest.mark.asyncio
async def test_full_video_analysis():
"""Test complete workflow with real API (rate limited)."""
# Use real video ID (public test video)
video_id = "dQw4w9WgXcQ"
# Step 1: Get context
context = await get_video_context(video_id)
assert context["title"]
# Step 2: Check transcript availability
availability = await check_transcript_availability(video_id)
assert availability["available"]
# Step 3: Search transcript
results = await search_transcript(
video_id=video_id,
search_query="test"
)
assert len(results) >= 0 # May or may not have matches
# Run integration tests with: pytest -m integration --maxfail=1
You can't improve what you don't measure. Here's what I track:
import structlog
logger = structlog.get_logger()
async def process_product_pricing(product_id):
"""Process pricing with full observability."""
logger.info(
"pricing.started",
product_id=product_id,
timestamp=datetime.now()
)
try:
# Get AI suggestion
start = time.time()
ai_price = await get_ai_price_suggestion(product_id)
duration = time.time() - start
logger.info(
"pricing.ai_complete",
product_id=product_id,
suggested_price=ai_price,
tokens_used=ai_price.tokens,
duration_ms=duration * 1000,
model="claude-haiku"
)
# Validate
validation = validate_pricing(product_id, ai_price)
if not validation.passed:
logger.warning(
"pricing.validation_failed",
product_id=product_id,
reason=validation.reason,
ai_price=ai_price,
expected_range=validation.expected_range
)
return None
logger.info(
"pricing.success",
product_id=product_id,
final_price=ai_price
)
return ai_price
except Exception as e:
logger.error(
"pricing.error",
product_id=product_id,
error=str(e),
traceback=traceback.format_exc()
)
raise
Track these metrics in your monitoring system:
I initially used Sonnet 4.5 for everything. Cost ballooned. Switching simple tasks to Haiku and Cloudflare AI saved $1,200+/month with no quality loss.
AI hallucinated a $400 "monthly HOA fee" that was actually annual ($400/12 = $33/month). Caught by validation. Would have been embarrassing in production.
YouTube MCP initially hit quota limits by noon. Added caching → 90% reduction in API calls. Now runs all day without issues.
Most bugs happen on edge cases: missing data, malformed inputs, API timeouts. Test those explicitly. My 356 tests mostly cover error scenarios.
Structured logging lets you debug production issues without accessing customer data. Log inputs, outputs, validation results, and AI reasoning.
Running production AI systems for property management (a portfolio of properties) and e-commerce (7,000+ products). 356 tests, $0.011/1k tokens, 90% cache hit rate. Real systems, real numbers, real lessons.
I help teams move from AI demos to production systems with proper validation, monitoring, and cost optimization.
Schedule a Consultation