Scaling Without Breaking: Handling Traffic and Latency in AI Workflows
When the first real trading bot found my x402 API and started calling it every 5 minutes for Korean price data, I had a brief moment of excitement followed by a longer moment of anxiety. What happens when there are 10 bots calling every 5 minutes? What about 100? Is a single free Oracle instance going to handle that?
The honest answer: I don't know yet. I haven't hit a scaling wall. But thinking through where the bottlenecks will appear — and building the architecture to delay them as long as possible — is work I've done. This post is about that work, with clear markers for what I've tested in production versus what I've planned but not stress-tested.
What This Post Covers
The specific scaling patterns I use across my projects, why caching is the single biggest performance lever for AI-powered APIs, how async processing prevents slow AI calls from blocking fast endpoints, the latency numbers I actually see in production, and the line between "scaling for real traffic" and "overengineering for traffic that doesn't exist yet."
The Bottleneck Isn't Where You'd Expect
Most developers worry about compute scaling — CPU maxing out, memory exhaustion, running out of connections. For AI-powered APIs, the actual bottleneck is almost always the AI call itself.
A Claude Haiku call takes 1-3 seconds. My endpoints that return cached data respond in under 10 milliseconds. That's a 100x-300x difference in response time. The architecture of the entire system flows from this single fact: the AI call is slow, everything else is fast, so the goal is to minimize how often the AI call actually happens.
The Caching Layer That Runs Everything
Every AI-powered endpoint in the x402 Protocol uses some form of caching. The patterns vary by how fast the underlying data changes.
kr-sentiment: 1-hour cache. Korean market sentiment doesn't flip by the minute. An agent calling this endpoint wants the current mood, not millisecond-level precision. Claude Haiku generates the analysis once per hour maximum. Every other request returns the cached version instantly.
market-read: 1-hour cache, same rationale. The AI-generated English summary of Korean market conditions refreshes on demand, not on schedule.
kimchi-premium, kr-prices, fx-rate: 60-second cache. These endpoints return numerical data from exchange APIs that update every minute. The background polling task refreshes the data; the API endpoint serves whatever's currently in memory. No AI call involved, so latency is consistently under 10ms.
arbitrage-scanner, exchange-alerts, market-movers: 60-second cache, derived from the same polling data. Calculations run in-process, no external API calls during request handling.
The key insight is layered caching: the background poller refreshes raw data every 60 seconds regardless of traffic. The API endpoints serve from that pre-computed cache. AI-powered endpoints add a second cache layer with longer TTLs. The user never waits for data collection or AI processing — both happen asynchronously.
The async lock deserves attention. Without it, 10 simultaneous requests hitting an expired cache would each trigger a separate Claude call. With the lock, the first request triggers the call while the others wait for the result. Same output, 10x less cost, and the lock adds negligible latency.
What the Latency Actually Looks Like
Here's what I see in production across different endpoint types. These are real numbers from the x402 Protocol, not benchmarks:
The gap between cache hit and cache miss on AI endpoints is the entire performance story. A cache hit returns in under a millisecond. A cache miss takes 2+ seconds because it's waiting on Claude. Every architectural decision I've made is about pushing the cache hit ratio as high as possible.
Current cache hit ratio for kr-sentiment: about 95%. Meaning only 1 in 20 requests actually triggers a Claude call. The other 19 get instant responses at zero marginal cost.
Background Polling: The Invisible Architecture
The reason numerical endpoints respond in 3ms isn't clever code. It's that the work already happened before the request arrived.
A background task runs every 60 seconds, polling five APIs: Upbit prices, Upbit market details, Bithumb prices, Binance prices, and the USD/KRW exchange rate. Five calls, all free, no authentication needed. The results get processed — 189 token premiums calculated, anomalies detected, alerts evaluated — and stored in memory.
When a user requests kimchi-premium, the endpoint reads from the pre-computed in-memory store. No API call happens during request handling. The polling and the serving are completely decoupled. The user gets whatever data was computed in the most recent 60-second cycle.
This is the same pattern the Weather Bot uses for Polymarket data: poll the source on a fixed schedule, cache the results, serve from cache on demand. The polling frequency matches how fast the underlying data actually changes. Crypto prices? 60 seconds. Market sentiment? 1 hour. Weather forecasts? Varies by model.
Where I Haven't Scaled Yet (Honestly)
Everything above works at current traffic levels, which are modest. A handful of bots, a few dozen calls per day, occasional bursts when a new user discovers the service. The Oracle free tier handles this without measurable load.
What I haven't tested:
Concurrent heavy traffic. If 50 trading bots all start polling every 5 minutes simultaneously, that's 600 requests per hour. The caching layer should absorb this easily — most requests hit cache — but I haven't verified it under load. The first real stress test will happen organically when traffic grows, not from a synthetic benchmark.
Database contention. SQLite handles concurrent reads well but concurrent writes can block. If the SpeedTap game gets a sudden traffic spike (say, from an Apps Center listing), the single-writer limitation might become visible. PostgreSQL or a write-ahead log configuration would fix this, but I'm not migrating preemptively.
Multi-region serving. Everything runs from a single Oracle instance in Tokyo. Users in Ohio or Lima experience network latency on top of application latency. For cached endpoints returning in 3ms, the network round trip dominates anyway. For AI endpoints with 2+ second response times, the network latency is invisible. Multi-region would matter only if I needed sub-50ms globally for non-cached endpoints, which I don't.
I'm documenting these gaps not because they're urgent but because pretending they don't exist would be dishonest. The architecture handles current traffic comfortably. Where it breaks under 10x or 100x load is a problem I'll solve when — and if — it arrives.
The Overengineering Trap
This is the hardest discipline for a solo builder: not building for scale you don't have.
I could set up a Redis cluster for distributed caching. I could deploy across three regions with a load balancer. I could move to PostgreSQL with read replicas. Each of those would make the system more robust at higher traffic levels. Each would also add operational complexity that I'd need to maintain every day at current traffic levels of 30-50 calls.
The cost of premature scaling isn't just time. It's ongoing maintenance burden for infrastructure that serves no current purpose. Every additional system is something that can break, something that needs monitoring, something that adds to the list of things I check when something goes wrong.
The rule I follow: build the simplest architecture that handles current traffic with comfortable headroom. When traffic exceeds that headroom, add the next layer. Not before. The caching + background polling pattern on a single server handles at least 10x my current traffic. That's enough headroom to wait.
Where to Start
If you're running an AI-powered service and haven't thought about scaling, start with one question: how many of your requests actually need a fresh AI call?
For most services, the answer is "far fewer than you think." If the same question gets asked repeatedly and the answer doesn't change minute-to-minute, cache it. If data comes from an external source that updates on a schedule, poll it in the background and serve from memory.
Add a cache hit ratio to your monitoring (covered in Episode 12). If it's below 80%, you have room to improve. If it's above 95%, your caching layer is doing its job and the next bottleneck is somewhere else.
The rest of scaling — multi-region, database migration, load balancing — can wait until the numbers demand it. Build for the traffic you have. Plan for the traffic you expect. Don't build for the traffic you dream about.
What's Next
We've covered the full stack: infrastructure, deployment, security, monitoring, coding philosophy, and scaling. The final post in this series ties it all together — the path from solo builder to sustainable revenue, what "professional-grade" means when you're one person with AI tools, and the honest economics of running production AI services at the scale I operate at today.
← Previous: Vibe Coding vs. Precision Coding Next: Building Sustainable Revenue Streams →
The final post in this series will cover revenue strategies and the economics of solo AI building. If you're working on shipping something with AI tools and have questions, drop them in the comments — the more we share, the faster we all move.
Disclaimer: This blog documents practical development workflows based on personal experience. Nothing here is financial, legal, or professional advice.
Comments
Post a Comment