Monitoring & Logging: How to Track Your AI Agent's Performance in Production

The x402 Protocol was live, serving real paid requests, and I had no idea what it was doing at 3 AM. Was it handling requests correctly? Was the sentiment cache refreshing? Was the polling loop stuck? The service was running but I was flying blind. That gap between "deployed" and "monitored" is where solo projects silently die.

Enterprise teams use Datadog, Grafana, PagerDuty — tools that cost hundreds per month and take weeks to configure. I needed something that costs nothing, takes an afternoon to set up, and tells me what's happening in production from my phone. The answer turned out to be a combination of JSONL logging, Telegram bot commands, and Claude-assisted log analysis.

What This Post Covers

The monitoring stack I run across every PrintMoneyLab project for $0, why Telegram became my alerting platform, the specific log entries that actually matter versus the ones that just create noise, how I debug production issues by pasting logs into Claude AI, and the daily auto-report that tells me the health of every service without me asking.

Log Everything, Read Almost Nothing

The instinct is to log only what you think you'll need. The problem is you don't know what you'll need until something breaks. The rule I follow now: log every meaningful event, but structure the logs so you never have to read them all.

Every x402 API call writes a JSONL line with timestamp, endpoint, whether it was paid, the price, and the response status. Every Claude Haiku invocation logs input tokens, output tokens, estimated cost, and response time. Every cache hit, cache miss, and cache expiration gets a line. Every payment settlement records the transaction hash.

# stats.jsonl — one event per line, append-only {"ts": 1746230400, "type": "api_call", "ep": "kr-sentiment", "paid": true, "usd": 0.05} {"ts": 1746230401, "type": "claude", "in": 2073, "out": 204, "cost": 0.003} {"ts": 1746230402, "type": "cache_hit", "ep": "kr-sentiment", "age_s": 1847} {"ts": 1746230520, "type": "alert", "sym": "SNT", "premium": -63.8}

The file grows by maybe 5 MB per month. That's nothing on a 200 GB disk. I never read it manually. Everything that matters gets surfaced by the tools that sit on top of it.

Why JSONL instead of a database for logs: Append-only writes are atomic, so concurrent processes don't corrupt the file. Each line is independent, so a corrupt line doesn't break the whole file. Command-line tools (grep, jq, wc) work directly on JSONL without needing a database client. And the file is portable — I can scp it to my laptop and analyze it offline with Claude Code if I need to. Structured data goes in SQLite (covered in Episode 11). Event streams go in JSONL.

Telegram as the Monitoring Dashboard

This is the part that feels like a hack but works better than most professional monitoring tools for a solo operation.

Every project I run has a Telegram bot that doubles as a monitoring interface. The bot I already built for SpeedTap's game notifications also serves as my admin console. The x402 Protocol bot that handles payment alerts also accepts monitoring commands.

Four commands that I actually use daily:

# Telegram bot commands — registered via @BotFather /stats — Today/this week/this month summary DAU, API calls, paid calls, revenue, cache hit rate /cost — Claude API cost breakdown Calls today, tokens used, estimated USD, daily cap status /sentiment — Current kr-sentiment analysis (cached) Quick glance at what the AI is reporting /kimp — Kimchi premium alerts ±10% tokens Which tokens are trading significantly above/below global

I type /stats on my phone during lunch and get back a summary of everything that happened this morning. How many API calls, how many were paid, total revenue for the day, cache hit ratio. If everything looks normal, I move on. If something's off — zero calls when there should be traffic, cache hit ratio dropping to 0%, revenue that doesn't match the call count — I know where to dig.

The daily auto-report is even more useful. Every day at midnight KST, the bot sends me an unprompted summary of the day. Calls, revenue, new users, errors, Claude API cost. I read it in bed. If all the numbers look reasonable, I sleep peacefully. If something's wrong, I can SSH in from my phone and check the logs.

Alerts That Actually Matter

The hardest part of monitoring isn't setting up alerts. It's setting up the right alerts. Too many and you start ignoring them. Too few and you miss real problems.

The alerts I kept after filtering out the noise:

Kimchi premium spikes above ±10%. Any token that crosses the threshold triggers a Telegram message with the symbol, the premium percentage, and the exchange. I added a 6-hour cooldown per token after the SIGN token kept oscillating around 10% and sending me alerts every minute. Cooldown solved the spam while keeping the signal.

Claude API cost exceeding hourly threshold. If Claude calls exceed a set amount per hour, I get a warning. This is the safety rail from Episode 6 — the early detection system for runaway loops. The threshold is generous enough that normal operation never triggers it, and tight enough that a stuck retry loop gets caught within minutes.

Service downtime. A simple health-check script on the Oracle instance that pings each service endpoint every 60 seconds. If a service doesn't respond, it sends a Telegram message. If it still doesn't respond after three checks, it attempts an automatic restart via systemd. I've had this trigger twice in three months — both times because I deployed a broken config, not because the service itself failed.

What I don't alert on: individual 4xx errors (users send malformed requests all the time), cache misses (expected behavior), individual payment failures (wallet balance issues aren't my problem). These get logged but don't generate notifications. The log is the archive; Telegram is the signal.

Debugging With Claude Using Logs

This is the workflow that saves me the most time, and it barely counts as a "tool."

When something goes wrong in production, the old me would've stared at the error message, tried to guess what was wrong, changed something, restarted, and checked if it worked. The new workflow:

Copy the last 50-100 lines of relevant logs. Paste them into Claude AI. Describe what I expected to happen and what actually happened. Ask for a hypothesis.

Nine times out of ten, Claude spots a pattern I missed. A specific API response code I didn't handle. A race condition between the cache check and the Claude call. A timezone mismatch that only appears during the UTC midnight rollover.

The logs are the context. Claude AI is the analyst. I'm the one who decides which hypothesis to test first. The loop — paste logs, get hypothesis, test it with Claude Code, check results — compresses hours of solo debugging into minutes.

One concrete example: the x402 Protocol's kimchi premium alerts were firing every single minute for SIGN token. The log showed the alert, the cooldown reset, the alert again, the cooldown reset again. The pattern was obvious once Claude saw 20 lines of it — the cooldown was resetting whenever the premium dipped below 10% and then immediately crossed back above. A 6-hour time-based cooldown replaced the threshold-based reset, and the spam stopped.

Token and Cost Monitoring

When you're running AI-powered endpoints in production, monitoring the model's behavior is as important as monitoring the application's behavior.

Every Claude Haiku call in the x402 Protocol logs three things: input token count, output token count, and estimated cost in USD. The /cost Telegram command reads these entries and summarizes the day's AI spend.

This matters for two reasons. First, cost control — if the sentiment endpoint suddenly starts producing 800-token responses instead of the usual 200, that's a 4x cost increase per call that I need to investigate. Usually it means the prompt drifted or the input data changed shape. Second, quality control — if output length drops dramatically, the model might be giving truncated or low-effort responses that users are paying for.

The daily auto-report includes a "Claude cost" line item. If it's under $0.10, everything's normal. If it spikes, I check whether it's legitimate traffic (more users means more calls) or something broken (a loop, a prompt that's too long, a cache that stopped working).

What This Costs

Total monitoring infrastructure cost across all projects: $0.

Telegram bot API is free. JSONL logging is just writing to disk. The health-check script is a bash loop. The daily report is a Python function triggered by cron. Claude AI for log analysis is part of my existing subscription.

The same monitoring capabilities from commercial tools would run $50-200 per month at the lowest tiers. For a solo builder running projects that earn $1.19 in cumulative revenue, paying $100 per month for monitoring is upside down economics. The free stack does everything I actually need.

Where to Start

If your project is running in production without any monitoring, start with one thing: a health check that tells you when the service is down.

A bash script that runs every minute via cron, curls your main endpoint, and sends a Telegram message if the response isn't 200. That's maybe 10 lines of code. Claude Code can write it in under a minute. Set it up, verify it fires correctly by stopping your service intentionally, and you have basic monitoring.

From there, add logging for the events you care about. API calls, errors, cost. Write them as JSONL lines. Build a Telegram command that summarizes today's entries. Each layer takes maybe an hour to add and permanently improves your visibility into what the project is doing.

The pattern is always the same: log the event, surface it through Telegram, and use Claude to analyze patterns when something looks wrong. Simple tools, consistently applied.

What's Next

Monitoring tells you what's happening. The next post in this series steps back from tools and talks about the philosophy of how AI-generated code actually gets built in practice — the spectrum between "vibe coding" and precision engineering, where each approach works, and the workflow I use to get speed without sacrificing stability.

← Previous: Database Strategy for AI Applications       Next: Vibe Coding vs. Precision Coding →


More posts in this series will cover the philosophy of AI-assisted coding, scaling patterns, and the path to sustainable revenue. If you're working on shipping something with AI tools and have questions, drop them in the comments — the more we share, the faster we all move.

Disclaimer: This blog documents practical development workflows based on personal experience. Nothing here is financial, legal, or professional advice.

Comments