The cost shape of an LLM-heavy serverless pipeline
Running my FPL enrichment pipeline costs roughly £1.50 to £3.00 a week. Nearly all of that is the Claude API. The AWS side (Lambda, Step Functions, S3, CloudFront, DynamoDB, SSM) combines into a few pennies, and most months it’s literally zero.
I engineered for that deliberately. It’s worth writing down what the cost shape actually looks like, because most “serverless cost” advice online is calibrated for workloads where the compute dominates. On an LLM-heavy pipeline the compute doesn’t dominate, so almost nothing about the standard advice applies.
A note on the free tier before going further. AWS’s is unusually generous if you read it carefully: 1M Lambda invocations and 400K GB-seconds of compute per month, 5GB of S3, 1TB of CloudFront egress, 4,000 Step Function transitions, 25GB-months of DynamoDB on-demand. For a personal-scale weekly pipeline, that’s a wide envelope. Most “I built this on AWS for $0” posts aren’t lying; they’re operating inside it.
The goal is to keep the cost predictable and proportional to the value, so I can make changes without flinching. Minimisation is a different target, and the difference matters once you start tuning.
The pipeline runs once a week. Five collector Lambdas pull from the FPL API, Understat, and a handful of news sources, landing raw JSON in an S3 lake. A validation Lambda promotes successful runs to a clean layer. A Step Function fans out enrichment Lambdas that call Anthropic Haiku for bulk summaries and Sonnet for the harder reasoning passes. A curate step writes dashboard-ready and agent-ready JSON to a curated layer, which CloudFront serves through an OAC-signed S3 origin. The whole graph runs in roughly fifteen minutes.
The week’s bill, broken out: Lambda compute about £0.00, Step Functions about £0.00, S3 storage plus requests about £0.00, CloudFront egress about £0.00, DynamoDB on-demand about £0.00, SSM Parameter Store £0.00 (standard tier is free), Anthropic API £1.50 to £3.00. What matters more than the absolute figure is the ratio. If I doubled Lambda memory tomorrow, total monthly cost would change by pennies. If I switched one Haiku call to Sonnet across all 700 players per gameweek, total monthly cost would roughly double. The leverage sits in the second knob, and only the second knob.
The right starting question for an LLM workload at any non-trivial volume is rate limits, not pricing. Pricing tells you what a call costs in pence. Rate limits tell you whether the call can happen at the volume you need. Anthropic’s rate limits are in tokens per minute and they bind well before pricing does on a batch workload. The first time I tried to enrich 700 players in parallel I hit the per-minute token cap, queued, and discovered that a “should take 30 seconds” job actually took eight minutes because it had spent seven of them waiting on rate-limit windows. So the cost-engineering started with “what’s the rate-limit shape, what’s the call schedule that respects it, what’s the parallelism factor that holds without queuing”. Pricing falls out of that. Without it, pricing is a meaningless number, because the job won’t run at the rate you want.
Once the rate-limit shape was right, the cost picture sharpened.
Pick any LLM workload and write down the cost matrix: rows for each model used, columns for each call type, cell equals (model price per token) times (calls per period) times (tokens per call). For a batch workload at non-trivial volume, one cell is always more than 90% of total spend. Everything else is rounding error. For the FPL pipeline that cell is Sonnet calls on enrichment, about 700 per gameweek, about 2,000 tokens per call. Optimising the Haiku calls on summarisation saves pennies. Switching the Sonnet calls to a cheaper model would save pounds. Filtering the input list of 700 players down to 200 by relevance would save pounds. Batching ten enrichment requests into a single prompt with structured output for ten players at once would save pounds. The first cost optimisation is identifying that one cell and attacking it specifically, rather than reaching for a global “switch to a cheaper model everywhere”; the other cells round off to zero whether you optimise them or not.
A corollary I had to repeat to myself a few times before I trusted it: serverless infrastructure is rounding error against LLM spend. Lambda, Step Functions, S3, CloudFront, Secrets Manager combined run me about 5p a week. The Claude API costs roughly thirty times that. Optimising Lambda memory for cost reasons on an LLM-heavy workload is the wrong instinct. Spend the memory if it makes the cold start better. The cost barely moves.
The one piece of AWS that did bite me on cost, and which I think is the most common surprise on workloads this small, was Secrets Manager. £0.30 per secret per month, plus £0.04 per 10K API calls. With four secrets (Anthropic API key, Langfuse public and secret keys, Neon connection string, agent shared secret), that’s £1.20 per month, which doesn’t move the needle but is a recurring line item that doesn’t need to exist. I moved everything to SSM Parameter Store. Standard-tier parameters are free up to 10,000 of them, with no per-API-call charge. The only thing you lose is automatic rotation, and a solo personal project doesn’t need automatic secret rotation. Migration was an afternoon. The savings are £14.40 a year forever, plus a slight reduction in IAM complexity.
The rule I’d put on a fridge magnet: Secrets Manager is paying for rotation. If you aren’t using rotation, you’re paying for nothing.
The DynamoDB story is the opposite shape. On-demand DynamoDB is generous at small scale. The always-free tier covers 25GB of storage and roughly 25 WCU/RCU equivalent per second. The Scout Agent uses DynamoDB for two things: per-request budget caps (prevents runaway agent loops from blowing past £1 per request) and per-IP rate limiting (one read/write per request). At a few hundred chat requests a month, the table is invisible on the bill. The mistake I almost made early on was provisioning the table with provisioned capacity to be “safe”. Provisioned capacity has a minimum charge per WCU/RCU regardless of usage. On-demand is more expensive per request but charges only for actual usage; at workloads that are mostly idle, on-demand wins by a wide margin. The rule of thumb I now repeat: provisioned makes sense when you have predictable steady traffic above roughly 10 ops per second sustained. Below that, on-demand is cheaper.
On Lambda memory, which I’d been treating as a cost knob and now treat as a latency knob: Lambda pricing is per GB-second of compute. Doubling memory doubles the per-second cost, but also typically halves the execution time on CPU-bound work, so the total cost stays the same. On I/O-bound work (which most LLM-calling Lambdas are, because most of the runtime is waiting for the model to respond), doubling memory often produces a small latency win and a small cost increase. The interesting case is cold starts. My Scout Agent Lambda’s cold start was 27 seconds at 1024 MB and 8 to 12 seconds at 3008 MB. The cost difference at the volume the agent gets called is fractions of a penny per month. The latency difference is the difference between a usable user experience and an unusable one. The decision is trivial: take the latency, eat the rounding-error cost.
The same principle for the broader workload. Lambda memory on an LLM-heavy serverless workload is the wrong place to look for cost savings. It’s the right place to look for latency improvements; set it to whatever you need for latency without thinking about the cost line on the Lambda side.
For monitoring: two AWS Budgets, set up on day one. One at $1/month with an alarm at 100% of actuals; this is the “something went very wrong” tripwire, because in normal months AWS spend is roughly $0.50 and any breach means a misconfigured cron or runaway loop. The second at $5/month forecast with an alarm at 80%; the “you’re about to spend real money” warning. On the LLM side, Langfuse tracks per-call cost across the pipeline. I can pull the dashboard and see exactly which prompt is generating which cost over a week. When prompt-engineering a new enrichment pass, I run it on five players, look at Langfuse, see the per-call token count, and extrapolate before letting it loose on 700. This catches the “I added a 4K-token few-shot block and didn’t realise” case before it costs anything visible.
If you’re starting a new workload of this shape, the order I’d do things in: rate-limit shape first, then identify the dominating cell, then build the input filtering and batching around that cell, then choose models, then everything else. The free-tier work falls out of “everything else” rather than driving any of it. Free tier is the icing; the cake is the LLM cell.