FPL Pulse, weekly enrichment pipeline

AWS Lambda, Step Functions, S3, Anthropic Haiku and Sonnet, Langfuse, Terraform.

FPL Pulse architecture: five Lambda collectors land raw data in S3, a Step Function drives enrichment through Anthropic Haiku and Sonnet, curated output is served via CloudFront. — Sources, S3 raw, Step Functions, enrichment Lambdas, curated, CloudFront. The diagram lives at fpl-platform/docs/architecture.

Every gameweek the pipeline picks up raw data from the FPL API, Understat, and a handful of news sources, normalises it, and runs each of the roughly 700 players through an LLM enrichment pass (short-form summary, injury signal, sentiment, transfer narrative). The curated output lands as versioned parquet in S3 and feeds both the live dashboard and the Scout Agent.

The shape

Five collector Lambdas run in parallel and land raw JSON in s3://fpl-data-lake-dev/raw/season=.../gameweek=.../. A validation Lambda promotes successful runs to clean/. A Step Function fans out enrichment Lambdas that call Anthropic Haiku for bulk summaries and Sonnet for the harder reasoning passes. A curate step synthesises the enriched parquet into dashboard-ready and agent-ready JSON under curated/, which CloudFront serves through an OAC-signed S3 origin.

Eleven ADRs in docs/adr/ capture the decisions that actually changed the shape: direct API over LangChain (ADR-0003), LLM cost optimisation (ADR-0004), the parallel pipeline design (ADR-0006), Neon pgvector for agent retrieval (ADR-0008), agent transport on a Function URL (ADR-0010), SSM Parameter Store over Secrets Manager (ADR-0011). The full build log lives in the vault.

Decisions worth pulling out

Cost discipline was a feature from the start. A typical weekly run costs £1.50 to £3.00, of which roughly 95% is the Anthropic API; the AWS side (Lambda, Step Functions, S3, CloudFront, DynamoDB, SSM) combines into a few pennies and most months it's literally zero. That ratio isn't an accident, and the framing I now use for any LLM-heavy workload (dominating-cell, rate-limits-before-pricing, Secrets-Manager-is-paying-for-rotation) came out of building this one. The full version is in the cost-engineering post.

Direct Anthropic SDK rather than LangChain. Every LLM call in the pipeline is structured: a prompt template, a Pydantic schema for the output, a model selection (Haiku for bulk, Sonnet for the harder passes), a token budget, a retry policy. LangChain would add a chain or agent abstraction between me and the API; for a workload this constrained, the abstraction would buy nothing and would cost something every time LangChain's surface changed between minor versions. The orchestration that LangChain's chains do well already lives one layer up in Step Functions, which is the right place for it. The trade-off is written up in docs/adrs/adr-0003-direct-api-over-langchain.md.

One evening the pipeline went green from Step Functions and zero downstream states had executed properly. Every Task in the execution history was marked LambdaFunctionSucceeded, the run summary was clean, the curated outputs in S3 were empty. CloudWatch logs from the collector Lambda were full of Cloudflare 403s. It took longer than it should have to spot the gap. A Lambda that returns {"statusCode": 500, "body": ...} is LambdaFunctionSucceeded from Step Functions' perspective: the invocation completed, a JSON body came back, the SDK didn't throw. The HTTP status inside the body is application-level information that Step Functions doesn't look at. My CheckShouldRun state had assumed otherwise, and the Cloudflare 403 inside the collector had propagated as a successful run with empty data, which then took out four downstream states with cryptic JsonPath errors. The fix turned out to be a pattern rather than a patch: after every Task state that can return a non-success body, the next state has to be a Choice checking the actual response. The PR shipped as fpl-platform#102; the pattern sits in the vault for next time.

Cloudflare blocks AWS Lambda IPs on many public endpoints. User-agent spoofing doesn't help because Cloudflare also TLS-fingerprints the request, and Python httpx doesn't look like Chrome no matter what header you send. curl_cffi with Chrome impersonation passes the TLS layer; exponential backoff (2s, 4s, 8s, 16s, 32s, five attempts) handles the rate-based blocks that impersonation alone doesn't. Without these the collectors fail roughly one in three runs. With them it's been stable for months.

Observability

Langfuse traces every LLM call at level 3, with per-request traces, quality scores, and per-node cost. Fast Lambdas need an explicit langfuse.flush() before return because the background shipper thread doesn't get to run otherwise; sub-second Lambdas freeze mid-flight and the traces are gone. Tight timeouts (LANGFUSE_TIMEOUT=2s, FLUSH_INTERVAL=1s) stop Langfuse outages from cascading into Lambda timeouts.

What I'd do differently

Observability was the regret. The first runs of the pipeline produced no Langfuse traces because the Lambdas finished before the SDK's background shipper thread had time to send anything; I hadn't added the explicit langfuse.flush() before return. Quick fix once spotted. The cost of catching observability late is that the earliest runs are opaque, which on a weekly-cadence pipeline is a month before the picture sharpens. On the next pipeline I'd wire observability deliberately on day one rather than trusting an SDK to "handle it for you" on a serverless runtime that can pause and resume between invocations.

Secrets Manager was the other. I defaulted to it as the AWS-recommended pattern, only spotting the £1.20 monthly line item a few weeks in when I opened AWS Cost Explorer and noticed the bill was mostly secrets I wasn't rotating. Migration to SSM Parameter Store was an afternoon. The full reasoning is in the cost-engineering writeup; the regret isn't the £1.20 but the months I'd been paying for rotation I never used.

How I'd productionise

The pipeline fails occasionally on collection because Cloudflare blocks the AWS IP ranges that Lambda exits from. The curl_cffi impersonation gets past the TLS-fingerprinting layer; it doesn't help with the IP-level block, which fires intermittently and takes out a run. The production fix is a residential-IP proxy service (BrightData and similar) or a VPC with a NAT routed through residential IPs. Both are persistent monthly costs in the tens of pounds, which doesn't make sense for a personal pipeline that costs £2/week to run. So I live with the failure-and-retry: a failed run triggers a CloudWatch alarm, I rerun the Step Function manually, the extra invocation cost is rounding error. If the project ever needed to serve anyone else, the residential-proxy line item would be the first thing I'd add.