FPL Pulse — weekly enrichment pipeline
AWS Lambda · Step Functions · S3 · Anthropic Haiku/Sonnet · Langfuse · Terraform
What it does
Every gameweek the pipeline picks up raw data from the FPL API, Understat, and a handful of news sources, normalises it, and runs each of the ~700 players through an LLM enrichment pass — short-form summary, injury signal, sentiment, transfer narrative. The curated output lands as versioned parquet in S3 and feeds both the public dashboard and the Scout Agent.
The goal wasn't another FPL tool; it was a personal-scale clone of the production shape I work on at Curve — batch pipelines, enrichment, observability, IaC — built end-to-end in one repo so the architecture is legible without an NDA conversation.
Architecture, briefly
Five collector Lambdas run in parallel and land raw JSON in
s3://fpl-data-lake-dev/raw/season=.../gameweek=.../. A
validation Lambda promotes successful runs to clean/, then a
Step Function fans out enrichment Lambdas that call Anthropic Haiku for
bulk summaries and Sonnet for the harder reasoning passes. A curate step
synthesises the enriched parquet into dashboard- and agent-ready JSON
under curated/, which CloudFront serves through an
OAC-signed S3 origin.
Eleven ADRs in docs/adrs/ capture the decisions that
actually changed the shape — model selection, Step Function retry
alignment, the curate-vs-on-read tradeoff, the OAC POST landmine on the
agent side. The build log lives in
knowledge-vault/projects/fpl-build-log.md.
Three decisions worth pulling out
Cost discipline is a feature, not a constraint
The whole pipeline runs inside the AWS free tier. Lambda, Step Functions, S3, CloudFront, SSM Parameter Store — combined cost is rounding error against the Claude API spend, which itself caps at £5–10/month even with weekly full-population enrichment. That isn't an accident; it shaped real decisions. Secrets Manager came out (£2/month → £0 via SSM SecureString once I accepted there was no rotation requirement on a solo project). DynamoDB on-demand stays under the always-free tier as long as I keep the Scout Agent's per-request state small. Memory on the cold-start path got a deliberate bump to 3008 MB because the 27s → 8–12s cold-start improvement was worth the per-ms cost — at this scale, optimising for pennies of compute is the wrong instinct.
The corollary that keeps showing up: serverless infra is rounding error against LLM spend. Cost optimisation on an LLM-heavy workload should focus on model selection, batching, input filtering, and record filtering, not Lambda memory or container size.
Step Functions "Succeeded" ≠ business success
A Lambda that returns {"statusCode": 500, "body": {...}}
is LambdaFunctionSucceeded from Step Functions' perspective —
the invocation completed and a JSON body came back. The first version of
CheckShouldRun assumed otherwise, and a Cloudflare 403
inside the collector silently propagated as a successful run with empty
data, taking out four downstream states with cryptic JsonPath errors.
The fix is the durable pattern, not the patch: after every Task state
that can return a non-success body, the next state has to be a Choice
checking the actual response. The lesson sits in
knowledge-vault/patterns/step-function-retry-alignment.md and
has already saved an evening on the day job.
curl_cffi for TLS-fingerprinted endpoints
Many of the FPL-adjacent endpoints sit behind Cloudflare which blocks AWS
Lambda IPs. User-agent spoofing didn't help — Cloudflare also
TLS-fingerprints the request, and Python httpx doesn't look
like Chrome no matter what header you send. curl_cffi with
Chrome impersonation passes the TLS layer; exponential backoff
(2s/4s/8s/16s/32s, five attempts) handles the rate-based blocks that the
impersonation alone doesn't. Without these the collectors fail roughly
one in three runs; with them it's been stable for months.
Observability
Langfuse traces every LLM call at level 3 — per-request traces, quality
scores, per-node cost. Fast Lambdas need an explicit
langfuse.flush() before return because the background
shipper thread doesn't get to run otherwise; sub-second Lambdas freeze
mid-flight and the traces are gone. Tight timeouts
(LANGFUSE_TIMEOUT=2s, FLUSH_INTERVAL=1s) stop
Langfuse outages from cascading into Lambda timeouts.
What I'd do differently
The four-stage Hive partitioning was over-engineered for the actual data
volumes; a single curated layer with deeper schema validation would have
paid for itself faster. And the first Step Function definition used
Retry blocks that overlapped with the Lambda's own retry
logic, multiplying the effective retry count — aligning retries
explicitly at the Step Function layer would have saved an afternoon.