← Projects

FPL Pulse — weekly enrichment pipeline

AWS Lambda · Step Functions · S3 · Anthropic Haiku/Sonnet · Langfuse · Terraform

FPL Pulse architecture: five Lambda collectors land raw data in S3, a Step Function drives enrichment through Anthropic Haiku and Sonnet, curated output is served via CloudFront.
Sources → S3 raw → Step Functions → enrichment Lambdas → curated → CloudFront. Diagram available on GitHub.

What it does

Every gameweek the pipeline picks up raw data from the FPL API, Understat, and a handful of news sources, normalises it, and runs each of the ~700 players through an LLM enrichment pass — short-form summary, injury signal, sentiment, transfer narrative. The curated output lands as versioned parquet in S3 and feeds both the public dashboard and the Scout Agent.

The goal wasn't another FPL tool; it was a personal-scale clone of the production shape I work on at Curve — batch pipelines, enrichment, observability, IaC — built end-to-end in one repo so the architecture is legible without an NDA conversation.

Architecture, briefly

Five collector Lambdas run in parallel and land raw JSON in s3://fpl-data-lake-dev/raw/season=.../gameweek=.../. A validation Lambda promotes successful runs to clean/, then a Step Function fans out enrichment Lambdas that call Anthropic Haiku for bulk summaries and Sonnet for the harder reasoning passes. A curate step synthesises the enriched parquet into dashboard- and agent-ready JSON under curated/, which CloudFront serves through an OAC-signed S3 origin.

Eleven ADRs in docs/adrs/ capture the decisions that actually changed the shape — model selection, Step Function retry alignment, the curate-vs-on-read tradeoff, the OAC POST landmine on the agent side. The build log lives in knowledge-vault/projects/fpl-build-log.md.

Three decisions worth pulling out

Cost discipline is a feature, not a constraint

The whole pipeline runs inside the AWS free tier. Lambda, Step Functions, S3, CloudFront, SSM Parameter Store — combined cost is rounding error against the Claude API spend, which itself caps at £5–10/month even with weekly full-population enrichment. That isn't an accident; it shaped real decisions. Secrets Manager came out (£2/month → £0 via SSM SecureString once I accepted there was no rotation requirement on a solo project). DynamoDB on-demand stays under the always-free tier as long as I keep the Scout Agent's per-request state small. Memory on the cold-start path got a deliberate bump to 3008 MB because the 27s → 8–12s cold-start improvement was worth the per-ms cost — at this scale, optimising for pennies of compute is the wrong instinct.

The corollary that keeps showing up: serverless infra is rounding error against LLM spend. Cost optimisation on an LLM-heavy workload should focus on model selection, batching, input filtering, and record filtering, not Lambda memory or container size.

Step Functions "Succeeded" ≠ business success

A Lambda that returns {"statusCode": 500, "body": {...}} is LambdaFunctionSucceeded from Step Functions' perspective — the invocation completed and a JSON body came back. The first version of CheckShouldRun assumed otherwise, and a Cloudflare 403 inside the collector silently propagated as a successful run with empty data, taking out four downstream states with cryptic JsonPath errors.

The fix is the durable pattern, not the patch: after every Task state that can return a non-success body, the next state has to be a Choice checking the actual response. The lesson sits in knowledge-vault/patterns/step-function-retry-alignment.md and has already saved an evening on the day job.

curl_cffi for TLS-fingerprinted endpoints

Many of the FPL-adjacent endpoints sit behind Cloudflare which blocks AWS Lambda IPs. User-agent spoofing didn't help — Cloudflare also TLS-fingerprints the request, and Python httpx doesn't look like Chrome no matter what header you send. curl_cffi with Chrome impersonation passes the TLS layer; exponential backoff (2s/4s/8s/16s/32s, five attempts) handles the rate-based blocks that the impersonation alone doesn't. Without these the collectors fail roughly one in three runs; with them it's been stable for months.

Observability

Langfuse traces every LLM call at level 3 — per-request traces, quality scores, per-node cost. Fast Lambdas need an explicit langfuse.flush() before return because the background shipper thread doesn't get to run otherwise; sub-second Lambdas freeze mid-flight and the traces are gone. Tight timeouts (LANGFUSE_TIMEOUT=2s, FLUSH_INTERVAL=1s) stop Langfuse outages from cascading into Lambda timeouts.

What I'd do differently

The four-stage Hive partitioning was over-engineered for the actual data volumes; a single curated layer with deeper schema validation would have paid for itself faster. And the first Step Function definition used Retry blocks that overlapped with the Lambda's own retry logic, multiplying the effective retry count — aligning retries explicitly at the Step Function layer would have saved an afternoon.

Links