Use ECS Fargate Tasks for Scheduled Batch Workers

Status

Proposed

Context

Several services follow a scheduler + SQS + worker Lambda pattern for scheduled batch jobs (e.g. syncing external data, polling payment statuses). While this works, it has poor observability — there is no aggregate view of a run, no easy way to see if a job succeeded, and no in-process progress tracking. Each unit of work is a separate Lambda invocation with its own log stream.

This is not a cost-saving measure. The cost difference between the two approaches is not negligible but is small enough that it is not a factor in the decision. The motivation is purely operational visibility, control, and testability.

Decision

For scheduled batch jobs with a natural start and end, replace the scheduler + SQS + worker Lambda chain with a single ECS Fargate task triggered by a CloudWatch rule via ecs:RunTask.

The task runs, does all the work, and exits. Business logic stays in a shared package; the Fargate entrypoint and any admin API endpoints are thin wrappers around the same code. SQS is removed from the path.

Key implementation points:

Use DynamoDB/RDS pagination with a bounded goroutine pool instead of relying on SQS as a buffer — never load unbounded datasets into memory at once
Rate limiting is handled in code via the goroutine pool size and golang.org/x/time/rate rather than through Lambda reserved concurrency settings — this makes limits explicit, colocated with the code that hits external APIs, and easier to tune per job
Records are read from the database immediately before processing, so there is no need to re-fetch and validate state at the start of each unit of work — with SQS a message can sit in the queue long enough for the underlying record to change, requiring a defensive re-read in the worker to avoid acting on stale data
Use ECS native secret injection rather than fetching secrets in application code at startup
Run a Datadog Agent sidecar for metrics/traces and a FireLens (Fluent Bit) sidecar for log forwarding
For failure alerting: create a Datadog monitor on aws.ecs.tasks_stopped filtered by task family and non-zero exit code — this covers all failure modes including hard kills without additional code or infrastructure; requires the Datadog AWS integration with ECS metrics enabled
Post a Slack summary on completion — since the task accumulates counters throughout the run, it can report meaningful aggregate output (records processed, errors, duration) in a single message when it exits; post to a dedicated ops channel so every run is visible without needing to open CloudWatch
Add an admin API endpoint (POST /admin/<job-name>) as the integration test hook so the QA framework can trigger scoped runs over HTTP without SQS or AWS-specific tooling

This does not apply to APIs, high-throughput event consumers, or any Lambda with unbounded or latency-sensitive invocation patterns.

Consequences

Pros

Single log stream and ECS task history per run — easy to see if a job ran, how long it took, and whether it succeeded
In-process counters for progress and summary reporting at exit
No 15-minute Lambda timeout
Simpler local development — run the binary directly with env vars
Deployment safety — in-flight tasks are not affected by new task definition revisions
Admin API endpoint doubles as a production debugging tool

Cons

More infrastructure per service — ECS task definition, ECR image, IAM task and execution roles
No automatic per-item retry — a mid-run crash restarts from zero; acceptable for idempotent jobs but requires careful design otherwise
~30–60s container startup overhead per run
Cost difference is not negligible but is in the range of a few dollars per month — not a deciding factor
Datadog integration requires sidecar containers instead of a Lambda layer