Use ECS Fargate Tasks for Scheduled Batch Workers

Status

Proposed

Context

Several services follow a scheduler + SQS + worker Lambda pattern for scheduled batch jobs (e.g. syncing external data, polling payment statuses). While this works, it has poor observability — there is no aggregate view of a run, no easy way to see if a job succeeded, and no in-process progress tracking. Each unit of work is a separate Lambda invocation with its own log stream.

This is not a cost-saving measure. The cost difference between the two approaches is not negligible but is small enough that it is not a factor in the decision. The motivation is purely operational visibility, control, and testability.

Decision

For scheduled batch jobs with a natural start and end, replace the scheduler + SQS + worker Lambda chain with a single ECS Fargate task triggered by a CloudWatch rule via ecs:RunTask.

The task runs, does all the work, and exits. Business logic stays in a shared package; the Fargate entrypoint and any admin API endpoints are thin wrappers around the same code. SQS is removed from the path.

Key implementation points:

  • Use DynamoDB/RDS pagination with a bounded goroutine pool instead of relying on SQS as a buffer — never load unbounded datasets into memory at once

  • Rate limiting is handled in code via the goroutine pool size and golang.org/x/time/rate rather than through Lambda reserved concurrency settings — this makes limits explicit, colocated with the code that hits external APIs, and easier to tune per job

  • Records are read from the database immediately before processing, so there is no need to re-fetch and validate state at the start of each unit of work — with SQS a message can sit in the queue long enough for the underlying record to change, requiring a defensive re-read in the worker to avoid acting on stale data

  • Use ECS native secret injection rather than fetching secrets in application code at startup

  • Run a Datadog Agent sidecar for metrics/traces and a FireLens (Fluent Bit) sidecar for log forwarding

  • For failure alerting: create a Datadog monitor on aws.ecs.tasks_stopped filtered by task family and non-zero exit code — this covers all failure modes including hard kills without additional code or infrastructure; requires the Datadog AWS integration with ECS metrics enabled

  • Post a Slack summary on completion — since the task accumulates counters throughout the run, it can report meaningful aggregate output (records processed, errors, duration) in a single message when it exits; post to a dedicated ops channel so every run is visible without needing to open CloudWatch

  • Add an admin API endpoint (POST /admin/<job-name>) as the integration test hook so the QA framework can trigger scoped runs over HTTP without SQS or AWS-specific tooling

This does not apply to APIs, high-throughput event consumers, or any Lambda with unbounded or latency-sensitive invocation patterns.

Consequences

Pros

  • Single log stream and ECS task history per run — easy to see if a job ran, how long it took, and whether it succeeded

  • In-process counters for progress and summary reporting at exit

  • No 15-minute Lambda timeout

  • Simpler local development — run the binary directly with env vars

  • Deployment safety — in-flight tasks are not affected by new task definition revisions

  • Admin API endpoint doubles as a production debugging tool

Cons

  • More infrastructure per service — ECS task definition, ECR image, IAM task and execution roles

  • No automatic per-item retry — a mid-run crash restarts from zero; acceptable for idempotent jobs but requires careful design otherwise

  • ~30–60s container startup overhead per run

  • Cost difference is not negligible but is in the range of a few dollars per month — not a deciding factor

  • Datadog integration requires sidecar containers instead of a Lambda layer