Use ECS Fargate Tasks for Scheduled Batch Workers
Context
Several services follow a scheduler + SQS + worker Lambda pattern for scheduled batch jobs (e.g. syncing external data, polling payment statuses). While this works, it has poor observability — there is no aggregate view of a run, no easy way to see if a job succeeded, and no in-process progress tracking. Each unit of work is a separate Lambda invocation with its own log stream.
This is not a cost-saving measure. The cost difference between the two approaches is not negligible but is small enough that it is not a factor in the decision. The motivation is purely operational visibility, control, and testability.
Decision
For scheduled batch jobs with a natural start and end, replace the scheduler + SQS + worker Lambda chain with a single ECS Fargate task triggered by a CloudWatch rule via ecs:RunTask.
The task runs, does all the work, and exits. Business logic stays in a shared package; the Fargate entrypoint and any admin API endpoints are thin wrappers around the same code. SQS is removed from the path.
Key implementation points:
-
Use DynamoDB/RDS pagination with a bounded goroutine pool instead of relying on SQS as a buffer — never load unbounded datasets into memory at once
-
Rate limiting is handled in code via the goroutine pool size and
golang.org/x/time/raterather than through Lambda reserved concurrency settings — this makes limits explicit, colocated with the code that hits external APIs, and easier to tune per job -
Records are read from the database immediately before processing, so there is no need to re-fetch and validate state at the start of each unit of work — with SQS a message can sit in the queue long enough for the underlying record to change, requiring a defensive re-read in the worker to avoid acting on stale data
-
Use ECS native secret injection rather than fetching secrets in application code at startup
-
Run a Datadog Agent sidecar for metrics/traces and a FireLens (Fluent Bit) sidecar for log forwarding
-
For failure alerting: create a Datadog monitor on
aws.ecs.tasks_stoppedfiltered by task family and non-zero exit code — this covers all failure modes including hard kills without additional code or infrastructure; requires the Datadog AWS integration with ECS metrics enabled -
Post a Slack summary on completion — since the task accumulates counters throughout the run, it can report meaningful aggregate output (records processed, errors, duration) in a single message when it exits; post to a dedicated ops channel so every run is visible without needing to open CloudWatch
-
Add an admin API endpoint (
POST /admin/<job-name>) as the integration test hook so the QA framework can trigger scoped runs over HTTP without SQS or AWS-specific tooling
This does not apply to APIs, high-throughput event consumers, or any Lambda with unbounded or latency-sensitive invocation patterns.
Consequences
Pros
-
Single log stream and ECS task history per run — easy to see if a job ran, how long it took, and whether it succeeded
-
In-process counters for progress and summary reporting at exit
-
No 15-minute Lambda timeout
-
Simpler local development — run the binary directly with env vars
-
Deployment safety — in-flight tasks are not affected by new task definition revisions
-
Admin API endpoint doubles as a production debugging tool
Cons
-
More infrastructure per service — ECS task definition, ECR image, IAM task and execution roles
-
No automatic per-item retry — a mid-run crash restarts from zero; acceptable for idempotent jobs but requires careful design otherwise
-
~30–60s container startup overhead per run
-
Cost difference is not negligible but is in the range of a few dollars per month — not a deciding factor
-
Datadog integration requires sidecar containers instead of a Lambda layer