Fargate

The Float Service runs scheduled, long-running, or oversized jobs on Amazon ECS Fargate alongside its Lambda functions. Fargate is used where the 15-minute Lambda timeout is insufficient or where container-based packaging is preferable to the provided.al2023 Lambda runtime.

Current Tasks

Task Trigger CPU Memory Purpose

site-floats-day-before-ach

EventBridge Scheduler (site-floats-day-before-ach, weekdays 19:00 ET)

512

1024

Batch job that pages all SCHEDULING floats due on the next business day, runs a rule set per float (re-fetch guard, ACH attempt cap, debit-card validity check), and submits next-day ACH for floats with no valid primary debit card. Runs approximately two hours before the Usio 21:00 ET ACH cutoff.

All tasks run on the shared site-floats ECS cluster, in awsvpc network mode, on the FARGATE launch type with platform version LATEST. Tasks run in the FloatMe private subnets with the PrivateSG security group and assign_public_ip = false.

Deploy Flow

fargate deploy

Container builds happen in GitHub Actions (.github/workflows/deploy.yaml), not inside the devkit — the devkit container has no docker-in-docker. The Makefile exposes two host targets:

  • make container-build-collections-jobsdocker buildx build --load, used for local smoke testing.

  • make container-push-collections-jobsdocker buildx build --push, used by CI. CONTAINER_REGISTRY and CONTAINER_TAG are overridable; CI sets CONTAINER_REGISTRY to the ECR registry from aws-actions/amazon-ecr-login (with /floats suffix) and CONTAINER_TAG to the value of TF_VAR_collections_jobs_image_tag.

CI Sequence

Target ordering (steady-state, after first-time ECR bootstrap):

  1. make build — produces Lambda artifacts.

  2. (push to main / release only) aws-actions/amazon-ecr-login and make container-push-collections-jobs — buildx build with --platform=linux/amd64 and --push to ECR.

  3. terraform init / terraform plan (on PRs) or terraform apply (on push to main / release). On apply, the task definition references the image tag that was just pushed.

The current deploy.yaml still runs Terraform before the image push (legacy ordering); the TODO in the workflow tracks the move to the target ordering above. On the legacy ordering, the task definition is reconciled with an image tag that does not yet exist in ECR — a Fargate task launched in the gap fails to pull until the image lands at the end of the same CI run. First-time bootstrap of the ECR repo still requires a one-off terraform apply before any image push is possible (the ECR repo itself is created by Terraform); the chicken-and-egg lives there, not in steady state.

Image Tag Selection

Environment Trigger TF_VAR_service_version

test

PR or push to main

SHORT_SHA of the commit (e.g., a1b2c3d).

prod

Release published

Full semver tag (e.g., v1.42.0).

Tags in ECR are immutable (image_tag_mutability = "IMMUTABLE"), so every CI run produces a new tag and cannot overwrite a prior build.

Build-Time Inputs

The collections-jobs Dockerfile (cmd/collections-jobs/Dockerfile) is a two-stage build:

  1. Builderghcr.io/floatme-corp/golang:1.26-alpine. Pulls module dependencies using a --mount=type=secret,id=github_token injected by Make (sourced from GITHUB_TOKEN), used to clone private github.com/floatme-corp modules. Compiles the binary with CGO_ENABLED=0 GOOS=linux GOARCH=amd64. Build version metadata is injected via -ldflags from the GIT_VERSION, GIT_COMMIT, GIT_COMMIT_DATE, and GIT_COMMIT_TIMESTAMP build args.

  2. Runtimegcr.io/distroless/static-debian12:nonroot. The compiled binary is copied to /usr/local/bin/collections-jobs and used as the image entrypoint.

ECR Repositories

One repository per binary. Repositories are owned by this service and never destroyed by Terraform (lifecycle.prevent_destroy = true) so historical builds remain available for rollback.

Repository Notes

floats/collections-jobs

Container image for the collections-jobs binary. image_tag_mutability = "IMMUTABLE"; scan_on_push = true.

Lifecycle policy on each repository:

  • Untagged images older than 7 days are expired (priority 1).

  • The most recent 5 images of any tag status are retained; older images are expired (priority 2).

ECS Cluster

A single cluster per environment hosts all Fargate tasks for the service.

  • Name: site-floats (e.g., prod-floats).

  • No capacity providers configured — all tasks specify FARGATE at run time.

IAM Roles

Role Purpose

site-floats-ecs-task-execution

Task execution role attached to every task definition. Trusts ecs-tasks.amazonaws.com. Holds the AWS-managed AmazonECSTaskExecutionRolePolicy (ECR pull, CloudWatch Logs) and a service policy granting secretsmanager:GetSecretValue on the runtime Datadog secret (site/datadog) — used by ECS to inject the Datadog API key into containers at startup. This is the same runtime secret the Lambda functions reference for the Datadog Extension; the separate site/datadog/terraform secret (API + app keys for the Datadog provider) is not used here.

site-floats-collections-jobs-task

Task role assumed by the collections-jobs container itself. Trusts ecs-tasks.amazonaws.com. Grants: secretsmanager:GetSecretValue + BatchGetSecretValue on the RDS main and replica secrets; dynamodb:PutItem + Query on collection-history; full read/write on locks (GetItem, PutItem, UpdateItem, Query, DeleteItem); execute-api:Invoke on the Payments Service and User Service API Gateway endpoints.

site-floats-scheduler-invoke-ecs

Role assumed by EventBridge Scheduler to launch Fargate tasks. Trusts scheduler.amazonaws.com. Grants ecs:RunTask scoped to the site-floats-day-before-ach task definition family (any revision) and the site-floats cluster, and iam:PassRole on the execution and task roles, conditioned on iam:PassedToService = ecs-tasks.amazonaws.com.

Task Definitions

Family Notes

site-floats-day-before-ach

Three containers: collections-jobs (application binary; runs the day-before-ach subcommand), log-router (FireLens / fluent-bit sidecar), and datadog-agent (DD Agent sidecar for APM and DogStatsD). Network mode awsvpc, requires_compatibilities ["FARGATE"], CPU 512, memory 1024. Image URI for the app is ${ecr_repo_url}:${var.service_version}.

Logging

The app container uses the awsfirelens log driver, routed by the FireLens log-router sidecar (public.ecr.aws/aws-observability/aws-for-fluent-bit:stable) directly to the Datadog logs intake at http-intake.logs.datadoghq.com. No CloudWatch Logs group is created. The Datadog API key is read from the site/datadog Secrets Manager secret (api_key JSON field) by ECS at container start via secretOptions.

Datadog tagging applied to log events:

  • dd_service = collections-jobs

  • dd_source = go

  • dd_tags = env:site,application:floats

  • provider = ecs

APM and Metrics

The datadog-agent sidecar (public.ecr.aws/datadog/agent:7) provides APM trace ingestion (port 8126/tcp) and DogStatsD (port 8125/udp). Both ports listen on localhost only; in awsvpc mode all containers in the task share a network namespace, so the app container reaches the agent at localhost.

Sidecar configuration:

  • ECS_FARGATE = true — required for the agent to discover task metadata via the ECS metadata endpoint instead of expecting a node-level agent.

  • DD_APM_ENABLED = true, DD_APM_NON_LOCAL_TRAFFIC = true — accept trace submissions from other containers in the task.

  • DD_DOGSTATSD_NON_LOCAL_TRAFFIC = true — accept StatsD from other containers.

  • DD_API_KEY — sourced from the same site/datadog secret as the FireLens config.

  • Marked essential = false so a flaky agent does not fail the task; the app container has dependsOn with condition HEALTHY, gating startup on the agent’s agent health healthcheck.

App container env vars (set on collections-jobs):

  • DD_AGENT_HOST = localhost, DD_TRACE_AGENT_PORT = 8126

  • DD_SERVICE = collections-jobs, DD_ENV = site, DD_VERSION = {var.service_version}

  • DD_TRACE_ENABLED = true

These let the Go tracer (gopkg.in/DataDog/dd-trace-go.v1) auto-configure traces and correlate them with the FireLens-shipped logs (same dd_service).

The task is provisioned at 512 CPU / 1024 MB total, split across the app, FireLens sidecar, and Datadog Agent. Revisit if worker count or page size is scaled up significantly.

Scheduled Jobs

EventBridge Scheduler (not classic CloudWatch Events / EventBridge rules) is used for Fargate task scheduling. Its native timezone support handles DST automatically, so we don’t need to recompute UTC offsets twice a year.

Schedule Cron Purpose

site-floats-day-before-ach

cron(0 19 ? * MON-FRI *) in America/New_York

Triggers the day-before-ach Fargate task at 19:00 ET on weekdays, approximately two hours before the Usio 21:00 ET ACH cutoff. flexible_time_window.mode = "OFF" (fires on the schedule exactly). State ENABLED. retry_policy.maximum_retry_attempts = 0ecs:RunTask failures are not retried automatically (the AWS default of 185 retries over 24h is unsafe for ACH-adjacent workflows; on a missed fire, operator follow-up is preferred over silent duplicate launches).

Subcommands and Flags

The collections-jobs binary dispatches on the first positional argument. Currently only one subcommand is implemented:

day-before-ach

Runs the day-before-ACH batch. The Fargate task definition passes ["day-before-ach"] as the container command. All flags have defaults suitable for production; override them in the task definition environment or via CLI.

Flag Default Effect

--payment-submit-rate

5

Payment service ACH submissions per second (token-bucket rate limiter shared across workers). Matches the Lambda path’s max_concurrent_executions.

--workers

5

Number of goroutines consuming from the producer channel. Matches the Lambda path’s max_concurrent_executions.

--page-size

50

Number of floats per RDS page query.

--channel-buffer

0 (= workers)

Producer→worker channel buffer depth. Defaults to the worker count.

Terraform Files

File Contents

fargate_cluster.tf

ECS cluster, task execution IAM role, Datadog Secrets Manager data source, and execution-role policy attachments (AWS-managed plus the Datadog secret read).

fargate_ecr.tf

ECR repository and lifecycle policy for the collections-jobs binary. Image tag is var.service_version (shared with Lambda deployments).

fargate_collections_jobs.tf

Collections-jobs task role and the site-floats-day-before-ach task definition (application container + FireLens sidecar).

fargate_collections_jobs_schedule.tf

EventBridge Scheduler schedule for site-floats-day-before-ach, plus the scheduler invocation IAM role and its scoped ecs:RunTask / iam:PassRole policy.