Deployment Guide

This page describes the deployment process for the Underwriting Service, including infrastructure provisioning, secrets management, and operational procedures.

Overview

The Underwriting Service is deployed using:

  • Infrastructure as Code: Terraform (AWS provider)

  • CI/CD: GitHub Actions

  • Build System: Makefile + Docker

  • Package Format: AWS Lambda deployment packages (ZIP)

  • Environments: dev, staging, prod

Deployment Flow:

  1. Code merged to main branch

  2. GitHub Actions triggered

  3. Build Lambda binaries (Go)

  4. Package deployment archives

  5. Upload to S3

  6. Terraform apply (infrastructure + Lambda updates)

  7. Post-deployment validation


Prerequisites

Required Tools

  • Terraform: v1.0+ (infrastructure provisioning)

  • Docker: For build environment consistency

  • AWS CLI: v2+ (credentials and manual operations)

  • Go: 1.20+ (local development)

  • Make: GNU Make (build automation)

  • Git: Version control

AWS Credentials

Authentication Method: AWS IAM Role assumption via GitHub Actions

Required Permissions: * Lambda: Create, update, delete functions * IAM: Create/manage roles and policies * S3: Upload Lambda deployment packages * API Gateway: Manage HTTP APIs * DynamoDB: Table operations * SQS: Queue management * EventBridge: Rule configuration * Secrets Manager: Read secrets * VPC: Network configuration (subnets, security groups) * CloudWatch: Logs and monitoring

GitHub Actions Role:

arn:aws:iam::267052520423:role/{environment}-github-actions-services-role

Role Assumption Duration: 15 minutes (sufficient for deployment)

Environment Variables

Set these before deploying:

Variable Description Required

TF_VAR_environment

Target environment (dev/staging/prod)

Yes

TF_VAR_service_version

Git version tag for deployment

Yes

AWS_DEFAULT_REGION

Primary AWS region (us-west-2)

Yes

GITHUB_TOKEN

GitHub token for private module access

Yes (CI only)

GOPRIVATE

Private Go modules (github.com/floatme-corp)

Yes (build)


Terraform Infrastructure

Directory Structure

deploy/
├── main.tf           # Provider configuration, locals
├── variables.tf      # Input variables (environment, regions, names)
├── terraform.tf      # Terraform and provider version constraints
├── lambda.tf         # Lambda function definitions (5 functions)
├── sqs.tf            # SQS queues and DLQs
├── kinesis.tf        # Kinesis event tap configuration
├── vpc.tf            # VPC, subnets, security groups
├── secrets.tf        # AWS Secrets Manager data sources
├── datadog.tf        # Datadog monitoring integration
└── Makefile          # Terraform automation commands

Terraform Modules

Main Configuration ([main.tf](deploy/main.tf))

Providers: * aws (default): Primary region operations * aws.dynamodb: DynamoDB region (may differ from primary) * aws.eventbridge: EventBridge region

Key Locals:

locals {
  account_id                         = data.aws_caller_identity.current.account_id
  s3_bucket                          = "${var.company}-${var.environment}-media"
  s3_prefix                          = "lambda/${var.application}"
  api_gateway_name                   = "${var.environment}-${var.application}"
  underwriting_table_name            = "${var.environment}-${var.application}"
  floatme_eventbridge_event_bus_name = "default"

  # Networking
  vpc_id             = data.aws_vpc.default.id
  private_subnet_ids = data.aws_subnets.private.ids
  security_group_ids = [aws_security_group.lambda.id]

  # Lambda configuration
  lambda_memory_size = 512
  lambda_logs_retention_days = 7
}

Lambda Module ([lambda.tf](deploy/lambda.tf))

Defines all 5 Lambda functions using the fmtf-module-lambda module:

Common Configuration: * Source code from S3 bucket * VPC integration (API and Rule Runner only) * CloudWatch Logs with retention * Datadog monitoring layer * Common environment variables * IAM execution roles with least privilege

Function-Specific Parameters:

Lambda Timeout Memory Special Configuration

API

300s

512 MB

API Gateway trigger, VPC integration

Rule Runner

108s

512 MB

SQS trigger, VPC integration, batch processing

Result Runner

108s

512 MB

SQS trigger, aggregation logic

Float Created Handler

108s

512 MB

SQS trigger (EventBridge events)

Profile Handler

108s

512 MB

SQS trigger (user signup events)

Deployment Package Location:

s3://floatme-{environment}-media/lambda/underwriting/{function-name}.zip

SQS Configuration ([sqs.tf](deploy/sqs.tf))

Defines 4 SQS queues with corresponding DLQs:

Queue Pattern:

resource "aws_sqs_queue" "queue_name" {
  name                      = "${environment}-underwriting-{queue-name}"
  visibility_timeout_seconds = 120
  message_retention_seconds  = 345600  # 4 days

  # Redrive policy: 1 retry then DLQ
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.queue_name_dlq.arn
    maxReceiveCount     = 1
  })
}

resource "aws_sqs_queue" "queue_name_dlq" {
  name                      = "${environment}-underwriting-{queue-name}-dlq"
  message_retention_seconds  = 1209600  # 14 days
}

Queues: 1. rule-runner-sqs-tap + DLQ 2. result-runner-sqs-tap + DLQ 3. float-created-sqs-tap + DLQ 4. profile-handler-sqs-tap + DLQ

Key Parameters: * Visibility Timeout: 120 seconds (Lambda timeout + buffer) * Max Receive Count: 1 (fail fast, manual DLQ investigation) * Message Retention: 4 days (active), 14 days (DLQ)

VPC Configuration ([vpc.tf](deploy/vpc.tf))

VPC Strategy: * Use existing VPC (data source lookup) * Deploy API and Rule Runner in private subnets (access to RDS, internal services) * Security group allows HTTPS egress

Security Group Rules:

resource "aws_security_group" "lambda" {
  name_prefix = "${var.environment}-underwriting-lambda-"
  vpc_id      = local.vpc_id

  egress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
    description = "HTTPS outbound for AWS services and APIs"
  }
}

Lambda ENI Considerations: * Each VPC Lambda creates ENI on invocation * ENI creation adds ~10 seconds cold start latency * Pre-warming via provisioned concurrency reduces impact

Secrets Management ([secrets.tf](deploy/secrets.tf))

Secrets Used:

  1. Datadog Terraform Secret ({environment}/datadog/terraform):

    • API key for Datadog monitoring integration

    • Used by Terraform for Datadog resource provisioning

  2. Segment Secret ({environment}/segment):

    • Write key for Segment event tracking

    • Used by Rule/Result Runners for analytics

  3. GrowthBook Secret ({environment}/growthbook):

    • API key for feature flags and A/B testing

    • Used by all Lambdas for dynamic configuration

Secret Access Pattern:

// Lambdas retrieve secrets at runtime
secretsManager := secretsmanager.New(session)
secret, err := secretsManager.GetSecretValue(&secretsmanager.GetSecretValueInput{
    SecretId: aws.String(os.Getenv("SM_GROWTHBOOK_NAME")),
})

Secret Rotation: * Manual rotation via AWS Console or CLI * Lambda restart required after rotation (invalidate cached credentials) * No automatic rotation configured


Build Process

Makefile Targets

The root [Makefile](Makefile) provides automation for common tasks:

Development:

make devkit              # Build Docker devkit image
make generate            # Generate OpenAPI server code
make test                # Run unit tests
make coverage            # Generate coverage report
make lint                # Run linters (golangci-lint)

Build:

make build               # Build all Lambda binaries
make dist                # Create deployment ZIPs
make clean               # Clean build artifacts

Deployment:

make deploy-dev          # Deploy to dev environment
make deploy-staging      # Deploy to staging
make deploy-prod         # Deploy to production

Docker Devkit

Purpose: Consistent build environment across local and CI

Image: ghcr.io/floatme-corp/underwriting-api:latest-devkit

Contents: * Go 1.20 compiler * Terraform CLI * AWS CLI v2 * golangci-lint * redocly CLI (OpenAPI linting)

Usage:

# All make commands run inside devkit container
make devkit              # Build/pull devkit image
make test                # Runs tests in container

Lambda Binary Compilation

Build Script: scripts/build.sh (invoked by make build)

Compilation Flags:

GOOS=linux GOARCH=amd64 go build \
  -ldflags="-s -w \
    -X main.Version=${GIT_VERSION} \
    -X main.Commit=${GIT_COMMIT} \
    -X main.BuildDate=${GIT_COMMIT_DATE}" \
  -o dist/$FUNCTION/bootstrap \
  cmd/$FUNCTION/main.go

Binary Optimization: * -s -w: Strip debug symbols (reduce size) * Static linking: No external dependencies * Bootstrap naming: AWS Lambda custom runtime convention

Output:

dist/
├── api/
│   ├── bootstrap
│   └── api.zip
├── rule-runner/
│   ├── bootstrap
│   └── rule-runner.zip
├── result-runner/
│   ├── bootstrap
│   └── result-runner.zip
├── float-created-handler/
│   ├── bootstrap
│   └── float-created-handler.zip
└── profile-handler/
    ├── bootstrap
    └── profile-handler.zip

Deployment Procedures

Environment Strategy

Three Environments:

  1. dev: Development testing (frequent deploys)

  2. staging: Pre-production validation (release candidates)

  3. prod: Production (stable releases only)

Promotion Path:

dev → staging → prod

Pre-Deployment Checklist

Before deploying to any environment:

  • Code Review: All PRs reviewed and approved

  • Tests Passing: Unit tests, integration tests green

  • Linting: No linting errors (make lint)

  • OpenAPI Spec: Valid and up-to-date (make lint-spec)

  • Secrets Current: Verify secrets exist and are valid

  • Version Tagged: Git tag exists (prod only)

  • Changelog Updated: CHANGELOG.md reflects changes

  • Rollback Plan: Identify previous stable version

Additional for Staging/Prod:

  • Dev Environment Validated: Changes tested in dev

  • Database Migrations: Run and tested (if applicable)

  • Dependency Updates: Third-party services notified

  • Monitoring: Datadog dashboards ready

  • On-Call Notified: Team aware of deployment window

GitHub Actions Deployment

Workflow: [.github/workflows/deploy.yml](.github/workflows/deploy.yml)

Triggers: * dev: Push to main branch * staging: Push to staging branch * prod: Git tag push (v* pattern)

Workflow Steps:

  1. Checkout Code ```yaml

    • uses: actions/checkout@v3 `

  2. Assume AWS Role ```yaml

    • uses: aws-actions/configure-aws-credentials@v2 with: role-to-assume: arn:aws:iam::267052520423:role/${{ env.ENVIRONMENT }}-github-actions-services-role aws-region: us-west-2 `

  3. Build Devkit Image bash make devkit

  4. Run Tests bash make test

  5. Build Lambda Binaries bash make build

  6. Create Deployment Packages bash make dist

  7. Upload to S3 bash aws s3 cp dist/ s3://floatme-${ENVIRONMENT}-media/lambda/underwriting/ --recursive

  8. Terraform Plan bash cd deploy terraform init terraform plan -out=tfplan

  9. Terraform Apply bash terraform apply tfplan

  10. Post-Deployment Tests bash make integration-test

Deployment Duration: ~5-10 minutes (full stack)

Manual Deployment

For emergency deployments or local testing:

Step 1: Build

export AWS_PROFILE=underwriting-dev
export TF_VAR_environment=dev
export TF_VAR_service_version=$(git describe --tags)

make build
make dist

Step 2: Upload to S3

aws s3 sync dist/ s3://floatme-dev-media/lambda/underwriting/ \
  --exclude "*" --include "*.zip"

Step 3: Terraform Apply

cd deploy
terraform init
terraform plan -out=tfplan
terraform apply tfplan

Step 4: Validate

# Test API endpoint
curl https://dev-underwriting.floatme.com/health

# Check Lambda logs
aws logs tail /aws/lambda/dev-underwriting-api --follow

Canary Deployments (Prod Only)

Strategy: Gradual traffic shift with rollback capability

Configuration: (Not currently implemented - future enhancement)

  1. Deploy new version with alias canary

  2. Configure API Gateway weighted routing: 95% live, 5% canary

  3. Monitor error rates, latency for 30 minutes

  4. If healthy, shift 50%, then 100%

  5. If unhealthy, revert to 100% live

Recommended Tool: AWS Lambda Alias Traffic Shifting or AWS CodeDeploy


Post-Deployment Validation

Automated Checks

GitHub Actions Post-Deploy:

# API health check
curl -sf https://${ENVIRONMENT}-underwriting.floatme.com/health || exit 1

# Lambda invocation test
aws lambda invoke --function-name ${ENVIRONMENT}-underwriting-api \
  --payload '{"path":"/health"}' response.json

# DynamoDB connectivity
aws dynamodb describe-table --table-name ${ENVIRONMENT}-underwriting

Manual Validation

1. API Endpoint Health

curl https://{env}-underwriting.floatme.com/health
# Expected: {"status": "ok", "version": "v1.2.3"}

2. Lambda Function Status

aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `{env}-underwriting`)].[FunctionName, State]'
# Expected: All functions in "Active" state

3. CloudWatch Logs

aws logs tail /aws/lambda/{env}-underwriting-api --since 5m
# Expected: No ERROR level logs

4. SQS Queue Metrics

aws sqs get-queue-attributes \
  --queue-url https://sqs.us-west-2.amazonaws.com/267052520423/{env}-underwriting-rule-runner \
  --attribute-names ApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible
# Expected: Low/zero message counts

5. Test Eligibility Check

# Replace with real user ID from environment
curl -X GET https://{env}-underwriting.floatme.com/{user_id}/float_check \
  -H "Authorization: Bearer $TOKEN"
# Expected: HTTP 200, valid evaluation response

6. Datadog Dashboard * Navigate to Underwriting Service dashboard * Verify deployment annotation appears * Check for error rate spikes * Monitor Lambda duration metrics

Success Criteria

Deployment is successful when:

  • All Lambda functions in "Active" state

  • API /health endpoint returns 200

  • CloudWatch Logs show successful invocations

  • SQS queues processing normally (no backlog)

  • Error rate < 1% (Datadog)

  • P95 latency < 500ms (API Lambda)

  • No alerts triggered in first 15 minutes


Rollback Procedures

When to Rollback

Rollback if any of the following occur within 30 minutes of deployment:

  • Error rate > 5%

  • API availability < 99%

  • Critical bugs reported

  • Downstream service failures

  • Database corruption or data loss

Terraform State Rollback

Revert to Previous Lambda Version:

# Step 1: Identify previous version
git log --oneline --decorate

# Step 2: Checkout previous commit
git checkout <previous-commit-hash>

# Step 3: Rebuild and redeploy
export TF_VAR_service_version=$(git describe --tags)
make build && make dist

# Step 4: Upload to S3
aws s3 sync dist/ s3://floatme-${ENVIRONMENT}-media/lambda/underwriting/

# Step 5: Terraform apply
cd deploy
terraform init
terraform apply -auto-approve

Alternate: Update Lambda from Console (Fastest)

  1. Navigate to AWS Lambda Console

  2. For each function (api, rule-runner, result-runner, etc.):

    • Click "Code" tab

    • Click "Upload from" → "S3"

    • Enter previous version S3 URI: s3://floatme-site-media/lambda/underwriting/{function}/{previous-version}.zip

    • Click "Save"

  3. Verify health checks pass

Time to Rollback: 2-5 minutes (console method)

Database Rollback

DynamoDB Schema Changes:

  • No direct rollback capability (NoSQL schema-less)

  • Prevention: Use backward-compatible changes only

  • Mitigation: Restore from point-in-time backup (if enabled)

Point-in-Time Recovery:

aws dynamodb restore-table-to-point-in-time \
  --source-table-name dev-underwriting \
  --target-table-name dev-underwriting-restored \
  --restore-date-time 2024-01-15T10:00:00Z

Note: Requires enabling PITR on DynamoDB table (recommended)

Communication Plan

Rollback Announcement:

  1. Slack: Post in #engineering and #incidents channels ` 🚨 ROLLBACK IN PROGRESS Service: Underwriting API Environment: site Reason: {brief description} ETA: 5 minutes Status: https://status.floatme.com `

  2. Status Page: Update status.floatme.com with incident

  3. Post-Mortem: Schedule blameless post-mortem within 24 hours


Monitoring & Alerting

CloudWatch Alarms

Critical Alarms (PagerDuty):

  • Lambda Errors > 10 in 5 minutes ` Metric: Errors Threshold: > 10 Period: 5 minutes Actions: SNS → PagerDuty `

  • API Gateway 5xx > 1% in 5 minutes ` Metric: 5XXError Threshold: > 1% Period: 5 minutes `

  • SQS DLQ Messages > 0 ` Metric: ApproximateNumberOfMessagesVisible Threshold: > 0 Period: 1 minute `

Warning Alarms (Slack):

  • Lambda Duration > 250ms (P95)

  • Lambda Throttles > 0

  • SQS Queue Age > 5 minutes

Datadog Monitoring

APM Traces: * Track request flows across Lambdas * Identify slow database queries * Monitor external service latency

Custom Metrics: * underwriting.evaluation.approved_count * underwriting.evaluation.denied_count * underwriting.cfi.limit_increased * underwriting.rule.execution.duration (by rule name)

Dashboards: * Underwriting Overview: High-level service health * Lambda Performance: Per-function metrics * Evaluation Analytics: Approval rates, amounts


Troubleshooting

Common Issues

Deployment Failure: Terraform Lock

Symptom:

Error: Error acquiring the state lock

Cause: Previous deployment didn’t release Terraform state lock

Solution:

# View lock info
terraform force-unlock <lock-id>

# Or delete lock from DynamoDB (if state backend uses DynamoDB)
aws dynamodb delete-item \
  --table-name terraform-state-lock \
  --key '{"LockID": {"S": "terraform-state-prod"}}'

Lambda Cold Start Timeouts

Symptom: API requests timeout on first invocation

Cause: VPC ENI creation delay (~10 seconds)

Solution: * Enable provisioned concurrency (minimum 1) * Pre-warm Lambda with scheduled CloudWatch event

SQS Messages Not Processing

Symptom: SQS queue depth increasing, Lambda not invoking

Cause: Lambda error rate triggers automatic throttle

Solution: 1. Check Lambda error logs in CloudWatch 2. Fix underlying issue 3. Re-invoke Lambda manually or wait for automatic retry 4. Purge queue if messages are poison pills

DynamoDB Throttling

Symptom:

ProvisionedThroughputExceededException

Cause: Too many read/write operations

Solution: * Enable DynamoDB auto-scaling * Increase provisioned capacity temporarily * Optimize query patterns (use Query instead of Scan)

Secret Not Found

Symptom:

ResourceNotFoundException: Secret not found: dev/growthbook

Cause: Secret doesn’t exist in target environment

Solution:

# Create secret in AWS Secrets Manager
aws secretsmanager create-secret \
  --name dev/growthbook \
  --secret-string '{"api_key": "gb_prod_xyz123"}'

# Update Lambda environment to reference secret
# Redeploy Lambda

Security Considerations

IAM Least Privilege

  • Each Lambda has dedicated execution role

  • Permissions scoped to specific resources (table names, queue ARNs)

  • No wildcard permissions in production

Network Security

  • Lambdas in private subnets (no direct internet access)

  • NAT Gateway for outbound traffic (AWS services, external APIs)

  • Security groups restrict inbound/outbound traffic

  • VPC endpoints for AWS services (DynamoDB, S3) reduce NAT costs

Secrets Rotation

Best Practices: * Rotate API keys every 90 days * Use AWS Secrets Manager rotation Lambda (recommended) * Test rotation in dev before applying to prod * Monitor for authentication failures after rotation


CI/CD Pipeline Diagram

┌─────────────────────────────────────────────────────────────┐
│  Developer                                                   │
│  └─> git push → main/staging/v* tag                         │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  GitHub Actions                                              │
│  ├─> Checkout code                                           │
│  ├─> Assume AWS role                                         │
│  ├─> Build devkit image                                      │
│  ├─> Run tests (unit, lint)                                  │
│  ├─> Build Go binaries                                       │
│  └─> Create ZIP packages                                     │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  AWS S3                                                      │
│  └─> Upload deployment packages                             │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Terraform                                                   │
│  ├─> Init (download modules, configure state)               │
│  ├─> Plan (calculate changes)                               │
│  ├─> Apply (provision infrastructure)                       │
│  │   ├─> Update Lambda functions (new code)                 │
│  │   ├─> Configure SQS queues                               │
│  │   ├─> Update API Gateway routes                          │
│  │   └─> Configure IAM roles/policies                       │
│  └─> Output (API endpoint URL, function ARNs)               │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Post-Deployment Validation                                  │
│  ├─> Health check API endpoint                              │
│  ├─> Verify Lambda state = Active                           │
│  ├─> Check CloudWatch Logs (no errors)                      │
│  ├─> Test eligibility check (smoke test)                    │
│  └─> Monitor Datadog for 15 minutes                         │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
                       ┌───┴───┐
                       │Success│
                       └───────┘

Infrastructure Costs

Monthly Cost Estimate (per environment):

Service Monthly Cost Notes

Lambda Invocations

$20-50

5 functions × ~1M invocations

Lambda Duration

$30-80

512 MB × execution time

API Gateway

$15-30

HTTP API (cheaper than REST)

DynamoDB

$25-100

On-demand pricing, varies by traffic

SQS

$1-5

First 1M requests free

S3 (Lambda packages)

$1-2

Storage + GET requests

CloudWatch Logs

$5-15

7-day retention

Datadog APM

$31/host

Per Lambda function monitoring

Total

$128-283

Scales with traffic

Cost Optimization: * Use provisioned concurrency sparingly (expensive) * DynamoDB on-demand cheaper than provisioned for variable workloads * HTTP API Gateway cheaper than REST API * Reduce Lambda memory if underutilized


See Also