Deployment Guide
This page describes the deployment process for the Underwriting Service, including infrastructure provisioning, secrets management, and operational procedures.
Overview
The Underwriting Service is deployed using:
-
Infrastructure as Code: Terraform (AWS provider)
-
CI/CD: GitHub Actions
-
Build System: Makefile + Docker
-
Package Format: AWS Lambda deployment packages (ZIP)
-
Environments: dev, staging, prod
Deployment Flow:
-
Code merged to main branch
-
GitHub Actions triggered
-
Build Lambda binaries (Go)
-
Package deployment archives
-
Upload to S3
-
Terraform apply (infrastructure + Lambda updates)
-
Post-deployment validation
Prerequisites
Required Tools
-
Terraform: v1.0+ (infrastructure provisioning)
-
Docker: For build environment consistency
-
AWS CLI: v2+ (credentials and manual operations)
-
Go: 1.20+ (local development)
-
Make: GNU Make (build automation)
-
Git: Version control
AWS Credentials
Authentication Method: AWS IAM Role assumption via GitHub Actions
Required Permissions: * Lambda: Create, update, delete functions * IAM: Create/manage roles and policies * S3: Upload Lambda deployment packages * API Gateway: Manage HTTP APIs * DynamoDB: Table operations * SQS: Queue management * EventBridge: Rule configuration * Secrets Manager: Read secrets * VPC: Network configuration (subnets, security groups) * CloudWatch: Logs and monitoring
GitHub Actions Role:
arn:aws:iam::267052520423:role/{environment}-github-actions-services-role
Role Assumption Duration: 15 minutes (sufficient for deployment)
Environment Variables
Set these before deploying:
| Variable | Description | Required |
|---|---|---|
|
Target environment (dev/staging/prod) |
Yes |
|
Git version tag for deployment |
Yes |
|
Primary AWS region (us-west-2) |
Yes |
|
GitHub token for private module access |
Yes (CI only) |
|
Private Go modules (github.com/floatme-corp) |
Yes (build) |
Terraform Infrastructure
Directory Structure
deploy/
├── main.tf # Provider configuration, locals
├── variables.tf # Input variables (environment, regions, names)
├── terraform.tf # Terraform and provider version constraints
├── lambda.tf # Lambda function definitions (5 functions)
├── sqs.tf # SQS queues and DLQs
├── kinesis.tf # Kinesis event tap configuration
├── vpc.tf # VPC, subnets, security groups
├── secrets.tf # AWS Secrets Manager data sources
├── datadog.tf # Datadog monitoring integration
└── Makefile # Terraform automation commands
Terraform Modules
Main Configuration ([main.tf](deploy/main.tf))
Providers:
* aws (default): Primary region operations
* aws.dynamodb: DynamoDB region (may differ from primary)
* aws.eventbridge: EventBridge region
Key Locals:
locals {
account_id = data.aws_caller_identity.current.account_id
s3_bucket = "${var.company}-${var.environment}-media"
s3_prefix = "lambda/${var.application}"
api_gateway_name = "${var.environment}-${var.application}"
underwriting_table_name = "${var.environment}-${var.application}"
floatme_eventbridge_event_bus_name = "default"
# Networking
vpc_id = data.aws_vpc.default.id
private_subnet_ids = data.aws_subnets.private.ids
security_group_ids = [aws_security_group.lambda.id]
# Lambda configuration
lambda_memory_size = 512
lambda_logs_retention_days = 7
}
Lambda Module ([lambda.tf](deploy/lambda.tf))
Defines all 5 Lambda functions using the fmtf-module-lambda module:
Common Configuration: * Source code from S3 bucket * VPC integration (API and Rule Runner only) * CloudWatch Logs with retention * Datadog monitoring layer * Common environment variables * IAM execution roles with least privilege
Function-Specific Parameters:
| Lambda | Timeout | Memory | Special Configuration |
|---|---|---|---|
API |
300s |
512 MB |
API Gateway trigger, VPC integration |
Rule Runner |
108s |
512 MB |
SQS trigger, VPC integration, batch processing |
Result Runner |
108s |
512 MB |
SQS trigger, aggregation logic |
Float Created Handler |
108s |
512 MB |
SQS trigger (EventBridge events) |
Profile Handler |
108s |
512 MB |
SQS trigger (user signup events) |
Deployment Package Location:
s3://floatme-{environment}-media/lambda/underwriting/{function-name}.zip
SQS Configuration ([sqs.tf](deploy/sqs.tf))
Defines 4 SQS queues with corresponding DLQs:
Queue Pattern:
resource "aws_sqs_queue" "queue_name" {
name = "${environment}-underwriting-{queue-name}"
visibility_timeout_seconds = 120
message_retention_seconds = 345600 # 4 days
# Redrive policy: 1 retry then DLQ
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.queue_name_dlq.arn
maxReceiveCount = 1
})
}
resource "aws_sqs_queue" "queue_name_dlq" {
name = "${environment}-underwriting-{queue-name}-dlq"
message_retention_seconds = 1209600 # 14 days
}
Queues:
1. rule-runner-sqs-tap + DLQ
2. result-runner-sqs-tap + DLQ
3. float-created-sqs-tap + DLQ
4. profile-handler-sqs-tap + DLQ
Key Parameters: * Visibility Timeout: 120 seconds (Lambda timeout + buffer) * Max Receive Count: 1 (fail fast, manual DLQ investigation) * Message Retention: 4 days (active), 14 days (DLQ)
VPC Configuration ([vpc.tf](deploy/vpc.tf))
VPC Strategy: * Use existing VPC (data source lookup) * Deploy API and Rule Runner in private subnets (access to RDS, internal services) * Security group allows HTTPS egress
Security Group Rules:
resource "aws_security_group" "lambda" {
name_prefix = "${var.environment}-underwriting-lambda-"
vpc_id = local.vpc_id
egress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "HTTPS outbound for AWS services and APIs"
}
}
Lambda ENI Considerations: * Each VPC Lambda creates ENI on invocation * ENI creation adds ~10 seconds cold start latency * Pre-warming via provisioned concurrency reduces impact
Secrets Management ([secrets.tf](deploy/secrets.tf))
Secrets Used:
-
Datadog Terraform Secret (
{environment}/datadog/terraform):-
API key for Datadog monitoring integration
-
Used by Terraform for Datadog resource provisioning
-
-
Segment Secret (
{environment}/segment):-
Write key for Segment event tracking
-
Used by Rule/Result Runners for analytics
-
-
GrowthBook Secret (
{environment}/growthbook):-
API key for feature flags and A/B testing
-
Used by all Lambdas for dynamic configuration
-
Secret Access Pattern:
// Lambdas retrieve secrets at runtime
secretsManager := secretsmanager.New(session)
secret, err := secretsManager.GetSecretValue(&secretsmanager.GetSecretValueInput{
SecretId: aws.String(os.Getenv("SM_GROWTHBOOK_NAME")),
})
Secret Rotation: * Manual rotation via AWS Console or CLI * Lambda restart required after rotation (invalidate cached credentials) * No automatic rotation configured
Build Process
Makefile Targets
The root [Makefile](Makefile) provides automation for common tasks:
Development:
make devkit # Build Docker devkit image
make generate # Generate OpenAPI server code
make test # Run unit tests
make coverage # Generate coverage report
make lint # Run linters (golangci-lint)
Build:
make build # Build all Lambda binaries
make dist # Create deployment ZIPs
make clean # Clean build artifacts
Deployment:
make deploy-dev # Deploy to dev environment
make deploy-staging # Deploy to staging
make deploy-prod # Deploy to production
Docker Devkit
Purpose: Consistent build environment across local and CI
Image: ghcr.io/floatme-corp/underwriting-api:latest-devkit
Contents: * Go 1.20 compiler * Terraform CLI * AWS CLI v2 * golangci-lint * redocly CLI (OpenAPI linting)
Usage:
# All make commands run inside devkit container
make devkit # Build/pull devkit image
make test # Runs tests in container
Lambda Binary Compilation
Build Script: scripts/build.sh (invoked by make build)
Compilation Flags:
GOOS=linux GOARCH=amd64 go build \
-ldflags="-s -w \
-X main.Version=${GIT_VERSION} \
-X main.Commit=${GIT_COMMIT} \
-X main.BuildDate=${GIT_COMMIT_DATE}" \
-o dist/$FUNCTION/bootstrap \
cmd/$FUNCTION/main.go
Binary Optimization:
* -s -w: Strip debug symbols (reduce size)
* Static linking: No external dependencies
* Bootstrap naming: AWS Lambda custom runtime convention
Output:
dist/
├── api/
│ ├── bootstrap
│ └── api.zip
├── rule-runner/
│ ├── bootstrap
│ └── rule-runner.zip
├── result-runner/
│ ├── bootstrap
│ └── result-runner.zip
├── float-created-handler/
│ ├── bootstrap
│ └── float-created-handler.zip
└── profile-handler/
├── bootstrap
└── profile-handler.zip
Deployment Procedures
Environment Strategy
Three Environments:
-
dev: Development testing (frequent deploys)
-
staging: Pre-production validation (release candidates)
-
prod: Production (stable releases only)
Promotion Path:
dev → staging → prod
Pre-Deployment Checklist
Before deploying to any environment:
-
Code Review: All PRs reviewed and approved
-
Tests Passing: Unit tests, integration tests green
-
Linting: No linting errors (
make lint) -
OpenAPI Spec: Valid and up-to-date (
make lint-spec) -
Secrets Current: Verify secrets exist and are valid
-
Version Tagged: Git tag exists (prod only)
-
Changelog Updated: CHANGELOG.md reflects changes
-
Rollback Plan: Identify previous stable version
Additional for Staging/Prod:
-
Dev Environment Validated: Changes tested in dev
-
Database Migrations: Run and tested (if applicable)
-
Dependency Updates: Third-party services notified
-
Monitoring: Datadog dashboards ready
-
On-Call Notified: Team aware of deployment window
GitHub Actions Deployment
Workflow: [.github/workflows/deploy.yml](.github/workflows/deploy.yml)
Triggers:
* dev: Push to main branch
* staging: Push to staging branch
* prod: Git tag push (v* pattern)
Workflow Steps:
-
Checkout Code ```yaml
-
uses: actions/checkout@v3
`
-
-
Assume AWS Role ```yaml
-
uses: aws-actions/configure-aws-credentials@v2 with: role-to-assume: arn:aws:iam::267052520423:role/${{ env.ENVIRONMENT }}-github-actions-services-role aws-region: us-west-2
`
-
-
Build Devkit Image
bash make devkit -
Run Tests
bash make test -
Build Lambda Binaries
bash make build -
Create Deployment Packages
bash make dist -
Upload to S3
bash aws s3 cp dist/ s3://floatme-${ENVIRONMENT}-media/lambda/underwriting/ --recursive -
Terraform Plan
bash cd deploy terraform init terraform plan -out=tfplan -
Terraform Apply
bash terraform apply tfplan -
Post-Deployment Tests
bash make integration-test
Deployment Duration: ~5-10 minutes (full stack)
Manual Deployment
For emergency deployments or local testing:
Step 1: Build
export AWS_PROFILE=underwriting-dev
export TF_VAR_environment=dev
export TF_VAR_service_version=$(git describe --tags)
make build
make dist
Step 2: Upload to S3
aws s3 sync dist/ s3://floatme-dev-media/lambda/underwriting/ \
--exclude "*" --include "*.zip"
Step 3: Terraform Apply
cd deploy
terraform init
terraform plan -out=tfplan
terraform apply tfplan
Step 4: Validate
# Test API endpoint
curl https://dev-underwriting.floatme.com/health
# Check Lambda logs
aws logs tail /aws/lambda/dev-underwriting-api --follow
Canary Deployments (Prod Only)
Strategy: Gradual traffic shift with rollback capability
Configuration: (Not currently implemented - future enhancement)
-
Deploy new version with alias
canary -
Configure API Gateway weighted routing: 95% live, 5% canary
-
Monitor error rates, latency for 30 minutes
-
If healthy, shift 50%, then 100%
-
If unhealthy, revert to 100% live
Recommended Tool: AWS Lambda Alias Traffic Shifting or AWS CodeDeploy
Post-Deployment Validation
Automated Checks
GitHub Actions Post-Deploy:
# API health check
curl -sf https://${ENVIRONMENT}-underwriting.floatme.com/health || exit 1
# Lambda invocation test
aws lambda invoke --function-name ${ENVIRONMENT}-underwriting-api \
--payload '{"path":"/health"}' response.json
# DynamoDB connectivity
aws dynamodb describe-table --table-name ${ENVIRONMENT}-underwriting
Manual Validation
1. API Endpoint Health
curl https://{env}-underwriting.floatme.com/health
# Expected: {"status": "ok", "version": "v1.2.3"}
2. Lambda Function Status
aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `{env}-underwriting`)].[FunctionName, State]'
# Expected: All functions in "Active" state
3. CloudWatch Logs
aws logs tail /aws/lambda/{env}-underwriting-api --since 5m
# Expected: No ERROR level logs
4. SQS Queue Metrics
aws sqs get-queue-attributes \
--queue-url https://sqs.us-west-2.amazonaws.com/267052520423/{env}-underwriting-rule-runner \
--attribute-names ApproximateNumberOfMessages,ApproximateNumberOfMessagesNotVisible
# Expected: Low/zero message counts
5. Test Eligibility Check
# Replace with real user ID from environment
curl -X GET https://{env}-underwriting.floatme.com/{user_id}/float_check \
-H "Authorization: Bearer $TOKEN"
# Expected: HTTP 200, valid evaluation response
6. Datadog Dashboard * Navigate to Underwriting Service dashboard * Verify deployment annotation appears * Check for error rate spikes * Monitor Lambda duration metrics
Success Criteria
Deployment is successful when:
-
All Lambda functions in "Active" state
-
API /health endpoint returns 200
-
CloudWatch Logs show successful invocations
-
SQS queues processing normally (no backlog)
-
Error rate < 1% (Datadog)
-
P95 latency < 500ms (API Lambda)
-
No alerts triggered in first 15 minutes
Rollback Procedures
When to Rollback
Rollback if any of the following occur within 30 minutes of deployment:
-
Error rate > 5%
-
API availability < 99%
-
Critical bugs reported
-
Downstream service failures
-
Database corruption or data loss
Terraform State Rollback
Revert to Previous Lambda Version:
# Step 1: Identify previous version
git log --oneline --decorate
# Step 2: Checkout previous commit
git checkout <previous-commit-hash>
# Step 3: Rebuild and redeploy
export TF_VAR_service_version=$(git describe --tags)
make build && make dist
# Step 4: Upload to S3
aws s3 sync dist/ s3://floatme-${ENVIRONMENT}-media/lambda/underwriting/
# Step 5: Terraform apply
cd deploy
terraform init
terraform apply -auto-approve
Alternate: Update Lambda from Console (Fastest)
-
Navigate to AWS Lambda Console
-
For each function (api, rule-runner, result-runner, etc.):
-
Click "Code" tab
-
Click "Upload from" → "S3"
-
Enter previous version S3 URI:
s3://floatme-site-media/lambda/underwriting/{function}/{previous-version}.zip -
Click "Save"
-
-
Verify health checks pass
Time to Rollback: 2-5 minutes (console method)
Database Rollback
DynamoDB Schema Changes:
-
No direct rollback capability (NoSQL schema-less)
-
Prevention: Use backward-compatible changes only
-
Mitigation: Restore from point-in-time backup (if enabled)
Point-in-Time Recovery:
aws dynamodb restore-table-to-point-in-time \
--source-table-name dev-underwriting \
--target-table-name dev-underwriting-restored \
--restore-date-time 2024-01-15T10:00:00Z
Note: Requires enabling PITR on DynamoDB table (recommended)
Communication Plan
Rollback Announcement:
-
Slack: Post in
#engineeringand#incidentschannels` 🚨 ROLLBACK IN PROGRESS Service: Underwriting API Environment: site Reason: {brief description} ETA: 5 minutes Status: https://status.floatme.com` -
Status Page: Update status.floatme.com with incident
-
Post-Mortem: Schedule blameless post-mortem within 24 hours
Monitoring & Alerting
CloudWatch Alarms
Critical Alarms (PagerDuty):
-
Lambda Errors > 10 in 5 minutes
` Metric: Errors Threshold: > 10 Period: 5 minutes Actions: SNS → PagerDuty` -
API Gateway 5xx > 1% in 5 minutes
` Metric: 5XXError Threshold: > 1% Period: 5 minutes` -
SQS DLQ Messages > 0
` Metric: ApproximateNumberOfMessagesVisible Threshold: > 0 Period: 1 minute`
Warning Alarms (Slack):
-
Lambda Duration > 250ms (P95)
-
Lambda Throttles > 0
-
SQS Queue Age > 5 minutes
Datadog Monitoring
APM Traces: * Track request flows across Lambdas * Identify slow database queries * Monitor external service latency
Custom Metrics:
* underwriting.evaluation.approved_count
* underwriting.evaluation.denied_count
* underwriting.cfi.limit_increased
* underwriting.rule.execution.duration (by rule name)
Dashboards: * Underwriting Overview: High-level service health * Lambda Performance: Per-function metrics * Evaluation Analytics: Approval rates, amounts
Troubleshooting
Common Issues
Deployment Failure: Terraform Lock
Symptom:
Error: Error acquiring the state lock
Cause: Previous deployment didn’t release Terraform state lock
Solution:
# View lock info
terraform force-unlock <lock-id>
# Or delete lock from DynamoDB (if state backend uses DynamoDB)
aws dynamodb delete-item \
--table-name terraform-state-lock \
--key '{"LockID": {"S": "terraform-state-prod"}}'
Lambda Cold Start Timeouts
Symptom: API requests timeout on first invocation
Cause: VPC ENI creation delay (~10 seconds)
Solution: * Enable provisioned concurrency (minimum 1) * Pre-warm Lambda with scheduled CloudWatch event
SQS Messages Not Processing
Symptom: SQS queue depth increasing, Lambda not invoking
Cause: Lambda error rate triggers automatic throttle
Solution: 1. Check Lambda error logs in CloudWatch 2. Fix underlying issue 3. Re-invoke Lambda manually or wait for automatic retry 4. Purge queue if messages are poison pills
DynamoDB Throttling
Symptom:
ProvisionedThroughputExceededException
Cause: Too many read/write operations
Solution: * Enable DynamoDB auto-scaling * Increase provisioned capacity temporarily * Optimize query patterns (use Query instead of Scan)
Secret Not Found
Symptom:
ResourceNotFoundException: Secret not found: dev/growthbook
Cause: Secret doesn’t exist in target environment
Solution:
# Create secret in AWS Secrets Manager
aws secretsmanager create-secret \
--name dev/growthbook \
--secret-string '{"api_key": "gb_prod_xyz123"}'
# Update Lambda environment to reference secret
# Redeploy Lambda
Security Considerations
IAM Least Privilege
-
Each Lambda has dedicated execution role
-
Permissions scoped to specific resources (table names, queue ARNs)
-
No wildcard permissions in production
CI/CD Pipeline Diagram
┌─────────────────────────────────────────────────────────────┐
│ Developer │
│ └─> git push → main/staging/v* tag │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ GitHub Actions │
│ ├─> Checkout code │
│ ├─> Assume AWS role │
│ ├─> Build devkit image │
│ ├─> Run tests (unit, lint) │
│ ├─> Build Go binaries │
│ └─> Create ZIP packages │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ AWS S3 │
│ └─> Upload deployment packages │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Terraform │
│ ├─> Init (download modules, configure state) │
│ ├─> Plan (calculate changes) │
│ ├─> Apply (provision infrastructure) │
│ │ ├─> Update Lambda functions (new code) │
│ │ ├─> Configure SQS queues │
│ │ ├─> Update API Gateway routes │
│ │ └─> Configure IAM roles/policies │
│ └─> Output (API endpoint URL, function ARNs) │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Post-Deployment Validation │
│ ├─> Health check API endpoint │
│ ├─> Verify Lambda state = Active │
│ ├─> Check CloudWatch Logs (no errors) │
│ ├─> Test eligibility check (smoke test) │
│ └─> Monitor Datadog for 15 minutes │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌───┴───┐
│Success│
└───────┘
Infrastructure Costs
Monthly Cost Estimate (per environment):
| Service | Monthly Cost | Notes |
|---|---|---|
Lambda Invocations |
$20-50 |
5 functions × ~1M invocations |
Lambda Duration |
$30-80 |
512 MB × execution time |
API Gateway |
$15-30 |
HTTP API (cheaper than REST) |
DynamoDB |
$25-100 |
On-demand pricing, varies by traffic |
SQS |
$1-5 |
First 1M requests free |
S3 (Lambda packages) |
$1-2 |
Storage + GET requests |
CloudWatch Logs |
$5-15 |
7-day retention |
Datadog APM |
$31/host |
Per Lambda function monitoring |
Total |
$128-283 |
Scales with traffic |
Cost Optimization: * Use provisioned concurrency sparingly (expensive) * DynamoDB on-demand cheaper than provisioned for variable workloads * HTTP API Gateway cheaper than REST API * Reduce Lambda memory if underutilized
See Also
-
Lambda Functions - Detailed Lambda documentation
-
System Architecture - Overall design
-
DynamoDB Schema - Data structures
-
API Specification - REST API endpoints