TerraformPilot

DevOps

Terraform for Genomics and Personalized Medicine on AWS

Provision HIPAA-aligned genomics infrastructure with Terraform: secure data lakes, AWS HealthOmics workflows, audit logging, and compliant compute.

LLuca Berton1 min read

Gene editing and personalized medicine are reshaping 2026 healthcare. Sequencing is cheap; compliant compute is the bottleneck. Hospitals and biotechs need HIPAA-aligned data lakes, AWS HealthOmics workflow runners, audited access, and isolated environments per study. Terraform turns those building blocks into a reproducible "genomics stack."

This guide shows how to provision a personalized-medicine genomics backend on AWS.

Architecture

#
LayerAWS service
Patient sequence storageHealthOmics Sequence Stores
Variant storageHealthOmics Variant Stores
WorkflowsHealthOmics Workflows
Annotation lakeS3 + Glue + Athena
PHI accessIAM + Lake Formation + CloudTrail
ComputeBatch with EFA / FSx

HealthOmics Sequence and Variant Stores

#
resource "aws_omics_sequence_store" "patient_seq" {
  name        = "patient-sequences"
  description = "Primary patient FASTQ/BAM/CRAM"
 
  sse_config {
    type = "AWS_OWNED_KMS_KEY"
  }
}
 
resource "aws_omics_variant_store" "germline" {
  name = "germline-variants"
 
  reference {
    reference_arn = aws_omics_reference_store.grch38.arn
  }
 
  sse_config {
    type    = "KMS"
    key_arn = aws_kms_key.phi.arn
  }
}

Workflow Runner

#
resource "aws_omics_workflow" "secondary_analysis" {
  name              = "germline-secondary-analysis"
  description       = "BWA-MEM2 + DeepVariant"
  engine            = "WDL"
  storage_capacity  = 1200
 
  definition_uri = "s3://${aws_s3_bucket.workflows.bucket}/germline.zip"
 
  parameter_template = jsonencode({
    sample_id   = { description = "Sample identifier", optional = false }
    fastq_uris  = { description = "FASTQ files",       optional = false }
    reference   = { description = "Reference genome",  optional = false }
  })
}

Lake Formation–Governed Annotation Lake

#
resource "aws_s3_bucket" "annotations" {
  bucket = "acme-genomics-annotations"
}
 
resource "aws_lakeformation_resource" "annotations" {
  arn      = aws_s3_bucket.annotations.arn
  role_arn = aws_iam_role.lake_formation.arn
}
 
resource "aws_glue_catalog_database" "genomics" {
  name = "genomics"
}
 
resource "aws_lakeformation_permissions" "researcher_read" {
  for_each   = toset(var.researchers)
  principal  = each.value
  permissions = ["SELECT"]
 
  table_with_columns {
    database_name = aws_glue_catalog_database.genomics.name
    name          = "variants"
    excluded_column_names = ["patient_id", "mrn", "dob"]
  }
}

The exclusion list is the trick: researchers query variants without ever seeing PHI columns.

Auditable Access (CloudTrail Lake)

#
resource "aws_cloudtrail_event_data_store" "phi" {
  name                          = "phi-audit"
  multi_region_enabled          = true
  retention_period              = 2557
  termination_protection_enabled = true
 
  advanced_event_selector {
    name = "PHI bucket data events"
    field_selector {
      field  = "eventCategory"
      equals = ["Data"]
    }
    field_selector {
      field  = "resources.type"
      equals = ["AWS::S3::Object"]
    }
    field_selector {
      field       = "resources.ARN"
      starts_with = ["${aws_s3_bucket.phi.arn}/"]
    }
  }
}

Compute With Network Isolation

#
resource "aws_batch_compute_environment" "genomics" {
  compute_environment_name = "genomics-secondary"
  type                     = "MANAGED"
  service_role             = aws_iam_role.batch.arn
 
  compute_resources {
    type                = "FARGATE"
    max_vcpus           = 4096
    subnets             = var.private_subnet_ids   # no NAT gateway, only VPC endpoints
    security_group_ids  = [aws_security_group.batch_no_egress.id]
  }
}

Best Practices

#
  • Use HealthOmics-managed stores instead of raw S3 — they were designed for genomics access patterns and HIPAA.
  • Lake Formation column-level security so PHI never leaves the database boundary.
  • No-egress private subnets for compute; everything goes through VPC endpoints.
  • CloudTrail Lake retention >= 7 years for HIPAA audit windows.
  • Terraform-backed BAA scope: every resource gets a data-classification=phi tag enforced via SCP.
#
#Terraform#Genomics#AWS#HealthOmics#Compliance

Share this article