Terraform for Mechanistic Interpretability Research Infrastructure

Mechanistic interpretability — reverse-engineering neural networks to understand circuits, features, and reasoning — is one of the more research-heavy 2026 trends. Labs running interpretability work need GPU notebooks, large model-checkpoint storage, experiment tracking, and reproducible Jupyter environments. Terraform turns that into a one-command lab setup.

This guide shows how to provision a research environment on AWS for interpretability work (sparse autoencoders, activation patching, circuit analysis).

Architecture

Layer	AWS service
GPU notebooks	SageMaker Studio, EC2 with NVIDIA AMI
Large checkpoint storage	S3 + FSx for Lustre
Experiment tracking	Self-hosted MLflow on ECS
Visualization	EC2 with neuronpedia/transformer-lens
Datasets	S3 (frozen, versioned)

SageMaker Studio Domain

resource "aws_sagemaker_domain" "interp" {
  domain_name             = "interp-research"
  auth_mode               = "IAM"
  vpc_id                  = module.vpc.vpc_id
  subnet_ids              = module.vpc.private_subnets
  app_network_access_type = "VpcOnly"
 
  default_user_settings {
    execution_role = aws_iam_role.studio.arn
 
    jupyter_server_app_settings {
      default_resource_spec {
        instance_type        = "ml.t3.medium"
        sagemaker_image_arn  = data.aws_sagemaker_prebuilt_ecr_image.jupyter.arn
      }
    }
 
    kernel_gateway_app_settings {
      default_resource_spec {
        instance_type = "ml.g5.2xlarge"
      }
    }
  }
}
 
resource "aws_sagemaker_user_profile" "researcher" {
  for_each          = toset(var.researchers)
  domain_id         = aws_sagemaker_domain.interp.id
  user_profile_name = each.key
}

Versioned Checkpoint Bucket

resource "aws_s3_bucket" "checkpoints" {
  bucket = "acme-interp-checkpoints"
}
 
resource "aws_s3_bucket_versioning" "checkpoints" {
  bucket = aws_s3_bucket.checkpoints.id
  versioning_configuration { status = "Enabled" }
}
 
resource "aws_s3_bucket_lifecycle_configuration" "checkpoints" {
  bucket = aws_s3_bucket.checkpoints.id
 
  rule {
    id     = "tier-old-runs"
    status = "Enabled"
    transition {
      days          = 30
      storage_class = "INTELLIGENT_TIERING"
    }
    noncurrent_version_transition {
      noncurrent_days = 60
      storage_class   = "GLACIER"
    }
  }
}

FSx for Lustre — Hot Activation Cache

Activation patching loads tens of GB of cached activations; FSx keeps it on the GPU's doorstep:

resource "aws_fsx_lustre_file_system" "activations" {
  storage_capacity              = 4800
  subnet_ids                    = [module.vpc.private_subnets[0]]
  deployment_type               = "PERSISTENT_2"
  per_unit_storage_throughput   = 1000
  data_compression_type         = "LZ4"
  auto_import_policy            = "NEW_CHANGED_DELETED"
  import_path                   = "s3://${aws_s3_bucket.checkpoints.bucket}/activations/"
  export_path                   = "s3://${aws_s3_bucket.checkpoints.bucket}/activations/"
}

Self-Hosted MLflow Tracking

resource "aws_ecs_cluster" "mlflow" {
  name = "mlflow"
}
 
resource "aws_db_instance" "mlflow" {
  identifier              = "mlflow"
  engine                  = "postgres"
  engine_version          = "16.3"
  instance_class          = "db.t4g.medium"
  allocated_storage       = 100
  storage_encrypted       = true
  username                = "mlflow"
  password                = random_password.db.result
  db_subnet_group_name    = aws_db_subnet_group.mlflow.name
  vpc_security_group_ids  = [aws_security_group.mlflow_db.id]
  skip_final_snapshot     = false
  final_snapshot_identifier = "mlflow-final"
}
 
resource "aws_ecs_service" "mlflow" {
  name            = "mlflow"
  cluster         = aws_ecs_cluster.mlflow.id
  task_definition = aws_ecs_task_definition.mlflow.arn
  desired_count   = 2
  launch_type     = "FARGATE"
 
  network_configuration {
    subnets          = module.vpc.private_subnets
    security_groups  = [aws_security_group.mlflow.id]
    assign_public_ip = false
  }
 
  load_balancer {
    target_group_arn = aws_lb_target_group.mlflow.arn
    container_name   = "mlflow"
    container_port   = 5000
  }
}

Best Practices

Freeze datasets in versioned buckets — interpretability claims need byte-exact reproducibility.
Pin transformer-lens / SAE library versions in the SageMaker lifecycle config.
Tag every run with paper-id, researcher, model, commit-sha for traceability.
Keep raw activations on FSx, derived results on S3 — different cost/latency profiles.

Architecture

SageMaker Studio Domain

Versioned Checkpoint Bucket

FSx for Lustre — Hot Activation Cache

Self-Hosted MLflow Tracking

Best Practices

Terraform for AI Companions: Real-Time Voice and Chat Backends

Terraform for AI-Native Development Platforms on AWS

Terraform for Domain-Specific Language Models on AWS

Terraform for Hyperscale AI Data Centers: Multi-Region Patterns

Architecture

SageMaker Studio Domain

Versioned Checkpoint Bucket

FSx for Lustre — Hot Activation Cache

Self-Hosted MLflow Tracking

Best Practices

Related

Related articles

Terraform for AI Companions: Real-Time Voice and Chat Backends

Terraform for AI-Native Development Platforms on AWS

Terraform for Domain-Specific Language Models on AWS

Terraform for Hyperscale AI Data Centers: Multi-Region Patterns