TerraformPilot

DevOps

Terraform for Mechanistic Interpretability Research Infrastructure

Provision mechanistic interpretability research infrastructure with Terraform: research compute, experiment tracking, model checkpoints, and notebooks.

LLuca Berton1 min read

Mechanistic interpretability — reverse-engineering neural networks to understand circuits, features, and reasoning — is one of the more research-heavy 2026 trends. Labs running interpretability work need GPU notebooks, large model-checkpoint storage, experiment tracking, and reproducible Jupyter environments. Terraform turns that into a one-command lab setup.

This guide shows how to provision a research environment on AWS for interpretability work (sparse autoencoders, activation patching, circuit analysis).

Architecture

#
LayerAWS service
GPU notebooksSageMaker Studio, EC2 with NVIDIA AMI
Large checkpoint storageS3 + FSx for Lustre
Experiment trackingSelf-hosted MLflow on ECS
VisualizationEC2 with neuronpedia/transformer-lens
DatasetsS3 (frozen, versioned)

SageMaker Studio Domain

#
resource "aws_sagemaker_domain" "interp" {
  domain_name             = "interp-research"
  auth_mode               = "IAM"
  vpc_id                  = module.vpc.vpc_id
  subnet_ids              = module.vpc.private_subnets
  app_network_access_type = "VpcOnly"
 
  default_user_settings {
    execution_role = aws_iam_role.studio.arn
 
    jupyter_server_app_settings {
      default_resource_spec {
        instance_type        = "ml.t3.medium"
        sagemaker_image_arn  = data.aws_sagemaker_prebuilt_ecr_image.jupyter.arn
      }
    }
 
    kernel_gateway_app_settings {
      default_resource_spec {
        instance_type = "ml.g5.2xlarge"
      }
    }
  }
}
 
resource "aws_sagemaker_user_profile" "researcher" {
  for_each          = toset(var.researchers)
  domain_id         = aws_sagemaker_domain.interp.id
  user_profile_name = each.key
}

Versioned Checkpoint Bucket

#
resource "aws_s3_bucket" "checkpoints" {
  bucket = "acme-interp-checkpoints"
}
 
resource "aws_s3_bucket_versioning" "checkpoints" {
  bucket = aws_s3_bucket.checkpoints.id
  versioning_configuration { status = "Enabled" }
}
 
resource "aws_s3_bucket_lifecycle_configuration" "checkpoints" {
  bucket = aws_s3_bucket.checkpoints.id
 
  rule {
    id     = "tier-old-runs"
    status = "Enabled"
    transition {
      days          = 30
      storage_class = "INTELLIGENT_TIERING"
    }
    noncurrent_version_transition {
      noncurrent_days = 60
      storage_class   = "GLACIER"
    }
  }
}

FSx for Lustre — Hot Activation Cache

#

Activation patching loads tens of GB of cached activations; FSx keeps it on the GPU's doorstep:

resource "aws_fsx_lustre_file_system" "activations" {
  storage_capacity              = 4800
  subnet_ids                    = [module.vpc.private_subnets[0]]
  deployment_type               = "PERSISTENT_2"
  per_unit_storage_throughput   = 1000
  data_compression_type         = "LZ4"
  auto_import_policy            = "NEW_CHANGED_DELETED"
  import_path                   = "s3://${aws_s3_bucket.checkpoints.bucket}/activations/"
  export_path                   = "s3://${aws_s3_bucket.checkpoints.bucket}/activations/"
}

Self-Hosted MLflow Tracking

#
resource "aws_ecs_cluster" "mlflow" {
  name = "mlflow"
}
 
resource "aws_db_instance" "mlflow" {
  identifier              = "mlflow"
  engine                  = "postgres"
  engine_version          = "16.3"
  instance_class          = "db.t4g.medium"
  allocated_storage       = 100
  storage_encrypted       = true
  username                = "mlflow"
  password                = random_password.db.result
  db_subnet_group_name    = aws_db_subnet_group.mlflow.name
  vpc_security_group_ids  = [aws_security_group.mlflow_db.id]
  skip_final_snapshot     = false
  final_snapshot_identifier = "mlflow-final"
}
 
resource "aws_ecs_service" "mlflow" {
  name            = "mlflow"
  cluster         = aws_ecs_cluster.mlflow.id
  task_definition = aws_ecs_task_definition.mlflow.arn
  desired_count   = 2
  launch_type     = "FARGATE"
 
  network_configuration {
    subnets          = module.vpc.private_subnets
    security_groups  = [aws_security_group.mlflow.id]
    assign_public_ip = false
  }
 
  load_balancer {
    target_group_arn = aws_lb_target_group.mlflow.arn
    container_name   = "mlflow"
    container_port   = 5000
  }
}

Best Practices

#
  • Freeze datasets in versioned buckets — interpretability claims need byte-exact reproducibility.
  • Pin transformer-lens / SAE library versions in the SageMaker lifecycle config.
  • Tag every run with paper-id, researcher, model, commit-sha for traceability.
  • Keep raw activations on FSx, derived results on S3 — different cost/latency profiles.
#
#Terraform#AI#Research#AWS#MLOps

Share this article