Terraform for AI Companions: Real-Time Voice and Chat Backends
Provision AI companion infrastructure with Terraform: real-time inference APIs, voice infrastructure, user data stores, moderation, and scaling policies.
DevOps
Provision mechanistic interpretability research infrastructure with Terraform: research compute, experiment tracking, model checkpoints, and notebooks.
Mechanistic interpretability — reverse-engineering neural networks to understand circuits, features, and reasoning — is one of the more research-heavy 2026 trends. Labs running interpretability work need GPU notebooks, large model-checkpoint storage, experiment tracking, and reproducible Jupyter environments. Terraform turns that into a one-command lab setup.
This guide shows how to provision a research environment on AWS for interpretability work (sparse autoencoders, activation patching, circuit analysis).
| Layer | AWS service |
|---|---|
| GPU notebooks | SageMaker Studio, EC2 with NVIDIA AMI |
| Large checkpoint storage | S3 + FSx for Lustre |
| Experiment tracking | Self-hosted MLflow on ECS |
| Visualization | EC2 with neuronpedia/transformer-lens |
| Datasets | S3 (frozen, versioned) |
resource "aws_sagemaker_domain" "interp" {
domain_name = "interp-research"
auth_mode = "IAM"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
app_network_access_type = "VpcOnly"
default_user_settings {
execution_role = aws_iam_role.studio.arn
jupyter_server_app_settings {
default_resource_spec {
instance_type = "ml.t3.medium"
sagemaker_image_arn = data.aws_sagemaker_prebuilt_ecr_image.jupyter.arn
}
}
kernel_gateway_app_settings {
default_resource_spec {
instance_type = "ml.g5.2xlarge"
}
}
}
}
resource "aws_sagemaker_user_profile" "researcher" {
for_each = toset(var.researchers)
domain_id = aws_sagemaker_domain.interp.id
user_profile_name = each.key
}resource "aws_s3_bucket" "checkpoints" {
bucket = "acme-interp-checkpoints"
}
resource "aws_s3_bucket_versioning" "checkpoints" {
bucket = aws_s3_bucket.checkpoints.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_lifecycle_configuration" "checkpoints" {
bucket = aws_s3_bucket.checkpoints.id
rule {
id = "tier-old-runs"
status = "Enabled"
transition {
days = 30
storage_class = "INTELLIGENT_TIERING"
}
noncurrent_version_transition {
noncurrent_days = 60
storage_class = "GLACIER"
}
}
}Activation patching loads tens of GB of cached activations; FSx keeps it on the GPU's doorstep:
resource "aws_fsx_lustre_file_system" "activations" {
storage_capacity = 4800
subnet_ids = [module.vpc.private_subnets[0]]
deployment_type = "PERSISTENT_2"
per_unit_storage_throughput = 1000
data_compression_type = "LZ4"
auto_import_policy = "NEW_CHANGED_DELETED"
import_path = "s3://${aws_s3_bucket.checkpoints.bucket}/activations/"
export_path = "s3://${aws_s3_bucket.checkpoints.bucket}/activations/"
}resource "aws_ecs_cluster" "mlflow" {
name = "mlflow"
}
resource "aws_db_instance" "mlflow" {
identifier = "mlflow"
engine = "postgres"
engine_version = "16.3"
instance_class = "db.t4g.medium"
allocated_storage = 100
storage_encrypted = true
username = "mlflow"
password = random_password.db.result
db_subnet_group_name = aws_db_subnet_group.mlflow.name
vpc_security_group_ids = [aws_security_group.mlflow_db.id]
skip_final_snapshot = false
final_snapshot_identifier = "mlflow-final"
}
resource "aws_ecs_service" "mlflow" {
name = "mlflow"
cluster = aws_ecs_cluster.mlflow.id
task_definition = aws_ecs_task_definition.mlflow.arn
desired_count = 2
launch_type = "FARGATE"
network_configuration {
subnets = module.vpc.private_subnets
security_groups = [aws_security_group.mlflow.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.mlflow.arn
container_name = "mlflow"
container_port = 5000
}
}paper-id, researcher, model, commit-sha for traceability.Provision AI companion infrastructure with Terraform: real-time inference APIs, voice infrastructure, user data stores, moderation, and scaling policies.
Provision AI-native developer platforms with Terraform: sandboxes, CI/CD runners, model-serving environments, secrets, VPCs, and preview environments.
Provision domain-specific LLM infrastructure with Terraform: GPU inference endpoints, private data stores, fine-tuning pipelines, and isolated environments.
Standardize hyperscale AI data center infrastructure with Terraform: multi-region modules, capacity blocks, GPU pools, and repeatable region rollouts.