Terraform for AI-Native Development Platforms on AWS
Provision AI-native developer platforms with Terraform: sandboxes, CI/CD runners, model-serving environments, secrets, VPCs, and preview environments.
DevOps
Provision AI supercomputing infrastructure with Terraform. Deploy GPU clusters with p5.48xlarge, EFA networking, FSx Lustre storage
AI supercomputing is one of Gartner's top 2026 trends — the race for AI compute is reshaping how infrastructure teams provision GPU clusters, high-speed networking, and distributed storage. NVIDIA's Blackwell Ultra and AWS P5 instances make enterprise-scale AI training accessible, but provisioning it correctly requires careful infrastructure planning.
This guide shows how to provision AI training infrastructure with Terraform on AWS.
| Instance | GPUs | GPU Memory | Network | Use Case |
|---|---|---|---|---|
g5.xlarge | 1× A10G | 24 GB | 10 Gbps | Inference, fine-tuning small models |
g5.48xlarge | 8× A10G | 192 GB | 100 Gbps | Batch inference, medium training |
p4d.24xlarge | 8× A100 | 320 GB | 400 Gbps EFA | Large model training |
p5.48xlarge | 8× H100 | 640 GB | 3200 Gbps EFA | Frontier model training |
trn1.32xlarge | 16× Trainium | 512 GB | 800 Gbps EFA | Cost-optimized training |
inf2.48xlarge | 12× Inferentia2 | 384 GB | 100 Gbps | High-throughput inference |
Elastic Fabric Adapter (EFA) is required for multi-node GPU training:
resource "aws_vpc" "ai" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
tags = { Name = "ai-training-vpc" }
}
resource "aws_subnet" "gpu" {
vpc_id = aws_vpc.ai.id
cidr_block = "10.0.0.0/24"
availability_zone = "us-east-1a" # GPU instances may be AZ-limited
tags = { Name = "gpu-subnet" }
}
# Placement group for low-latency GPU-to-GPU communication
resource "aws_placement_group" "gpu_cluster" {
name = "gpu-cluster"
strategy = "cluster" # Pack instances close together
}resource "aws_instance" "gpu_worker" {
count = var.gpu_node_count
ami = data.aws_ami.deep_learning.id
instance_type = "p5.48xlarge"
subnet_id = aws_subnet.gpu.id
placement_group = aws_placement_group.gpu_cluster.id
# EFA network interface
network_interface {
network_interface_id = aws_network_interface.efa[count.index].id
device_index = 0
}
root_block_device {
volume_size = 500
volume_type = "gp3"
throughput = 1000
iops = 16000
}
# NVMe instance storage for scratch
ephemeral_block_device {
device_name = "/dev/sdb"
virtual_name = "ephemeral0"
}
user_data = base64encode(templatefile("${path.module}/scripts/gpu-setup.sh", {
fsx_dns = aws_fsx_lustre_file_system.training_data.dns_name
fsx_mount = aws_fsx_lustre_file_system.training_data.mount_name
node_rank = count.index
world_size = var.gpu_node_count
}))
tags = {
Name = "gpu-worker-${count.index}"
Component = "ai-training"
}
}
# EFA network interfaces
resource "aws_network_interface" "efa" {
count = var.gpu_node_count
subnet_id = aws_subnet.gpu.id
security_groups = [aws_security_group.gpu.id]
interface_type = "efa"
}
# Deep Learning AMI
data "aws_ami" "deep_learning" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["Deep Learning AMI GPU PyTorch *Ubuntu 22.04*"]
}
}resource "aws_security_group" "gpu" {
name_prefix = "gpu-cluster-"
vpc_id = aws_vpc.ai.id
# Allow all traffic within the cluster (EFA needs this)
ingress {
from_port = 0
to_port = 0
protocol = "-1"
self = true
}
# SSH access
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [var.admin_cidr]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}resource "aws_fsx_lustre_file_system" "training_data" {
storage_capacity = 4800 # GB — must be multiple of 2400 for PERSISTENT_2
subnet_ids = [aws_subnet.gpu.id]
deployment_type = "PERSISTENT_2"
per_unit_storage_throughput = 1000 # MB/s per TiB
security_group_ids = [aws_security_group.fsx.id]
# Auto-import from S3
import_path = "s3://${aws_s3_bucket.training_data.id}"
export_path = "s3://${aws_s3_bucket.training_data.id}/results"
tags = {
Component = "ai-training-storage"
}
}
resource "aws_s3_bucket" "training_data" {
bucket = "ai-training-data-${data.aws_caller_identity.current.account_id}"
}GPU instances are expensive. Use Spot instances for fault-tolerant training:
resource "aws_spot_instance_request" "gpu_spot" {
count = var.spot_gpu_count
ami = data.aws_ami.deep_learning.id
instance_type = "p4d.24xlarge"
spot_price = "15.00" # Max hourly bid
wait_for_fulfillment = true
spot_type = "persistent"
instance_interruption_behavior = "stop"
placement_group = aws_placement_group.gpu_cluster.id
subnet_id = aws_subnet.gpu.id
tags = { Name = "gpu-spot-${count.index}" }
}
# Budget alert
resource "aws_budgets_budget" "gpu_compute" {
name = "ai-training-compute"
budget_type = "COST"
limit_amount = "10000"
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 50
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [var.alert_email]
}
}| Instance | On-Demand/hr | Spot/hr (~) | 8-node cluster/day |
|---|---|---|---|
p4d.24xlarge | $32.77 | ~$12 | $2,304 on-demand |
p5.48xlarge | $98.32 | ~$35 | $6,720 on-demand |
trn1.32xlarge | $21.50 | ~$7 | $1,344 on-demand |
g5.48xlarge | $16.29 | ~$6 | $1,176 on-demand |
AI supercomputing infrastructure on AWS requires GPU instances, EFA networking for multi-node training, FSx Lustre for high-throughput storage, and placement groups for low-latency communication. Terraform makes GPU clusters reproducible — spin up for training, tear down when done, and use Spot instances to cut costs by 60-70%. As AI training scales in 2026, infrastructure-as-code is the only practical way to manage GPU fleet provisioning.
Provision AI-native developer platforms with Terraform: sandboxes, CI/CD runners, model-serving environments, secrets, VPCs, and preview environments.
Deploy agentic AI and multi-agent systems with Terraform on AWS. Provision SQS queues, Lambda functions, Step Functions orchestration
Optimize AI infrastructure costs with Terraform. Deploy right-sized inference endpoints, auto-scale based on token throughput, use Spot instances
Secure AI workloads with Terraform. Deploy Bedrock guardrails, model access IAM policies, prompt injection detection