AI supercomputing is one of Gartner’s top 2026 trends — the race for AI compute is reshaping how infrastructure teams provision GPU clusters, high-speed networking, and distributed storage. NVIDIA’s Blackwell Ultra and AWS P5 instances make enterprise-scale AI training accessible, but provisioning it correctly requires careful infrastructure planning.
This guide shows how to provision AI training infrastructure with Terraform on AWS.
GPU Instance Types for AI Workloads
| Instance | GPUs | GPU Memory | Network | Use Case |
|---|---|---|---|---|
g5.xlarge | 1× A10G | 24 GB | 10 Gbps | Inference, fine-tuning small models |
g5.48xlarge | 8× A10G | 192 GB | 100 Gbps | Batch inference, medium training |
p4d.24xlarge | 8× A100 | 320 GB | 400 Gbps EFA | Large model training |
p5.48xlarge | 8× H100 | 640 GB | 3200 Gbps EFA | Frontier model training |
trn1.32xlarge | 16× Trainium | 512 GB | 800 Gbps EFA | Cost-optimized training |
inf2.48xlarge | 12× Inferentia2 | 384 GB | 100 Gbps | High-throughput inference |
VPC with EFA Networking
Elastic Fabric Adapter (EFA) is required for multi-node GPU training:
resource "aws_vpc" "ai" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
tags = { Name = "ai-training-vpc" }
}
resource "aws_subnet" "gpu" {
vpc_id = aws_vpc.ai.id
cidr_block = "10.0.0.0/24"
availability_zone = "us-east-1a" # GPU instances may be AZ-limited
tags = { Name = "gpu-subnet" }
}
# Placement group for low-latency GPU-to-GPU communication
resource "aws_placement_group" "gpu_cluster" {
name = "gpu-cluster"
strategy = "cluster" # Pack instances close together
}
GPU Training Cluster
resource "aws_instance" "gpu_worker" {
count = var.gpu_node_count
ami = data.aws_ami.deep_learning.id
instance_type = "p5.48xlarge"
subnet_id = aws_subnet.gpu.id
placement_group = aws_placement_group.gpu_cluster.id
# EFA network interface
network_interface {
network_interface_id = aws_network_interface.efa[count.index].id
device_index = 0
}
root_block_device {
volume_size = 500
volume_type = "gp3"
throughput = 1000
iops = 16000
}
# NVMe instance storage for scratch
ephemeral_block_device {
device_name = "/dev/sdb"
virtual_name = "ephemeral0"
}
user_data = base64encode(templatefile("${path.module}/scripts/gpu-setup.sh", {
fsx_dns = aws_fsx_lustre_file_system.training_data.dns_name
fsx_mount = aws_fsx_lustre_file_system.training_data.mount_name
node_rank = count.index
world_size = var.gpu_node_count
}))
tags = {
Name = "gpu-worker-${count.index}"
Component = "ai-training"
}
}
# EFA network interfaces
resource "aws_network_interface" "efa" {
count = var.gpu_node_count
subnet_id = aws_subnet.gpu.id
security_groups = [aws_security_group.gpu.id]
interface_type = "efa"
}
# Deep Learning AMI
data "aws_ami" "deep_learning" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["Deep Learning AMI GPU PyTorch *Ubuntu 22.04*"]
}
}
Security Group for GPU Cluster
resource "aws_security_group" "gpu" {
name_prefix = "gpu-cluster-"
vpc_id = aws_vpc.ai.id
# Allow all traffic within the cluster (EFA needs this)
ingress {
from_port = 0
to_port = 0
protocol = "-1"
self = true
}
# SSH access
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [var.admin_cidr]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
FSx for Lustre: High-Performance Training Data
resource "aws_fsx_lustre_file_system" "training_data" {
storage_capacity = 4800 # GB — must be multiple of 2400 for PERSISTENT_2
subnet_ids = [aws_subnet.gpu.id]
deployment_type = "PERSISTENT_2"
per_unit_storage_throughput = 1000 # MB/s per TiB
security_group_ids = [aws_security_group.fsx.id]
# Auto-import from S3
import_path = "s3://${aws_s3_bucket.training_data.id}"
export_path = "s3://${aws_s3_bucket.training_data.id}/results"
tags = {
Component = "ai-training-storage"
}
}
resource "aws_s3_bucket" "training_data" {
bucket = "ai-training-data-${data.aws_caller_identity.current.account_id}"
}
Cost Management
GPU instances are expensive. Use Spot instances for fault-tolerant training:
resource "aws_spot_instance_request" "gpu_spot" {
count = var.spot_gpu_count
ami = data.aws_ami.deep_learning.id
instance_type = "p4d.24xlarge"
spot_price = "15.00" # Max hourly bid
wait_for_fulfillment = true
spot_type = "persistent"
instance_interruption_behavior = "stop"
placement_group = aws_placement_group.gpu_cluster.id
subnet_id = aws_subnet.gpu.id
tags = { Name = "gpu-spot-${count.index}" }
}
# Budget alert
resource "aws_budgets_budget" "gpu_compute" {
name = "ai-training-compute"
budget_type = "COST"
limit_amount = "10000"
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 50
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [var.alert_email]
}
}
Cost Comparison
| Instance | On-Demand/hr | Spot/hr (~) | 8-node cluster/day |
|---|---|---|---|
p4d.24xlarge | $32.77 | ~$12 | $2,304 on-demand |
p5.48xlarge | $98.32 | ~$35 | $6,720 on-demand |
trn1.32xlarge | $21.50 | ~$7 | $1,344 on-demand |
g5.48xlarge | $16.29 | ~$6 | $1,176 on-demand |
Hands-On Courses
- Terraform for Beginners on CopyPasteLearn
- Terraform By Example — practical code examples
Conclusion
AI supercomputing infrastructure on AWS requires GPU instances, EFA networking for multi-node training, FSx Lustre for high-throughput storage, and placement groups for low-latency communication. Terraform makes GPU clusters reproducible — spin up for training, tear down when done, and use Spot instances to cut costs by 60-70%. As AI training scales in 2026, infrastructure-as-code is the only practical way to manage GPU fleet provisioning.