Terraform GPU Clusters: Provision AI Supercomputing Infrastructure on AWS 2026

AI supercomputing is one of Gartner’s top 2026 trends — the race for AI compute is reshaping how infrastructure teams provision GPU clusters, high-speed networking, and distributed storage. NVIDIA’s Blackwell Ultra and AWS P5 instances make enterprise-scale AI training accessible, but provisioning it correctly requires careful infrastructure planning.

This guide shows how to provision AI training infrastructure with Terraform on AWS.

GPU Instance Types for AI Workloads

Instance	GPUs	GPU Memory	Network	Use Case
`g5.xlarge`	1× A10G	24 GB	10 Gbps	Inference, fine-tuning small models
`g5.48xlarge`	8× A10G	192 GB	100 Gbps	Batch inference, medium training
`p4d.24xlarge`	8× A100	320 GB	400 Gbps EFA	Large model training
`p5.48xlarge`	8× H100	640 GB	3200 Gbps EFA	Frontier model training
`trn1.32xlarge`	16× Trainium	512 GB	800 Gbps EFA	Cost-optimized training
`inf2.48xlarge`	12× Inferentia2	384 GB	100 Gbps	High-throughput inference

VPC with EFA Networking

Elastic Fabric Adapter (EFA) is required for multi-node GPU training:

resource "aws_vpc" "ai" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true

  tags = { Name = "ai-training-vpc" }
}

resource "aws_subnet" "gpu" {
  vpc_id            = aws_vpc.ai.id
  cidr_block        = "10.0.0.0/24"
  availability_zone = "us-east-1a"  # GPU instances may be AZ-limited

  tags = { Name = "gpu-subnet" }
}

# Placement group for low-latency GPU-to-GPU communication
resource "aws_placement_group" "gpu_cluster" {
  name     = "gpu-cluster"
  strategy = "cluster"  # Pack instances close together
}

GPU Training Cluster

resource "aws_instance" "gpu_worker" {
  count = var.gpu_node_count

  ami           = data.aws_ami.deep_learning.id
  instance_type = "p5.48xlarge"
  subnet_id     = aws_subnet.gpu.id
  placement_group = aws_placement_group.gpu_cluster.id

  # EFA network interface
  network_interface {
    network_interface_id = aws_network_interface.efa[count.index].id
    device_index         = 0
  }

  root_block_device {
    volume_size = 500
    volume_type = "gp3"
    throughput  = 1000
    iops        = 16000
  }

  # NVMe instance storage for scratch
  ephemeral_block_device {
    device_name  = "/dev/sdb"
    virtual_name = "ephemeral0"
  }

  user_data = base64encode(templatefile("${path.module}/scripts/gpu-setup.sh", {
    fsx_dns    = aws_fsx_lustre_file_system.training_data.dns_name
    fsx_mount  = aws_fsx_lustre_file_system.training_data.mount_name
    node_rank  = count.index
    world_size = var.gpu_node_count
  }))

  tags = {
    Name      = "gpu-worker-${count.index}"
    Component = "ai-training"
  }
}

# EFA network interfaces
resource "aws_network_interface" "efa" {
  count = var.gpu_node_count

  subnet_id       = aws_subnet.gpu.id
  security_groups = [aws_security_group.gpu.id]
  interface_type  = "efa"
}

# Deep Learning AMI
data "aws_ami" "deep_learning" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["Deep Learning AMI GPU PyTorch *Ubuntu 22.04*"]
  }
}

Security Group for GPU Cluster

resource "aws_security_group" "gpu" {
  name_prefix = "gpu-cluster-"
  vpc_id      = aws_vpc.ai.id

  # Allow all traffic within the cluster (EFA needs this)
  ingress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    self      = true
  }

  # SSH access
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.admin_cidr]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

FSx for Lustre: High-Performance Training Data

resource "aws_fsx_lustre_file_system" "training_data" {
  storage_capacity            = 4800  # GB — must be multiple of 2400 for PERSISTENT_2
  subnet_ids                  = [aws_subnet.gpu.id]
  deployment_type             = "PERSISTENT_2"
  per_unit_storage_throughput = 1000  # MB/s per TiB
  security_group_ids          = [aws_security_group.fsx.id]

  # Auto-import from S3
  import_path = "s3://${aws_s3_bucket.training_data.id}"
  export_path = "s3://${aws_s3_bucket.training_data.id}/results"

  tags = {
    Component = "ai-training-storage"
  }
}

resource "aws_s3_bucket" "training_data" {
  bucket = "ai-training-data-${data.aws_caller_identity.current.account_id}"
}

Cost Management

GPU instances are expensive. Use Spot instances for fault-tolerant training:

resource "aws_spot_instance_request" "gpu_spot" {
  count = var.spot_gpu_count

  ami                    = data.aws_ami.deep_learning.id
  instance_type          = "p4d.24xlarge"
  spot_price             = "15.00"  # Max hourly bid
  wait_for_fulfillment   = true
  spot_type              = "persistent"
  instance_interruption_behavior = "stop"

  placement_group = aws_placement_group.gpu_cluster.id
  subnet_id       = aws_subnet.gpu.id

  tags = { Name = "gpu-spot-${count.index}" }
}

# Budget alert
resource "aws_budgets_budget" "gpu_compute" {
  name         = "ai-training-compute"
  budget_type  = "COST"
  limit_amount = "10000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 50
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = [var.alert_email]
  }
}

Cost Comparison

Instance	On-Demand/hr	Spot/hr (~)	8-node cluster/day
`p4d.24xlarge`	$32.77	~$12	$2,304 on-demand
`p5.48xlarge`	$98.32	~$35	$6,720 on-demand
`trn1.32xlarge`	$21.50	~$7	$1,344 on-demand
`g5.48xlarge`	$16.29	~$6	$1,176 on-demand

Hands-On Courses

Terraform for Beginners on CopyPasteLearn
Terraform By Example — practical code examples

Conclusion

AI supercomputing infrastructure on AWS requires GPU instances, EFA networking for multi-node training, FSx Lustre for high-throughput storage, and placement groups for low-latency communication. Terraform makes GPU clusters reproducible — spin up for training, tear down when done, and use Spot instances to cut costs by 60-70%. As AI training scales in 2026, infrastructure-as-code is the only practical way to manage GPU fleet provisioning.

Terraform for AI Supercomputing: Provision GPU Clusters and NVIDIA DGX on AWS

GPU Instance Types for AI Workloads

VPC with EFA Networking

GPU Training Cluster

Security Group for GPU Cluster

FSx for Lustre: High-Performance Training Data

Cost Management

Cost Comparison

Hands-On Courses

Conclusion

Level Up Your Terraform Skills

Luca Berton

Terraform for AI Supercomputing: Provision GPU Clusters and NVIDIA DGX on AWS

GPU Instance Types for AI Workloads

VPC with EFA Networking

GPU Training Cluster

Security Group for GPU Cluster

FSx for Lustre: High-Performance Training Data

Cost Management

Cost Comparison

Hands-On Courses

Conclusion

Level Up Your Terraform Skills

Luca Berton

Related Articles

Terraform for AI Infrastructure Optimization: Cost-Efficient Model …

Terraform for AI Security: Guardrails, Model Access Control, and …

Terraform for Agentic AI Infrastructure: Deploy Multi-Agent Systems on …

AWS CDK vs Terraform: Which IaC Tool Should You Use in 2026?