DevOps

Terraform for AI Supercomputing: Provision GPU Clusters and NVIDIA DGX on AWS

Provision AI supercomputing infrastructure with Terraform. Deploy GPU clusters with p5.48xlarge, EFA networking, FSx Lustre storage

LLuca BertonApril 12, 20262 min read

Terraform for AI Supercomputing: Provision GPU Clusters and NVIDIA DGX on AWS

AI supercomputing is one of Gartner's top 2026 trends — the race for AI compute is reshaping how infrastructure teams provision GPU clusters, high-speed networking, and distributed storage. NVIDIA's Blackwell Ultra and AWS P5 instances make enterprise-scale AI training accessible, but provisioning it correctly requires careful infrastructure planning.

This guide shows how to provision AI training infrastructure with Terraform on AWS.

GPU Instance Types for AI Workloads

Instance	GPUs	GPU Memory	Network	Use Case
`g5.xlarge`	1× A10G	24 GB	10 Gbps	Inference, fine-tuning small models
`g5.48xlarge`	8× A10G	192 GB	100 Gbps	Batch inference, medium training
`p4d.24xlarge`	8× A100	320 GB	400 Gbps EFA	Large model training
`p5.48xlarge`	8× H100	640 GB	3200 Gbps EFA	Frontier model training
`trn1.32xlarge`	16× Trainium	512 GB	800 Gbps EFA	Cost-optimized training
`inf2.48xlarge`	12× Inferentia2	384 GB	100 Gbps	High-throughput inference

VPC with EFA Networking

Elastic Fabric Adapter (EFA) is required for multi-node GPU training:

resource "aws_vpc" "ai" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
 
  tags = { Name = "ai-training-vpc" }
}
 
resource "aws_subnet" "gpu" {
  vpc_id            = aws_vpc.ai.id
  cidr_block        = "10.0.0.0/24"
  availability_zone = "us-east-1a"  # GPU instances may be AZ-limited
 
  tags = { Name = "gpu-subnet" }
}
 
# Placement group for low-latency GPU-to-GPU communication
resource "aws_placement_group" "gpu_cluster" {
  name     = "gpu-cluster"
  strategy = "cluster"  # Pack instances close together
}

GPU Training Cluster

resource "aws_instance" "gpu_worker" {
  count = var.gpu_node_count
 
  ami           = data.aws_ami.deep_learning.id
  instance_type = "p5.48xlarge"
  subnet_id     = aws_subnet.gpu.id
  placement_group = aws_placement_group.gpu_cluster.id
 
  # EFA network interface
  network_interface {
    network_interface_id = aws_network_interface.efa[count.index].id
    device_index         = 0
  }
 
  root_block_device {
    volume_size = 500
    volume_type = "gp3"
    throughput  = 1000
    iops        = 16000
  }
 
  # NVMe instance storage for scratch
  ephemeral_block_device {
    device_name  = "/dev/sdb"
    virtual_name = "ephemeral0"
  }
 
  user_data = base64encode(templatefile("${path.module}/scripts/gpu-setup.sh", {
    fsx_dns    = aws_fsx_lustre_file_system.training_data.dns_name
    fsx_mount  = aws_fsx_lustre_file_system.training_data.mount_name
    node_rank  = count.index
    world_size = var.gpu_node_count
  }))
 
  tags = {
    Name      = "gpu-worker-${count.index}"
    Component = "ai-training"
  }
}
 
# EFA network interfaces
resource "aws_network_interface" "efa" {
  count = var.gpu_node_count
 
  subnet_id       = aws_subnet.gpu.id
  security_groups = [aws_security_group.gpu.id]
  interface_type  = "efa"
}
 
# Deep Learning AMI
data "aws_ami" "deep_learning" {
  most_recent = true
  owners      = ["amazon"]
 
  filter {
    name   = "name"
    values = ["Deep Learning AMI GPU PyTorch *Ubuntu 22.04*"]
  }
}

Security Group for GPU Cluster

resource "aws_security_group" "gpu" {
  name_prefix = "gpu-cluster-"
  vpc_id      = aws_vpc.ai.id
 
  # Allow all traffic within the cluster (EFA needs this)
  ingress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    self      = true
  }
 
  # SSH access
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.admin_cidr]
  }
 
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

FSx for Lustre: High-Performance Training Data

resource "aws_fsx_lustre_file_system" "training_data" {
  storage_capacity            = 4800  # GB — must be multiple of 2400 for PERSISTENT_2
  subnet_ids                  = [aws_subnet.gpu.id]
  deployment_type             = "PERSISTENT_2"
  per_unit_storage_throughput = 1000  # MB/s per TiB
  security_group_ids          = [aws_security_group.fsx.id]
 
  # Auto-import from S3
  import_path = "s3://${aws_s3_bucket.training_data.id}"
  export_path = "s3://${aws_s3_bucket.training_data.id}/results"
 
  tags = {
    Component = "ai-training-storage"
  }
}
 
resource "aws_s3_bucket" "training_data" {
  bucket = "ai-training-data-${data.aws_caller_identity.current.account_id}"
}

Cost Management

GPU instances are expensive. Use Spot instances for fault-tolerant training:

resource "aws_spot_instance_request" "gpu_spot" {
  count = var.spot_gpu_count
 
  ami                    = data.aws_ami.deep_learning.id
  instance_type          = "p4d.24xlarge"
  spot_price             = "15.00"  # Max hourly bid
  wait_for_fulfillment   = true
  spot_type              = "persistent"
  instance_interruption_behavior = "stop"
 
  placement_group = aws_placement_group.gpu_cluster.id
  subnet_id       = aws_subnet.gpu.id
 
  tags = { Name = "gpu-spot-${count.index}" }
}
 
# Budget alert
resource "aws_budgets_budget" "gpu_compute" {
  name         = "ai-training-compute"
  budget_type  = "COST"
  limit_amount = "10000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"
 
  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 50
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = [var.alert_email]
  }
}

Cost Comparison

Instance	On-Demand/hr	Spot/hr (~)	8-node cluster/day
`p4d.24xlarge`	$32.77	~$12	$2,304 on-demand
`p5.48xlarge`	$98.32	~$35	$6,720 on-demand
`trn1.32xlarge`	$21.50	~$7	$1,344 on-demand
`g5.48xlarge`	$16.29	~$6	$1,176 on-demand

Hands-On Courses

Terraform for Beginners on CopyPasteLearn
Terraform By Example — practical code examples

Conclusion

AI supercomputing infrastructure on AWS requires GPU instances, EFA networking for multi-node training, FSx Lustre for high-throughput storage, and placement groups for low-latency communication. Terraform makes GPU clusters reproducible — spin up for training, tear down when done, and use Spot instances to cut costs by 60-70%. As AI training scales in 2026, infrastructure-as-code is the only practical way to manage GPU fleet provisioning.

#Terraform#AI#AWS#GPU#Supercomputing#DevOps

Share this article

DevOps

Terraform for AI-Native Development Platforms on AWS

Provision AI-native developer platforms with Terraform: sandboxes, CI/CD runners, model-serving environments, secrets, VPCs, and preview environments.

May 4, 20262 min read

DevOps

Terraform for Agentic AI Infrastructure: Deploy Multi-Agent Systems on AWS

Deploy agentic AI and multi-agent systems with Terraform on AWS. Provision SQS queues, Lambda functions, Step Functions orchestration

Apr 12, 20264 min read

DevOps

Terraform for AI Infrastructure Optimization: Cost-Efficient Model Deployment on AWS

Optimize AI infrastructure costs with Terraform. Deploy right-sized inference endpoints, auto-scale based on token throughput, use Spot instances

Apr 12, 20264 min read

DevOps

Terraform for AI Security: Guardrails, Model Access Control, and Threat Detection

Secure AI workloads with Terraform. Deploy Bedrock guardrails, model access IAM policies, prompt injection detection

Apr 12, 20264 min read

GPU Instance Types for AI Workloads

VPC with EFA Networking

GPU Training Cluster

Security Group for GPU Cluster

FSx for Lustre: High-Performance Training Data

Cost Management

Cost Comparison

Hands-On Courses

Conclusion

Related articles

Terraform for AI-Native Development Platforms on AWS

Terraform for Agentic AI Infrastructure: Deploy Multi-Agent Systems on AWS

Terraform for AI Infrastructure Optimization: Cost-Efficient Model Deployment on AWS

Terraform for AI Security: Guardrails, Model Access Control, and Threat Detection