Skip to main content

Terraform for AI Supercomputing: Provision GPU Clusters and NVIDIA DGX on AWS

Key Takeaway

Provision AI supercomputing infrastructure with Terraform. Deploy GPU clusters with p5.48xlarge, EFA networking, FSx Lustre storage, and auto-scaling for model training on AWS.

Table of Contents

AI supercomputing is one of Gartner’s top 2026 trends — the race for AI compute is reshaping how infrastructure teams provision GPU clusters, high-speed networking, and distributed storage. NVIDIA’s Blackwell Ultra and AWS P5 instances make enterprise-scale AI training accessible, but provisioning it correctly requires careful infrastructure planning.

This guide shows how to provision AI training infrastructure with Terraform on AWS.

GPU Instance Types for AI Workloads

InstanceGPUsGPU MemoryNetworkUse Case
g5.xlarge1× A10G24 GB10 GbpsInference, fine-tuning small models
g5.48xlarge8× A10G192 GB100 GbpsBatch inference, medium training
p4d.24xlarge8× A100320 GB400 Gbps EFALarge model training
p5.48xlarge8× H100640 GB3200 Gbps EFAFrontier model training
trn1.32xlarge16× Trainium512 GB800 Gbps EFACost-optimized training
inf2.48xlarge12× Inferentia2384 GB100 GbpsHigh-throughput inference

VPC with EFA Networking

Elastic Fabric Adapter (EFA) is required for multi-node GPU training:

resource "aws_vpc" "ai" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true

  tags = { Name = "ai-training-vpc" }
}

resource "aws_subnet" "gpu" {
  vpc_id            = aws_vpc.ai.id
  cidr_block        = "10.0.0.0/24"
  availability_zone = "us-east-1a"  # GPU instances may be AZ-limited

  tags = { Name = "gpu-subnet" }
}

# Placement group for low-latency GPU-to-GPU communication
resource "aws_placement_group" "gpu_cluster" {
  name     = "gpu-cluster"
  strategy = "cluster"  # Pack instances close together
}

GPU Training Cluster

resource "aws_instance" "gpu_worker" {
  count = var.gpu_node_count

  ami           = data.aws_ami.deep_learning.id
  instance_type = "p5.48xlarge"
  subnet_id     = aws_subnet.gpu.id
  placement_group = aws_placement_group.gpu_cluster.id

  # EFA network interface
  network_interface {
    network_interface_id = aws_network_interface.efa[count.index].id
    device_index         = 0
  }

  root_block_device {
    volume_size = 500
    volume_type = "gp3"
    throughput  = 1000
    iops        = 16000
  }

  # NVMe instance storage for scratch
  ephemeral_block_device {
    device_name  = "/dev/sdb"
    virtual_name = "ephemeral0"
  }

  user_data = base64encode(templatefile("${path.module}/scripts/gpu-setup.sh", {
    fsx_dns    = aws_fsx_lustre_file_system.training_data.dns_name
    fsx_mount  = aws_fsx_lustre_file_system.training_data.mount_name
    node_rank  = count.index
    world_size = var.gpu_node_count
  }))

  tags = {
    Name      = "gpu-worker-${count.index}"
    Component = "ai-training"
  }
}

# EFA network interfaces
resource "aws_network_interface" "efa" {
  count = var.gpu_node_count

  subnet_id       = aws_subnet.gpu.id
  security_groups = [aws_security_group.gpu.id]
  interface_type  = "efa"
}

# Deep Learning AMI
data "aws_ami" "deep_learning" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["Deep Learning AMI GPU PyTorch *Ubuntu 22.04*"]
  }
}

Security Group for GPU Cluster

resource "aws_security_group" "gpu" {
  name_prefix = "gpu-cluster-"
  vpc_id      = aws_vpc.ai.id

  # Allow all traffic within the cluster (EFA needs this)
  ingress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    self      = true
  }

  # SSH access
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.admin_cidr]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

FSx for Lustre: High-Performance Training Data

resource "aws_fsx_lustre_file_system" "training_data" {
  storage_capacity            = 4800  # GB — must be multiple of 2400 for PERSISTENT_2
  subnet_ids                  = [aws_subnet.gpu.id]
  deployment_type             = "PERSISTENT_2"
  per_unit_storage_throughput = 1000  # MB/s per TiB
  security_group_ids          = [aws_security_group.fsx.id]

  # Auto-import from S3
  import_path = "s3://${aws_s3_bucket.training_data.id}"
  export_path = "s3://${aws_s3_bucket.training_data.id}/results"

  tags = {
    Component = "ai-training-storage"
  }
}

resource "aws_s3_bucket" "training_data" {
  bucket = "ai-training-data-${data.aws_caller_identity.current.account_id}"
}

Cost Management

GPU instances are expensive. Use Spot instances for fault-tolerant training:

resource "aws_spot_instance_request" "gpu_spot" {
  count = var.spot_gpu_count

  ami                    = data.aws_ami.deep_learning.id
  instance_type          = "p4d.24xlarge"
  spot_price             = "15.00"  # Max hourly bid
  wait_for_fulfillment   = true
  spot_type              = "persistent"
  instance_interruption_behavior = "stop"

  placement_group = aws_placement_group.gpu_cluster.id
  subnet_id       = aws_subnet.gpu.id

  tags = { Name = "gpu-spot-${count.index}" }
}

# Budget alert
resource "aws_budgets_budget" "gpu_compute" {
  name         = "ai-training-compute"
  budget_type  = "COST"
  limit_amount = "10000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 50
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = [var.alert_email]
  }
}

Cost Comparison

InstanceOn-Demand/hrSpot/hr (~)8-node cluster/day
p4d.24xlarge$32.77~$12$2,304 on-demand
p5.48xlarge$98.32~$35$6,720 on-demand
trn1.32xlarge$21.50~$7$1,344 on-demand
g5.48xlarge$16.29~$6$1,176 on-demand

Hands-On Courses

Conclusion

AI supercomputing infrastructure on AWS requires GPU instances, EFA networking for multi-node training, FSx Lustre for high-throughput storage, and placement groups for low-latency communication. Terraform makes GPU clusters reproducible — spin up for training, tear down when done, and use Spot instances to cut costs by 60-70%. As AI training scales in 2026, infrastructure-as-code is the only practical way to manage GPU fleet provisioning.

🚀

Level Up Your Terraform Skills

Hands-on courses, books, and resources from Luca Berton

Luca Berton
Written by

Luca Berton

DevOps Engineer, AWS Partner, Terraform expert, and author. Creator of Ansible Pilot, Terraform Pilot, and CopyPasteLearn.