TerraformPilot

DevOps

Terraform for Domain-Specific Language Models on AWS

Provision domain-specific LLM infrastructure with Terraform: GPU inference endpoints, private data stores, fine-tuning pipelines, and isolated environments.

LLuca Berton1 min read

Domain-specific language models are a 2026 trend reshaping enterprise AI. Instead of one giant general-purpose model, organizations fine-tune smaller models on legal, medical, financial, or industrial corpora — and run them in compliance-isolated environments. Terraform provisions the GPU endpoints, private data stores, and fine-tuning pipelines that make this repeatable.

This guide shows how to build a domain-specific LLM platform on AWS with Terraform.

Architecture Overview

#
LayerAWS service
Private data lakeS3 + Lake Formation
Fine-tuningSageMaker Training Jobs
Model registrySageMaker Model Registry
InferenceSageMaker async / serverless / real-time endpoints
IsolationVPC endpoints, KMS, IAM, PrivateLink

Private Training Data Lake

#
resource "aws_s3_bucket" "training_corpus" {
  bucket = "acme-legal-corpus-${var.env}"
}
 
resource "aws_s3_bucket_server_side_encryption_configuration" "corpus" {
  bucket = aws_s3_bucket.training_corpus.id
 
  rule {
    apply_server_side_encryption_by_default {
      kms_master_key_id = aws_kms_key.corpus.arn
      sse_algorithm     = "aws:kms"
    }
    bucket_key_enabled = true
  }
}
 
resource "aws_s3_bucket_public_access_block" "corpus" {
  bucket                  = aws_s3_bucket.training_corpus.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}
 
resource "aws_kms_key" "corpus" {
  description         = "Encrypts domain-specific training data"
  enable_key_rotation = true
}

Fine-Tuning Job

#
resource "aws_sagemaker_training_job" "legal_finetune" {
  training_job_name = "legal-llm-${var.run_id}"
  role_arn          = aws_iam_role.sagemaker.arn
 
  algorithm_specification {
    training_image      = "763104351884.dkr.ecr.${var.region}.amazonaws.com/huggingface-pytorch-training:2.3-transformers4.46-gpu-py311-cu124-ubuntu22.04"
    training_input_mode = "File"
  }
 
  resource_config {
    instance_type   = "ml.p4d.24xlarge"
    instance_count  = 2
    volume_size_in_gb = 500
  }
 
  input_data_config {
    channel_name = "train"
    data_source {
      s3_data_source {
        s3_data_type = "S3Prefix"
        s3_uri       = "s3://${aws_s3_bucket.training_corpus.bucket}/processed/"
      }
    }
  }
 
  output_data_config {
    s3_output_path = "s3://${aws_s3_bucket.models.bucket}/output/"
    kms_key_id     = aws_kms_key.corpus.arn
  }
 
  vpc_config {
    subnets            = var.private_subnet_ids
    security_group_ids = [aws_security_group.sagemaker.id]
  }
 
  enable_inter_container_traffic_encryption = true
  enable_network_isolation                  = true
}

Private Inference Endpoint

#
resource "aws_sagemaker_model" "legal_llm" {
  name               = "legal-llm-v1"
  execution_role_arn = aws_iam_role.sagemaker.arn
 
  primary_container {
    image          = var.inference_image
    model_data_url = "s3://${aws_s3_bucket.models.bucket}/output/model.tar.gz"
  }
 
  vpc_config {
    subnets            = var.private_subnet_ids
    security_group_ids = [aws_security_group.sagemaker.id]
  }
}
 
resource "aws_sagemaker_endpoint_configuration" "legal_llm" {
  name = "legal-llm-config"
 
  production_variants {
    variant_name           = "AllTraffic"
    model_name             = aws_sagemaker_model.legal_llm.name
    instance_type          = "ml.g5.12xlarge"
    initial_instance_count = 2
  }
 
  kms_key_arn = aws_kms_key.corpus.arn
}
 
resource "aws_sagemaker_endpoint" "legal_llm" {
  name                 = "legal-llm"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.legal_llm.name
}

VPC Endpoints for No-Egress Inference

#
resource "aws_vpc_endpoint" "sagemaker_runtime" {
  vpc_id            = var.vpc_id
  service_name      = "com.amazonaws.${var.region}.sagemaker.runtime"
  vpc_endpoint_type = "Interface"
  subnet_ids        = var.private_subnet_ids
 
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}

Best Practices

#
  • One KMS key per data domain so revoking access revokes all artifacts.
  • Network isolation on training jobs prevents leaks of proprietary corpus during fine-tune.
  • Model Registry approval workflow before promoting to prod endpoints.
  • Use SageMaker async endpoints for batch inference of long contracts/records.
#
#Terraform#AI#LLM#AWS#SageMaker

Share this article