Skip to main content

Terraform for Agentic AI Infrastructure: Deploy Multi-Agent Systems on AWS

Key Takeaway

Deploy agentic AI and multi-agent systems with Terraform on AWS. Provision SQS queues, Lambda functions, Step Functions orchestration, and vector databases for autonomous AI agents.

Table of Contents

Agentic AI is the biggest infrastructure trend of 2026. AI is moving from chat interfaces to autonomous agents that execute multi-step tasks across workflows — and those agents need infrastructure. Gartner lists multiagent systems in its 2026 top 10 strategic trends.

This guide shows how to provision the infrastructure for agentic AI systems using Terraform on AWS.

What Agentic AI Infrastructure Looks Like

Unlike a simple LLM API call, agentic systems need:

  • Message queues for agent-to-agent communication
  • Orchestration for multi-step task execution
  • Vector databases for agent memory and context
  • Compute for running agent logic (Lambda, ECS, or EC2)
  • Observability for tracking agent decisions and costs
  • Guardrails for safety boundaries around autonomous actions

Architecture Overview

User Request
    │
    ▼
┌─────────────────┐
│  API Gateway     │
└────────┬────────┘
         ▼
┌─────────────────┐     ┌──────────────┐
│  Orchestrator    │────▶│  Agent Queue  │
│  (Step Functions)│     │  (SQS)       │
└────────┬────────┘     └──────┬───────┘
         │                     │
    ┌────┴────┐          ┌─────┴─────┐
    ▼         ▼          ▼           ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│Research│ │Planning│ │Executor│ │Reviewer│
│ Agent  │ │ Agent  │ │ Agent  │ │ Agent  │
│(Lambda)│ │(Lambda)│ │(Lambda)│ │(Lambda)│
└────┬───┘ └────┬───┘ └────┬───┘ └────┬───┘
     │          │          │          │
     └──────────┴──────────┴──────────┘
                    │
              ┌─────┴─────┐
              │  Bedrock   │  ← LLM API
              │  OpenSearch │  ← Vector memory
              │  DynamoDB   │  ← State/history
              └───────────┘

Agent Communication: SQS Queues

Each agent gets its own input queue for asynchronous communication:

variable "agents" {
  default = ["research", "planning", "executor", "reviewer"]
}

resource "aws_sqs_queue" "agent_queue" {
  for_each = toset(var.agents)

  name                       = "agent-${each.key}-queue"
  visibility_timeout_seconds = 300    # 5 min for LLM processing
  message_retention_seconds  = 86400  # 24 hours
  receive_wait_time_seconds  = 20     # Long polling

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.agent_dlq[each.key].arn
    maxReceiveCount     = 3
  })

  tags = {
    Component = "agentic-ai"
    Agent     = each.key
  }
}

resource "aws_sqs_queue" "agent_dlq" {
  for_each = toset(var.agents)
  name     = "agent-${each.key}-dlq"
}

Agent Compute: Lambda Functions

resource "aws_lambda_function" "agent" {
  for_each = toset(var.agents)

  function_name = "agent-${each.key}"
  runtime       = "python3.12"
  handler       = "handler.main"
  timeout       = 300  # 5 min max for complex reasoning
  memory_size   = 512

  filename         = "lambda/${each.key}/deployment.zip"
  source_code_hash = filebase64sha256("lambda/${each.key}/deployment.zip")
  role             = aws_iam_role.agent_role[each.key].arn

  environment {
    variables = {
      AGENT_NAME      = each.key
      BEDROCK_MODEL   = var.bedrock_model_id
      OPENSEARCH_HOST = aws_opensearch_domain.memory.endpoint
      STATE_TABLE     = aws_dynamodb_table.agent_state.name
      OUTPUT_QUEUE    = aws_sqs_queue.orchestrator_results.url
    }
  }
}

# SQS triggers each agent
resource "aws_lambda_event_source_mapping" "agent_trigger" {
  for_each = toset(var.agents)

  event_source_arn = aws_sqs_queue.agent_queue[each.key].arn
  function_name    = aws_lambda_function.agent[each.key].arn
  batch_size       = 1  # One task at a time per agent
}

Orchestration: Step Functions

resource "aws_sfn_state_machine" "agent_orchestrator" {
  name     = "agent-orchestrator"
  role_arn = aws_iam_role.sfn_role.arn

  definition = jsonencode({
    Comment = "Multi-agent task orchestration"
    StartAt = "Research"
    States = {
      Research = {
        Type     = "Task"
        Resource = "arn:aws:states:::sqs:sendMessage.waitForTaskToken"
        Parameters = {
          QueueUrl    = aws_sqs_queue.agent_queue["research"].url
          MessageBody = {
            "task.$"       = "$.task"
            "taskToken.$"  = "$$.Task.Token"
          }
        }
        TimeoutSeconds = 600
        Next           = "Planning"
        Catch = [{
          ErrorEquals = ["States.TaskFailed", "States.Timeout"]
          Next        = "HandleError"
        }]
      }
      Planning = {
        Type     = "Task"
        Resource = "arn:aws:states:::sqs:sendMessage.waitForTaskToken"
        Parameters = {
          QueueUrl    = aws_sqs_queue.agent_queue["planning"].url
          MessageBody = {
            "task.$"           = "$.task"
            "research_output.$" = "$.research_output"
            "taskToken.$"      = "$$.Task.Token"
          }
        }
        TimeoutSeconds = 600
        Next           = "Execute"
        Catch = [{
          ErrorEquals = ["States.TaskFailed"]
          Next        = "HandleError"
        }]
      }
      Execute = {
        Type     = "Task"
        Resource = "arn:aws:states:::sqs:sendMessage.waitForTaskToken"
        Parameters = {
          QueueUrl    = aws_sqs_queue.agent_queue["executor"].url
          MessageBody = {
            "task.$"     = "$.task"
            "plan.$"     = "$.plan"
            "taskToken.$" = "$$.Task.Token"
          }
        }
        TimeoutSeconds = 900
        Next           = "Review"
        Catch = [{
          ErrorEquals = ["States.TaskFailed"]
          Next        = "HandleError"
        }]
      }
      Review = {
        Type     = "Task"
        Resource = "arn:aws:states:::sqs:sendMessage.waitForTaskToken"
        Parameters = {
          QueueUrl    = aws_sqs_queue.agent_queue["reviewer"].url
          MessageBody = {
            "task.$"     = "$.task"
            "result.$"   = "$.result"
            "taskToken.$" = "$$.Task.Token"
          }
        }
        TimeoutSeconds = 600
        Next           = "CheckApproval"
        Catch = [{
          ErrorEquals = ["States.TaskFailed"]
          Next        = "HandleError"
        }]
      }
      CheckApproval = {
        Type = "Choice"
        Choices = [{
          Variable     = "$.approved"
          BooleanEquals = true
          Next         = "Success"
        }]
        Default = "Planning"  # Loop back if rejected
      }
      Success = {
        Type = "Succeed"
      }
      HandleError = {
        Type  = "Task"
        Resource = aws_lambda_function.agent["reviewer"].arn
        End   = true
      }
    }
  })
}

Agent Memory: OpenSearch Vector Store

resource "aws_opensearch_domain" "memory" {
  domain_name    = "agent-memory"
  engine_version = "OpenSearch_2.13"

  cluster_config {
    instance_type  = "r6g.large.search"
    instance_count = 2
  }

  ebs_options {
    ebs_enabled = true
    volume_size = 100
    volume_type = "gp3"
  }

  encrypt_at_rest {
    enabled = true
  }

  node_to_node_encryption {
    enabled = true
  }

  tags = {
    Component = "agent-memory"
  }
}

Agent State: DynamoDB

resource "aws_dynamodb_table" "agent_state" {
  name         = "agent-state"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "task_id"
  range_key    = "agent_name"

  attribute {
    name = "task_id"
    type = "S"
  }

  attribute {
    name = "agent_name"
    type = "S"
  }

  ttl {
    attribute_name = "expires_at"
    enabled        = true
  }

  tags = {
    Component = "agent-state"
  }
}

Cost Controls and Guardrails

# Budget alarm for AI spend
resource "aws_budgets_budget" "ai_spend" {
  name         = "agentic-ai-monthly"
  budget_type  = "COST"
  limit_amount = "500"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:Component$agentic-ai"]
  }

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = [var.alert_email]
  }
}

# CloudWatch alarm for runaway agents
resource "aws_cloudwatch_metric_alarm" "agent_invocations" {
  alarm_name          = "agent-high-invocations"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "Invocations"
  namespace           = "AWS/Lambda"
  period              = 300
  statistic           = "Sum"
  threshold           = 1000
  alarm_description   = "Agent invocations exceeded safety threshold"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

Hands-On Courses

Conclusion

Agentic AI systems need purpose-built infrastructure: message queues for agent communication, Step Functions for orchestration, vector databases for memory, and budget guardrails for cost control. Terraform makes this repeatable — deploy the same multi-agent architecture across dev, staging, and production with consistent configuration. As agents move from experiments to production workloads in 2026, infrastructure-as-code becomes essential for managing their complexity.

🚀

Level Up Your Terraform Skills

Hands-on courses, books, and resources from Luca Berton

Luca Berton
Written by

Luca Berton

DevOps Engineer, AWS Partner, Terraform expert, and author. Creator of Ansible Pilot, Terraform Pilot, and CopyPasteLearn.