Table of Contents

Introduction

Not everything in your infrastructure is managed by Terraform. Legacy resources, manually created configurations, resources managed by other teams, or information from external services all exist outside your Terraform state. Data sources are Terraform’s mechanism for reading information from these external sources and using it in your configurations.

Unlike resources, data sources don’t create, update, or delete anything. They perform read-only queries that return information you can reference elsewhere in your code. This makes them essential for building configurations that integrate with existing infrastructure rather than starting from scratch.

Data Source Basics

A data source is defined using a data block:

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"] # Canonical

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"
}

In this example, instead of hardcoding an AMI ID (which changes across regions and over time), we query for the latest Ubuntu 22.04 AMI dynamically. Every time you run terraform plan, it fetches the current AMI ID.

Common Data Source Patterns

Looking Up VPCs and Subnets

When your networking is managed by a different team or Terraform configuration:

data "aws_vpc" "main" {
  filter {
    name   = "tag:Name"
    values = ["production-vpc"]
  }
}

data "aws_subnets" "private" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.main.id]
  }

  filter {
    name   = "tag:Tier"
    values = ["private"]
  }
}

resource "aws_instance" "app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.medium"
  subnet_id     = data.aws_subnets.private.ids[0]
  vpc_security_group_ids = [aws_security_group.app.id]
}

resource "aws_security_group" "app" {
  name   = "app-sg"
  vpc_id = data.aws_vpc.main.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [data.aws_vpc.main.cidr_block]
  }
}

Referencing Remote State

One of the most powerful data sources lets you read outputs from another Terraform state:

data "terraform_remote_state" "networking" {
  backend = "s3"

  config = {
    bucket = "my-terraform-state"
    key    = "networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.medium"
  subnet_id     = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
}

This pattern enables separation of concerns — the networking team manages VPCs and subnets, and application teams reference them via remote state outputs.

Looking Up IAM Policies

Reference AWS-managed or existing custom policies:

data "aws_iam_policy" "admin" {
  name = "AdministratorAccess"
}

data "aws_iam_policy_document" "custom" {
  statement {
    actions = [
      "s3:GetObject",
      "s3:PutObject",
      "s3:ListBucket",
    ]

    resources = [
      "arn:aws:s3:::my-bucket",
      "arn:aws:s3:::my-bucket/*",
    ]
  }

  statement {
    actions   = ["logs:*"]
    resources = ["*"]
  }
}

resource "aws_iam_policy" "app" {
  name   = "app-policy"
  policy = data.aws_iam_policy_document.custom.json
}

The aws_iam_policy_document data source is particularly valuable because it generates valid JSON policy documents with proper syntax, reducing the chance of policy errors.

Querying Availability Zones

Build region-agnostic configurations:

data "aws_availability_zones" "available" {
  state = "available"

  filter {
    name   = "opt-in-status"
    values = ["opt-in-not-required"]
  }
}

resource "aws_subnet" "private" {
  count             = min(length(data.aws_availability_zones.available.names), 3)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "private-${data.aws_availability_zones.available.names[count.index]}"
    Tier = "private"
  }
}

This creates subnets in up to 3 availability zones, regardless of which region you deploy to.

Current AWS Account and Region

Access metadata about your current context:

data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_partition" "current" {}

locals {
  account_id = data.aws_caller_identity.current.account_id
  region     = data.aws_region.current.name
  partition  = data.aws_partition.current.partition
}

output "account_info" {
  value = "Running in account ${local.account_id} in region ${local.region}"
}

Looking Up Route 53 Zones

Reference existing DNS zones:

data "aws_route53_zone" "main" {
  name         = "example.com"
  private_zone = false
}

resource "aws_route53_record" "app" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.app.dns_name
    zone_id                = aws_lb.app.zone_id
    evaluate_target_health = true
  }
}

The external Data Source

For data that doesn’t have a native Terraform provider, use the external data source to call any script:

data "external" "git_info" {
  program = ["bash", "-c", <<-EOT
    echo '{"commit":"'$(git rev-parse --short HEAD)'","branch":"'$(git rev-parse --abbrev-ref HEAD)'"}'
  EOT
  ]
}

resource "aws_instance" "app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"

  tags = {
    GitCommit = data.external.git_info.result.commit
    GitBranch = data.external.git_info.result.branch
  }
}

The script must output valid JSON to stdout with string values only.

The http Data Source

Fetch data from HTTP endpoints:

data "http" "my_ip" {
  url = "https://ipv4.icanhazip.com"
}

resource "aws_security_group_rule" "ssh_from_my_ip" {
  type              = "ingress"
  from_port         = 22
  to_port           = 22
  protocol          = "tcp"
  cidr_blocks       = ["${chomp(data.http.my_ip.response_body)}/32"]
  security_group_id = aws_security_group.bastion.id
}

Data Source Lifecycle

Understanding when data sources are evaluated is crucial:

  1. During plan: Most data sources are read during terraform plan
  2. During apply: Data sources that depend on resources being created are deferred until apply
  3. Every run: Data sources are re-read on every plan/apply — they don’t cache results between runs

This means data source results can change between runs if the underlying data changes. For example, a new AMI might be published, or a VPC might be modified.

Best Practices

1. Use Specific Filters

Avoid broad queries that might return unexpected results:

# Bad — might match unintended VPCs
data "aws_vpc" "main" {
  default = true
}

# Good — specific filter
data "aws_vpc" "main" {
  filter {
    name   = "tag:Name"
    values = ["production-vpc"]
  }

  filter {
    name   = "tag:Environment"
    values = ["production"]
  }
}

2. Handle Multiple Results

Some data sources can return multiple results. Use most_recent for AMIs or ensure your filters are specific enough to return exactly one result.

3. Prefer Remote State Over Data Sources

When both teams use Terraform, terraform_remote_state is more reliable than querying resources by tags:

# Better — explicit contract via outputs
data "terraform_remote_state" "networking" {
  backend = "s3"
  config  = { ... }
}

# Riskier — depends on tags being correct
data "aws_vpc" "main" {
  filter { ... }
}

4. Use aws_iam_policy_document for IAM

Always prefer the aws_iam_policy_document data source over inline JSON:

  • Type-checked at plan time
  • Composable with source_policy_documents
  • Readable and maintainable

5. Don’t Over-Use External Data Sources

The external data source should be a last resort. It introduces dependencies on local tools and reduces portability. Check if a native provider or data source exists first.

Conclusion

Data sources are the bridge between Terraform-managed infrastructure and everything else. They enable you to build configurations that reference existing resources, query dynamic information, and integrate with external systems — all without duplicating resource management. Master data sources, and you’ll write more flexible, maintainable, and collaborative Terraform code.