Infrastructure as Code

Using Terraform Data Sources Effectively

Learn how to use Terraform data sources to query existing resources, look up AMIs, reference remote state, and build dynamic configurations. Complete.

LLuca BertonFebruary 19, 20253 min read

Using Terraform Data Sources Effectively

Introduction

Not everything in your infrastructure is managed by Terraform. Legacy resources, manually created configurations, resources managed by other teams, or information from external services all exist outside your Terraform state. Data sources are Terraform's mechanism for reading information from these external sources and using it in your configurations.

Unlike resources, data sources don't create, update, or delete anything. They perform read-only queries that return information you can reference elsewhere in your code. This makes them essential for building configurations that integrate with existing infrastructure rather than starting from scratch.

Data Source Basics

A data source is defined using a data block:

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"] # Canonical
 
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
 
  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}
 
resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"
}

In this example, instead of hardcoding an AMI ID (which changes across regions and over time), we query for the latest Ubuntu 22.04 AMI dynamically. Every time you run terraform plan, it fetches the current AMI ID.

Common Data Source Patterns

Looking Up VPCs and Subnets

When your networking is managed by a different team or Terraform configuration:

data "aws_vpc" "main" {
  filter {
    name   = "tag:Name"
    values = ["production-vpc"]
  }
}
 
data "aws_subnets" "private" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.main.id]
  }
 
  filter {
    name   = "tag:Tier"
    values = ["private"]
  }
}
 
resource "aws_instance" "app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.medium"
  subnet_id     = data.aws_subnets.private.ids[0]
  vpc_security_group_ids = [aws_security_group.app.id]
}
 
resource "aws_security_group" "app" {
  name   = "app-sg"
  vpc_id = data.aws_vpc.main.id
 
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [data.aws_vpc.main.cidr_block]
  }
}

Referencing Remote State

One of the most powerful data sources lets you read outputs from another Terraform state:

data "terraform_remote_state" "networking" {
  backend = "s3"
 
  config = {
    bucket = "my-terraform-state"
    key    = "networking/terraform.tfstate"
    region = "us-east-1"
  }
}
 
resource "aws_instance" "app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.medium"
  subnet_id     = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
}

This pattern enables separation of concerns — the networking team manages VPCs and subnets, and application teams reference them via remote state outputs.

Looking Up IAM Policies

Reference AWS-managed or existing custom policies:

data "aws_iam_policy" "admin" {
  name = "AdministratorAccess"
}
 
data "aws_iam_policy_document" "custom" {
  statement {
    actions = [
      "s3:GetObject",
      "s3:PutObject",
      "s3:ListBucket",
    ]
 
    resources = [
      "arn:aws:s3:::my-bucket",
      "arn:aws:s3:::my-bucket/*",
    ]
  }
 
  statement {
    actions   = ["logs:*"]
    resources = ["*"]
  }
}
 
resource "aws_iam_policy" "app" {
  name   = "app-policy"
  policy = data.aws_iam_policy_document.custom.json
}

The aws_iam_policy_document data source is particularly valuable because it generates valid JSON policy documents with proper syntax, reducing the chance of policy errors.

Querying Availability Zones

Build region-agnostic configurations:

data "aws_availability_zones" "available" {
  state = "available"
 
  filter {
    name   = "opt-in-status"
    values = ["opt-in-not-required"]
  }
}
 
resource "aws_subnet" "private" {
  count             = min(length(data.aws_availability_zones.available.names), 3)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
 
  tags = {
    Name = "private-${data.aws_availability_zones.available.names[count.index]}"
    Tier = "private"
  }
}

This creates subnets in up to 3 availability zones, regardless of which region you deploy to.

Current AWS Account and Region

Access metadata about your current context:

data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_partition" "current" {}
 
locals {
  account_id = data.aws_caller_identity.current.account_id
  region     = data.aws_region.current.name
  partition  = data.aws_partition.current.partition
}
 
output "account_info" {
  value = "Running in account ${local.account_id} in region ${local.region}"
}

Looking Up Route 53 Zones

Reference existing DNS zones:

data "aws_route53_zone" "main" {
  name         = "example.com"
  private_zone = false
}
 
resource "aws_route53_record" "app" {
  zone_id = data.aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
 
  alias {
    name                   = aws_lb.app.dns_name
    zone_id                = aws_lb.app.zone_id
    evaluate_target_health = true
  }
}

The `external` Data Source

For data that doesn't have a native Terraform provider, use the external data source to call any script:

data "external" "git_info" {
  program = ["bash", "-c", <<-EOT
    echo '{"commit":"'$(git rev-parse --short HEAD)'","branch":"'$(git rev-parse --abbrev-ref HEAD)'"}'
  EOT
  ]
}
 
resource "aws_instance" "app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"
 
  tags = {
    GitCommit = data.external.git_info.result.commit
    GitBranch = data.external.git_info.result.branch
  }
}

The script must output valid JSON to stdout with string values only.

The `http` Data Source

Fetch data from HTTP endpoints:

data "http" "my_ip" {
  url = "https://ipv4.icanhazip.com"
}
 
resource "aws_security_group_rule" "ssh_from_my_ip" {
  type              = "ingress"
  from_port         = 22
  to_port           = 22
  protocol          = "tcp"
  cidr_blocks       = ["${chomp(data.http.my_ip.response_body)}/32"]
  security_group_id = aws_security_group.bastion.id
}

Data Source Lifecycle

Understanding when data sources are evaluated is crucial:

During plan: Most data sources are read during terraform plan
During apply: Data sources that depend on resources being created are deferred until apply
Every run: Data sources are re-read on every plan/apply — they don't cache results between runs

This means data source results can change between runs if the underlying data changes. For example, a new AMI might be published, or a VPC might be modified.

Best Practices

1. Use Specific Filters

Avoid broad queries that might return unexpected results:

# Bad — might match unintended VPCs
data "aws_vpc" "main" {
  default = true
}
 
# Good — specific filter
data "aws_vpc" "main" {
  filter {
    name   = "tag:Name"
    values = ["production-vpc"]
  }
 
  filter {
    name   = "tag:Environment"
    values = ["production"]
  }
}

2. Handle Multiple Results

Some data sources can return multiple results. Use most_recent for AMIs or ensure your filters are specific enough to return exactly one result.

3. Prefer Remote State Over Data Sources

When both teams use Terraform, terraform_remote_state is more reliable than querying resources by tags:

# Better — explicit contract via outputs
data "terraform_remote_state" "networking" {
  backend = "s3"
  config  = { ... }
}
 
# Riskier — depends on tags being correct
data "aws_vpc" "main" {
  filter { ... }
}

4. Use `aws_iam_policy_document` for IAM

Always prefer the aws_iam_policy_document data source over inline JSON:

Type-checked at plan time
Composable with source_policy_documents
Readable and maintainable

5. Don't Over-Use External Data Sources

The external data source should be a last resort. It introduces dependencies on local tools and reduces portability. Check if a native provider or data source exists first.

Hands-On Courses

Learn by doing with interactive courses on CopyPasteLearn:

Terraform for Beginners course on CopyPasteLearn

Conclusion

Data sources are the bridge between Terraform-managed infrastructure and everything else. They enable you to build configurations that reference existing resources, query dynamic information, and integrate with external systems — all without duplicating resource management. Master data sources, and you'll write more flexible, maintainable, and collaborative Terraform code.

#Terraform#Infrastructure as Code#DevOps#AWS#HashiCorp

Share this article

Cloud Computing

Managing Multiple AWS Accounts with Terraform

Master multi-account AWS management with Terraform. Learn provider aliases, cross-account IAM roles, AWS Organizations integration, and production-ready.

Feb 5, 20255 min read

Cloud Computing

Terraform State Locking with DynamoDB: Prevent Concurrent Modifications

Learn how to implement Terraform state locking with AWS DynamoDB to prevent concurrent modifications and state corruption. Complete setup guide with examples.

Jan 22, 20255 min read

DevOps

How to Use Terraform with GitHub Actions CI/CD

Learn how to integrate Terraform with GitHub Actions for automated infrastructure deployments. Complete guide with workflows, best practices, and.

Jan 15, 20255 min read

Cloud Computing

Terraform Version Constraints Guide

Master Terraform version constraints for Terraform core and providers. Covers operators, lock files, required_version, required_providers, and upgrade...

Feb 7, 20243 min read

Introduction

Data Source Basics

Common Data Source Patterns

Looking Up VPCs and Subnets

Referencing Remote State

Looking Up IAM Policies

Querying Availability Zones

Current AWS Account and Region

Looking Up Route 53 Zones

The external Data Source

The http Data Source

Data Source Lifecycle

Best Practices

1. Use Specific Filters

2. Handle Multiple Results

3. Prefer Remote State Over Data Sources

4. Use aws_iam_policy_document for IAM

5. Don't Over-Use External Data Sources

Hands-On Courses

Conclusion

Related articles

Managing Multiple AWS Accounts with Terraform

Terraform State Locking with DynamoDB: Prevent Concurrent Modifications

How to Use Terraform with GitHub Actions CI/CD

Terraform Version Constraints Guide

The `external` Data Source

The `http` Data Source

4. Use `aws_iam_policy_document` for IAM