Table of Contents
Introduction
Not everything in your infrastructure is managed by Terraform. Legacy resources, manually created configurations, resources managed by other teams, or information from external services all exist outside your Terraform state. Data sources are Terraform’s mechanism for reading information from these external sources and using it in your configurations.
Unlike resources, data sources don’t create, update, or delete anything. They perform read-only queries that return information you can reference elsewhere in your code. This makes them essential for building configurations that integrate with existing infrastructure rather than starting from scratch.
Data Source Basics
A data source is defined using a data block:
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
resource "aws_instance" "web" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.micro"
}
In this example, instead of hardcoding an AMI ID (which changes across regions and over time), we query for the latest Ubuntu 22.04 AMI dynamically. Every time you run terraform plan, it fetches the current AMI ID.
Common Data Source Patterns
Looking Up VPCs and Subnets
When your networking is managed by a different team or Terraform configuration:
data "aws_vpc" "main" {
filter {
name = "tag:Name"
values = ["production-vpc"]
}
}
data "aws_subnets" "private" {
filter {
name = "vpc-id"
values = [data.aws_vpc.main.id]
}
filter {
name = "tag:Tier"
values = ["private"]
}
}
resource "aws_instance" "app" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.medium"
subnet_id = data.aws_subnets.private.ids[0]
vpc_security_group_ids = [aws_security_group.app.id]
}
resource "aws_security_group" "app" {
name = "app-sg"
vpc_id = data.aws_vpc.main.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [data.aws_vpc.main.cidr_block]
}
}
Referencing Remote State
One of the most powerful data sources lets you read outputs from another Terraform state:
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "my-terraform-state"
key = "networking/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "app" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.medium"
subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
}
This pattern enables separation of concerns — the networking team manages VPCs and subnets, and application teams reference them via remote state outputs.
Looking Up IAM Policies
Reference AWS-managed or existing custom policies:
data "aws_iam_policy" "admin" {
name = "AdministratorAccess"
}
data "aws_iam_policy_document" "custom" {
statement {
actions = [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
]
resources = [
"arn:aws:s3:::my-bucket",
"arn:aws:s3:::my-bucket/*",
]
}
statement {
actions = ["logs:*"]
resources = ["*"]
}
}
resource "aws_iam_policy" "app" {
name = "app-policy"
policy = data.aws_iam_policy_document.custom.json
}
The aws_iam_policy_document data source is particularly valuable because it generates valid JSON policy documents with proper syntax, reducing the chance of policy errors.
Querying Availability Zones
Build region-agnostic configurations:
data "aws_availability_zones" "available" {
state = "available"
filter {
name = "opt-in-status"
values = ["opt-in-not-required"]
}
}
resource "aws_subnet" "private" {
count = min(length(data.aws_availability_zones.available.names), 3)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "private-${data.aws_availability_zones.available.names[count.index]}"
Tier = "private"
}
}
This creates subnets in up to 3 availability zones, regardless of which region you deploy to.
Current AWS Account and Region
Access metadata about your current context:
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_partition" "current" {}
locals {
account_id = data.aws_caller_identity.current.account_id
region = data.aws_region.current.name
partition = data.aws_partition.current.partition
}
output "account_info" {
value = "Running in account ${local.account_id} in region ${local.region}"
}
Looking Up Route 53 Zones
Reference existing DNS zones:
data "aws_route53_zone" "main" {
name = "example.com"
private_zone = false
}
resource "aws_route53_record" "app" {
zone_id = data.aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
alias {
name = aws_lb.app.dns_name
zone_id = aws_lb.app.zone_id
evaluate_target_health = true
}
}
The external Data Source
For data that doesn’t have a native Terraform provider, use the external data source to call any script:
data "external" "git_info" {
program = ["bash", "-c", <<-EOT
echo '{"commit":"'$(git rev-parse --short HEAD)'","branch":"'$(git rev-parse --abbrev-ref HEAD)'"}'
EOT
]
}
resource "aws_instance" "app" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.micro"
tags = {
GitCommit = data.external.git_info.result.commit
GitBranch = data.external.git_info.result.branch
}
}
The script must output valid JSON to stdout with string values only.
The http Data Source
Fetch data from HTTP endpoints:
data "http" "my_ip" {
url = "https://ipv4.icanhazip.com"
}
resource "aws_security_group_rule" "ssh_from_my_ip" {
type = "ingress"
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["${chomp(data.http.my_ip.response_body)}/32"]
security_group_id = aws_security_group.bastion.id
}
Data Source Lifecycle
Understanding when data sources are evaluated is crucial:
- During plan: Most data sources are read during
terraform plan - During apply: Data sources that depend on resources being created are deferred until apply
- Every run: Data sources are re-read on every plan/apply — they don’t cache results between runs
This means data source results can change between runs if the underlying data changes. For example, a new AMI might be published, or a VPC might be modified.
Best Practices
1. Use Specific Filters
Avoid broad queries that might return unexpected results:
# Bad — might match unintended VPCs
data "aws_vpc" "main" {
default = true
}
# Good — specific filter
data "aws_vpc" "main" {
filter {
name = "tag:Name"
values = ["production-vpc"]
}
filter {
name = "tag:Environment"
values = ["production"]
}
}
2. Handle Multiple Results
Some data sources can return multiple results. Use most_recent for AMIs or ensure your filters are specific enough to return exactly one result.
3. Prefer Remote State Over Data Sources
When both teams use Terraform, terraform_remote_state is more reliable than querying resources by tags:
# Better — explicit contract via outputs
data "terraform_remote_state" "networking" {
backend = "s3"
config = { ... }
}
# Riskier — depends on tags being correct
data "aws_vpc" "main" {
filter { ... }
}
4. Use aws_iam_policy_document for IAM
Always prefer the aws_iam_policy_document data source over inline JSON:
- Type-checked at plan time
- Composable with
source_policy_documents - Readable and maintainable
5. Don’t Over-Use External Data Sources
The external data source should be a last resort. It introduces dependencies on local tools and reduces portability. Check if a native provider or data source exists first.
Conclusion
Data sources are the bridge between Terraform-managed infrastructure and everything else. They enable you to build configurations that reference existing resources, query dynamic information, and integrate with external systems — all without duplicating resource management. Master data sources, and you’ll write more flexible, maintainable, and collaborative Terraform code.

