The Datadog IP Ranges Anti-Pattern

December 31, 2020

Why do we seek IP addresses in the Cloud-first world?

Is it the mindset leftover from the bygone era of procured hardware & CIDR blocks?
Is it the availability of published IP ranges that makes you want to utilise them?
Or is it a hard bit of how the internet works to detach from?

Let's consider the case of Datadog Agent v7.32.3, the endpoint for which, as per official docs, would simply be 7-32-3-app.agent.datadoghq.com. This is the contract Datadog will fulfil, and we shouldn't assume more.

But first, Datadog IP ranges at present...

The IP addresses currently resolving for that FQDN are more than a few.

$ dig +short 7-32-3-app.agent.datadoghq.com
metrics.agent.datadoghq.com.
alb-metrics-agent-shard1-770518637.us-east-1.elb.amazonaws.com.
3.233.148.117
3.233.148.51
3.233.148.8
3.233.148.16
3.233.148.11
3.233.148.35
3.233.148.18
3.233.148.83
$ date --utc
Tue  1 Feb 15:03:09 UTC 2022

Moreover, they are likely to change over time if not within 5 minutes because these get assigned to Cloud load balancers from a vast pool and every time there is a change of any sort, an allocated IP to the load balancer is likely to change. The Cloud providers update the DNS entries on-the-fly too when this happens.

Almost Random IPs

A few Datadog CIDR blocks perhaps...?

You may also note the list of IPs currently resolved are non-contiguous. Since firewall rules (Security Groups) only take CIDR blocks, these IPs would take a /32 entry each. Now Datadog do make available their IP ranges via an API, but the number is too many to fit into what is allowed in Security Groups.

⦾ Could you run a script, in a loop, to keep resolving IP addresses for given names and updating the firewall rules?

You could, but would it be effective? Nope.

caution

The IPs resolved at any time by your script could and in many cases would differ from the IPs resolved by the clients that need to connect to this resource on the internet. These false positives can cause serious operational issues.

IP address mismatch betweeen Firewall Rules and source Client

Besides, setting up the infrastructure for this script to run and giving it the right permissions will require considerable effort.

⦾ Could you create a rule that allows only port 443 over TCP outbound, but for all IP addresses, i.e. 0.0.0.0/0?

You could, and since defence-in-depth measures always reduce risk, this surely reduces the risk somewhat. Would it reduce substantive risk though?

Let's draw an analogy here of IP Addresses and Ports to Hotels and Rooms.

caution

The rule 0.0.0.0/0 port 443 would be analogous to any hotel in the world as long as the room number is 443.

Opening up port 443 to the entire world

Well, let's admit it. More than 4 billion IPv4 addresses in the world, many in the hands of adversaries or in hostile territories. Would you rather care about the hotel address or the room number? And most of the traffic is on port 443 anyway.

Let's not create a false sense of security by introducing a rule that creates an illusion of security.

Leaving destination 0.0.0.0/0 open also inadvertently permits your software and frameworks to phone home and divulge metrics or data that your organisation may not be interested in sharing. See our findings on Ubuntu for an example.

StackDriver DiscrimiNAT flow log

Remote Code Execution (RCE) vulnerabilities could have a field day. See our analysis of Log4J's Log4Shell vulnerability as seen through an egress filter.

CloudWatch DiscrimiNAT flow log

info

So it does appear to be a hard bit of how the internet works to detach from.

⦿ Wouldn't it be easier if one could stick in the FQDNs straight into firewall rules and let the Cloud take care of all this?

With DiscrimiNAT, you can. Just swap your Cloud provider's basic NAT Gateway with a DiscrimiNAT and Bob's your uncle.

This isn't a radical new approach. In fact, before the Cloud providers had a native NAT Gateway offering, creating a proper NAT Instance with a firewall OS was the way.

This Is The Way

AWS' documentation on how to create one when your use case is not vanilla is here. AWS' Well-Architected Framework even goes onto recommend the use of a proper FQDN egress filtering solution.

info

DiscrimiNAT has been engineered from the ground up with Cloud patterns in mind. It's as Cloud-native as they get! See the FAQ for more info.

Your workflow with DiscrimiNAT will have,

✓ No more private addressing and DNS hacks to make that work.
✓ No more creating private Container registries and running CI jobs to sync them with upstream, only so you could pull from a private address. Jenkins Devil
✓ Removal of the bulk of VPC Endpoints from your Infrastructure-as-Code (IaC) where the resource was public anyway.
✓ Copying desired FQDNs found in flow logs and pasting them into firewall rules.

Copy-Paste

Whether you operate the Cloud via the web console, the CLI, Terraform, CloudFormation or Deployment Manager, the class for firewall rules is standard and always a built-in. Your next CD cycle will roll out the egress control change, and you can move on to the next ticket in your backlog.

This also simplifies the process of adding or changing an FQDN-to-allow going forward. The configuration lives in the deployment code, and the potential blast-radius of an impact is contained to the application associated with the particular firewall rules.

You also get certain security baselines met out of the box. Only TLS 1.2 or better connections are allowed (or SSH v2 if using SFTP or similar); and there is no risk of inadvertently making a plaintext HTTP connection to the world outside of your VPC. No need to worry here – any modern HTTPS connection will be utilising TLS 1.2 or better anyway, and it is only the misconfigurations on the client-side, the server-side, or an ancient server on the other side that would be running a lower version of TLS.

Award for Security

Oh and the forensic logs flow into StackDriver (or CloudWatch) automatically. Any changes made in the firewall rules, any connections denied, or allowed against which rule – right in the Cloud itself for troubleshooting and analysis.

🕐 Shall we try again? $ dig +short 7-32-3-app.agent.datadoghq.com

Replace your basic NAT with a DiscrimiNAT today. It takes 5 minutes to deploy, and the free trial will allow you plenty of time to get a feel for it in your workflow.

You can either retrofit one into your current VPC, or a create a new VPC from scratch with a DiscrimiNAT serving the egress function. Just pick the desired IaC template from our library and off you go.

Swapping Components

Don't like it? It's as quick to remove as it was to put in. Since there was never a change to your application configuration, there will be no impact of swapping a DiscrimiNAT with another NAT.

tip

Need more time or instances on the free trial? Just get in touch and we'll sort you out.

MacBook Deploy

Follow through to our AWS Quick Start, or the Google Cloud (GCP) Quick Start. Or watch a short demo first GCP / AWS.