The Datadog IP Ranges Anti-Pattern
Why do we seek IP addresses in the Cloud-first world?
- Is it the mindset leftover from the bygone era of procured hardware & CIDR blocks?
- Is it the availability of published IP ranges that makes you want to utilise them?
- Or is it a hard bit of how the internet works to detach from?
Let’s consider the case of Datadog Agent v7.25.1, the endpoint for which, as per official docs, would simply be
7-25-1-app.agent.datadoghq.com. This is the contract Datadog will fulfil, and we shouldn’t assume more.
But first, Datadog IP ranges at present...
The IP addresses currently resolving for that FQDN are more than a few.
$ dig +short 7-25-1-app.agent.datadoghq.com alb-metrics-agent-shard2-1030968321.us-east-1.elb.amazonaws.com. 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 $ date --utc Sat 24 Apr 15:00:00 UTC 2021
Moreover, they are likely to change over time if not within 5 minutes because these get assigned to Cloud load balancers from a vast pool and every time there is a change of any sort, an allocated IP to the load balancer is likely to change. The Cloud providers update the DNS entries on-the-fly too when this happens.
A few Datadog CIDR blocks perhaps...?
You may also note the list of IPs currently resolved are non-contiguous. Since firewall rules (Security Groups) only take CIDR blocks, these IPs would take a
/32 entry each. Now Datadog do make available their IP ranges via an API, but the number is too many to fit into what is allowed in Security Groups.
⦾ Could you run a script, in a loop, to keep resolving IP addresses for given names and updating the firewall rules?
You could, but would it be effective? Nope.
Besides, setting up the infrastructure for this script to run and giving it the right permissions will require considerable effort.
⦾ Could you create a rule that allows only port 443 over TCP outbound, but for all IP addresses, i.e.
You could, and since defence-in-depth measures always reduce risk, this surely reduces the risk somewhat. Would it reduce substantive risk though?
Let’s draw an analogy here of IP Addresses and Ports to Hotels and Rooms.
443would be analogous to any hotel in the world as long as the room number is 443.
Well, let’s admit it. More than 4 billion IPv4 addresses in the world, many in the hands of adversaries or in hostile territories. Would you rather care about the hotel address or the room number? And most of the traffic is on port 443 anyway.
Let’s not create a false sense of security by introducing a rule that creates an illusion of security.
0.0.0.0/0 open also inadvertently permits your software and frameworks to phone home and divulge metrics or data that your organisation may not be interested in sharing. See our findings on Ubuntu for an example.
⦿ Wouldn’t it be easier if one could stick in the FQDNs straight into firewall rules and let the Cloud take care of all this?
With discrimiNAT, you can. Just swap your Cloud provider’s basic NAT Gateway with a discrimiNAT and Bob’s your uncle.
This isn’t a radical new approach. In fact, before the Cloud providers had a native NAT Gateway offering, creating a proper NAT Instance with a firewall OS was the way.
AWS' documentation on how to create one when your use case is not vanilla is here. AWS' Well-Architected Framework even goes onto recommend the use of a proper FQDN egress filtering solution. And Google Cloud’s (GCP) is here.
Your workflow with discrimiNAT will have,
✓ No more private addressing and DNS hacks to make that work.
✓ No more creating private Container registries and running CI jobs to sync them with upstream, only so you could pull from a private address.
✓ Removal of the bulk of VPC Endpoints from your Infrastructure-as-Code (IaC) where the resource was public anyway.
✓ Copying desired FQDNs found in flow logs and pasting them into firewall rules.
Whether you operate the Cloud via the web console, the CLI, Terraform, CloudFormation or Deployment Manager, the class for firewall rules is standard and always a built-in. Your next CD cycle will roll out the egress control change, and you can move on to the next ticket in your backlog.
This also simplifies the process of adding or changing an FQDN-to-allow going forward. The configuration lives in the deployment code, and the potential blast-radius of an impact is contained to the application associated with the particular firewall rules.
You also get certain security baselines met out of the box. Only TLS 1.2 or better connections are allowed (or SSH v2 if using SFTP or similar); and there is no risk of inadvertently making a plaintext HTTP connection to the world outside of your VPC. No need to worry here — any modern HTTPS connection will be utilising TLS 1.2 or better anyway, and it is only the misconfigurations on the client-side, the server-side, or an ancient server on the other side that would be running a lower version of TLS.
Oh and the forensic logs flow into StackDriver (or CloudWatch) automatically. Any changes made in the firewall rules, any connections denied, or allowed against which rule — right in the Cloud itself for troubleshooting and analysis.
🕐 Shall we try again?
$ dig +short 7-25-1-app.agent.datadoghq.com
Replace your basic NAT with a discrimiNAT today. It takes 5 minutes to deploy, and the free trial will allow you plenty of time to get a feel for it in your workflow.
You can either retrofit one into your current VPC, or a create a new VPC from scratch with a discrimiNAT serving the egress function. Just pick the desired IaC template from our library and off you go.
Don’t like it? It’s as quick to remove as it was to put in. Since there was never a change to your application configuration, there will be no impact of swapping a discrimiNAT with another NAT.