DNS caching and forwarding in Cloud Foundry with Consul Agents

Hector Rivas Gandara


Recently we were investigating some DNS resolution errors in our CF
installation running CF v238 with consul-agent.

We found out some caveats and we would like to know your thoughts, how
you setup it and possible improvements.

I will assume CF v238 [1] + Diego 0.1476.0 [2] + consul-release 92
[3], running on AWS.

Our findings:

* the consul-release/consul-agent job will always configure consul as
forwarding resolver.
* consul-agent is added in resolv.conf before the normal recursors.
The linux resolver will still query directly the recursor if consul
fails [5].
* consul forwarding capabilities are simple [4], compared with
bind/pdndsd/dnsmasq (eg. hardcoded timeout 2s, simply iterates each
recursor, no health monitoring, no caching/stale, no parallel
* it is not possible to customise the consul-agent DNS config (listen
port, disable forwarding, etc)
* consul-agent has not "serve stale" configured [6], so DNS interface
will fail during leader election [7]. Also the query must be served by
the leader.
* the recursor, AWS DNS, might timeout for some queries (as any DNS)

We think this setup is not really the most resilient one. The leader
election makes the resolution fail, consul timeouts in 2s but linux
does in 5s, linux might bypass the consul-agent, no stale serving,

We are considering some improvements on this situation:

1. Deploy some DNS forwarding+cache servers (eg bind),
- use them in front of AWS DNS doing caching
- they can delegate to the consul masters for
*.services.cf.internal domains.
- Pros: central cache will have a better success rate.
- Cons: Another service/server to take care of.

2. Deploy a pdnsd/dnsmasq/bind in each node as local cache.
- Implement all the DNS local caching goodness (and badness)[8]
- Can forward to the local consul-agent for *.services.cf.internal domains.
- We will need to change consul-release to allow use this.

3. Modify consul-release to allow enable stale caching, change DNS
configuration and change how resolv.conf is managed?

4. Enable serve DNS stale in consul, to avoid issues during leader election.

5. Do some PRs to consul to improve the DNS forwarding, adding some
cool features like pdnsd [8]

Our questions are:

* Do you have similar errors with DNS than us? (eg deployments
failing due DNS resolution errors)
* Is the current consul-release setup a desired configuration? is
there any documented architectural decision about this?
* Shall we use consul-agent as DNS forwarder at all?
* How do you deal with DNS caching? Do you use any local agent or
bind caching servers?
* Shall we enable serve stale in the consul-release?
* Is it right that the consul-agent changes and parses
/etc/resolv.conf in the start scripts? [9][10]


Bunch of links:

[1] https://github.com/cloudfoundry/cf-release/releases/tag/v238
[2] https://github.com/cloudfoundry-incubator/consul-release/tree/v92
[3] https://github.com/cloudfoundry/diego-release/tree/v0.1476.0
[4] https://github.com/hashicorp/consul/blob/ab1654758f3e216ef0035ba3ae2defaccb772747/command/agent/dns.go#L767
[5] http://manpages.ubuntu.com/manpages/trusty/man5/resolv.conf.5.html
[6] https://github.com/cloudfoundry-incubator/consul-release/issues/26
[7] https://github.com/hashicorp/consul/issues/1888
[8] https://wiki.archlinux.org/index.php/pdnsd#Additional_performance_settings
[9] https://github.com/cloudfoundry-incubator/consul-release/blob/d2d875badabcbf5b41ac3802d8fd986c81c99688/jobs/consul_agent/templates/pre-start.erb#L9-L22
[10] https://github.com/cloudfoundry-incubator/consul-release/blob/d2d875badabcbf5b41ac3802d8fd986c81c99688/jobs/consul_agent/templates/agent_ctl.sh.erb#L19-L21

Hector Rivas | GDS / Multi-Cloud PaaS

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.