Re: DNS caching and forwarding in Cloud Foundry with Consul Agents

Amit Kumar Gupta

Hi Hector,

I believe you offerred [0] to submit a PR to consul-release to deal with
the caching and allow stale reads issue [1], we would happily welcome
that. I'm less sure about the other suggestions you mentioned, and not
clear on what concrete problem they solve, but happy to discuss further if
you can explain.


Amit, CF Infrastructure team PM

On Thu, Jul 21, 2016 at 4:10 AM, Hector Rivas Gandara <
hector.rivas.gandara(a)> wrote:


Recently we were investigating some DNS resolution errors in our CF
installation running CF v238 with consul-agent.

We found out some caveats and we would like to know your thoughts, how
you setup it and possible improvements.

I will assume CF v238 [1] + Diego 0.1476.0 [2] + consul-release 92
[3], running on AWS.

Our findings:

* the consul-release/consul-agent job will always configure consul as
forwarding resolver.
* consul-agent is added in resolv.conf before the normal recursors.
The linux resolver will still query directly the recursor if consul
fails [5].
* consul forwarding capabilities are simple [4], compared with
bind/pdndsd/dnsmasq (eg. hardcoded timeout 2s, simply iterates each
recursor, no health monitoring, no caching/stale, no parallel
* it is not possible to customise the consul-agent DNS config (listen
port, disable forwarding, etc)
* consul-agent has not "serve stale" configured [6], so DNS interface
will fail during leader election [7]. Also the query must be served by
the leader.
* the recursor, AWS DNS, might timeout for some queries (as any DNS)

We think this setup is not really the most resilient one. The leader
election makes the resolution fail, consul timeouts in 2s but linux
does in 5s, linux might bypass the consul-agent, no stale serving,

We are considering some improvements on this situation:

1. Deploy some DNS forwarding+cache servers (eg bind),
- use them in front of AWS DNS doing caching
- they can delegate to the consul masters for
* domains.
- Pros: central cache will have a better success rate.
- Cons: Another service/server to take care of.

2. Deploy a pdnsd/dnsmasq/bind in each node as local cache.
- Implement all the DNS local caching goodness (and badness)[8]
- Can forward to the local consul-agent for *
- We will need to change consul-release to allow use this.

3. Modify consul-release to allow enable stale caching, change DNS
configuration and change how resolv.conf is managed?

4. Enable serve DNS stale in consul, to avoid issues during leader

5. Do some PRs to consul to improve the DNS forwarding, adding some
cool features like pdnsd [8]

Our questions are:

* Do you have similar errors with DNS than us? (eg deployments
failing due DNS resolution errors)
* Is the current consul-release setup a desired configuration? is
there any documented architectural decision about this?
* Shall we use consul-agent as DNS forwarder at all?
* How do you deal with DNS caching? Do you use any local agent or
bind caching servers?
* Shall we enable serve stale in the consul-release?
* Is it right that the consul-agent changes and parses
/etc/resolv.conf in the start scripts? [9][10]


Bunch of links:


Hector Rivas | GDS / Multi-Cloud PaaS

Join { to automatically receive all group messages.