Re: DNS caching and forwarding in Cloud Foundry with Consul Agents


Hector Rivas Gandara
 

Hello,

Yes, it was us who offered implement the stale reads. I will try to
get that prioritised.

About what we want to solve/ask:

- It might be interesting implement some kind of DNS caching solution
on a CF installation to reduce the impact of issues in DNS resolvers
and reduce latency.

Does it make sense implement DNS caching on CF? What is the
community doing at the moment for DNS caching?

- Consul does not offer DNS caching

- It does not feel right that consul-release implementation forces
you to use consul-agent to forward DNS, and how it changes directly
/etc/resolv.conf. We should allow customise that.

Thank you.

On 21 July 2016 at 17:36, Amit Gupta <agupta(a)pivotal.io> wrote:
Hi Hector,

I believe you offerred [0] to submit a PR to consul-release to deal with the
caching and allow stale reads issue [1], we would happily welcome that. I'm
less sure about the other suggestions you mentioned, and not clear on what
concrete problem they solve, but happy to discuss further if you can
explain.

[0] https://cloudfoundry.slack.com/archives/general/p1467821852000818
[1] https://github.com/cloudfoundry-incubator/consul-release/issues/26

Thanks,
Amit, CF Infrastructure team PM

On Thu, Jul 21, 2016 at 4:10 AM, Hector Rivas Gandara
<hector.rivas.gandara(a)digital.cabinet-office.gov.uk> wrote:

Hello,

Recently we were investigating some DNS resolution errors in our CF
installation running CF v238 with consul-agent.

We found out some caveats and we would like to know your thoughts, how
you setup it and possible improvements.

I will assume CF v238 [1] + Diego 0.1476.0 [2] + consul-release 92
[3], running on AWS.

Our findings:

* the consul-release/consul-agent job will always configure consul as
forwarding resolver.
* consul-agent is added in resolv.conf before the normal recursors.
The linux resolver will still query directly the recursor if consul
fails [5].
* consul forwarding capabilities are simple [4], compared with
bind/pdndsd/dnsmasq (eg. hardcoded timeout 2s, simply iterates each
recursor, no health monitoring, no caching/stale, no parallel
queries...)
* it is not possible to customise the consul-agent DNS config (listen
port, disable forwarding, etc)
* consul-agent has not "serve stale" configured [6], so DNS interface
will fail during leader election [7]. Also the query must be served by
the leader.
* the recursor, AWS DNS, might timeout for some queries (as any DNS)

We think this setup is not really the most resilient one. The leader
election makes the resolution fail, consul timeouts in 2s but linux
does in 5s, linux might bypass the consul-agent, no stale serving,
etc.

We are considering some improvements on this situation:

1. Deploy some DNS forwarding+cache servers (eg bind),
- use them in front of AWS DNS doing caching
- they can delegate to the consul masters for
*.services.cf.internal domains.
- Pros: central cache will have a better success rate.
- Cons: Another service/server to take care of.

2. Deploy a pdnsd/dnsmasq/bind in each node as local cache.
- Implement all the DNS local caching goodness (and badness)[8]
- Can forward to the local consul-agent for *.services.cf.internal
domains.
- We will need to change consul-release to allow use this.

3. Modify consul-release to allow enable stale caching, change DNS
configuration and change how resolv.conf is managed?

4. Enable serve DNS stale in consul, to avoid issues during leader
election.

5. Do some PRs to consul to improve the DNS forwarding, adding some
cool features like pdnsd [8]

Our questions are:

* Do you have similar errors with DNS than us? (eg deployments
failing due DNS resolution errors)
* Is the current consul-release setup a desired configuration? is
there any documented architectural decision about this?
* Shall we use consul-agent as DNS forwarder at all?
* How do you deal with DNS caching? Do you use any local agent or
bind caching servers?
* Shall we enable serve stale in the consul-release?
* Is it right that the consul-agent changes and parses
/etc/resolv.conf in the start scripts? [9][10]

Thx!

Bunch of links:

[1] https://github.com/cloudfoundry/cf-release/releases/tag/v238
[2] https://github.com/cloudfoundry-incubator/consul-release/tree/v92
[3] https://github.com/cloudfoundry/diego-release/tree/v0.1476.0
[4]
https://github.com/hashicorp/consul/blob/ab1654758f3e216ef0035ba3ae2defaccb772747/command/agent/dns.go#L767
[5] http://manpages.ubuntu.com/manpages/trusty/man5/resolv.conf.5.html
[6] https://github.com/cloudfoundry-incubator/consul-release/issues/26
[7] https://github.com/hashicorp/consul/issues/1888
[8]
https://wiki.archlinux.org/index.php/pdnsd#Additional_performance_settings
[9]
https://github.com/cloudfoundry-incubator/consul-release/blob/d2d875badabcbf5b41ac3802d8fd986c81c99688/jobs/consul_agent/templates/pre-start.erb#L9-L22
[10]
https://github.com/cloudfoundry-incubator/consul-release/blob/d2d875badabcbf5b41ac3802d8fd986c81c99688/jobs/consul_agent/templates/agent_ctl.sh.erb#L19-L21

--
Regards
Hector Rivas | GDS / Multi-Cloud PaaS
--
Regards
Hector Rivas | GDS / Multi-Cloud PaaS

Join {cf-dev@lists.cloudfoundry.org to automatically receive all group messages.