DNS caching and forwarding in Cloud Foundry with Consul Agents
Hector Rivas Gandara
Hello,
Recently we were investigating some DNS resolution errors in our CF installation running CF v238 with consul-agent. We found out some caveats and we would like to know your thoughts, how you setup it and possible improvements. I will assume CF v238 [1] + Diego 0.1476.0 [2] + consul-release 92 [3], running on AWS. Our findings: * the consul-release/consul-agent job will always configure consul as forwarding resolver. * consul-agent is added in resolv.conf before the normal recursors. The linux resolver will still query directly the recursor if consul fails [5]. * consul forwarding capabilities are simple [4], compared with bind/pdndsd/dnsmasq (eg. hardcoded timeout 2s, simply iterates each recursor, no health monitoring, no caching/stale, no parallel queries...) * it is not possible to customise the consul-agent DNS config (listen port, disable forwarding, etc) * consul-agent has not "serve stale" configured [6], so DNS interface will fail during leader election [7]. Also the query must be served by the leader. * the recursor, AWS DNS, might timeout for some queries (as any DNS) We think this setup is not really the most resilient one. The leader election makes the resolution fail, consul timeouts in 2s but linux does in 5s, linux might bypass the consul-agent, no stale serving, etc. We are considering some improvements on this situation: 1. Deploy some DNS forwarding+cache servers (eg bind), - use them in front of AWS DNS doing caching - they can delegate to the consul masters for *.services.cf.internal domains. - Pros: central cache will have a better success rate. - Cons: Another service/server to take care of. 2. Deploy a pdnsd/dnsmasq/bind in each node as local cache. - Implement all the DNS local caching goodness (and badness)[8] - Can forward to the local consul-agent for *.services.cf.internal domains. - We will need to change consul-release to allow use this. 3. Modify consul-release to allow enable stale caching, change DNS configuration and change how resolv.conf is managed? 4. Enable serve DNS stale in consul, to avoid issues during leader election. 5. Do some PRs to consul to improve the DNS forwarding, adding some cool features like pdnsd [8] Our questions are: * Do you have similar errors with DNS than us? (eg deployments failing due DNS resolution errors) * Is the current consul-release setup a desired configuration? is there any documented architectural decision about this? * Shall we use consul-agent as DNS forwarder at all? * How do you deal with DNS caching? Do you use any local agent or bind caching servers? * Shall we enable serve stale in the consul-release? * Is it right that the consul-agent changes and parses /etc/resolv.conf in the start scripts? [9][10] Thx! Bunch of links: [1] https://github.com/cloudfoundry/cf-release/releases/tag/v238 [2] https://github.com/cloudfoundry-incubator/consul-release/tree/v92 [3] https://github.com/cloudfoundry/diego-release/tree/v0.1476.0 [4] https://github.com/hashicorp/consul/blob/ab1654758f3e216ef0035ba3ae2defaccb772747/command/agent/dns.go#L767 [5] http://manpages.ubuntu.com/manpages/trusty/man5/resolv.conf.5.html [6] https://github.com/cloudfoundry-incubator/consul-release/issues/26 [7] https://github.com/hashicorp/consul/issues/1888 [8] https://wiki.archlinux.org/index.php/pdnsd#Additional_performance_settings [9] https://github.com/cloudfoundry-incubator/consul-release/blob/d2d875badabcbf5b41ac3802d8fd986c81c99688/jobs/consul_agent/templates/pre-start.erb#L9-L22 [10] https://github.com/cloudfoundry-incubator/consul-release/blob/d2d875badabcbf5b41ac3802d8fd986c81c99688/jobs/consul_agent/templates/agent_ctl.sh.erb#L19-L21 -- Regards Hector Rivas | GDS / Multi-Cloud PaaS |
|