john mcteague <john.mcteague@...>
Ive only had a brief look, my graphite server does not seem to have the same set of stats for each dea and doppler node, but where i can draw a comparison is that the dopplers in z2 and 3 are receiving 10x more logs than zone 4.
The dea stat for z4 has a receivedMessageCount that is lower than dopplers z4 receivedMessageCount. Im not convinced my stats are being sent correctly, I will investigate and provide further info tomorrow.
Thanks for your help.
John
toggle quoted message
Show quoted text
On Tue, May 26, 2015 at 10:35 PM, John Tuley <jtuley(a)pivotal.io> wrote: I also don't expect that to be the source of your CPU imbalance.
Well, the bad news is that the easy stuff checks out, so I have no idea what's actually wrong. I'll keep suggesting diagnostics, but I don't have a silver bullet for you.
Do you have a collector wired up in your deployment? If so, I'd take a look at the metrics `MetronAgent.dropondeAgentListener.receivedMessageCount` across each of the runners, and `DopplerServer.dropsondeListener.receivedMessageCount` across each of the dopplers. That should give you a better idea of the number of log messages that *should* be sent to each Doppler (the first metric) and that *are* received and processed (the second metric).
If the metron numbers are high (as you expect), but the doppler numbers are low, then there's probably something wrong with those doppler instances. If the metron numbers are low, then there might be something wrong with metron on the runners, or with the DEA logging agent. Or, maybe the app instances in those zones just aren't logging much (which seems the least likely explanation so far).
– John Tuley
On Tue, May 26, 2015 at 3:24 PM, john mcteague <john.mcteague(a)gmail.com> wrote:
- From etcd I see 5 unique entries, all 5 doppler hosts are listed with the correct zone - All metron_agent.json files list the correct zone name - All doppler.json files also contain the correct zone name
All 5 doppler servers contain the following two errors, in varying amounts.
{"timestamp":1432671780.232883453,"process_id":1422," source":"doppler","log_level":"error","message":"AppStoreWatcher: Got error while waiting for ETCD events: store request timed out","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/ github.com/cloudfoundry/loggregatorlib/store/app_service_store_watcher.go ","line":78,"method":" github.com/cloudfoundry/loggregatorlib/store.(*AppServiceStoreWatcher).Run "}
{"timestamp":1432649819.481923819,"process_id":1441," source":"doppler","log_level":"warn","message":"TB: Output channel too full. Dropped 100 messages for app f744c900-d82d-4efc-bbe4- 004e94ffdfec.","data":null,"file":"/var/vcap/data/compile/ doppler/loggregator/src/doppler/truncatingbuffer/ truncating_buffer.go","line":65,"method":"doppler/truncatingbuffer.(* TruncatingBuffer).Run"}
For the latter, given the high log rate of the test app, it suggests I need to tune the buffer of doppler, but I dont expect this to be the cause of my cpu imbalance.
On Tue, May 26, 2015 at 5:08 PM, John Tuley <jtuley(a)pivotal.io> wrote:
John,
Can you verify (on, say one runner in each of your zones) that Metron's local configuration has the correct zone? (Look in /var/vcap/jobs/metron_agent/config/metron.json.)
Can you also verify the same for the Doppler servers (/var/vcap/jobs/doppler/config/doppler.json)?
And then can you please verify that etcd is being updated correctly? (curl *$ETCD_URL*/api/v2/keys/healthstatus/doppler/?recursive=true with the correct ETCD_URL - the output should contain entries with the correct IP address of each of your dopplers, under the correct zone.)
If all of those check out, then please send me the logs from the affected Doppler servers and I'll take a look.
– John Tuley
On Tue, May 26, 2015 at 9:26 AM, <cf-dev-request(a)lists.cloudfoundry.org> wrote:
Message: 2 Date: Tue, 26 May 2015 16:26:30 +0100 From: john mcteague <john.mcteague(a)gmail.com> To: Erik Jasiak <ejasiak(a)pivotal.io> Cc: cf-dev <cf-dev(a)lists.cloudfoundry.org> Subject: Re: [cf-dev] Doppler zoning query Message-ID: <CAEduAK4WmMfrhdhxWDfpR= Ot0eM+yspsswqx4hG36Mte0bS9kg(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8"
We are using cf v204 and all loggregators are the same size and config (other than zone).
The distribution of requests across app instances is fairly even as far as I can see.
John. On 26 May 2015 06:21, "Erik Jasiak" <ejasiak(a)pivotal.io> wrote:
Hi John,
I'll be working on this with engineering in the morning; thanks for
the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are
serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your
configurations for all dopplers are roughly the same? All app instances per
AZ are serving an equal number of requests?
Thanks, Erik Jasiak
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote:
Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way.
however the
dopplers associated with zone 4 / 5 are getting a very small amount
of load
sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <
john.mcteague(a)gmail.com>
wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively
even
balance between all app instances, yet doppler on zones 1-3
consume far
greater cpu resources (15x in some cases) than zones 4 and 5.
Generally
zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30
instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you
see logs
with "cf logs"? you can target a single app instance index to get
restarted
with using a "cf curl" command for terminating an app index [1].
you can
find the details with json output from "cf stats" that should
show you the
private IPs for the DEAs hosting your app, which should help you
figure out
which zone each app index is in.
http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are
not routable
somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the
routers
cannot reach DEAs in zone 4 or zone 5 with outbound traffic and
routers
fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague < john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent
and
traffic_controller. This aligns to our physical failure domains
in
openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single
app
running 30 instances and have verified it is evenly balanced
across all 5
zones (6 instances in each). I have additionally verified that
each logical
zone in the bosh yml contains 1 dea, doppler server and traffic
controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
-------------- next part -------------- An HTML attachment was scrubbed... URL: < http://lists.cloudfoundry.org/pipermail/cf-dev/attachments/20150526/31789891/attachment.html ------------------------------
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
End of cf-dev Digest, Vol 2, Issue 73 *************************************
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
|
|
I also don't expect that to be the source of your CPU imbalance. Well, the bad news is that the easy stuff checks out, so I have no idea what's actually wrong. I'll keep suggesting diagnostics, but I don't have a silver bullet for you. Do you have a collector wired up in your deployment? If so, I'd take a look at the metrics `MetronAgent.dropondeAgentListener.receivedMessageCount` across each of the runners, and `DopplerServer.dropsondeListener.receivedMessageCount` across each of the dopplers. That should give you a better idea of the number of log messages that *should* be sent to each Doppler (the first metric) and that *are* received and processed (the second metric). If the metron numbers are high (as you expect), but the doppler numbers are low, then there's probably something wrong with those doppler instances. If the metron numbers are low, then there might be something wrong with metron on the runners, or with the DEA logging agent. Or, maybe the app instances in those zones just aren't logging much (which seems the least likely explanation so far). – John Tuley On Tue, May 26, 2015 at 3:24 PM, john mcteague <john.mcteague(a)gmail.com> wrote: - From etcd I see 5 unique entries, all 5 doppler hosts are listed with the correct zone - All metron_agent.json files list the correct zone name - All doppler.json files also contain the correct zone name
All 5 doppler servers contain the following two errors, in varying amounts.
{"timestamp":1432671780.232883453,"process_id":1422," source":"doppler","log_level":"error","message":"AppStoreWatcher: Got error while waiting for ETCD events: store request timed out","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/ github.com/cloudfoundry/loggregatorlib/store/app_service_store_watcher.go ","line":78,"method":" github.com/cloudfoundry/loggregatorlib/store.(*AppServiceStoreWatcher).Run "}
{"timestamp":1432649819.481923819,"process_id":1441," source":"doppler","log_level":"warn","message":"TB: Output channel too full. Dropped 100 messages for app f744c900-d82d-4efc-bbe4- 004e94ffdfec.","data":null,"file":"/var/vcap/data/compile/ doppler/loggregator/src/doppler/truncatingbuffer/ truncating_buffer.go","line":65,"method":"doppler/truncatingbuffer.(* TruncatingBuffer).Run"}
For the latter, given the high log rate of the test app, it suggests I need to tune the buffer of doppler, but I dont expect this to be the cause of my cpu imbalance.
On Tue, May 26, 2015 at 5:08 PM, John Tuley <jtuley(a)pivotal.io> wrote:
John,
Can you verify (on, say one runner in each of your zones) that Metron's local configuration has the correct zone? (Look in /var/vcap/jobs/metron_agent/config/metron.json.)
Can you also verify the same for the Doppler servers (/var/vcap/jobs/doppler/config/doppler.json)?
And then can you please verify that etcd is being updated correctly? (curl *$ETCD_URL*/api/v2/keys/healthstatus/doppler/?recursive=true with the correct ETCD_URL - the output should contain entries with the correct IP address of each of your dopplers, under the correct zone.)
If all of those check out, then please send me the logs from the affected Doppler servers and I'll take a look.
– John Tuley
On Tue, May 26, 2015 at 9:26 AM, <cf-dev-request(a)lists.cloudfoundry.org> wrote:
Message: 2 Date: Tue, 26 May 2015 16:26:30 +0100 From: john mcteague <john.mcteague(a)gmail.com> To: Erik Jasiak <ejasiak(a)pivotal.io> Cc: cf-dev <cf-dev(a)lists.cloudfoundry.org> Subject: Re: [cf-dev] Doppler zoning query Message-ID: <CAEduAK4WmMfrhdhxWDfpR= Ot0eM+yspsswqx4hG36Mte0bS9kg(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8"
We are using cf v204 and all loggregators are the same size and config (other than zone).
The distribution of requests across app instances is fairly even as far as I can see.
John. On 26 May 2015 06:21, "Erik Jasiak" <ejasiak(a)pivotal.io> wrote:
Hi John,
I'll be working on this with engineering in the morning; thanks for the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are
serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your
configurations for all dopplers are roughly the same? All app instances per
AZ are serving an equal number of requests?
Thanks, Erik Jasiak
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote:
Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way.
however the
dopplers associated with zone 4 / 5 are getting a very small amount
of load
sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <
john.mcteague(a)gmail.com>
wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively
even
balance between all app instances, yet doppler on zones 1-3 consume
far
greater cpu resources (15x in some cases) than zones 4 and 5.
Generally
zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30
instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you
see logs
with "cf logs"? you can target a single app instance index to get
restarted
with using a "cf curl" command for terminating an app index [1].
you can
find the details with json output from "cf stats" that should show
you the
private IPs for the DEAs hosting your app, which should help you
figure out
which zone each app index is in.
http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not
routable
somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and
routers
fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague < john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent
and
traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced
across all 5
zones (6 instances in each). I have additionally verified that
each logical
zone in the bosh yml contains 1 dea, doppler server and traffic
controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
-------------- next part -------------- An HTML attachment was scrubbed... URL: < http://lists.cloudfoundry.org/pipermail/cf-dev/attachments/20150526/31789891/attachment.html ------------------------------
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
End of cf-dev Digest, Vol 2, Issue 73 *************************************
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
|
|
john mcteague <john.mcteague@...>
- From etcd I see 5 unique entries, all 5 doppler hosts are listed with the correct zone - All metron_agent.json files list the correct zone name - All doppler.json files also contain the correct zone name
All 5 doppler servers contain the following two errors, in varying amounts.
{"timestamp":1432671780.232883453,"process_id":1422," source":"doppler","log_level":"error","message":"AppStoreWatcher: Got error while waiting for ETCD events: store request timed out","data":null,"file":"/var/vcap/data/compile/doppler/loggregator/src/ github.com/cloudfoundry/loggregatorlib/store/app_service_store_watcher.go ","line":78,"method":" github.com/cloudfoundry/loggregatorlib/store.(*AppServiceStoreWatcher).Run"}
{"timestamp":1432649819.481923819,"process_id":1441," source":"doppler","log_level":"warn","message":"TB: Output channel too full. Dropped 100 messages for app f744c900-d82d-4efc-bbe4- 004e94ffdfec.","data":null,"file":"/var/vcap/data/compile/ doppler/loggregator/src/doppler/truncatingbuffer/ truncating_buffer.go","line":65,"method":"doppler/truncatingbuffer.(* TruncatingBuffer).Run"}
For the latter, given the high log rate of the test app, it suggests I need to tune the buffer of doppler, but I dont expect this to be the cause of my cpu imbalance.
toggle quoted message
Show quoted text
On Tue, May 26, 2015 at 5:08 PM, John Tuley <jtuley(a)pivotal.io> wrote: John,
Can you verify (on, say one runner in each of your zones) that Metron's local configuration has the correct zone? (Look in /var/vcap/jobs/metron_agent/config/metron.json.)
Can you also verify the same for the Doppler servers (/var/vcap/jobs/doppler/config/doppler.json)?
And then can you please verify that etcd is being updated correctly? (curl *$ETCD_URL*/api/v2/keys/healthstatus/doppler/?recursive=true with the correct ETCD_URL - the output should contain entries with the correct IP address of each of your dopplers, under the correct zone.)
If all of those check out, then please send me the logs from the affected Doppler servers and I'll take a look.
– John Tuley
On Tue, May 26, 2015 at 9:26 AM, <cf-dev-request(a)lists.cloudfoundry.org> wrote:
Message: 2 Date: Tue, 26 May 2015 16:26:30 +0100 From: john mcteague <john.mcteague(a)gmail.com> To: Erik Jasiak <ejasiak(a)pivotal.io> Cc: cf-dev <cf-dev(a)lists.cloudfoundry.org> Subject: Re: [cf-dev] Doppler zoning query Message-ID: <CAEduAK4WmMfrhdhxWDfpR= Ot0eM+yspsswqx4hG36Mte0bS9kg(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8"
We are using cf v204 and all loggregators are the same size and config (other than zone).
The distribution of requests across app instances is fairly even as far as I can see.
John. On 26 May 2015 06:21, "Erik Jasiak" <ejasiak(a)pivotal.io> wrote:
Hi John,
I'll be working on this with engineering in the morning; thanks for the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are
serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your configurations for all dopplers are roughly the same? All app instances per
AZ are serving an equal number of requests?
Thanks, Erik Jasiak
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote:
Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way.
however the
dopplers associated with zone 4 / 5 are getting a very small amount
of load
sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <
john.mcteague(a)gmail.com>
wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively
even
balance between all app instances, yet doppler on zones 1-3 consume
far
greater cpu resources (15x in some cases) than zones 4 and 5.
Generally
zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30
instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see
logs
with "cf logs"? you can target a single app instance index to get
restarted
with using a "cf curl" command for terminating an app index [1].
you can
find the details with json output from "cf stats" that should show
you the
private IPs for the DEAs hosting your app, which should help you
figure out
which zone each app index is in.
http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not
routable
somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and
routers
fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague < john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent
and
traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced
across all 5
zones (6 instances in each). I have additionally verified that
each logical
zone in the bosh yml contains 1 dea, doppler server and traffic
controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
-------------- next part -------------- An HTML attachment was scrubbed... URL: < http://lists.cloudfoundry.org/pipermail/cf-dev/attachments/20150526/31789891/attachment.html ------------------------------
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
End of cf-dev Digest, Vol 2, Issue 73 *************************************
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
|
|
John, Can you verify (on, say one runner in each of your zones) that Metron's local configuration has the correct zone? (Look in /var/vcap/jobs/metron_agent/config/metron.json.) Can you also verify the same for the Doppler servers (/var/vcap/jobs/doppler/config/doppler.json)? And then can you please verify that etcd is being updated correctly? (curl *$ETCD_URL*/api/v2/keys/healthstatus/doppler/?recursive=true with the correct ETCD_URL - the output should contain entries with the correct IP address of each of your dopplers, under the correct zone.) If all of those check out, then please send me the logs from the affected Doppler servers and I'll take a look. – John Tuley On Tue, May 26, 2015 at 9:26 AM, <cf-dev-request(a)lists.cloudfoundry.org> wrote:
Message: 2 Date: Tue, 26 May 2015 16:26:30 +0100 From: john mcteague <john.mcteague(a)gmail.com> To: Erik Jasiak <ejasiak(a)pivotal.io> Cc: cf-dev <cf-dev(a)lists.cloudfoundry.org> Subject: Re: [cf-dev] Doppler zoning query Message-ID: <CAEduAK4WmMfrhdhxWDfpR= Ot0eM+yspsswqx4hG36Mte0bS9kg(a)mail.gmail.com> Content-Type: text/plain; charset="utf-8"
We are using cf v204 and all loggregators are the same size and config (other than zone).
The distribution of requests across app instances is fairly even as far as I can see.
John. On 26 May 2015 06:21, "Erik Jasiak" <ejasiak(a)pivotal.io> wrote:
Hi John,
I'll be working on this with engineering in the morning; thanks for the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are
serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your configurations for all dopplers are roughly the same? All app instances per
AZ are serving an equal number of requests?
Thanks, Erik Jasiak
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote:
Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way. however
the
dopplers associated with zone 4 / 5 are getting a very small amount of
load
sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <
john.mcteague(a)gmail.com>
wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively even balance between all app instances, yet doppler on zones 1-3 consume
far
greater cpu resources (15x in some cases) than zones 4 and 5.
Generally
zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30 instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see
logs
with "cf logs"? you can target a single app instance index to get
restarted
with using a "cf curl" command for terminating an app index [1]. you
can
find the details with json output from "cf stats" that should show
you the
private IPs for the DEAs hosting your app, which should help you
figure out
which zone each app index is in.
http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not
routable
somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and
routers
fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague < john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent and traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced across
all 5
zones (6 instances in each). I have additionally verified that each
logical
zone in the bosh yml contains 1 dea, doppler server and traffic
controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
-------------- next part -------------- An HTML attachment was scrubbed... URL: < http://lists.cloudfoundry.org/pipermail/cf-dev/attachments/20150526/31789891/attachment.html ------------------------------
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
End of cf-dev Digest, Vol 2, Issue 73 *************************************
|
|
john mcteague <john.mcteague@...>
We are using cf v204 and all loggregators are the same size and config (other than zone).
The distribution of requests across app instances is fairly even as far as I can see.
John.
toggle quoted message
Show quoted text
On 26 May 2015 06:21, "Erik Jasiak" <ejasiak(a)pivotal.io> wrote: Hi John,
I'll be working on this with engineering in the morning; thanks for the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your configurations for all dopplers are roughly the same? All app instances per AZ are serving an equal number of requests?
Thanks, Erik Jasiak
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote:
Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way. however the dopplers associated with zone 4 / 5 are getting a very small amount of load sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <john.mcteague(a)gmail.com> wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively even balance between all app instances, yet doppler on zones 1-3 consume far greater cpu resources (15x in some cases) than zones 4 and 5. Generally zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30 instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see logs with "cf logs"? you can target a single app instance index to get restarted with using a "cf curl" command for terminating an app index [1]. you can find the details with json output from "cf stats" that should show you the private IPs for the DEAs hosting your app, which should help you figure out which zone each app index is in. http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not routable somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and routers fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague < john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent and traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced across all 5 zones (6 instances in each). I have additionally verified that each logical zone in the bosh yml contains 1 dea, doppler server and traffic controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
|
|
Erik Jasiak <ejasiak@...>
Hi John,
I'll be working on this with engineering in the morning; thanks for the details thus far.
This is puzzling: Metrons do not route traffic to dopplers outside their zone today. If all your app instances are spread evenly, and all are serving an equal amount of requests, then I would expect no major variability in Doppler load either.
For completeness, what version of CF are you running? I assume your configurations for all dopplers are roughly the same? All app instances per AZ are serving an equal number of requests?
Thanks, Erik Jasiak
toggle quoted message
Show quoted text
On Monday, May 25, 2015, john mcteague <john.mcteague(a)gmail.com> wrote: Correct, thanks.
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io <javascript:_e(%7B%7D,'cvml','jbayer(a)pivotal.io');>> wrote:
ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way. however the dopplers associated with zone 4 / 5 are getting a very small amount of load sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <john.mcteague(a)gmail.com <javascript:_e(%7B%7D,'cvml','john.mcteague(a)gmail.com');>> wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively even balance between all app instances, yet doppler on zones 1-3 consume far greater cpu resources (15x in some cases) than zones 4 and 5. Generally zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30 instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io <javascript:_e(%7B%7D,'cvml','jbayer(a)pivotal.io');>> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see logs with "cf logs"? you can target a single app instance index to get restarted with using a "cf curl" command for terminating an app index [1]. you can find the details with json output from "cf stats" that should show you the private IPs for the DEAs hosting your app, which should help you figure out which zone each app index is in. http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not routable somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and routers fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague <john.mcteague(a)gmail.com <javascript:_e(%7B%7D,'cvml','john.mcteague(a)gmail.com');>> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent and traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced across all 5 zones (6 instances in each). I have additionally verified that each logical zone in the bosh yml contains 1 dea, doppler server and traffic controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org <javascript:_e(%7B%7D,'cvml','cf-dev(a)lists.cloudfoundry.org');> https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
|
|
john mcteague <john.mcteague@...>
Correct, thanks.
toggle quoted message
Show quoted text
On Mon, May 25, 2015 at 12:01 AM, James Bayer <jbayer(a)pivotal.io> wrote: ok thanks for the extra detail.
to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way. however the dopplers associated with zone 4 / 5 are getting a very small amount of load sent their way. is that right?
On Sun, May 24, 2015 at 3:45 PM, john mcteague <john.mcteague(a)gmail.com> wrote:
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively even balance between all app instances, yet doppler on zones 1-3 consume far greater cpu resources (15x in some cases) than zones 4 and 5. Generally zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30 instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see logs with "cf logs"? you can target a single app instance index to get restarted with using a "cf curl" command for terminating an app index [1]. you can find the details with json output from "cf stats" that should show you the private IPs for the DEAs hosting your app, which should help you figure out which zone each app index is in. http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not routable somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and routers fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague <john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent and traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced across all 5 zones (6 instances in each). I have additionally verified that each logical zone in the bosh yml contains 1 dea, doppler server and traffic controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you,
James Bayer
|
|
ok thanks for the extra detail. to confirm, during the load test, the http traffic is being routed through zones 4 and 5 app instances on DEAs in a balanced way. however the dopplers associated with zone 4 / 5 are getting a very small amount of load sent their way. is that right? On Sun, May 24, 2015 at 3:45 PM, john mcteague <john.mcteague(a)gmail.com> wrote: I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively even balance between all app instances, yet doppler on zones 1-3 consume far greater cpu resources (15x in some cases) than zones 4 and 5. Generally zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30 instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote:
john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see logs with "cf logs"? you can target a single app instance index to get restarted with using a "cf curl" command for terminating an app index [1]. you can find the details with json output from "cf stats" that should show you the private IPs for the DEAs hosting your app, which should help you figure out which zone each app index is in. http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not routable somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and routers fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague <john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent and traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced across all 5 zones (6 instances in each). I have additionally verified that each logical zone in the bosh yml contains 1 dea, doppler server and traffic controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
-- Thank you, James Bayer
|
|
john mcteague <john.mcteague@...>
I am seeing logs from zone 4 and 5 when tailing the logs (*cf logs hello-world | grep App | awk '{ print $2 }'*), I see a relatively even balance between all app instances, yet doppler on zones 1-3 consume far greater cpu resources (15x in some cases) than zones 4 and 5. Generally zones 4 and 5 barely get above 1% utilization.
Running *cf curl /v2/apps/guid/stats | grep host | sort *shows 30 instances, 6 in each zone, a perfect balance.
Each loggregator is running with 8GB RAM and 4vcpus.
John
toggle quoted message
Show quoted text
On Sat, May 23, 2015 at 11:56 PM, James Bayer <jbayer(a)pivotal.io> wrote: john,
can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see logs with "cf logs"? you can target a single app instance index to get restarted with using a "cf curl" command for terminating an app index [1]. you can find the details with json output from "cf stats" that should show you the private IPs for the DEAs hosting your app, which should help you figure out which zone each app index is in. http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.html
if you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not routable somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and routers fails over to instances in DEAs 1-3 that it can reach * some other mystery
On Fri, May 22, 2015 at 2:06 PM, john mcteague <john.mcteague(a)gmail.com> wrote:
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent and traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced across all 5 zones (6 instances in each). I have additionally verified that each logical zone in the bosh yml contains 1 dea, doppler server and traffic controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you,
James Bayer
|
|
john, can you say more about "receiving no load at all"? for example, if you restart one of the app instances in zone 4 or zone 5 do you see logs with "cf logs"? you can target a single app instance index to get restarted with using a "cf curl" command for terminating an app index [1]. you can find the details with json output from "cf stats" that should show you the private IPs for the DEAs hosting your app, which should help you figure out which zone each app index is in. http://apidocs.cloudfoundry.org/209/apps/terminate_the_running_app_instance_at_the_given_index.htmlif you are seeing logs from zone 4 and zone 5, then what might be happening is that for some reason DEAs in zone 4 or zone 5 are not routable somewhere along the path. reasons for that could be: * DEAs in Zone 4 / Zone 5 not getting apps that are hosted there listed in the routing table * The routing table may be correct, but for some reason the routers cannot reach DEAs in zone 4 or zone 5 with outbound traffic and routers fails over to instances in DEAs 1-3 that it can reach * some other mystery On Fri, May 22, 2015 at 2:06 PM, john mcteague <john.mcteague(a)gmail.com> wrote: We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent and traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced across all 5 zones (6 instances in each). I have additionally verified that each logical zone in the bosh yml contains 1 dea, doppler server and traffic controller.
Thanks, John
_______________________________________________ cf-dev mailing list cf-dev(a)lists.cloudfoundry.org https://lists.cloudfoundry.org/mailman/listinfo/cf-dev
-- Thank you, James Bayer
|
|
john mcteague <john.mcteague@...>
We map our dea's , dopplers and traffic controllers in 5 logical zones using the various zone properties of doppler, metron_agent and traffic_controller. This aligns to our physical failure domains in openstack.
During a recent load test we discovered that zones 4 and 5 were receiving no load at all, all traffic went to zones 1-3.
What would cause this unbalanced distribution? I have a single app running 30 instances and have verified it is evenly balanced across all 5 zones (6 instances in each). I have additionally verified that each logical zone in the bosh yml contains 1 dea, doppler server and traffic controller.
Thanks, John
|
|