add Diego into our monitoring system


Liu Rui
 

Hello,

We need to add Diego into our current monitoring system. BOSH still can be used. The previous Varz information as following from HM9000 is very useful to us. Is there any substitute for it in Diego?

{
"name": "HM9000",
"numCPUS": 4,
"numGoRoutines": 82,
"memoryStats": {
"..."
},
"tags": {
"ip": "..."
},
"contexts": [
{
"name": "HM9000",
"metrics": [
{
"name": "StartEvacuating",
"value": 643631
},
{
"name": "StopEvacuationComplete",
"value": 564223
},
{
"name": "DesiredStateSyncTimeInMilliseconds",
"value": 1576.52033
},
{
"name": "ActualStateListenerStoreUsagePercentage",
"value": 4.43904
},
{
"name": "StartCrashed",
"value": 3396363
},
{
"name": "StartMissing",
"value": 1519145
},
{
"name": "StopDuplicate",
"value": 45593
},
{
"name": "StopExtra",
"value": 2040903
},
{
"name": "SavedHeartbeats",
"value": 14958857
},
{
"name": "ReceivedHeartbeats",
"value": 14958857
},
{
"name": "NumberOfAppsWithAllInstancesReporting",
"value": 66841
},
{
"name": "NumberOfAppsWithMissingInstances",
"value": 10
},
{
"name": "NumberOfUndesiredRunningApps",
"value": 7
},
{
"name": "NumberOfRunningInstances",
"value": 71574
},
{
"name": "NumberOfMissingIndices",
"value": 12
},
{
"name": "NumberOfCrashedInstances",
"value": 2139
},
{
"name": "NumberOfCrashedIndices",
"value": 373
},
{
"name": "NumberOfDesiredApps",
"value": 66851
},
{
"name": "NumberOfDesiredInstances",
"value": 71869
},
{
"name": "NumberOfDesiredAppsPendingStaging",
"value": 9
}
]
}
]
}


Matthew Sykes <matthew.sykes@...>
 

There are dozens of metrics emitted by diego but I don't know of any
documentation for them in the open source repositories.

You can find most of them with a quick search of the diego-release
submodules under `src/github.com/cloudfoundry-incubator` with a pattern
like `metric\..*\(`. You will see metrics like `CrashedActualLRPs`,
`LRPsMissing`, and `LRPsExtra` in there.

On Thu, Mar 3, 2016 at 8:35 PM, Liu Rui <ibmmarmot(a)gmail.com> wrote:

Hello,

We need to add Diego into our current monitoring system. BOSH still can be
used. The previous Varz information as following from HM9000 is very useful
to us. Is there any substitute for it in Diego?

{
"name": "HM9000",
"numCPUS": 4,
"numGoRoutines": 82,
"memoryStats": {
"..."
},
"tags": {
"ip": "..."
},
"contexts": [
{
"name": "HM9000",
"metrics": [
{
"name": "StartEvacuating",
"value": 643631
},
{
"name": "StopEvacuationComplete",
"value": 564223
},
{
"name": "DesiredStateSyncTimeInMilliseconds",
"value": 1576.52033
},
{
"name": "ActualStateListenerStoreUsagePercentage",
"value": 4.43904
},
{
"name": "StartCrashed",
"value": 3396363
},
{
"name": "StartMissing",
"value": 1519145
},
{
"name": "StopDuplicate",
"value": 45593
},
{
"name": "StopExtra",
"value": 2040903
},
{
"name": "SavedHeartbeats",
"value": 14958857
},
{
"name": "ReceivedHeartbeats",
"value": 14958857
},
{
"name": "NumberOfAppsWithAllInstancesReporting",
"value": 66841
},
{
"name": "NumberOfAppsWithMissingInstances",
"value": 10
},
{
"name": "NumberOfUndesiredRunningApps",
"value": 7
},
{
"name": "NumberOfRunningInstances",
"value": 71574
},
{
"name": "NumberOfMissingIndices",
"value": 12
},
{
"name": "NumberOfCrashedInstances",
"value": 2139
},
{
"name": "NumberOfCrashedIndices",
"value": 373
},
{
"name": "NumberOfDesiredApps",
"value": 66851
},
{
"name": "NumberOfDesiredInstances",
"value": 71869
},
{
"name": "NumberOfDesiredAppsPendingStaging",
"value": 9
}
]
}
]
}
--
Matthew Sykes
matthew.sykes(a)gmail.com


Kim Hoffman <khoffman@...>
 

Does this help? http://docs.cloudfoundry.org/loggregator/all_metrics.html

On Thu, Mar 3, 2016 at 5:32 PM, Matthew Sykes <matthew.sykes(a)gmail.com>
wrote:

There are dozens of metrics emitted by diego but I don't know of any
documentation for them in the open source repositories.

You can find most of them with a quick search of the diego-release
submodules under `src/github.com/cloudfoundry-incubator`
<http://github.com/cloudfoundry-incubator> with a pattern like
`metric\..*\(`. You will see metrics like `CrashedActualLRPs`,
`LRPsMissing`, and `LRPsExtra` in there.

On Thu, Mar 3, 2016 at 8:35 PM, Liu Rui <ibmmarmot(a)gmail.com> wrote:

Hello,

We need to add Diego into our current monitoring system. BOSH still can
be used. The previous Varz information as following from HM9000 is very
useful to us. Is there any substitute for it in Diego?

{
"name": "HM9000",
"numCPUS": 4,
"numGoRoutines": 82,
"memoryStats": {
"..."
},
"tags": {
"ip": "..."
},
"contexts": [
{
"name": "HM9000",
"metrics": [
{
"name": "StartEvacuating",
"value": 643631
},
{
"name": "StopEvacuationComplete",
"value": 564223
},
{
"name": "DesiredStateSyncTimeInMilliseconds",
"value": 1576.52033
},
{
"name": "ActualStateListenerStoreUsagePercentage",
"value": 4.43904
},
{
"name": "StartCrashed",
"value": 3396363
},
{
"name": "StartMissing",
"value": 1519145
},
{
"name": "StopDuplicate",
"value": 45593
},
{
"name": "StopExtra",
"value": 2040903
},
{
"name": "SavedHeartbeats",
"value": 14958857
},
{
"name": "ReceivedHeartbeats",
"value": 14958857
},
{
"name": "NumberOfAppsWithAllInstancesReporting",
"value": 66841
},
{
"name": "NumberOfAppsWithMissingInstances",
"value": 10
},
{
"name": "NumberOfUndesiredRunningApps",
"value": 7
},
{
"name": "NumberOfRunningInstances",
"value": 71574
},
{
"name": "NumberOfMissingIndices",
"value": 12
},
{
"name": "NumberOfCrashedInstances",
"value": 2139
},
{
"name": "NumberOfCrashedIndices",
"value": 373
},
{
"name": "NumberOfDesiredApps",
"value": 66851
},
{
"name": "NumberOfDesiredInstances",
"value": 71869
},
{
"name": "NumberOfDesiredAppsPendingStaging",
"value": 9
}
]
}
]
}


--
Matthew Sykes
matthew.sykes(a)gmail.com


Eric Malm <emalm@...>
 

The Diego runtime emits statistics about instances as dropsonde metrics
that flow through the Loggregator system, and can then be consumed from the
Loggregator firehose via a nozzle (such as the Datadog Firehose nozzle at
https://github.com/cloudfoundry-incubator/datadog-firehose-nozzle-release).
I've compiled a fairly complete list of the metrics that Diego 0.1446.0
emits, with some preliminary descriptions, and made it available at
https://docs.google.com/spreadsheets/d/1v6ThMxf9oGmTYAuEF8IYX3sIqsB8RmWBNPqmGpmp5yU/.
There isn't a verbatim correspondence with the HM9000 metrics, because the
architectures are different, but many of the quantities should be similar.
If you have details on specific HM9000 quantities you monitor, it should be
possible to construct an analogous quantity or alert from the Diego metrics.

The Diego team also has a prioritized story at
https://www.pivotaltracker.com/story/show/112544565 to automate the
documentation of those metrics from the codebase, since updating them
manually is unreliable and error-prone; the Google doc linked above is also
provided on that story as a starting point for documentation.

Thanks,
Eric, CF Runtime Diego PM

On Thu, Mar 3, 2016 at 4:35 AM, Liu Rui <ibmmarmot(a)gmail.com> wrote:

Hello,

We need to add Diego into our current monitoring system. BOSH still can be
used. The previous Varz information as following from HM9000 is very useful
to us. Is there any substitute for it in Diego?

{
"name": "HM9000",
"numCPUS": 4,
"numGoRoutines": 82,
"memoryStats": {
"..."
},
"tags": {
"ip": "..."
},
"contexts": [
{
"name": "HM9000",
"metrics": [
{
"name": "StartEvacuating",
"value": 643631
},
{
"name": "StopEvacuationComplete",
"value": 564223
},
{
"name": "DesiredStateSyncTimeInMilliseconds",
"value": 1576.52033
},
{
"name": "ActualStateListenerStoreUsagePercentage",
"value": 4.43904
},
{
"name": "StartCrashed",
"value": 3396363
},
{
"name": "StartMissing",
"value": 1519145
},
{
"name": "StopDuplicate",
"value": 45593
},
{
"name": "StopExtra",
"value": 2040903
},
{
"name": "SavedHeartbeats",
"value": 14958857
},
{
"name": "ReceivedHeartbeats",
"value": 14958857
},
{
"name": "NumberOfAppsWithAllInstancesReporting",
"value": 66841
},
{
"name": "NumberOfAppsWithMissingInstances",
"value": 10
},
{
"name": "NumberOfUndesiredRunningApps",
"value": 7
},
{
"name": "NumberOfRunningInstances",
"value": 71574
},
{
"name": "NumberOfMissingIndices",
"value": 12
},
{
"name": "NumberOfCrashedInstances",
"value": 2139
},
{
"name": "NumberOfCrashedIndices",
"value": 373
},
{
"name": "NumberOfDesiredApps",
"value": 66851
},
{
"name": "NumberOfDesiredInstances",
"value": 71869
},
{
"name": "NumberOfDesiredAppsPendingStaging",
"value": 9
}
]
}
]
}


Liu Rui
 

Thanks for all comments which are very helpful.

We are now able to get the metrics through nozzle connecting to traffic controller.

I have 2 questions:

1) If we just want to have those metrics, we still have to collect all logging data through the Loggregator system. It is a huge network overhead for our system because our CF system is large.

2) We will migrate to CF release 233 soon. And the /varz support for HM9000 will be sunset. Do we have the corresponding metrics of statistics flowing through the Loggregator system? Do we have some document for that?


Liu Rui
 

Thanks for all comments which are very helpful.

We are now able to get the metrics through nozzle connecting to traffic controller.

I have 2 questions:

1) If we just want to have those metrics, we still have to collect all logging data through the Loggregator system. It is a huge network overhead for our system because our CF system is very large.

2) We will migrate to CF release 233 soon. And the /varz support for HM9000 will be sunset. Do we have the corresponding metrics of statistics flowing through the Loggregator system? Do we have some document for that?


Jim CF Campbell
 

Hi Liu Rui,


1) If we just want to have those metrics, we still have to collect all
logging data through the Loggregator system. It is a huge network overhead
for our system because our CF system is large.
You can use a metrics nozzle that only passes metrics and not logs. For
example, here is the Datadog nozzle
<https://github.com/cloudfoundry-incubator/datadog-firehose-nozzle>.



2) We will migrate to CF release 233 soon. And the /varz support for
HM9000 will be sunset. Do we have the corresponding metrics of statistics
flowing through the Loggregator system? Do we have some document for that?
<https://docs.cloudfoundry.org/loggregator/all_metrics.html>
List of CF metrics
<https://docs.cloudfoundry.org/loggregator/all_metrics.html>


--
Jim Campbell | Product Manager | Cloud Foundry | Pivotal.io | 303.618.0963


Patrick Wang <goupeng212wpp@...>
 

Hi Jim,
I pulled the source code of Datadog nozzle. Datadog nozzle is also get all drain data from traffic controller and then do filter in the nozzle. The nozzle still receive all drain data from traffic controller. That means, it is a huge network overhead on metron/doppler/traffic controller. From my perspective, to avoid hug network traffic, it is better to add filter on the metron/doppler/traffic controller. Do you know if there is a plan to add filter on the side of metron/doppler/traffic controller?
func getValue(envelope *events.Envelope) float64 {
switch envelope.GetEventType() {
case events.Envelope_ValueMetric:
return envelope.GetValueMetric().GetValue()
case events.Envelope_CounterEvent:
return float64(envelope.GetCounterEvent().GetTotal())
default:
panic("Unknown event type")
}
}
<<<<<<


Jim CF Campbell
 

Hi Patrick,

The basic design philosophy is for all logging and metric data to be
transported by Loggregator to the firehose. At that point the design intent
is to filter as appropriate. Sorry, but not only do we have no plans to
filter at the data origin, we have roadmap items to add *more* data to
Loggregator with system component syslogs, and for PCF, custom app metrics.

Jim

On Mon, Mar 28, 2016 at 1:11 AM, Patrick Wang <goupeng212wpp(a)gmail.com>
wrote:

Hi Jim,
I pulled the source code of Datadog nozzle. Datadog nozzle is also get all
drain data from traffic controller and then do filter in the nozzle. The
nozzle still receive all drain data from traffic controller. That means, it
is a huge network overhead on metron/doppler/traffic controller. From my
perspective, to avoid hug network traffic, it is better to add filter on
the metron/doppler/traffic controller. Do you know if there is a plan to
add filter on the side of metron/doppler/traffic controller?
func getValue(envelope *events.Envelope) float64 {
switch envelope.GetEventType() {
case events.Envelope_ValueMetric:
return envelope.GetValueMetric().GetValue()
case events.Envelope_CounterEvent:
return float64(envelope.GetCounterEvent().GetTotal())
default:
panic("Unknown event type")
}
}
<<<<<<


--
Jim Campbell | Product Manager | Cloud Foundry | Pivotal.io | 303.618.0963