[abacus] Separate time-based from discrete usage metrics


Hristo Iliev
 

Hi,

We're trying to fix Abacus issue 88: Missing aggregated usage for the running application [1].

Background
=========

See the jsdelfino comment in the GitHub issue [2]. TL;DR: Resource providers have to send a 'ping' doc per month for time-based metrics.

Proposed solution
==============

We decided to implement a solution in Abacus that frees the usage providers from sending the 'ping' submission.

To fix the issue we decided to:
1. Distinguish between time-based (linux-container) and discrete usage metrics (the rest basically)
2. Store the time-based metrics in a separate DB(s)

We already drafted a proposal for adding measurement type in the usage plans with PR #320 [3].

We're about to spike on storing the time-based metrics in their own Database, but we wanted to get the community opinion on the topic.

Motivation
========

The discrete usage submitted to Abacus is:
* stored in partitioned databases, due to their size/number
* like an event log, storing the history of the usage/resources

In contrast the current time-based metrics are:
* limited number (usually around 2 million on a loaded CF system)
* storing just the app resources usage state (GB/h consumed so far, GB/h consuming currently)

Therefore it looks like a good idea to separate the two usage metrics types and store the time-based metrics in a separate database. This will allow us not only to solve the issue, but also to store and query the data more effectively.

We may still need to maintain 2 databases and swap new/old (irrelevant) metrics to reduce the DB size on the month boundaries.


Regards,
Hristo & Adriana

[1] https://github.com/cloudfoundry-incubator/cf-abacus/issues/88
[2] https://github.com/cloudfoundry-incubator/cf-abacus/issues/88#issuecomment-148498164
[3] https://github.com/cloudfoundry-incubator/cf-abacus/pull/320


Jean-Sebastien Delfino
 

Hi,

To fix the issue we decided to:
1. Distinguish between time-based (linux-container) and discrete usage
metrics (the rest basically)
2. Store the time-based metrics in a separate DB(s)
Your proposal looks good to me. Piotr, Kevin and Raj and I had several
design discussions on this topic in the last few days and we've come up
with a few more ideas on top of what you're describing here:

- The distinction between time-based and discrete resource usage metering
could also be understood as usage metering of a stateful vs stateless
resource or metric. In the stateful case, we simply store the state of the
resource instance in a separate DB like you're proposing (e.g. we store the
fact that an app or a container is currently running, or has stopped), and
update that state in place when it changes. Then to compute and report
usage later on we just need to the current resource instance state from
that DB.

- We could continue to store the metrics in the current historical
databases as well (on top of that new DB) to preserve the resource instance
history as many users typically want to know their past usage.

- Some of us were not sure if the time-based / discrete distinction should
be at the resource type level or at the metric level... IMO your proposal
to do that at the metric level is cleaner so I'm happy with it :)

- The dataflow module will probably need a few minor code changes to detect
the case where some of the output docs need to go to a separate DB (IIRC
you or someone else also mentioned that on one of our scrums or on slack...)

- Like you said, we may still need to maintain 2 DBs to purge old entries.
If that's easier, we could also adjust the usage accumulator service and
the dataflow module a bit to delete entries for inactive resource instances
right away (e.g. when an app or container stops.)

Thoughts?

P.S. I'll add these comments to issue #88 as well to make it be easier to
follow up there.

- Jean-Sebastien

On Thu, May 12, 2016 at 5:54 AM, Hristo Iliev <hsiliev(a)gmail.com> wrote:

Hi,

We're trying to fix Abacus issue 88: Missing aggregated usage for the
running application [1].

Background
=========

See the jsdelfino comment in the GitHub issue [2]. TL;DR: Resource
providers have to send a 'ping' doc per month for time-based metrics.

Proposed solution
==============

We decided to implement a solution in Abacus that frees the usage
providers from sending the 'ping' submission.

To fix the issue we decided to:
1. Distinguish between time-based (linux-container) and discrete usage
metrics (the rest basically)
2. Store the time-based metrics in a separate DB(s)

We already drafted a proposal for adding measurement type in the usage
plans with PR #320 [3].

We're about to spike on storing the time-based metrics in their own
Database, but we wanted to get the community opinion on the topic.

Motivation
========

The discrete usage submitted to Abacus is:
* stored in partitioned databases, due to their size/number
* like an event log, storing the history of the usage/resources

In contrast the current time-based metrics are:
* limited number (usually around 2 million on a loaded CF system)
* storing just the app resources usage state (GB/h consumed so far, GB/h
consuming currently)

Therefore it looks like a good idea to separate the two usage metrics
types and store the time-based metrics in a separate database. This will
allow us not only to solve the issue, but also to store and query the data
more effectively.

We may still need to maintain 2 databases and swap new/old (irrelevant)
metrics to reduce the DB size on the month boundaries.


Regards,
Hristo & Adriana

[1] https://github.com/cloudfoundry-incubator/cf-abacus/issues/88
[2]
https://github.com/cloudfoundry-incubator/cf-abacus/issues/88#issuecomment-148498164
[3] https://github.com/cloudfoundry-incubator/cf-abacus/pull/320


Hristo Iliev
 

Hi,

When we talk for a DB storing stateful metrics, do we really mean a single
DB storing all the Abacus pipeline data, or an input & output DBs for each
of the Abacus micro-services?

+1 for storing the data in the historical/log-like databases. This gives us
the possibility to extend the implementation of the failed events
management to the stateful measures.

We already started a spike on dataflow module. It would be straightforward
to detect stateful metrics in POST requests. This can be done by extending
the account plugin API and the metering config & schema.

We have two challenges:
* GET requests do not have access to the stateful flag, so we need a way to
detect stateful data using the document id. The idea we have is to use a
new ID schema (or just a prefix?), as you proposed on the last IPM.
* We think Replay function might miss some of the data, exactly due to the
problem we try to solve.

Regards,
Hristo Iliev

2016-05-18 19:34 GMT+03:00 Jean-Sebastien Delfino <jsdelfino(a)gmail.com>:

Hi,

To fix the issue we decided to:
1. Distinguish between time-based (linux-container) and discrete usage
metrics (the rest basically)
2. Store the time-based metrics in a separate DB(s)
Your proposal looks good to me. Piotr, Kevin and Raj and I had several
design discussions on this topic in the last few days and we've come up
with a few more ideas on top of what you're describing here:

- The distinction between time-based and discrete resource usage metering
could also be understood as usage metering of a stateful vs stateless
resource or metric. In the stateful case, we simply store the state of the
resource instance in a separate DB like you're proposing (e.g. we store the
fact that an app or a container is currently running, or has stopped), and
update that state in place when it changes. Then to compute and report
usage later on we just need to the current resource instance state from
that DB.

- We could continue to store the metrics in the current historical
databases as well (on top of that new DB) to preserve the resource instance
history as many users typically want to know their past usage.

- Some of us were not sure if the time-based / discrete distinction should
be at the resource type level or at the metric level... IMO your proposal
to do that at the metric level is cleaner so I'm happy with it :)

- The dataflow module will probably need a few minor code changes to
detect the case where some of the output docs need to go to a separate DB
(IIRC you or someone else also mentioned that on one of our scrums or on
slack...)

- Like you said, we may still need to maintain 2 DBs to purge old entries.
If that's easier, we could also adjust the usage accumulator service and
the dataflow module a bit to delete entries for inactive resource instances
right away (e.g. when an app or container stops.)

Thoughts?

P.S. I'll add these comments to issue #88 as well to make it be easier to
follow up there.

- Jean-Sebastien

On Thu, May 12, 2016 at 5:54 AM, Hristo Iliev <hsiliev(a)gmail.com> wrote:

Hi,

We're trying to fix Abacus issue 88: Missing aggregated usage for the
running application [1].

Background
=========

See the jsdelfino comment in the GitHub issue [2]. TL;DR: Resource
providers have to send a 'ping' doc per month for time-based metrics.

Proposed solution
==============

We decided to implement a solution in Abacus that frees the usage
providers from sending the 'ping' submission.

To fix the issue we decided to:
1. Distinguish between time-based (linux-container) and discrete usage
metrics (the rest basically)
2. Store the time-based metrics in a separate DB(s)

We already drafted a proposal for adding measurement type in the usage
plans with PR #320 [3].

We're about to spike on storing the time-based metrics in their own
Database, but we wanted to get the community opinion on the topic.

Motivation
========

The discrete usage submitted to Abacus is:
* stored in partitioned databases, due to their size/number
* like an event log, storing the history of the usage/resources

In contrast the current time-based metrics are:
* limited number (usually around 2 million on a loaded CF system)
* storing just the app resources usage state (GB/h consumed so far, GB/h
consuming currently)

Therefore it looks like a good idea to separate the two usage metrics
types and store the time-based metrics in a separate database. This will
allow us not only to solve the issue, but also to store and query the data
more effectively.

We may still need to maintain 2 databases and swap new/old (irrelevant)
metrics to reduce the DB size on the month boundaries.


Regards,
Hristo & Adriana

[1] https://github.com/cloudfoundry-incubator/cf-abacus/issues/88
[2]
https://github.com/cloudfoundry-incubator/cf-abacus/issues/88#issuecomment-148498164
[3] https://github.com/cloudfoundry-incubator/cf-abacus/pull/320