[abacus] Refactor Aggregated Usage and Aggregated Rated Usage data model


Saravanakumar A. Srinivasan
 

Started to look into two user stories([1] and [2]) titled  "Organize the usage report data model for better querying and DB utilization"

Current state of Abacus processing pipeline starting from Usage Accumulator:

    a) Usage Accumulator processes metered usage for a resource instance, accumulates the usage at resource instance scope and then forwards accumulated usage for a resource instance to Usage Aggregator.
    b) Usage Aggregator processes accumulated usage for a resource instance, aggregates the usage at following scopes: 
        organization.resources, 
        organization.resources.plans, 
        organization.spaces.resources, 
        organization.spaces.resources.plans, 
        organization.spaces.consumers.resources and 
        organization.spaces.consumers.resources.plans, and then forwards aggregated usage for an organization to Usage Rating Service.
    c) Usage Rating Service processes aggregated usage for an organization, rates the aggregated usage at following scopes: 
        organization.resources.plans, 
        organization.spaces.resources.plans, and 
        organization.spaces.consumers.resources.plans.
    d) Usage Reporting Service processes rated usage for an organization and summarizes usage and charge at all aggregation scopes. See [3] for a sample Abacus usage report.


Initial thought on changes needed to optimize steps b, c, and d are

    b) Usage Aggregator processes accumulated usage for a resource instance and aggregates and rates the usage at a consumer scope - equivalent to the scopes of organization.spaces.consumers.resources and organization.spaces.consumers.resources.plans and then maintains a normalized aggregated usage for an organization that contains references to all consumer scoped documents that belong to the organization.
    c) Eliminate Usage Rating Service and split the current rating step across Usage Aggregator and Usage Reporting Service.
    d) Usage Reporting Service processes a normalized aggregated usage for an organization, uses references to get all consumer scoped documents that belong to the organization, aggregates and rates consumer scoped usage at all other scopes, and then summarizes usage and charge at all aggregation scopes.

Any comments?


[1] https://www.pivotaltracker.com/story/show/107598654
[2] https://www.pivotaltracker.com/story/show/107598652
[3] https://gist.github.com/sasrin/697437b33d38bdddf825#file-report-json

Thanks,
Saravanakumar Srinivasan (Assk),

Bay Area Lab, 1001, E Hillsdale Blvd, Ste 400, Foster City, CA - 94404.
E-mail: sasrin(a)us.ibm.com
Phone: 650 645 8251 (T/L 367-8251)


Jean-Sebastien Delfino
 

Hi all,

Here's an update on this topic and the design discussions Assk, Ben and I
had in the last few days:

I'll start with a description of the problem we're trying to solve here:

Abacus currently computes and stores the aggregated usage at various levels
within an org in real time. Each time new usage for resource instances gets
submitted we compute your latest aggregated usage at the org, space, app,
resource and plan level, and store that in a new document keyed by the org
id and the current time.

We effectively write a history of your org's aggregated usage in the Abacus
database, and that design allows us to efficiently report your latest
usage, your usage history, or trigger usage limit alerts in real time for
example, simply because we always have your latest usage for a given time
in hand in a single doc, as opposed to having to run complex database
queries pulling all your usage data into an aggregation when it's needed.

So, that design is all good until somebody creates a thousand (or even a
hundred) apps in the org. With many apps, our aggregated usage (JSON) docs
get pretty big as we're keeping track of the aggregated usage for each app,
JSON is not very space-efficient at representing all that data (that's a
euphemism), and since we're writing a new doc for each new submitted usage,
we eventually overload our Couch database with these big JSON docs.

Long story short... this discussion is about trying to optimize our data
model for aggregated usage to fix that problem. It's also an example of the
typical tension in systems that need to stream a lot of data, compute some
aggregates, and make quick decisions based on them: (a) do you pro-actively
compute and store the aggregated values in real time as you're consuming
your stream of input data? or (b) do you just write the input data and then
run a mix of pseudo-real time and batch queries over and over on that data
to compute the aggregates later? Our current design is along the lines of
(a), but we're starting to also poke at ideas from the (b) camp to mitigate
some of the issues of the (a) camp.

The initial proposal described by Assk earlier in this thread was to split
the single org level doc containing all the usage aggregations within the
org into smaller docs: one doc per app for example (aka consumer in Abacus
as we support usage from other things than pure apps). That's what he was
calling 'normalized' usage, since the exercise of coming up with that new
structure would be similar to a 'normalization' of the data in the
relational database sense, as opposed to the 'denormalization' we went
through to design the structure of our current aggregated usage doc (a JSON
hierarchical structure including some data duplication).

Now, while that data 'normalization' would help reduce the size of the docs
and the amount of data written to record the history of your org's
aggregated usage, in the last few days we've also started to realize that
it would on the other hand increase the amount of data we'd have to read,
to retrieve all the little docs representing the current aggregated usage
and 'join' them into a complete view of the org's aggregated usage before
adding new usage to it...

Like I said before, a tension between two approaches, (a) writes a lot of
data, is cheap on reads, (b) writes the minimum, requires a lot of reads...
nothing's easy or perfect :) So the next step here is going to be an
evaluation of some of the trade-offs between:

a) write all the aggregated usage data for an org in one doc like we do now
but simplify and refactor a bit the JSON format we use to represent it, in
an attempt to make that JSON representation much smaller;

b) split the aggregated usage in separate docs, one per app, linked
together by a parent doc per org containing their ids, and optimize (with
caching for example) the reads and 'joins' of all the docs forming the
aggregated usage for the org;

c) a middle-ground approach where we'll store the aggregated usage per app
in separate docs, but maintain the aggregated usage at the upper levels
(org, space, resource, plan) in the parent doc linking the app usage docs
together, and explore what constrains or limitations that would impose on
our ability to trigger real time usage limit alerts at any org, space,
resource, plan, app etc level.

This is a rather complex subject, so please feel free to ask questions or
send any thoughts here, or in the tracker and Github issues referenced by
Assk earlier if that's easier. Thanks!

- Jean-Sebastien

On Fri, Nov 20, 2015 at 11:09 AM, Saravanakumar A Srinivasan <
sasrin(a)us.ibm.com> wrote:

Started to look into two user stories([1] and [2]) titled "Organize the
usage report data model for better querying and DB utilization"

Current state of Abacus processing pipeline starting from Usage
Accumulator:

a) Usage Accumulator processes metered usage for a resource instance,
accumulates the usage at resource instance scope and then forwards
accumulated usage for a resource instance to Usage Aggregator.
b) Usage Aggregator processes accumulated usage for a resource
instance, aggregates the usage at following scopes:
organization.resources,
organization.resources.plans,
organization.spaces.resources,
organization.spaces.resources.plans,
organization.spaces.consumers.resources and
organization.spaces.consumers.resources.plans, and then forwards
aggregated usage for an organization to Usage Rating Service.
c) Usage Rating Service processes aggregated usage for an
organization, rates the aggregated usage at following scopes:
organization.resources.plans,
organization.spaces.resources.plans, and
organization.spaces.consumers.resources.plans.
d) Usage Reporting Service processes rated usage for an organization
and summarizes usage and charge at all aggregation scopes. See [3] for a
sample Abacus usage report.


Initial thought on changes needed to optimize steps b, c, and d are

b) Usage Aggregator processes accumulated usage for a resource
instance and aggregates and rates the usage at a consumer scope -
equivalent to the scopes of organization.spaces.consumers.resources and
organization.spaces.consumers.resources.plans and then maintains a
normalized aggregated usage for an organization that contains references to
all consumer scoped documents that belong to the organization.
c) Eliminate Usage Rating Service and split the current rating step
across Usage Aggregator and Usage Reporting Service.
d) Usage Reporting Service processes a normalized aggregated usage for
an organization, uses references to get all consumer scoped documents that
belong to the organization, aggregates and rates consumer scoped usage at
all other scopes, and then summarizes usage and charge at all aggregation
scopes.

Any comments?


[1] https://www.pivotaltracker.com/story/show/107598654
[2] https://www.pivotaltracker.com/story/show/107598652
[3] https://gist.github.com/sasrin/697437b33d38bdddf825#file-report-json

Thanks,
Saravanakumar Srinivasan (Assk),

Bay Area Lab, 1001, E Hillsdale Blvd, Ste 400, Foster City, CA - 94404.
E-mail: sasrin(a)us.ibm.com
Phone: 650 645 8251 (T/L 367-8251)


Saravanakumar A. Srinivasan
 

c) a middle-ground approach where we'll store the aggregated usage per app in separate docs, but maintain the aggregated usage at the upper levels (org, space, resource, plan) in the parent doc linking the app usage docs together, and explore what constrains or limitations that would impose on our ability to trigger real time usage limit alerts at any org, space, resource, plan, app etc level.
As a first step (refer to [1] for more details) to refactor the usage data model using middle-ground approach, we have removed Usage Rating Service from Abacus pipeline (refer to commit at [2]) and moved entire rating implementation from Usage Rating Service to Usage Aggregator (refer to commit at [3])

With these commits, If you are using Abacus, be aware that the Abacus pipeline has become shorter and you have one less application (Usage Rating Service) to manage.

[1] https://github.com/cloudfoundry-incubator/cf-abacus/issues/184
[2] https://github.com/cloudfoundry-incubator/cf-abacus/commit/1488e1ae2e4547a010151ad2245f3a3f1ff2e488
[3] https://github.com/cloudfoundry-incubator/cf-abacus/commit/c661b7bdd35e70e985583570cb9920b90ced44a8