Re: [abacus] Refactor Aggregated Usage and Aggregated Rated Usage data model
Here's an update on this topic and the design discussions Assk, Ben and I
had in the last few days:
I'll start with a description of the problem we're trying to solve here:
Abacus currently computes and stores the aggregated usage at various levels
within an org in real time. Each time new usage for resource instances gets
submitted we compute your latest aggregated usage at the org, space, app,
resource and plan level, and store that in a new document keyed by the org
id and the current time.
We effectively write a history of your org's aggregated usage in the Abacus
database, and that design allows us to efficiently report your latest
usage, your usage history, or trigger usage limit alerts in real time for
example, simply because we always have your latest usage for a given time
in hand in a single doc, as opposed to having to run complex database
queries pulling all your usage data into an aggregation when it's needed.
So, that design is all good until somebody creates a thousand (or even a
hundred) apps in the org. With many apps, our aggregated usage (JSON) docs
get pretty big as we're keeping track of the aggregated usage for each app,
JSON is not very space-efficient at representing all that data (that's a
euphemism), and since we're writing a new doc for each new submitted usage,
we eventually overload our Couch database with these big JSON docs.
Long story short... this discussion is about trying to optimize our data
model for aggregated usage to fix that problem. It's also an example of the
typical tension in systems that need to stream a lot of data, compute some
aggregates, and make quick decisions based on them: (a) do you pro-actively
compute and store the aggregated values in real time as you're consuming
your stream of input data? or (b) do you just write the input data and then
run a mix of pseudo-real time and batch queries over and over on that data
to compute the aggregates later? Our current design is along the lines of
(a), but we're starting to also poke at ideas from the (b) camp to mitigate
some of the issues of the (a) camp.
The initial proposal described by Assk earlier in this thread was to split
the single org level doc containing all the usage aggregations within the
org into smaller docs: one doc per app for example (aka consumer in Abacus
as we support usage from other things than pure apps). That's what he was
calling 'normalized' usage, since the exercise of coming up with that new
structure would be similar to a 'normalization' of the data in the
relational database sense, as opposed to the 'denormalization' we went
through to design the structure of our current aggregated usage doc (a JSON
hierarchical structure including some data duplication).
Now, while that data 'normalization' would help reduce the size of the docs
and the amount of data written to record the history of your org's
aggregated usage, in the last few days we've also started to realize that
it would on the other hand increase the amount of data we'd have to read,
to retrieve all the little docs representing the current aggregated usage
and 'join' them into a complete view of the org's aggregated usage before
adding new usage to it...
Like I said before, a tension between two approaches, (a) writes a lot of
data, is cheap on reads, (b) writes the minimum, requires a lot of reads...
nothing's easy or perfect :) So the next step here is going to be an
evaluation of some of the trade-offs between:
a) write all the aggregated usage data for an org in one doc like we do now
but simplify and refactor a bit the JSON format we use to represent it, in
an attempt to make that JSON representation much smaller;
b) split the aggregated usage in separate docs, one per app, linked
together by a parent doc per org containing their ids, and optimize (with
caching for example) the reads and 'joins' of all the docs forming the
aggregated usage for the org;
c) a middle-ground approach where we'll store the aggregated usage per app
in separate docs, but maintain the aggregated usage at the upper levels
(org, space, resource, plan) in the parent doc linking the app usage docs
together, and explore what constrains or limitations that would impose on
our ability to trigger real time usage limit alerts at any org, space,
resource, plan, app etc level.
This is a rather complex subject, so please feel free to ask questions or
send any thoughts here, or in the tracker and Github issues referenced by
Assk earlier if that's easier. Thanks!
On Fri, Nov 20, 2015 at 11:09 AM, Saravanakumar A Srinivasan <
Started to look into two user stories( and ) titled "Organize the