Re: [abacus] Refactor Aggregated Usage and Aggregated Rated Usage data model
Jean-Sebastien Delfino
Hi all,
Here's an update on this topic and the design discussions Assk, Ben and I had in the last few days: I'll start with a description of the problem we're trying to solve here: Abacus currently computes and stores the aggregated usage at various levels within an org in real time. Each time new usage for resource instances gets submitted we compute your latest aggregated usage at the org, space, app, resource and plan level, and store that in a new document keyed by the org id and the current time. We effectively write a history of your org's aggregated usage in the Abacus database, and that design allows us to efficiently report your latest usage, your usage history, or trigger usage limit alerts in real time for example, simply because we always have your latest usage for a given time in hand in a single doc, as opposed to having to run complex database queries pulling all your usage data into an aggregation when it's needed. So, that design is all good until somebody creates a thousand (or even a hundred) apps in the org. With many apps, our aggregated usage (JSON) docs get pretty big as we're keeping track of the aggregated usage for each app, JSON is not very space-efficient at representing all that data (that's a euphemism), and since we're writing a new doc for each new submitted usage, we eventually overload our Couch database with these big JSON docs. Long story short... this discussion is about trying to optimize our data model for aggregated usage to fix that problem. It's also an example of the typical tension in systems that need to stream a lot of data, compute some aggregates, and make quick decisions based on them: (a) do you pro-actively compute and store the aggregated values in real time as you're consuming your stream of input data? or (b) do you just write the input data and then run a mix of pseudo-real time and batch queries over and over on that data to compute the aggregates later? Our current design is along the lines of (a), but we're starting to also poke at ideas from the (b) camp to mitigate some of the issues of the (a) camp. The initial proposal described by Assk earlier in this thread was to split the single org level doc containing all the usage aggregations within the org into smaller docs: one doc per app for example (aka consumer in Abacus as we support usage from other things than pure apps). That's what he was calling 'normalized' usage, since the exercise of coming up with that new structure would be similar to a 'normalization' of the data in the relational database sense, as opposed to the 'denormalization' we went through to design the structure of our current aggregated usage doc (a JSON hierarchical structure including some data duplication). Now, while that data 'normalization' would help reduce the size of the docs and the amount of data written to record the history of your org's aggregated usage, in the last few days we've also started to realize that it would on the other hand increase the amount of data we'd have to read, to retrieve all the little docs representing the current aggregated usage and 'join' them into a complete view of the org's aggregated usage before adding new usage to it... Like I said before, a tension between two approaches, (a) writes a lot of data, is cheap on reads, (b) writes the minimum, requires a lot of reads... nothing's easy or perfect :) So the next step here is going to be an evaluation of some of the trade-offs between: a) write all the aggregated usage data for an org in one doc like we do now but simplify and refactor a bit the JSON format we use to represent it, in an attempt to make that JSON representation much smaller; b) split the aggregated usage in separate docs, one per app, linked together by a parent doc per org containing their ids, and optimize (with caching for example) the reads and 'joins' of all the docs forming the aggregated usage for the org; c) a middle-ground approach where we'll store the aggregated usage per app in separate docs, but maintain the aggregated usage at the upper levels (org, space, resource, plan) in the parent doc linking the app usage docs together, and explore what constrains or limitations that would impose on our ability to trigger real time usage limit alerts at any org, space, resource, plan, app etc level. This is a rather complex subject, so please feel free to ask questions or send any thoughts here, or in the tracker and Github issues referenced by Assk earlier if that's easier. Thanks! - Jean-Sebastien On Fri, Nov 20, 2015 at 11:09 AM, Saravanakumar A Srinivasan < sasrin(a)us.ibm.com> wrote: Started to look into two user stories([1] and [2]) titled "Organize the |
|