Re: [abacus] Handling Notifications


Jean-Sebastien Delfino
 

Hi Chaskin,

Thanks for sharing your thoughts! I'll try to address some of your
questions:

1. If topic based filtering is not expressive enough for our use case,
which of the types of content based filtering are required to express what
we need while maintaining a scalable pub/sub system?

I was thinking about starting simple with ANDs of predicates using the =
and > operators, like space = 'production' AND cost > 1000 for example.
Like you said there are well known efficient algorithms to implement that
kind of filters.

2. @Ben Would you be able to provide a reference to the algorithm that
you are describing? Which form of content based filtering are you
referring to?

Looks like Ben had started to describe what he had in mind in this thread
[1].

3. What are the general scalability goals of this system(How many
subscriptions would be allowed? How complex would the subscriptions be
allowed to be? etc.)

To get an idea (and maybe get a bit scared), you could probably start with
a low strawman estimate of 'number of users' * '10 to 20 apps' * '4 or 5
services' * '4 to 12 usage events per hour' * '10 or 20 filters' * '4 to 5
predicates', and say that each filter should react quickly enough to be
useful.

If you do the math... looks like high scalability will be a key goal here :)

4. Are you tied to a document based database, or can you use a
traditional, relational DBMS for this if it proves to be more efficient?

We're storing JSON docs in CouchDB now but I'll be happy to consider any
pull request that provides an efficient solution based on another data
format and storage technology. I'd also like to consider the opportunity to
filter incoming usage docs *before* they even hit the DB (as opposed to
querying a DB to find matching usage docs once they're stored there).

[1]
https://lists.cloudfoundry.org/archives/list/cf-dev(a)lists.cloudfoundry.org/message/WMESTS67YU2A5BPULOS54ARLB6F5DHMR/

HTH

- Jean-Sebastien

On Sun, Jan 31, 2016 at 3:46 PM, Chaskin Saroff <chaskin.saroff(a)gmail.com>
wrote:

I apologize in advance for the length of this email. This is a
complicated topic, so I want to ensure that everyone is on the same page.
TL;DR I define different types of pub/sub and list questions at the bottom.

I'm going to highlight my general understanding of the common types and
subtypes of publish/subscribe models for clarification.

Publishers publish objects or messages(collections of key, value pairs).
Subscribers only receive a subset of publications. Filtering is done by
two general methods.

From Wikipedia:

Topic-based Filtering- "Messages are published to "topics" or "channels."
Subscribers in a topic-based system will receive all messages published to
the topics to which they subscribe, and all subscribers to a topic will
receive the same messages.

Content-based Filtering - "Messages are only delivered to a subscriber if
the attributes or content of those messages match constraints defined by
the subscriber."[1]

The article goes on to say that there are hybrid based approaches where
content based subscription filtering is done within a topic.

Topic Based Filtering
---

Topic based delivery seems relatively easy. You save an entry in a
database for each subscription with the entry containing a subscriber's
address, secret and channel.

When a publication occurs, SELECT (Address, Secret) FROM Subscriptions
WHERE Channel=<publication channel>

Content based filtering seems to be much more challenging. Unfortunately,
topic based filtering does not seem to be expansive enough for the use
cases that Subhash listed above.

Content Based Filtering
---

Let a message, m = {(k_1, v_1), (k_2, v_2), ..., (k_n, v_n)}

Let a predicate, p = <messagekey> <operator> <value>

Ex)

For a message, m,

m = {
"event_type": "usage_increase",
"organization": "af41dcc7-54cc-4489-89a6-b20c2abadc79"
"aggregate_rated_usage": 1234.5,
"region": "us-south",
"country": "USA"
"currency": "USD"
}

p = "aggregate_rated_usage" > 1000

In the widely used and simple case of content filtering, we would only
allow subscriptions to be *conjunctions* of predicates. Let's refer to
this as "simple conjunctive filtering" or "canonical filtering".

For a subscription, s, and predicates p_1, ..., p_n

s = p_1 ^ p_2 ^ ... ^ p_n

Ex)

s = "aggregate_rated_usage" > 1000 AND "organization" =
af41dcc7-54cc-4489-89a6-b20c2abadc79 AND "currency" = "USD"

In such cases where only conjunctions of predicates are allowed, efficient
algorithms are known. [2]

As you might already have considered, there are cases where a subscriber
may want a subscription that is more than a flat list of predicates that
are joined by conjunctions.

Let us refer to any collection of predicates joined with conjunctions,
disjunctions and parentheses as "non-canonical filtering" or "generalized
junctive filtering"[3]

Ex)
s = "aggregate_rated_usage" > 1000 AND ("country" = "USA" OR "currency" =
"USD")

The final, and most generalized form of content-based filtering could be
called "generalized logical filtering". This would allow for the use of
the remainder of the logical operators such as the unary NOT operator, and
the XOR operator.

Ex)
s = "aggregate_rated_usage" > 1000 XOR NOT("country" = "USA" OR
"currency" = "USD")


This leads me to several questions.

1. If topic based filtering is not expressive enough for our use case,
which of the types of content based filtering are required to express what
we need while maintaining a scalable pub/sub system?

Also, we ended up discussing a little bit on the high level on how to do
the matching
algorithm, so I'll relay it here.
2. @Ben Would you be able to provide a reference to the algorithm that you
are describing? Which form of content based filtering are you referring to?

3. What are the general scalability goals of this system(How many
subscriptions would be allowed? How complex would the subscriptions be
allowed to be? etc.)

4. Are you tied to a document based database, or can you use a
traditional, relational DBMS for this if it proves to be more efficient?

Cheers,
Chaskin

[1]
https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern#Message_filtering
[2] https://www.cs.cornell.edu/Courses/cs614/2003SP/papers/ASS99.pdf
[3]
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=0F159B2E9BE3145BC0F8C51F79C4DEB0?doi=10.1.1.483.1899&rep=rep1&type=pdf

Join cf-dev@lists.cloudfoundry.org to automatically receive all group messages.