On Thu, Nov 12, 2015 at 1:41 PM, Benjamin Cheng <bscheng(a)us.ibm.com> wrote:
How about keying by criteria as well to know when to trigger a Webhook
call? and maybe allow multiple registrations per URL? (e.g please call me
back at http://foo.com/bar on new usage for org abc, and also when usage
for app xyz goes over 1000, would probably be two separate registration
docs for a single URL)
I think that would work well too. I guess the main thing between the two
approaches is number of docs versus size of a doc. Building a query for
either wouldn't be too hard, but putting multiple registrations inside a
doc would probably add additional complexity to the notification logic.
Yes, that's why I was suggesting one doc per registration keyed by trigger
criteria.
+1 for a sort of quarantine on an unreachable Webhook. How about slow
Webhooks causing back pressure problems? Would we quarantine these too?
This is kind of hard for me to figure out. If they continue to cause
problems, leaving the slow webhooks would probably compound the problems as
time goes on, but at the same time, they aren't quite in the same category
as an unreachable one since they're actually reachable. Somehow the client
needs to know their webhook is causing issues or unreachable.
I thought a bit more about this. We could set a timeout (using our circuit
breaker module for example) causing slow Webhooks to error and then get
handled like other errors.
- should we let the rating service app do this or have a separate
notification service app?
I don't have a concrete opinion on this, but both sides have their merits.
With a separate notification app, this keeps notification logic outside of
something like aggregator such that aggregator stays doing aggregator logic
and doesn't expand to something that it might not want to do. One thing to
consider is dealing with the load of everything coming to a single
notification app. I'm also not sure if this approach would design the logic
in such a way that it won't be pluggable in any other app such that someone
would not have to the notifications application running to use it.
I don't have as much to say on putting the logic inside of rating. The
only thing I can bring up is my previous statement of having rating doing
notification logic instead of strictly rating logic, and it does give a
bigger separation of logic in terms of determining if a criteria is met or
not.
Right, so I'd vote for a separate service initially for a cleaner
separation of concerns, and we can always merge it back in later if we want.
- with partitioning of the orgs across multiple deployments of our apps
(for scalability or regional deployments for example) do I need to first
find the right service to register with (e.g. register with the us-south
notification service or the eu-west notification service)? or can I
register with a central notification service that will then figure out
which deployment instance will call me back?
It might work better on a central notification service to prevent things
like a client registering to an incorrect region. If we keep it separate
and configured criteria continue to evolve, would a central notification
service help along the lines of something like "notify if all my
organizations across all the regions exceed X amount?"
Good point, I had not thought about notifications on total usage across
multiple regional organizations.
- how do we secure the registration calls and Webhook callbacks?
The security on registration would probably just validate if the user's
token has access to that organization/space/etc.
OK sounds good, that'll also be consistent with how our usage reporting
works.
For the webhook, I'm not sure if this falls within abacus.usage.read or
abacus.usage.write, or if we would need a new scope to handle this case.
It'd be good to get some use case input from Subhash, Piotr etc on this.
- do we replay notifications when we can't deliver them?
Ideally, it would make sense. Whether it's a simple series of retries or
have replay logic that would later play through the set of unsent
notifications. I think this is related to the quarantine logic. Some
notifications when replayed will probably go through, some will fail and
may continue to fail. Those that continue to fail may be due to the webhook
their attached to, and thus, it may make sense to quarantine that webhook
based upon these replay failures.
+1
- can I register to receive all notifications from a certain point in the
logical stream of notifications matching a criteria (e.g. call me back if
this org consumed too much per hour at any point since last week)
I'm not sure if I completely understand this question. Is this pretty much
setting a adhoc time range upon which notifications should be sent?
I was thinking about a kind of 'cursor' mechanism or something along the
lines of what CF app events provide, where you can request notifications
from a sequence number, a timestamp, or a page number for example... That
cursor mechanism will be handy too when you'll want to replay missed
notifications after a failure. Makes sense?
- Jean-Sebastien