BPMN Modelling: Parallel Processes, Dependency on Status of an Incomplete Process - parallel-processing

I am trying to model a process that splits into 2 parallel threads, where thread 1 progresses independently through milestones, while thread 2 needs to take into consideration its own progress + the status of thread 1 to progress through the milestones. At the end, both thread need to complete. How do I model it? (my best try below)

What you modeled would work. However, you don't need the intermediate events. You can directly connect to the tasks. And you don't need an inclusive gateway. It would work, but a parallel gateway would do the same and be less complicated.

In short
There is an issue in the way you merge the incoming event with the normal flow on the lower branch. The symbol used is ambiguous an does not guarantee compliance with the execution semantics.
More details
The diagram will probably be understood as you expect. But it is not correct from the point of view of the BPMN execution semantics due to a missing synchronisation.
Let's analyse the flow with the concept of token, according to the execution semantics (chapter 13 of the specs):
A Process is instantiated when one of its Start Events occurs.
Each Start Event that occurs creates a token on its outgoing Sequence Flows
For a Process instance to become completed, all tokens in that instance MUST reach an end node, i.e., a node without outgoing Sequence Flows
So at the start of your process, a token is created, and it is passed to the first task. You then have a parallel gateway for a fork:
The Parallel Gateway consumes exactly one token from each incoming Sequence Flow and produces exactly one token at each outgoing Sequence Flow.
You then have 2 tokens, that will flow to the first upper and the first lower task. The upper token will continue to the "none" intermediate event. The lower token will reach the entry of a "merge gate". The question is if we are guaranteed to keep one token on each parallel branch.
The "none" intermediate gate will throw and pass the token down the outgoing flow. 2 tokens are hence generated: one to the next upper task, and one to the "merge gate".
What I called coloquially a "merge gate" is in fact ambiguous in your diagram:
it cannot be an exclusive gateway, since this would route each incoming token through it. This would mean that in the lower branch we would then end up with two tokens. This would not be legal.
it could be an inclusive gateway. But the symbol inside should be a simple circle and not a double circle as you have used. The inclusive flow consumes all tokens AVAILABLE on the input, but it requires at least one to get active and does not require any waiting for all tokens to be there. There is no synchronisation guarantee and you could end up with more than one token on the lower flow if there is the slightest delay in one of the branch. This is not acceptable.
Event-based gateways are 2 step gateways. The first is an event with a pentagon inside, and it must have several outgoing flows, each leading to a different kind of event to be received. In this case, it makes no sense, since we do not expect several kind of events.
According to the book "Real-Life BPMN" written by Freund & Rücker from carmunda, the solution would be to use a complex gateway, i.e. with an large internal '*' symbol and the description of a condition that states that all inputs must be available. You'd then be guaranteed to have only one outgoing token in the lower flow
I personally would recommend a parallel join gateway: in fact the two outgoing flows from the intermediate events are uncontrolled flows and are to be understood as implicitly starting a new parallel branch. The join gate would then clearly show the merge of the new implicit branch with the lower branch and clearly document the synchronisation (aka waiting for both token to be available). This seems to be the most appropriate alternative so far.
An even easier alternative would be to get rid of the lower merge gate, and have two incoming flows for the second lower task. This is then understood as two incoming uncontrolled flows as similar to an implicit join. It's equivalent to the previous solution but with less symbols.
The two last options are the only one which guarantee that there stay one and exactly one token on the upper and the lower branch. The rest of the flow is then trivial until the end.

Related

Logoot CRDT: interleaving of data on concurrent edits to the same spot?

I want to implement Logoot for eventually-convergent P2P text editing and I've run into a bit of a problem.
My understanding of Logoot is that the intervals between objects (lines of text in the original paper, but could be characters or words) can be divided infinitely on account of an unbounded identifier. This means that the position of an object is determined not by its neighbors as in WOOT (which would require tombstones) but by a fixed numerical point along the length of the string. Combined with a unique site identifier, this also gives us a total order and enables eventual convergence.
However... doesn't this cause a problem when concurrent edits are made to the same spot? If two disconnected clients start writing new sentences at the same cursor position and then merge, their sentences have a good chance of interleaving.
Below is a whiteboard example of what I'm talking about:
As you can see, both site B and site C divide the interval between "I" and "conquered" according to the rules of Logoot, giving us random points between the positions of (20,A) and (25,A). But nothing orders these points relative to each other, causing them to mix when merged. Meanwhile, neighbor-based algorithms can account for this issue since the causality chain of each object is preserved.
The above is a baby example, but in the more general case, imagine if two users wanted to insert a different sentence between two existing sentences. If one of the users happened to be offline, they shouldn't come back to a garbled mess! Clearly, to preserve intent, one sentence should follow the other.
Am I missing something in my reading of the paper, or is this an inherent downside to Logoot?
(Also, why is there a recorded clock value that's seemingly unused in the algorithm? The paper even points out that each object's identifier is necessarily unique without the clock.)
You're correct, this a real anomaly in Logoot and LSEQ. Whether or not it constitutes a intention violation depends on what your definition of intention is. An extension to the definition requiring that contiguous sequences remain contiguous unless they are split by a casually subsequent operation would make intuitive sense.
The clock is unnecessary. Most likely the authors used the (site, clock) pair or Lamport timestamp as their UUIDs out of convention. One site can never create two identical positions, so clocks will never need to be compared. (Assuming messages are received from a site in order, which is required for other aspects of Logoot/LSEQ too.)

algorithm: match an user available to join the game to the closest user to him

my project matches 2 stranger users. so i have a database containing the status of the users and when an user is available for the match, i match him with another stranger user.
I added a function according to which if users shares their location, the user asking for the match is matched with the closest user to him.
Actually, despite i have more than 600 active users per minute, when an user is available for the match, he is matched with the user closest to him, but since there are no other preferences the queue is always made of 2 users, so the closest user to him is the first one still available and not matched.
example: i join the game ( i am available) and waiting for another user. Another user joins the game and he is the closest to me, but just because he is the only one.
i would like to make it more real. i was thinking for example to match users only if the distance is less than 200km, but again the problem is that the user is matched with the first user inside a range of 200km, but maybe 1 second later could have been available other users with a distance lower than the previous user.
example: i join the game, other users joins the game but they are not in the range of 200km. Finally an user joins the game in the range of 200km and he is matched with me, but probably the user coming one second later could have been much more near than me.
how could i make it more real? an idea to develop a better algorithm.
I would introduce some parameters:
match quality: in your case this just seems to be the distance between the two users, but you could also make it dependent on the waiting time, e.g. by multiplying with a function on elapsed seconds, so that the longer you have to wait the more acceptable are longer distances (with the siginificant disadvantage of having to recalculate distances constantly)
quality theshold: if the match quality between two users is below this threshold they cannot be matched
maximum waiting time: prevents users from waiting too long, if the match quality improves over time then this parameter could be left out (it would be implicitly defined by the maximal distance, the quality formula and the threshold), but the behavior of the queue is more transparent with it
minimum waiting time: allows the queue to fill up (only for the user to be matched, not for the one he is matched with)
queue threshold: if the number of users in the queue is above this threshold, match the first user in the queue immediately ignoring the minimum waiting time
optional: a second higher threshold for the quality above which users are matched immediately to reduce waiting times a little, so everytime you add a user to the queue you could calculate the match quality with the first one in the queue and match immediately if above the threshold
If your goal is to make it fair/balanced for all the users, then it is probably best to always match the first user in the queue with his best match (when he meets matching conditions). The disadvantage is that someone from a remote area might block the queue for "maximum waiting time". But this seems to be the most feasible way, because users are garanteed to be matched and not waiting forever (what could happen if you always tried to find the best pairings in the queue, not just the best match for the first user in the queue).

Algorithm for concurrent queue (single consumer, multiple producers) based on shared dictionary data structure

I am looking for a queue algorithm that fulfills the following properties:
Processes communicate using only a shared dictionary (key-value-store)
Does not use any atomic operations other than load and store (no CAS, for example)
Supports multiple producers
Supports a single consumer
Producers can die at any time and queue must remain operational
The consumer can also die at any time and be restarted later, but there will never be more than one consumer-process running at a time
This is meant as a general question about a suitable algorithm, since I'd like to use it in a couple of different scenarios. But to help visualize the requirements, here is an example use-case:
I have a website with two pages: producer.html and consumer.html
producer.html can be opened in multiple tabs simultaneously
Each producer.html adds events to the queue
One copy of consumer.html is open and consumes these events (to aggregate and stream them to a webserver, for example)
If the multiple producer-tabs are opened by the user rather than the page, these tabs do not have references to each other available, so the usual communication methods (postMessage or calling directly into the other tab's JS code) are out. One of the ways they can still communicate with each other is via LocalStorage as suggested here: Javascript; communication between tabs/windows with same origin. But LocalStorage is not "thread-safe" as detailed here.
Note: There may be other ways to implement cross-tab communication in the browser (Flash, ...), but these are NOT the aim of this question as they won't translate to my other use-cases. This is really just an example use-case for the general queue algorithm that I am trying to find.
A couple more parameters:
The number of producers will never be very large (10s or 100s maybe), so the scaling of the number of reads and writes needed with respect to the number of producers is not really a concern.
I don't know before hand how many producers I might have and there is no immediately obvious way to assign a number or index to them. (Many mutex algorithms (Lamport's Bakery, Eisenberg&McGuire, Szymański's, ...) maintain an array of state for each process, which wouldn't necessarily be a natural approach here, although I do not want to exclude these approaches ex ante, if they can be implemented using the shared dictionary in some way...)
The algorithm should be 100% reliable. So, I'd like to avoid things like the delay in Lamport's first Fast Mutex algorithm (page 2 in the PDF) since I don't have any kind of real-time guarantees.
It would be very helpful if the queue was FIFO, but it's not strictly required.
The algorithm should not be encumbered by any patents, etc.
Update:
The Two-Lock Concurrent Queue Algorithm by Michael and Scott looks like it could work, but I would need two things to implement it:
A locking mechanism using the shared dictionary that can survive the crash of a lock-holder
A reliable way to allocate a new node (if I move the allocation into the locked section, I could just generate new random keys until I find one that's not in use yet, but there might be a better way?)
Update 2:
It seems, I wasn't being specific enough about the dictionary:
It's really nothing more than a trivial key-value-store. It provides the functions get(key) to read the value of a key, put(key, value) to change the value of a key, and delete(key) to remove a key. In some of my use-cases, I can also iterate over keys, but if possible, I'd like to avoid it for generality. Keys are arbitrary and the producers and consumers can create or calculate them as needed. The dictionary does not provide any facilities for automatically generating unique keys.
Examples are HTML LocalStorage, Google AppEngine's Datastore, a Java Map, a Python dictionary, or even a file-system with only a single directory (where the keys would be the file-names and the values the content of the files).
After quite a bit of further reading and sleeping on things for a night, I came up with one way that should be able to accomplish what I need, but it might not be the most elegant:
The paper Wait-Free Algorithms for Fast, Long-Lived Renaming by Moir and Anderson generalizes Lamport's Fast Mutex Algorithm #2 (page 6 here) into the following building block (Figure 2):
When n processes enter this section of code, at most one of them will stop, at most n-1 will move right and at most n-1 will move down.
In Lamport's algorithm, stopping means the process acquired the lock, whereas moving right or left will simply send the process back to the beginning of this section of code. To release the lock, a process simply sets Y back to false. (Not quite correct, actually... See "Update" below...)
The big problem with this is that if any of the processes ever die while holding the lock (i.e. before releasing it), the block will simply stay locked forever.
Another problem is that every process needs to be assigned a unique process ID p.
The locked-forever problem can be fixed by borrowing an idea from Moir and Anderson, namely to send processes that end up moving right or down into a different building block rather than back to this one, leading to a structure like this (Figure 3 in the paper):
Except that in this case, I won't be using this grid to assign process IDs as M&A did (although I could probably solve the problem of the unique values for p with this). Instead, every box in the grid will correspond to a very simple queue. If a process stops on a box, it acquired the tail-lock for the corresponding queue (e.g. as per the algorithm by Michael and Scott) and proceeds to enqueue a new element to that queue. Upon completion, it sets the Y value of the box back to false to allow other processes to use this queue. This way, if there is high contention or if processes die before releasing locks, new queues will be created dynamically as needed.
The consumer-process doesn't need to worry about locking the heads of the queues when dequeuing elements, since it's the only process to ever do so. So, it simply traverses the tree of boxes to find all queues and trivially helps itself to their contained elements. One thing to note is that while each individual queue will be FIFO, there is no synchronization between the queues, so the combined queue will not necessarily be FIFO.
If we now change the boolean Y to a time-stamp (or null/0 to indicate false), the consumer can also expire locks after some safe timeout to re-activate dead queues.
A note about implementation using the dictionary:
The shared variables X and Y can be entries in the dictionaries with key-names X_123 and Y_123, where 123 is the number of the box.
p can simply be any unique random string and will be stored as the value of key X_123.
The boolean or time-stamp is also simply stored as the value of key Y_123. The producer-processes interpret a missing entry for Y_123 as false or null/0.
The box-numbers 123 need to be calculated from the move-pattern. One way to do this would be to start with 1 in the top-left corner. If the process stops in that box, we're done. If not, the current number (starting with 1) is shifted left by 1 (i.e. multiplied by 2) and, if the process moved down, also incremented by 1. Smaller (and fewer) numbers can be calculated with a different numbering scheme (I still need to work it out), but this one should work.
The queues then consist of one entry with key H_123 that holds the index of the current head of the queue in its value and one entry with key T_123 that holds the index of the tail. Both default to 0 if they don't exist.
To enqueue an item into queue 123, the tail index is read from T_123 (let's say it yields 48) and an entry with key Q_123_48 is put into the dictionary with its value containing the enqueued item. After, T_123 is incremented by 1.
After the item is enqueued, the Y_123 entry is set back to false or null/0 (not deleted!)
To dequeue an item, the head index is read from H_123 (let's say it yields 39) and compared to the tail index T_123. If it is smaller, an item is available at Q_123_39, which is then read and deleted from the dictionary. After, H_123 is incremented by 1.
To traverse the box-tree, the consumer starts with the box in the top left corner. For each box (e.g. 123), if a key Y_123 exists in the dictionary (even if it contains values null/0 or false), the consumer dequeues items from the corresponding queue, and then recursively moves right and down to the adjacent boxes. If no key Y_123 exists, this box hasn't been used by any processes yet and doesn't need to be considered (and neither do the boxes below or to its right).
I haven't actually implemented this yet, but I'll do that next. I just wanted to post this already to see if it could inspire other approaches or if anyone can see anything wrong with this idea.
Update:
I just noticed one complication: It is possible that if two processes are trying to acquire the lock for a queue simultaneously, both will fail and move on to the next block. This will leave that queue locked forever as no-one will be left to set Y back to false or null/0.
This is the reason why the "Long-Lived Renaming" algorithm by M&A as well as Lamport's algorithm #2 use an array of Y-values in which every process has its own entry that it resets also if it moves on to another block. Y is then only considered false if all entries are false.
Since I don't know before-hand how many processes I will have, I could implement this only if the dictionary had some way of enumerating keys (the keys would then be Y_123_456 where 456 is the value of p for each process).
But, with rare contention and the above described timeout-mechanism for reactivating dead queues, the issue might lead to only a little bit of memory inefficiency, rather than a major problem.
Update 2:
A better way to label the boxes would be this pattern:
If we call the total number of moves n (counting the move into the top left box also, i.e. n ≥ 1) and the number of moves to the right r, then the box-number can be calculated using
box = (n × (n - 1))/2 + r
Just use a RDBMS. It's pretty simple in MS SQL, for PostgresSQL you'd have to use the RETURNING keyword and for MySQL you'd probably have to use triggers.
CREATE TABLE Q ([Key] BIGINT IDENTITY(1,1) PRIMARY KEY, [Message] NVARCHAR(4000))
INSERT INTO Q OUTPUT inserted.* VALUE(#message)
DELETE TOP(1) Q WITH (READPAST) OUTPUT deleted.*
If you were really hoping for an algorithmic solution, just use a ring buffer.
const int MAX_Q_SIZE = 20000000;
static string[] Q = new string[MAX_Q_SIZE];
static long ProducerID = 0;
static long ConsumerID = 0;
public static long Produce(string message) {
long key = Interlocked.Increment(ref ProducerID);
int idx = (int)(key % MAX_Q_SIZE);
Q[idx] = message;
return key;
}
public static string Consume() {
long key = Interlocked.Increment(ref ConsumerID);
int idx = (int)(key % MAX_Q_SIZE);
string message = Q[idx];
return message;
}

Design/Code Dispatcher for a Publish-Subscribe System

A friend of mine was asked this problem in an interview. I would like to discuss this problem here
What can be the efficient implementation for this problem ?
A simple idea which comes to me is normal memqueue , using Memcache machines to scale several requests, with a consumer job running which will write things from memcache to DB.
and later on for the second part we can just run a sql query to find list of matching subscribers .
PROBLEM:-
Events get published to this system. Each event can be thought of as containing a fixed number (N) of string columns called C1, C2, … CN. Each event can thus be passed around as an array of Strings (C1 being the 0th element in the array, C2 the 1st and so on).
There are M subscribers – S1, … SM
Each subscriber registers a predicate that specifies what subset of the events it’s interested in. Each predicate can contain:
Equality clause on columns, for example: (C1 == “US”)
Conjunctions of such clauses, example:
(C1 == “IN”) && (C2 == “home.php”)
(C1 == “IN”) && (C2 == “search.php”) && (C3 == “nytimes.com”)
(In the above examples, C1 stands for the country code of an event and C2 stands for the web page of the site and C3 the referrer code.)
ie. – each predicate is a conjunction of some number of equality conditions. Note that the predicate does not necessarily have an equality clause for ALL columns (ie. – a predicate may not care about the value of some or all columns). (In the examples above: #a does not care about the columns C3, … CN).
We have to design and code a Dispatcher that can match incoming events to registered subscribers. The incoming event rate is in millions per second. The number of subscribers is in thousands. So this dispatcher has to be very efficient. In plain words:
When the system boots, all the subscribers register their predicates to the dispatcher
After this events start coming to the dispatcher
For each event, the dispatcher has to emit the id of the matching subscribers.
In terms of an interface specification, the following can be roughly spelt out (in Java):
Class Dispatcher {
public Dispatcher(int N /* number of columns in each event – fixed up front */);
public void registerSubscriber( String subscriberId /* assume no conflicts */,
String predicate /* predicate for this subscriberid */);
public List<String> findMatchingIds(String[] event /* assume each event has N Strings */);
}
Ie.: the dispatcher is constructed, then a bunch of registerSubscriber calls are made. After this we continuously invoke the method findMatchingIds() and the goal of this exercise is to make this function as efficient as possible.
As Hanno Binder implied, the problem is clearly set up to allow pre-processing the subscriptions to obtain an efficient lookup structure. Hanno says the lookup should be a map
(N, K) -> set of subscribers who specified K in field N
(N, "") -> set of subscribers who omitted a predicate for field N
When an event arrives, just look up all the applicable sets and find their intersection. A lookup failure returns the empty set. I'm only recapping Hanno's fine answer to point out that a hash table is O(1) and perhaps faster in this application than a tree. On the other hand, intersecting trees can be faster, O(S + log N) where S is the intersection size. So it depends on the nature of the sets.
Alternative
Here is my alternative lookup structure, again created only once during preprocessing. Begin by compiling a map
(N, K) -> unique token T (small integer)
There is also a distinguished token 0 that stands for "don't care."
Now every predicate can be thought of as a regular expression-like pattern with N tokens, either representing a specific event string key or "don't care."
We can now build a decision tree in advance. You can also think of this tree is a Deterministic Finite Automaton (DFA) for recognizing the patterns. Edges are labeled with tokens, including "don't care". A don't care edge is taken if no other edge matches. Accepting states contain the respective subscriber set.
Processing an event starts with converting the keys to a token pattern. If this fails due to a missing map entry, there are no subscribers. Otherwise feed the pattern to the DFA. If the DFA consumes the pattern without crashing, the final state contains the subscriber set. Return this.
For the example, we would have the map:
(1, "IN") -> 1
(2, "home.php") -> 2
(2, "search.php") -> 3
(3, "nytimes.com") -> 4
For N=4, the DFA would look like this:
o --1--> o --2--> o --0--> o --0--> o
\
-3--> o --4--> o --0--> o
Note that since there are no subscribers who don't care about e.g. C1, the starting state doesn't have a don't care transition. Any event without "IN" in C1 will cause a crash, and the null set will be properly returned.
With only thousands of subscribers, the size of this DFA ought to be reasonable.
Processing time here is of course O(N) and could be very fast in practice. For real speed, the preprocessing could generate and compile a nest of C switch statements. In this fashion you might actually get millions of events per second with a small number of processors.
You might even be able to coax a standard tool like the flex scanner generator to do most of the work for you.
A solution that comes to my mind would be:
For each Cn we have a mapping from values to sets of subscribers for those subscribers who subscribed for a value of Cn. Additionally, for each Cn we have a set of subscribers who don't care for the value of Cn ('ANY').
When receiving an event, we look up all the subscribers with matching subscriptions for Cn and receive a set with 0 or more subscribers. To this set we add those subscribers from the 'ANY' set for this Cn.
We do this for every n <= N, yielding n sets of subscribers. The intersection of all n sets is the set of subscribers matching this event.
The mapping from Cn to subscribers can efficiently be stored as a tree, which gives a complexity O(k) = log(k) to look up the subscribers for a single Cn, given that there are subscriptions to k different values.
Thus, for n values we have a complexity of O(n,k) = n * log(k).
Intersecting n sets can also be done in O(n,m) = n * log(m), so that we end up with a logarithmic complexity in total, which shouldn't be too bad.
Interesting.
My initial thoughts.
I feel it would be easier if the subscriber predicates for e.g.
(C1 == “IN”) && (C2 == “search.php”) && (C3 == “nytimes.com”)
that come to the Dispatcher
public void registerSubscriber
method needs to be flattened so that it is much performance friendly for comparison. Something like below (wild guess)
C1IN|C2search.php|C3nytimes.com
Then a map needs to be maintained in the memory with event string and subscriber ids
In the
findMatchingIds
method - the String array of events also need to be flattened with the similar rules so that a look up can be done for the matching subscriber id
This way the Dispatchers can be scaled horizontally serving many events in parallel
I think this is more of a design question- I don't think the interviewer would have been looking for working code . The general problem is called Content based Publish Subscribe , and if you search for papers in the same area, you would get a lot of results :
For instance- this paper also
Here are few things the system would need
1) A data-store for the subscriptions which needs to store:
a)Store the list of subscribers
b)Store the list of subscriptions
2) A means for authenticating the requests for subscriptions and the nodes themselves
a) Server-Subscribers communicate over ssl. In the case of the server handling thousands of SSL connections - It's a CPU intensive task, especially if lots of connections are set up in bursts.
b) If all the subscriber nodes are in the same trusted network, need not have ssl.
3) Whether we want a Push or Pull based model:
a)Server can maintain a latest timestamp seen per node, per filter matched. When an event matches a filter, send a notification to the subscriber. Let the client then
send a request. The server then initiate sending matching events.
b)Server matches and sends filter to clients at one shot.
Difference between (a) and (b) is that, in (a) you have more state maintained on the client side. Easier to extend a subscriber-specific logic later on. In (b) the client is dumb. It does not have any means to say if it does not want to receive events for whatever reason. (say, network clog).
4) How are the events maintained in memory at the server-side?
a)The logical model here is table with columns of strings (C1..CN), and each new row added is a new event.
b)We could have A hash-table per column storing a tupple of (timestamp, pointer to event structure). And each event is given a unique id. With different data-structures,we can come up with different schemes.
c) Events here are considered as infinite stream. If we have a 32-bit eventId, we have chances of integer-overflow.
d) If we have a timer function on the server, matching and dispatching events,what is the actual resolution of the system timer? Does that have any implication?
e) Memory allocation is a very expensive operation. If your filter-matching logic is going to do frequent allocations/ freeing, it will adversely affect performance. How can we manage the memory-pool for this particular operation? Would we different size-buckets of page-aligned memory?
5) What should happen if the subscriber node loses connectivity or goes down?
(a)Is it acceptable for the client to lose events during the period, or should the server buffer everything?
(b)If the subscriber goes down,till what historical time in the past can it request matching events.
6) More details of the messaging layer between (Server,Subscriber)
(a) Is the communication between the server and subscribers synchronous or asynchronous?
(b)Do we need a binary-protocol or text-based protocol between the client/server? (There are trade-off's in both)
7) Should we need any rate-limiting logic in server side? What should we do if we starve some of the clients while serving data to few others?
8) How would the change of subscriptions be managed? If some client wishes to change it's subsciption then, should it be updated in-memory first before updating the permanent data-store? Or vice-versa? What would happen if the server goes down, before the data-store is written-to? How would we ensure consistency of the data-store- the subscriptions/server list?
9)This was assuming that we have a single server- What if we need a cluster of servers that
the subscribers can connect to? (Whole bunch of issues here: )
a)How can network-partitioning be handled? ( example: of say 5 nodes,3 nodes are reachable from each other, and other 2 nodes can only reach other?)
b) How are events/workload distributed among the members of the cluster?
10) Is absolute correctness of information sent to the subscriber a requirement,ie, can the client receive additional information,that what it's subscription rules indicate? This can determine choice of data-structure- example using a probabilistic data structure like a Bloom filter on the server side, while doing the filtering
11)How is time-ordering of events maintained on the server side? (Time-order sorted linked list? timestamps?)
12)Will the predicate-logic parser for the subscriptions need unicode support?
In conclusion,Content-based pub-sub is a pretty vast area- and it is a distributed system which involves interaction of databases,networking,algorithms,node behavior(systems go down,disk goes bad,system runs out of memory because of a memory leak etc) - We have to look all these aspects. And most importantly, we have to look at the available time for actual implementation, and then determine how we want to go about solving this problem.

Algorithm/Heuristic for grouping chat message histories by 'conversation'/implicit sessions from time stamps?

The problem: I have a series of chat messages -- between two users -- with time stamps. I could present, say, an entire day's worth of chat messages at once. During the entire day, however, there were multiple, discrete conversations/sessions...and it would be more useful to the user to see these divided up as opposed to all of the days as one continuous stream.
Is there an algorithm or heuristic that can 'deduce' implicit session/conversation starts/breaks from time stamps? Besides an arbitrary 'if the gap is more than x minutes, it's a separate session'. And if that is the only case, how is this interval determined? In any case, I'd like to avoid this.
For example, there are...fifty messages that get sent between 2:00 and 3:00, and then a break, and then twenty messages sent between 4:00 and 5:00. There would be a break inserted between there...but how would the break be determined?
I'm sure that there is already literature on this subject, but I just don't know what to search for.
I was playing around with things like edge detection algorithms and gradient-based approaches for a while.
(see comments for more clarification)
EDIT (Better idea):
You can view each message as being of two types:
A continuation of a previous conversation
A brand new conversation
You can model these two types of messages as independent Poisson processes, where the time difference between adjacent messages is an exponential distribution.
You can then empirically determine the exponential parameters for these two types of messages by hand (wouldn't be too hard to do given some initial data). Now you have a model for these two events.
Finally when a new message comes along, you can calculate the probability of the message being of type 1 or type 2. If type 2, then you have a new conversation.
Clarification:
The probability of the message being a new conversation, given that the delay is some time T.
P(new conversation | delay=T) = P(new conversation AND delay=T)/P(delay=T)
Using Bayes' Rule:
= P(delay=T | new conversation)*P(new conversation)/P(delay=T)
The same calculation goes for P(old conversation | delay=T).
P(delay=T | new conversation) comes from the model. P(new conversation) is easily calculable from the data used to generate your model. P(delay=T) you don't need to calculate at all since all you want to do is compare the two probabilities.
The difference in timestamps between adjacent messages depends on the type of conversation and the people participating. Thus you'll want an algorithm that takes into account local characteristics, as opposed to a global threshold parameter.
My proposition would be as follows:
Get the time difference between the last 10 adjacent messages.
Compute the mean (or median)
If the delay until the next message is more than 30 times the the mean, it's a new conversation.
Of course, I came up with these numbers on the spot. They would have to be tuned to fit your purpose.

Resources