More complicated correlations by rules - apache-storm

I've seen some support for aggregations and joins but there aren't much words about it,
I wonder if Storm can correlate between events when there is no explicit correlation-id.
For example, assuming I have 3 (may be more) Spouts that emit tuples which represents Person from different sources.
Spout 1:
Person: name, security_id
Spout 2:
Person: fullName, secId, email
Spout 3:
Person: email
The end of the pipe should be 1 list of merged tuples (fields should be combined from all tuples), I would like to merge the Person tuples based on conditions such:
Spout1.security_id = Spout2.secId
||
Spout2.email = Spout3.email
(may be more rules)

In your case, it seems that you need to do a "windowed Cartesian Product" (which is quite expensive). For this, you need to use allGrouping connection pattern for all spouts to a single join bolt. Furthermore, in you join bolt, you need to distinguish incoming tuples (ie, from which spout a tuple was emitted) using input.getSourceComponent() or input.getSourceStreamId(). See here for a discussion about both methods: How to send output of two different Spout to the same Bolt?
The most tricky part is the buffering. Because you do not have any ordering guarantees and you don't know if a tuple might join in the future or nor, you need to buffer each incoming tuple for some time (best to use distinct buffers for the different sources). Each time you receive a tuple, you need to evaluate your complex predicate using all buffered tuples. The most difficult question to answer is, how long to keep tuple in the buffer. This question is application dependent as it is a pure semantical question. You need to answer it for yourself.

Related

Parallel Stream toArray maintains order

I read about concurrent collectors maintaining order of input list. So if i use a Collectors to ArrayList, it can guarantee ordered collection.
Also map functions on ordered list maintain the order.
I could not find any documentation around order preservation in toArray
Even when a pipeline is constrained to produce a result that is consistent with the encounter order of the stream source (for example, IntStream.range(0,5).parallel().map(x -> x*2).toArray() must produce [0, 2, 4, 6, 8]), no guarantees are made as to the order in which the mapper function is applied to individual elements, or in what thread any behavioral parameter is executed for a given element.
So will
Stream.map(x->x).toArray()
produce ordered results? Or should I use collectors.
The cited part of the documentation already states by example that both, map and toArray will maintain the encounter order.
When you go through the Stream API documentation you’ll see that it never makes an explicit statement about operations which maintain the encounter order, but does it the other way round, it explicitly states when an operation is unordered or has special policies depending on the ordered state.
obviously unordered() retracts the encounter order explicitly
forEach and findAny do not respect the encounter order
Stream.concat returns an unordered stream if at least one of the two input streams is unordered (a debatable behavior, but that’s how it is)
Stream.generate() generates an unordered stream
skip, limit, takeWhile, and dropWhile respect the encounter order, which may cause significant performance penalties in parallel executions
distinct() and sorted() are stable for ordered streams, distinct() may have significantly better parallel performance when the stream is unordered
collect(Collector) may behave as unordered if the collector is unordered, which is only hinted by the statement that the operation will be concurrent if the collector is concurrent and either the stream is unordered or the collector is unordered. For more details, we have to refer the Collector documentation and the builtin collectors.
Note that while the operations count(), allMatch, anyMatch, and noneMatch have no statement about the encounter order, these operations have a semantic that implies that the result should not depend on the encounter order at all.

RxJS event order guarantee

while exploring rx for our project, we ran into the following puzzler:
We have one stream S1 that can receive two distinct events (A and B).
If we create two separate streams (Sx1 and Sx2) from that stream (S1) that subscribe specifically for either A or B events (Sx1 for A and Sx2 for B), is there any guarantee that the subscribers will receive the events
in the order they arrive in S1?
It all depends on what merging method you chose to carry out which will determine how the results are given back.
Take a look at RxMArbles it has great visual examples.
for this case I'd say Concat would keep it in the same order the events went in but if you are dealing with async data this might not be the best option. look at the COMBINING OPERATORS on RxMarble

Dynamic tuple in Pig?

I obtained data with tuple in pig
0,(0),(zero)
1,(1,2),(first,second)
Can I receive this?
0,0,zero
1,1,first
1,2,second
To start off, I'm going to correct your terminology and you should be treating (0) and (1,2) as bags, not tuples. Tuples are intended to be fixed-length data structures that represent some sort of entity. Say (name, address, year of birth), for example. If you have a list of similar objects, like {(apple), (orange), (banana)}, you want a bag.
There doesn't exist behavior that allows you to "zip" up multiple bags/lists. The reason for this is from a design perspective, Pig treats bags as unordered lists, hence the term "bag" not "list". This assumption really helps out with parallelism since you don't have to be considered with order. Therefore, it's really hard to match up 1 with first.
What you could try to do is write an eval function UDF that takes in two bags as parameters, and then zips up the two lists, and then returns back one back with the bags zipped.

Design/Code Dispatcher for a Publish-Subscribe System

A friend of mine was asked this problem in an interview. I would like to discuss this problem here
What can be the efficient implementation for this problem ?
A simple idea which comes to me is normal memqueue , using Memcache machines to scale several requests, with a consumer job running which will write things from memcache to DB.
and later on for the second part we can just run a sql query to find list of matching subscribers .
PROBLEM:-
Events get published to this system. Each event can be thought of as containing a fixed number (N) of string columns called C1, C2, … CN. Each event can thus be passed around as an array of Strings (C1 being the 0th element in the array, C2 the 1st and so on).
There are M subscribers – S1, … SM
Each subscriber registers a predicate that specifies what subset of the events it’s interested in. Each predicate can contain:
Equality clause on columns, for example: (C1 == “US”)
Conjunctions of such clauses, example:
(C1 == “IN”) && (C2 == “home.php”)
(C1 == “IN”) && (C2 == “search.php”) && (C3 == “nytimes.com”)
(In the above examples, C1 stands for the country code of an event and C2 stands for the web page of the site and C3 the referrer code.)
ie. – each predicate is a conjunction of some number of equality conditions. Note that the predicate does not necessarily have an equality clause for ALL columns (ie. – a predicate may not care about the value of some or all columns). (In the examples above: #a does not care about the columns C3, … CN).
We have to design and code a Dispatcher that can match incoming events to registered subscribers. The incoming event rate is in millions per second. The number of subscribers is in thousands. So this dispatcher has to be very efficient. In plain words:
When the system boots, all the subscribers register their predicates to the dispatcher
After this events start coming to the dispatcher
For each event, the dispatcher has to emit the id of the matching subscribers.
In terms of an interface specification, the following can be roughly spelt out (in Java):
Class Dispatcher {
public Dispatcher(int N /* number of columns in each event – fixed up front */);
public void registerSubscriber( String subscriberId /* assume no conflicts */,
String predicate /* predicate for this subscriberid */);
public List<String> findMatchingIds(String[] event /* assume each event has N Strings */);
}
Ie.: the dispatcher is constructed, then a bunch of registerSubscriber calls are made. After this we continuously invoke the method findMatchingIds() and the goal of this exercise is to make this function as efficient as possible.
As Hanno Binder implied, the problem is clearly set up to allow pre-processing the subscriptions to obtain an efficient lookup structure. Hanno says the lookup should be a map
(N, K) -> set of subscribers who specified K in field N
(N, "") -> set of subscribers who omitted a predicate for field N
When an event arrives, just look up all the applicable sets and find their intersection. A lookup failure returns the empty set. I'm only recapping Hanno's fine answer to point out that a hash table is O(1) and perhaps faster in this application than a tree. On the other hand, intersecting trees can be faster, O(S + log N) where S is the intersection size. So it depends on the nature of the sets.
Alternative
Here is my alternative lookup structure, again created only once during preprocessing. Begin by compiling a map
(N, K) -> unique token T (small integer)
There is also a distinguished token 0 that stands for "don't care."
Now every predicate can be thought of as a regular expression-like pattern with N tokens, either representing a specific event string key or "don't care."
We can now build a decision tree in advance. You can also think of this tree is a Deterministic Finite Automaton (DFA) for recognizing the patterns. Edges are labeled with tokens, including "don't care". A don't care edge is taken if no other edge matches. Accepting states contain the respective subscriber set.
Processing an event starts with converting the keys to a token pattern. If this fails due to a missing map entry, there are no subscribers. Otherwise feed the pattern to the DFA. If the DFA consumes the pattern without crashing, the final state contains the subscriber set. Return this.
For the example, we would have the map:
(1, "IN") -> 1
(2, "home.php") -> 2
(2, "search.php") -> 3
(3, "nytimes.com") -> 4
For N=4, the DFA would look like this:
o --1--> o --2--> o --0--> o --0--> o
\
-3--> o --4--> o --0--> o
Note that since there are no subscribers who don't care about e.g. C1, the starting state doesn't have a don't care transition. Any event without "IN" in C1 will cause a crash, and the null set will be properly returned.
With only thousands of subscribers, the size of this DFA ought to be reasonable.
Processing time here is of course O(N) and could be very fast in practice. For real speed, the preprocessing could generate and compile a nest of C switch statements. In this fashion you might actually get millions of events per second with a small number of processors.
You might even be able to coax a standard tool like the flex scanner generator to do most of the work for you.
A solution that comes to my mind would be:
For each Cn we have a mapping from values to sets of subscribers for those subscribers who subscribed for a value of Cn. Additionally, for each Cn we have a set of subscribers who don't care for the value of Cn ('ANY').
When receiving an event, we look up all the subscribers with matching subscriptions for Cn and receive a set with 0 or more subscribers. To this set we add those subscribers from the 'ANY' set for this Cn.
We do this for every n <= N, yielding n sets of subscribers. The intersection of all n sets is the set of subscribers matching this event.
The mapping from Cn to subscribers can efficiently be stored as a tree, which gives a complexity O(k) = log(k) to look up the subscribers for a single Cn, given that there are subscriptions to k different values.
Thus, for n values we have a complexity of O(n,k) = n * log(k).
Intersecting n sets can also be done in O(n,m) = n * log(m), so that we end up with a logarithmic complexity in total, which shouldn't be too bad.
Interesting.
My initial thoughts.
I feel it would be easier if the subscriber predicates for e.g.
(C1 == “IN”) && (C2 == “search.php”) && (C3 == “nytimes.com”)
that come to the Dispatcher
public void registerSubscriber
method needs to be flattened so that it is much performance friendly for comparison. Something like below (wild guess)
C1IN|C2search.php|C3nytimes.com
Then a map needs to be maintained in the memory with event string and subscriber ids
In the
findMatchingIds
method - the String array of events also need to be flattened with the similar rules so that a look up can be done for the matching subscriber id
This way the Dispatchers can be scaled horizontally serving many events in parallel
I think this is more of a design question- I don't think the interviewer would have been looking for working code . The general problem is called Content based Publish Subscribe , and if you search for papers in the same area, you would get a lot of results :
For instance- this paper also
Here are few things the system would need
1) A data-store for the subscriptions which needs to store:
a)Store the list of subscribers
b)Store the list of subscriptions
2) A means for authenticating the requests for subscriptions and the nodes themselves
a) Server-Subscribers communicate over ssl. In the case of the server handling thousands of SSL connections - It's a CPU intensive task, especially if lots of connections are set up in bursts.
b) If all the subscriber nodes are in the same trusted network, need not have ssl.
3) Whether we want a Push or Pull based model:
a)Server can maintain a latest timestamp seen per node, per filter matched. When an event matches a filter, send a notification to the subscriber. Let the client then
send a request. The server then initiate sending matching events.
b)Server matches and sends filter to clients at one shot.
Difference between (a) and (b) is that, in (a) you have more state maintained on the client side. Easier to extend a subscriber-specific logic later on. In (b) the client is dumb. It does not have any means to say if it does not want to receive events for whatever reason. (say, network clog).
4) How are the events maintained in memory at the server-side?
a)The logical model here is table with columns of strings (C1..CN), and each new row added is a new event.
b)We could have A hash-table per column storing a tupple of (timestamp, pointer to event structure). And each event is given a unique id. With different data-structures,we can come up with different schemes.
c) Events here are considered as infinite stream. If we have a 32-bit eventId, we have chances of integer-overflow.
d) If we have a timer function on the server, matching and dispatching events,what is the actual resolution of the system timer? Does that have any implication?
e) Memory allocation is a very expensive operation. If your filter-matching logic is going to do frequent allocations/ freeing, it will adversely affect performance. How can we manage the memory-pool for this particular operation? Would we different size-buckets of page-aligned memory?
5) What should happen if the subscriber node loses connectivity or goes down?
(a)Is it acceptable for the client to lose events during the period, or should the server buffer everything?
(b)If the subscriber goes down,till what historical time in the past can it request matching events.
6) More details of the messaging layer between (Server,Subscriber)
(a) Is the communication between the server and subscribers synchronous or asynchronous?
(b)Do we need a binary-protocol or text-based protocol between the client/server? (There are trade-off's in both)
7) Should we need any rate-limiting logic in server side? What should we do if we starve some of the clients while serving data to few others?
8) How would the change of subscriptions be managed? If some client wishes to change it's subsciption then, should it be updated in-memory first before updating the permanent data-store? Or vice-versa? What would happen if the server goes down, before the data-store is written-to? How would we ensure consistency of the data-store- the subscriptions/server list?
9)This was assuming that we have a single server- What if we need a cluster of servers that
the subscribers can connect to? (Whole bunch of issues here: )
a)How can network-partitioning be handled? ( example: of say 5 nodes,3 nodes are reachable from each other, and other 2 nodes can only reach other?)
b) How are events/workload distributed among the members of the cluster?
10) Is absolute correctness of information sent to the subscriber a requirement,ie, can the client receive additional information,that what it's subscription rules indicate? This can determine choice of data-structure- example using a probabilistic data structure like a Bloom filter on the server side, while doing the filtering
11)How is time-ordering of events maintained on the server side? (Time-order sorted linked list? timestamps?)
12)Will the predicate-logic parser for the subscriptions need unicode support?
In conclusion,Content-based pub-sub is a pretty vast area- and it is a distributed system which involves interaction of databases,networking,algorithms,node behavior(systems go down,disk goes bad,system runs out of memory because of a memory leak etc) - We have to look all these aspects. And most importantly, we have to look at the available time for actual implementation, and then determine how we want to go about solving this problem.

How to implement self-join/cross-product with hadoop?

It is common task to make some evaluation on pairs of items:
Examples: de-duplication, collaborative filtering, similar items etc
This is basically self-join or cross-product with the same source of data.
To do a self join, you can follow the "reduce-side join" pattern. The mapper emits the join/foreign key as key, and the record as the value.
So, let's say we wanted to do a self-join on "city" (the middle column) on the following data:
don,baltimore,12
jerry,boston,19
bob,baltimore,99
cameron,baltimore,13
james,seattle,1
peter,seattle,2
The mapper would emit the key->value pairs:
(baltimore -> don,12)
(boston -> jerry,19)
(baltimore -> bob,99)
(baltimore -> cameron,13)
(seattle -> james,1)
(seattle -> peter,2)
In the reducer, we'll get this:
(baltimore -> [(don,12), (bob,99), (cameron,13)])
(boston -> [(jerry,19)])
(seattle -> [(james,1), (peter,2)])
From here, you can do the inner join logic, if you so choose. To do this, you'd just match up every item for every other item. To do this, load it up the data into an array list, then do a N x N loop over the items to compare each to each other.
Realize that reduce-side joins are expensive. They send pretty much all of the data to the reducers if you don't filter anything out. Also, be careful of loading the data up into memory in the reducers-- you may blow your heap on a hot join key by loading all of the data in an array list.
The above is a bit different than the typical reduce-side join. The idea is the same when joining two data sets: the foreign key is the key, and the record is the value. The only difference is that the values could be coming from two or more data sets. You can use MultipleInputs to have different mappers parse different input sets, then have the reducer collect data from both.
Cross product in the case where you don't have any constraints is a nightmare. I.e.,
select * from tablea, tableb;
There are a number of ways to do this. None of them are particularly efficient. If you want this type of behavior, leave me a comment and I'll spend more time explaining a way to do this.
If you can figure out some sort of join key which is a fundamental key to similarity, you are much better off.
Plug for my book: MapReduce Design Patterns. It should be published in a few months, but if you are really interested I can email you the chapter on joins.
One typically uses the reducer to perform whatever logic is required on the join. The trick is to map the dataset twice, possibly adding some marker to the value indicating which run it is. Then a self join is no different from any other kind of join.

Resources