Rxjava2 how to have flatMap emitting items in called order? - parallel-processing

Imagining there is Observable A emitting
a1, a2, a3, a4...
A.flatMap(a -> f(a)) will emit items in unpredictable order, for example:
fa3, fa1, fa2, fa4...
How could I get results in order like below?
fa1, fa2, fa3, fa4...
ConcatMap can return the result I want, but it processes streams in sequential order, which is not effective in time consuming.
I need something like concatMap with parallel processing ability. Any solution? Thanks.

You can use concatMapEager, which buffers emission and produce items sequentially.

Related

Parallel Stream toArray maintains order

I read about concurrent collectors maintaining order of input list. So if i use a Collectors to ArrayList, it can guarantee ordered collection.
Also map functions on ordered list maintain the order.
I could not find any documentation around order preservation in toArray
Even when a pipeline is constrained to produce a result that is consistent with the encounter order of the stream source (for example, IntStream.range(0,5).parallel().map(x -> x*2).toArray() must produce [0, 2, 4, 6, 8]), no guarantees are made as to the order in which the mapper function is applied to individual elements, or in what thread any behavioral parameter is executed for a given element.
So will
Stream.map(x->x).toArray()
produce ordered results? Or should I use collectors.
The cited part of the documentation already states by example that both, map and toArray will maintain the encounter order.
When you go through the Stream API documentation you’ll see that it never makes an explicit statement about operations which maintain the encounter order, but does it the other way round, it explicitly states when an operation is unordered or has special policies depending on the ordered state.
obviously unordered() retracts the encounter order explicitly
forEach and findAny do not respect the encounter order
Stream.concat returns an unordered stream if at least one of the two input streams is unordered (a debatable behavior, but that’s how it is)
Stream.generate() generates an unordered stream
skip, limit, takeWhile, and dropWhile respect the encounter order, which may cause significant performance penalties in parallel executions
distinct() and sorted() are stable for ordered streams, distinct() may have significantly better parallel performance when the stream is unordered
collect(Collector) may behave as unordered if the collector is unordered, which is only hinted by the statement that the operation will be concurrent if the collector is concurrent and either the stream is unordered or the collector is unordered. For more details, we have to refer the Collector documentation and the builtin collectors.
Note that while the operations count(), allMatch, anyMatch, and noneMatch have no statement about the encounter order, these operations have a semantic that implies that the result should not depend on the encounter order at all.

is the output of an aggregator in informatica always sorted?

Is the output of an aggregator component in Informatica always going to be sorted by the group specified, also when the input sorted box is not ticked, and the component does the grouping itself? Or is there no guarantee?
If your input to the aggregator transformation is not sorted, then the output will also not be sorted.
As for your other question, even if you do not use sorted input , aggregator transformation will group by just fine. It might only impact performance.
Hi Rusty If the data is not sorted, aggregator will not sort it. If it is sorted already before passing to the aggregator transformation, check the sorted option as it increases the execution speed. To have fun on informatica, play with https://etlinfromatica.wordpress.com/

More complicated correlations by rules

I've seen some support for aggregations and joins but there aren't much words about it,
I wonder if Storm can correlate between events when there is no explicit correlation-id.
For example, assuming I have 3 (may be more) Spouts that emit tuples which represents Person from different sources.
Spout 1:
Person: name, security_id
Spout 2:
Person: fullName, secId, email
Spout 3:
Person: email
The end of the pipe should be 1 list of merged tuples (fields should be combined from all tuples), I would like to merge the Person tuples based on conditions such:
Spout1.security_id = Spout2.secId
||
Spout2.email = Spout3.email
(may be more rules)
In your case, it seems that you need to do a "windowed Cartesian Product" (which is quite expensive). For this, you need to use allGrouping connection pattern for all spouts to a single join bolt. Furthermore, in you join bolt, you need to distinguish incoming tuples (ie, from which spout a tuple was emitted) using input.getSourceComponent() or input.getSourceStreamId(). See here for a discussion about both methods: How to send output of two different Spout to the same Bolt?
The most tricky part is the buffering. Because you do not have any ordering guarantees and you don't know if a tuple might join in the future or nor, you need to buffer each incoming tuple for some time (best to use distinct buffers for the different sources). Each time you receive a tuple, you need to evaluate your complex predicate using all buffered tuples. The most difficult question to answer is, how long to keep tuple in the buffer. This question is application dependent as it is a pure semantical question. You need to answer it for yourself.

RxJS event order guarantee

while exploring rx for our project, we ran into the following puzzler:
We have one stream S1 that can receive two distinct events (A and B).
If we create two separate streams (Sx1 and Sx2) from that stream (S1) that subscribe specifically for either A or B events (Sx1 for A and Sx2 for B), is there any guarantee that the subscribers will receive the events
in the order they arrive in S1?
It all depends on what merging method you chose to carry out which will determine how the results are given back.
Take a look at RxMArbles it has great visual examples.
for this case I'd say Concat would keep it in the same order the events went in but if you are dealing with async data this might not be the best option. look at the COMBINING OPERATORS on RxMarble

Pig order preservation across RANK and in general

In the following code:
dsBodyStartStopSort =
order dsBodyStartStop
by terminal_id, point_date_stamp, point_time_stamp, event_code
;
dsBodyStartStopRank =
rank dsBodyStartStopSort
;
store dsBodyStartStopSort
into 'xStartStopSort.csv'
using PigStorage(';')
;
I know that if I don't do that RANK in the middle, the sort order will make it to the STORE command. And that is guaranteed by Pig.
And it appears from the testing I've done that doing RANK does not mess up the sort order--but is that guaranteed? I don't want to just be running on luck.
More generally, what is Pig's rule for preserving sort once it's done? Is it until some reduce event occurs? Will it work across FILTER? Certainly not GROUP? Just wondering if there is a well defined set of rules on when and how Pig guarantees or does not guarantee order.
To summarize: 1) Is order preserved across RANK? 2) How is order preserved generally?
The best piece of documentation I found on the topic:
Bags are disordered unless you explicitly apply a nested ORDER BY
operation as demonstrated below. A nested FOREACH will preserve
ordering, letting you order by one combination of fields then project
out just the values you'd like to concatenate.
From looking at unofficial examples and comments, I conclude the following:
If you do an order right before a rank, it should preserve the order. Personally I prefer to just use RANK xxx BY a,b,c; and only use ORDER afterwards if it is really needed.
If you do an order right before a LIMIT, it should feed LIMIT with the top lines. However the output would be sorted rather than in the original order.

Resources