Should StreamEx parallelism work when using takeWhile? - java-8

I have a stream that I create like this:
StreamEx.generate(new MySupplier<List<Entity>>())
.flatMap(List::stream)
.map(Entity::getName)
.map(name -> ...)
.. // more stuff
I can change this to work in parallel by just adding parallel:
StreamEx.generate(new MySupplier<List<Entity>>())
.flatMap(List::stream)
.map(Entity::getName)
.map(name -> ...)
.parallel()
.. // more stuff
But I also want to add a takeWhile condition to make the stream stop:
StreamEx.generate(new MySupplier<List<Entity>>())
.takeWhile(not(List::isEmpty))
.flatMap(List::stream)
.map(Entity::getName)
.map(name -> ...)
.parallel()
.. // more stuff
But as soon as I add the takeWhile the stream seems to become sequential (at least it's only processed by one thread). According to the javadoc of takeWhile, if I understand it correctly, should work with parallel streams. Am I doing something wrong or is this according to design?

As in normal Stream API if something works in parallel, it does not mean that it works efficiently. The javadoc states that:
While this operation is quite cheap for sequential stream, it can be quite expensive on parallel pipelines.
Actually you want to use takeWhile with unordered stream which could be optimized specially, but does not optimized currently, so this could be considered as a defect. I will try to fix this (I'm the StreamEx author).
Update: fixed in version 0.6.5

Related

Why do a Beam io need beam.AddFixedKey+beam.GroupByKey to work properly?

I'm working on a Beam IO for Elasticsearch in Golang and at the moment I have a working draft version but, only managed to make it work by doing something that's not clear to me why do I need it.
Basically I looked at existing IO's and found that writes only work if I add the following:
x := beam.AddFixedKey(s, pColl)
y := beam.GroupByKey(s, x)
A full example is in the existing BigQuery IO
Basically I would like to understand why do I need both AddFixedKey followed by a GroupByKey to make it work. Also checked the issue BEAM-3860, but doesn't have much more details about it.
Those two transforms essentially function as a way to group all elements in a PCollection into one list. For example, its usage in the BigQuery example you posted allows grouping the entire input PCollection into a list that gets iterated over in the ProcessElement method.
Whether to use this approach depends how you are implementing the IO. The BigQuery example you posted performs its writes as a batch once all elements are available, but that may not be the best approach for your use case. You might prefer to write elements one at a time as they come in, especially if you can parallelize writes among different workers. In that case you would want to avoid grouping the input PCollection together.

Best way to remove cache entry based on predicate in infinispan?

I want to remove few cache entries if the key in the cache matches some pattern.
For example, I've the following key-value pair in the cache,
("key-1", "value-1"), ("key-2", "value-2"), ("key-3", "value-3"), ("key-4", "value-4")
Since cache implements a map interface, i can do like this
cache.entrySet().removeIf(entry -> entry.getKey().indexOf("key-") > 0);
Is there a better way to do this in infinispan (may be using functional or cache stream api)?
The removeIf method on entrySet should work just fine. It will be pretty slow though for a distributed cache as it will pull down every entry in the cache and evaluate the predicate locally and then perform a remove for each entry that matches. Even in a replicated cache it still has to do all of the remove calls (at least the iterator will be local though). This method may be changed in the future as we are updating the Map methods already [a].
Another option is to instead use the functional API as you said [1]. Unfortunately the way this is implemented is it will still pull all the entries locally first. This may be changed at a later point if the Functional Map APIs become more popular.
Yet another choice is the cache stream API which may be a little more cumbersome to use, but will provide you the best performance of all of the options. Glad you mentioned it :) What I would recommend is to apply any intermediate operations first (luckily in your case you can use filter since your keys won't change concurrently). Then use the forEach terminal operation which passes the Cache on that node [2] (note this is an override). Inside the forEach callable you can call the remove command just as you wanted.
cache.entrySet().parallelStream() // stream() if you want single thread per node
.filter(e -> e.getKey().indexOf("key-") > 0)
.forEach((cache, e) -> cache.remove(e.getKey()));
You could also use indexing to avoid the iteration of the container as well, but I won't go into that here. Indexing is a whole different beast.
[a] https://issues.jboss.org/browse/ISPN-5728
[1] https://docs.jboss.org/infinispan/9.0/apidocs/org/infinispan/commons/api/functional/FunctionalMap.ReadWriteMap.html#evalAll-java.util.function.Function-
[2] https://docs.jboss.org/infinispan/9.0/apidocs/org/infinispan/CacheStream.html#forEach-org.infinispan.util.function.SerializableBiConsumer-

parralel stream syntax prior to java 8 release

Prior to the Java 8 official release, when it was still in development, am I correct in thinking the syntax of getting streams and parallel streams was slightly different. Now we have the option of either saying:
stream().parallel() or parallelStream()
I remember reading tutorials before its release when there was a subtle difference here - can anyone remind of of what it was as it has been bugging me!
Current implementation has no difference: .stream() creates a pipeline with parallel field set to false, then .parallel() just sets this field to true and returns the same object. When using .parallelStream(), it creates the pipeline with parallel field set to true in constructor. So both versions are the same. Any consequent calls to .parallel() or .sequential() just do the same: change the stream mode flag to true or false and return the same object.
The early implementation of Stream API was different. Here's the source code of AbstractPipeline (parent for all Stream, IntStream, LongStream and DoubleStream implementations) in lambda-dev just before the logic was changed. Setting the mode to parallel() right after the stream is created from the spliterator was relatively cheap: it just extracts spliterator from the original stream (depth == 0 branch in spliteratorSupplier()), then creates a new stream on the top of this spliterator discarding the original stream (those times there were no close()/onClose(), so it was unnecessary to delegate close handlers).
Nevertheless if your stream source included intermediate steps (for example, consider Collections.nCopies implementation which includes map step), the things were worse: using .stream().parallel() would create a new spliterator with poor-man splitting strategy (which includes buffering). So for such collection using .parallelStream() was actually better as it used internally .parallel() before the intermediate operation. Currently even for nCopies() you can use both .stream().parallel() and .parallelStream() interchangeably.
Going even more backwards, you may notice that .parallelStream() was called simply .parallel() initially. It was renamed in this changeset.

Will a Java 8 Stream.forEach( x -> {} ); do anything?

I am controlling the Consumer that get's to this forEach so it may or may not be asked to perform an action.
list.parallelStream().forEach( x-> {} );
Streams being lazy Streams won't iterate, right? Nothing will happen is what i expect. Tell me if i am wrong, please.
It will traverse the whole stream, submitting tasks to fork-join pool, splitting the list to parts and passing all the list elements to this empty lambda. Currently it's impossible to check in runtime whether the lambda expression is empty or not, thus it cannot be optimized out.
Similar problem appears in using Collector. All collectors have the finisher operation, but in many cases it's an identity function like x -> x. In this case sometimes the code which uses collectors can be greatly optimized, but you cannot robustly detect whether the supplied lambda is identity or not. To solve this an additional collector characteristic called IDENTITY_FINISH was introduced instead. Were it possible to robustly detect whether supplied lambda is identity function, this characteristic would be unnecessary.
Also look at JDK-8067971 discussion. This proposes creating static constants like Predicate.TRUE (always true) or Predicate.FALSE (always false) to optimize operations like Stream.filter. For example, if Predicate.TRUE is supplied, then filtering step can be removed, and if Predicate.FALSE is supplied, then stream can be replaced with empty stream at this point. Again were it possible to detect in runtime that the supplied predicate is always true, then it would be unnecessary to create such constants.

Storm Trident 'merge' function that preserves time order

Say I have two streams:
Stream 1: [1,3],[2,4]
Stream 2: [2,5],[3,2]
A regular merge would produce a Stream 3, like this:
[1,3],[2,4],[2,5],[3,2]
I would like to merge the stream whilst preserving the order in which the
tuple was emitted, so if [2,5] was emitted at time 1, [1,3] was emitted at
time 2, [3,2] at time 3 and [2,4] at time 4, the resulting stream would
be:
[2,5],[1,3],[3,2],[2,4]
Is there anyway way to do this and, if so, how? Some sample code would be
appreciated as I'm a complete Trident rookie who has recently been thrust
into a Trident based project.
Thanks in advance for your help,
Eli
You have to use an external data storage using trident persistent. Sorted set of redis should serve your purpose, I guess.
MORE INFO
If you go through this https://github.com/nathanmarz/storm/wiki/Trident-tutorial, you can get how to use memcache as the store for words count.
Similarly, you can write a stream backup on Redis (if you are not familiar with redis try out,
http://redis.io/commands#sorted_set). I think redis sorted set will serve as a purpose for your case.
If you want a persistent storage for your data, you can think of using other NOSQL solution like mongo and then you can always easily index your final data on your time. That will easily provide the sort functionality you want. And what not someone has already written a mongo trident, https://github.com/sjoerdmulder/trident-mongodb.
Let me know if you are still confused and about what.

Resources