parralel stream syntax prior to java 8 release - parallel-processing

Prior to the Java 8 official release, when it was still in development, am I correct in thinking the syntax of getting streams and parallel streams was slightly different. Now we have the option of either saying:
stream().parallel() or parallelStream()
I remember reading tutorials before its release when there was a subtle difference here - can anyone remind of of what it was as it has been bugging me!

Current implementation has no difference: .stream() creates a pipeline with parallel field set to false, then .parallel() just sets this field to true and returns the same object. When using .parallelStream(), it creates the pipeline with parallel field set to true in constructor. So both versions are the same. Any consequent calls to .parallel() or .sequential() just do the same: change the stream mode flag to true or false and return the same object.
The early implementation of Stream API was different. Here's the source code of AbstractPipeline (parent for all Stream, IntStream, LongStream and DoubleStream implementations) in lambda-dev just before the logic was changed. Setting the mode to parallel() right after the stream is created from the spliterator was relatively cheap: it just extracts spliterator from the original stream (depth == 0 branch in spliteratorSupplier()), then creates a new stream on the top of this spliterator discarding the original stream (those times there were no close()/onClose(), so it was unnecessary to delegate close handlers).
Nevertheless if your stream source included intermediate steps (for example, consider Collections.nCopies implementation which includes map step), the things were worse: using .stream().parallel() would create a new spliterator with poor-man splitting strategy (which includes buffering). So for such collection using .parallelStream() was actually better as it used internally .parallel() before the intermediate operation. Currently even for nCopies() you can use both .stream().parallel() and .parallelStream() interchangeably.
Going even more backwards, you may notice that .parallelStream() was called simply .parallel() initially. It was renamed in this changeset.

Related

Confused about performance implications of Sync

I have a question about the marker trait Sync after reading Extensible Concurrency with the Sync and Send Traits.
Java's "synchronize" means blocking, so I was very confused about how a Rust struct with Sync implemented whose method is executed on multiple threads would be effective.
I searched but found no meaningful answer. I'm thinking about it this way: every thread will get the struct's reference synchronously (blocking), but call the method in parallel, is that true?
Java: Accesses to this object from multiple threads become a synchronized sequence of actions when going through this codepath.
Rust: It is safe to access this type synchronously through a reference from multiple threads.
(The two points above are not canonical definitions, they are just demonstrations how similar words can be used in sentences to obtain different meanings)
synchronized is implemented as a mutual exclusion lock at runtime. Sync is a compile time promise about runtime properties of a specific type that allows other types depend on those properties through trait bounds. A Mutex just happens to be one way one can provide Sync behavior. Immutable types usually provide this behavior too without any runtime cost.
Generally you shouldn't rely on words having exactly the same meaning in different contexts. Java IO stream != java collection stream != RxJava reactive stream ~= tokio Stream. C volatile != java volatile. etc. etc.
Ultimately the prose matters a lot more than the keyword which are just shorthands.

Forcing map() over Java 8 Stream ()

I'm confused on this situation:
I've a Producer which produces an undetermined number of items from an underlining iterator, possibly a large number of them.
Each item must be mapped to a different interface (eg, wrapper, JavaBean from JSON structure).
So, I'm thinking that it would be good for Producer to return a stream, it's easier to write code that convert Iterator to Stream (using Spliterators and StreamSupport.stream()), then apply Stream.map() and return the final stream.
The problem is I have an invoker that does nothing with the resulting stream, eg, a unit test, yet I still want the mapping code to be invoked for every item. At the moment I'm simply calling Stream.count() from the invoker to force that.
Questions are:
Am I doing it wrong? Should I use different interfaces? Note that I think implementing next()/hasNext() for Iterator is cumbersome, mainly because it forces you to create a new class (even if it can be anonymous) and keep a pointer and check it. Same for collection views, returning a collection that is created and not a dynamic view over the underlining iterator is out of question (the input data set might be very large). The only alternative I like so far is a Java implementation of yield(). Neither do I want the stream to be consumed inside Producer (ie, forEach()), since some other invoker might want it to perform some real operation.
Is there a better best practice to force the stream processing?

Best way to remove cache entry based on predicate in infinispan?

I want to remove few cache entries if the key in the cache matches some pattern.
For example, I've the following key-value pair in the cache,
("key-1", "value-1"), ("key-2", "value-2"), ("key-3", "value-3"), ("key-4", "value-4")
Since cache implements a map interface, i can do like this
cache.entrySet().removeIf(entry -> entry.getKey().indexOf("key-") > 0);
Is there a better way to do this in infinispan (may be using functional or cache stream api)?
The removeIf method on entrySet should work just fine. It will be pretty slow though for a distributed cache as it will pull down every entry in the cache and evaluate the predicate locally and then perform a remove for each entry that matches. Even in a replicated cache it still has to do all of the remove calls (at least the iterator will be local though). This method may be changed in the future as we are updating the Map methods already [a].
Another option is to instead use the functional API as you said [1]. Unfortunately the way this is implemented is it will still pull all the entries locally first. This may be changed at a later point if the Functional Map APIs become more popular.
Yet another choice is the cache stream API which may be a little more cumbersome to use, but will provide you the best performance of all of the options. Glad you mentioned it :) What I would recommend is to apply any intermediate operations first (luckily in your case you can use filter since your keys won't change concurrently). Then use the forEach terminal operation which passes the Cache on that node [2] (note this is an override). Inside the forEach callable you can call the remove command just as you wanted.
cache.entrySet().parallelStream() // stream() if you want single thread per node
.filter(e -> e.getKey().indexOf("key-") > 0)
.forEach((cache, e) -> cache.remove(e.getKey()));
You could also use indexing to avoid the iteration of the container as well, but I won't go into that here. Indexing is a whole different beast.
[a] https://issues.jboss.org/browse/ISPN-5728
[1] https://docs.jboss.org/infinispan/9.0/apidocs/org/infinispan/commons/api/functional/FunctionalMap.ReadWriteMap.html#evalAll-java.util.function.Function-
[2] https://docs.jboss.org/infinispan/9.0/apidocs/org/infinispan/CacheStream.html#forEach-org.infinispan.util.function.SerializableBiConsumer-

Should StreamEx parallelism work when using takeWhile?

I have a stream that I create like this:
StreamEx.generate(new MySupplier<List<Entity>>())
.flatMap(List::stream)
.map(Entity::getName)
.map(name -> ...)
.. // more stuff
I can change this to work in parallel by just adding parallel:
StreamEx.generate(new MySupplier<List<Entity>>())
.flatMap(List::stream)
.map(Entity::getName)
.map(name -> ...)
.parallel()
.. // more stuff
But I also want to add a takeWhile condition to make the stream stop:
StreamEx.generate(new MySupplier<List<Entity>>())
.takeWhile(not(List::isEmpty))
.flatMap(List::stream)
.map(Entity::getName)
.map(name -> ...)
.parallel()
.. // more stuff
But as soon as I add the takeWhile the stream seems to become sequential (at least it's only processed by one thread). According to the javadoc of takeWhile, if I understand it correctly, should work with parallel streams. Am I doing something wrong or is this according to design?
As in normal Stream API if something works in parallel, it does not mean that it works efficiently. The javadoc states that:
While this operation is quite cheap for sequential stream, it can be quite expensive on parallel pipelines.
Actually you want to use takeWhile with unordered stream which could be optimized specially, but does not optimized currently, so this could be considered as a defect. I will try to fix this (I'm the StreamEx author).
Update: fixed in version 0.6.5

Will a Java 8 Stream.forEach( x -> {} ); do anything?

I am controlling the Consumer that get's to this forEach so it may or may not be asked to perform an action.
list.parallelStream().forEach( x-> {} );
Streams being lazy Streams won't iterate, right? Nothing will happen is what i expect. Tell me if i am wrong, please.
It will traverse the whole stream, submitting tasks to fork-join pool, splitting the list to parts and passing all the list elements to this empty lambda. Currently it's impossible to check in runtime whether the lambda expression is empty or not, thus it cannot be optimized out.
Similar problem appears in using Collector. All collectors have the finisher operation, but in many cases it's an identity function like x -> x. In this case sometimes the code which uses collectors can be greatly optimized, but you cannot robustly detect whether the supplied lambda is identity or not. To solve this an additional collector characteristic called IDENTITY_FINISH was introduced instead. Were it possible to robustly detect whether supplied lambda is identity function, this characteristic would be unnecessary.
Also look at JDK-8067971 discussion. This proposes creating static constants like Predicate.TRUE (always true) or Predicate.FALSE (always false) to optimize operations like Stream.filter. For example, if Predicate.TRUE is supplied, then filtering step can be removed, and if Predicate.FALSE is supplied, then stream can be replaced with empty stream at this point. Again were it possible to detect in runtime that the supplied predicate is always true, then it would be unnecessary to create such constants.

Resources