Nondeterministic foreach on Stream class, Java 8 - java-8

The Java docs says about the method forEach on Stream class:
The behavior of this operation is explicitly nondeterministic.
So what I understand is that It could be possible to have different outcomes from repeated sequential (and so not parallel) executions of identical stream pipelines on an identical source even if the source has an encouter order. Is that right? Could someone give me an example to illustrate that?

The elements inside pipeline may be processed in parallel manner (see Parallelism entry) independently to each other so the intermediate operations made made the individual elements may vary - hence the order is not predictable in case the terminal operation is invoked on the stream working in parallel mode.
It may also depend on the data source the stream originates from (be it HashSet for instance) and the intermediate operations in between - see Ordering entry.
Check if any intermediate operation introduces parallelism myStream.isParallel() == true or breaks ordering myStream.unordered().

Related

Effect of splittability and statefulness on the Stream parallel processing

I am working with Stream parallel processing and get to know that if I am using plane Array stream, it gets processed very fast. But if I am using ArrayList, then the processing gets a bit slower. But if I use LinkedList or some Binary Tree, the processing gets even more slower.
All that sounds like the more is the splittability of stream, more faster the processing would be. That means array and array list is most efficient in case of parallel streams. Is it true? If so, Shall we always use ArrayList or Array if we want to process stream in parallel? If so, how to use LinkedList and BlockingQueue in case of parallel stream?
Another thing is the state-fulness of the intermediate functions chosen. If I perform stateless operations like filter(), map(), the performance is high, but if perform the state full operations like distinct(), sorted(), limit(), skip(), it takes a lot of time. So again, the parallel stream get slower. Does that means we should not go for state full intermediate functions in parallel stream? If so, then what is the work around for that?
Well, as discussed in this question, there is hardly any reason to use LinkedList at all. The higher iteration costs apply to all operations, not just parallel streams.
Generally, the splitting support has indeed a big impact on the parallel performance. First, whether it has a genuine, hopefully cheap, splitting support rather than inheriting the buffering default behavior of AbstractSpliterator, second, how balanced the splits are.
In this regard, there is no reason why a binary tree should perform badly. A tree can be split into sub-trees easily and if the tree is balanced at the beginning, the splits will be balanced too. Of course, this requires that the actual Collection implementation implements the spliterator() method returning a suitable Spliterator implementation rather than inheriting the default method. E.g. TreeSet has a dedicated spliterator. Still, iterating the sub-trees might be more expensive than iterating an array, but that’s not a property of the parallel processing, as that would apply to sequential processing as well or any kind of iteration over the elements in general.
The question, how to use LinkedList and BlockingQueue in case of parallel streams, is moot. You choose the collection type depending on the application’s needs and if you really need one of these (in case of LinkedList hard to imagine), then you use it and live with the fact that its parallel stream performance would be less than that of ArrayList, which apparently didn’t fit your other needs. There is no general trick to make the parallel stream performance of badly splittable collections better. If there was, it would be part of the library.
There are some corner cases where the JRE doesn’t provide the maximum performance, which will be addressed in Java 9, like String.chars(), Files.lines() or the default spliterator for 3rd part RandomAccess Lists, but none of these apply to LinkedList, BlockingQueue or custom Binary Tree implementations.
In other words, if you have a particular use case with a particular collection, there might be something to improve, but there is no trick that could improve the parallel performance of all tasks with all collections.
It is correct that stateful intermediate operations like distinct(), sorted(), limit(), skip() have higher costs for parallel streams and their documentation even tells this. So we could give the general advice to avoid them, especially for parallel stream, but that would be kind of pointless, as you didn’t use them, if you didn’t need them. And again, there is no general work-around for that, as there wouldn’t be much sense in offering these operations if there was a generally better alternative.
Not a bad questions IMO.
Of course the array and ArrayList are going to be splittable much better then LinkedList or some type of a Tree. You can look at how their Spliterators are made to convince yourself. They usually start with some batch size (1024 elements) and increase from that. LinkedList does that and Files.lines if I remember correctly. So yes, using an arrays and ArrayList will have a very good parallelization.
If you want a better parallel support for some structures like LinkedList you could write your own spliterator - I think StreamEx did that for Files.lines to start with a smaller batch size... And this is a related question btw.
The other thing is that when you use stateful intermediate operations - you will effectively make intermediate operations that are above that stateful one - into stateful too... Let me provide an example:
IntStream.of(1, 3, 5, 2, 6)
.filter(x -> {
System.out.println("Filtering : " + x);
return x > 2;
})
.sorted()
.peek(x -> System.out.println("Peek : " + x))
.boxed()
.collect(Collectors.toList());
This will print:
Filtering : 1
Filtering : 3
Filtering : 5
Filtering : 2
Filtering : 6
Peek : 3
Peek : 5
Peek : 6
Because you have used sorted and filter is above that, filter has to take all elements and process them - so that sorted is applied to the correct ones.
On the other hand if you dropped sorted:
IntStream.of(1, 3, 5, 2, 6)
.filter(x -> {
System.out.println("Filtering : " + x);
return x > 2;
})
// .sorted()
.peek(x -> System.out.println("Peek : " + x))
.boxed()
.collect(Collectors.toList());
The output is going to be:
Filtering : 1
Filtering : 3
Peek : 3
Filtering : 5
Peek : 5
Filtering : 2
Filtering : 6
Peek : 6
Generally I do agree, I try to avoid (if I can) stateful intermediate operations - may be you don't want sorted let's say - may be you can collect to a TreeSet... etc. But I don't overthink it - it I need to use it - I just do and may be measure to see if it's really a bottleneck.
Unless you are really hitting some performance problems around this - I would not take that much into account; especially since you would need lots of elements to actually have some speed benefit from parallel.
Here is a related question that shows that you really really need lots of elements to see a performance gain.

What is the difference between codata and data?

There is some explanation here. Intuitively I understand how finite data structures differ from infinite data structures like streams. Nevertheless, it's interesting to see other explanations of differences, characteristics, types of codata.
I stumbled upon codata term when reading about streams.
This answer isn't terribly precise, but I'm going to post it anyway.
The real distinction...
... is between data and computations.
Data
The fundamental property of data is that it has a structure. Data can be passed as input and returned as output by computations. The structure of data can be used by computations. However, by itself, data doesn't do anything. Data just is.
Examples of data types are booleans, numbers, strings, algebraic data types (tagged unions), etc. Correspondingly, examples of values are false, 2, "pyon", SOME 2. It makes sense for computations to operate on values: for example, a computation can branch, depending on whether a number is even or odd. However, it doesn't make sense to ask what values can do: the number 2 can't do anything, it just is.
Computations
The fundamental property of computations is that they have behavior. In other words, computations do. However, computations are "too active" to be passed around or stored in variables. For example, you can't store in a variable the very act of printing "Hello, world!" to the screen.
You may object that you can store a reference to a function in a variable. But a reference to a function isn't quite the same thing as the function's behavior when it's executed! The former is data, the latter is a computation.
Back to codata...
What exactly is codata? Before giving a proper definition, I'll use an example:
Streams are codata
What exactly is a stream? A stream is a reference1 to a computation that, when executed, produces either:
The first element ("head") of a sequence, together with another stream ("tail") that is logically the remainder of the sequence. Or...
A special symbol ("nil") indicating the end of the sequence.
Streams (and, more generally, codata) are data, because they are references to computations, rather than the computations themselves. However, what makes streams (and, more generally, codata) special is that, when the underlying computations are executed, they may produce other streams (and, more generally, codata).
Now I can finally give a proper definition:
Codata is a reference to a computation that, when executed, may produce (amongst other things) more codata.
1 The correct technical term is "thunk", not "reference".

How is a prefix sum a bulk-synchronous algorithmic primitive?

Concerning NVIDIA GPU the author in High Performance and Scalable GPU Graph Traversal paper says:
1-A sequence of kernel invocations is bulk- synchronous: each kernel is
initially presented with a consistent view of the results from the
previous.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I am unable to understand these two points (I know GPU based prefix sum though), Can smeone help me this concept.
1-A sequence of kernel invocations is bulk- synchronous: each kernel is initially presented with a consistent view of the results from the previous.
It's about parallel computation model: each processor has its own memory which is fast (like cache in CPU) and is performing computation using values stored there without any synchronization. Then non-blocking synchronization takes place - processor puts data it has computed so far and gets data from its neighbours. Then there's another synchronization - barrier, which makes all of them wait for each other.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I believe that's about the second step of BSP model - synchronization. That's the way processors store and get data for the next step.
Name of the model implies that it is highly concurrent (many many processes that work synchronously relatively to each other). And this is how we get to the second point.
As far as we want to live up to the name (be highly concurrent) we want get rid of sequential parts where it is possible. We can achieve that with prefix-sum.
Consider prefix-sum associative operator +. Then scan on set [5 2 0 3 1] returns the set [0 5 7 7 10 11]. So, now we can replace such sequential pseudocode:
foreach i = 1...n
foo[i] = foo[i-1] + bar(i);
with this pseudocode, which now can be parallel(!):
foreach(i)
baz[i] = bar(i);
scan(foo, baz);
That is very much naive version, but it's okay for explanation.

Linq to Objects: filtering performance question

I was thinking about the way linq computes and it made me wonder:
If I write
var count = collection.Count(o => o.Category == 3);
Will that perform any differently than:
var count = collection.Where(o => o.Category == 3).Count();
After all, IEnumerable<T>.Where() will return IEnumerable<T> which doesn't implement Count property, so a subsequent Count() would actually have to walk through the items to determine the count which should cause extra time being spent on this.
I wrote some quick test code to get some metrics but they seem to beat each other at random. I won't put in the test code here initially, but if anyone requests, I'll get it in.
So, am I missing something?
There won't be a lot in it, really - both forms will iterate over the collection, check the predicate against each item, and count the matches. Both approaches will stream the data - it's not like Where is actually building an in-memory list of all matches, for example.
The first form has one fewer (thin) layer of indirection in, that's all. The main reason for using it (IMO) is for readability/simplicity, rather than performance.
As Jon Skeet says, both techniques will have to essentially do the same thing - enumerate the sequence while conditionally incrementing a counter when the predicate is matched. Any performance differences between the two should be slight: insignificant for almost all use-cases. If there is a token winner though, I would think it should be the first one, since from reflector it appears that the overload ofCountthat takes a predicate uses its ownforeachto enumerate rather than the more obvious way of offloading the work to a streaming aWhereinto a parameterlessCountas in your second example. This means technique #1 is likely to have two minor performance benefits:
Fewer argument validation (null-tests etc.) checks. Technique #2's Count will also check if its (piped) input is an ICollection or ICollection<T> , which it can't possibly be.
A single constructed enumerator vs two enumerators piped together (an additional state-machine has costs).
There is one minor in favour of technique #2 point though:Whereis slightly more sophisticated in constructing an enumerator for the source-sequence; it uses a different one for lists and arrays. This may make it more performant in certain scenarios.
Of course, I should reiterate that I might be plain wrong about my analysis - reasoning about performance through static code analysis, especially when the differences are likely to be slight, is not a good idea. There is only one way to find out - measuring the execution times for your specific setup.
FYI, the source I reflected was from .NET 3.5 SP1.
I know what you are thinking here. At least, I think I do; Count() will look to see if Count is available as a property, and will simply return that if so. Otherwise, it has to enumerate the items to get its return value.
The version of Count() which accepts the predicate, though, will always cause the collection to be iterated, since it has to do it to see which ones match.
Above answers make good points, consider also that if you break away into any Linq-To-X implementations that deferred execution (Linq to Sql being the primary), the Expression parameters used in these methods may cause different results.

Coregions in UML

What are coregions in UML sequence diagrams?
Coregions are used when the sequence of events does not matter, that is they can occur safely in any order.
This is one of the first few pages I found when I searched coregion sequence diagram in Google.
The coregion is a notational/sytanx choice for representing parallel CombinedFragments the UML 2.2 Superstructure spec (14.3.3) says:
Parallel The interactionOperator par
designates that the CombinedFragment
represents a parallel merge between
the behaviors of the operands. The
OccurrenceSpecifications of the
different operands can be interleaved
in any way as long as the ordering
imposed by each operand as such is
preserved. A parallel merge defines a
set of traces that describes all the
ways that OccurrenceSpecifications of
the operands may be interleaved
without obstructing the order of the
OccurrenceSpecifications within the
operand.
The answer above is correct this is just more context.
The UML is specified by the OMG in the two documents (http://www.omg.org/spec/uml): The UML Infrastructure and the UML Supestructure. Whatever documentation may be not official.
In the UML superstructure section 14.3.3 it is said:
A notational shorthand for parallel combined fragments are available for the common situation where the order of event occurrences (or other nested fragments) on one Lifeline is insignificant. This means that in a given “coregion” area of a Lifeline all the directly contained fragments are considered separate operands of a parallel combined fragment.

Resources