How "unordered" helps "distinct()" and "groupingBy" - java-8

I am doing the Oracle's Stream API Java 1.8 course and I stumbled upon this while perusing the lecture notes:
– Inherited from BaseStream
– Returns a stream that is unordered (used internally)
– Can improve efficiency of operations like distinct() and groupingBy()
Here is my question. How can the property of being unordered lead to more efficient calculation of distinct() and groupingBy()

It only has significance in case of parallel streams. In case of ordered parallel streams, the distinct() operation has to do extra work in order to keep it's stability guarantee, that is,
for duplicated elements, the element appearing first in the encounter
order is preserved
(see the API Note part in the javadoc for Stream.distinct().
In case of unordered parallel streams, no such guarantee needs to be kept, as the stream is already unordered. This way, removing the ordered characteristic from an ordered parallel stream can greatly improve the performance of the distinct() operation.
Similarly, for the groupingBy() operation, lifting the requirement that the stream order should be preserved can greatly improve the efficiency of the operation in case of parallel streams as the reduction itself can be performed concurrently. Note that this will only happen while collecting from parallel streams with concurrent collectors, with either the collector or the stream itself being unordered. Practically, you'll need to use Stream.collect(groupingByConcurrent(..)) instead of Stream.collect(groupingBy(..)). See the javadoc for Stream.collect() and Collector for more details.


Enforcing well balanced parallelism in a unkeyed Flink stream

Based on my understanding of Flink, it introduces parallelism based on keys (keygroups). However, suppose one had a massive unkeyed stream and would like the work to be done in parallel, what would be the best way to achieve this?
If the stream has some fields, one might think about keying by one of the fields arbirtrarily, however this does not guarantee that the workload will be balanced properly. For instance because one value in that field may occur in 90% of the messages. Hence my question:
How to enforce well balanced parallelism in Flink, without prior knowledge of what is in the stream
One potential solution I could think of is to assign a random number to each message (say 1-3 if you want to have a parallelism of 3, or 1-1000 if you want parallelism to be more flexible). However, I wondered if this was the recommended approach as it does not feel very elegant.
keyBy is one way to specify stream partitioning, and it is especially useful, since you are guaranteed that all stream elements with the same key will be processed together. This is the basis for stateful stream processing with Flink.
However, if you don't need to use key-partitioned state, and instead care about ensuring that the partitions are well balanced, you can use shuffle() or rebalance() to cause a random or round-robin partitioning. See the docs for more details. You can also implement a custom partitioner, if you want more explicit control.
BTW, if you do want to key the stream by a random number, do not do something like keyBy(new Random.nextInt(n)). It's imperative that the key selector be deterministic. This is necessary because the keys do not travel with the stream records -- instead, the key selector function is used to compute the key whenever it is needed. So for random keying, add another field to your events and populate it with a random number, and use that as the key. This technique is useful when you want to use keyed state or timers, but don't have anything suitable to use as a key.

Parallel counting using a functional approach and immutable data structures?

I have heard and bought the argument that mutation and state is bad for concurrency. But I struggle to understand what the correct alternatives actually are?
For example, when looking at the simplest of all tasks: counting, e.g. word counting in a large corpus of documents. Accessing and parsing the document takes a while so we want to do it in parallel using k threads or actors or whatever the abstraction for parallelism is.
What would be the correct but also practical pure functional way, using immutable data structures to do this?
The general approach in analyzing data sets in a functional way is to partition the data set in some way that makes sense, for a document you might cut it up into sections based on size. i.e. four threads means the doc is sectioned into four pieces.
The thread or process then executes its algorithm on each section of the data set and generates an output. All the outputs are gathered together and then merged. For word counts, for example, a collection of word counts are sorted by the word, and then each list is stepped through using looking for the same words. If that word occurs in more than one list, the counts are summed. In the end, a new list with the sums of all the words is output.
This approach is commonly referred to as map/reduce. The step of converting a document into word counts is a "map" and the aggregation of the outputs is a "reduce".
In addition to the advantage of eliminating the overhead to prevent data conflicts, a functional approach enables the compiler to optimize to a faster approach. Not all languages and compilers do this, but because a compiler knows its variables are not going to be modified by an outside agent it can apply transforms to the code to increase its performance.
In addition, functional programming lets systems like Spark to dynamically create threads because the boundaries of change are clearly defined. That's why you can write a single function chain in Spark, and then just throw servers at it without having to change the code. Pure functional languages can do this in a general way making every application intrinsically multi-threaded.
One of the reasons functional programming is "hot" is because of this ability to enable multiprocessing transparently and safely.
Mutation and state are bad for concurrency only if mutable state is shared between multiple threads for communication, because it's very hard to argue about impure functions and methods that silently trash some shared memory in parallel.
One possible alternative is using message passing for communication between threads/actors (as is done in Akka), and building ("reasonably pure") functional data analysis frameworks like Apache Spark on top of it. Apache Spark is known to be rather suitable for counting words in a large corpus of documents.

What are the main advantages in using promises in scheme?

Pragmatically, what are the main advantages of using promises? Can you show me some examples of real-life useful usage of promises?
In Scheme a promise is just a value that has a task that is not necessarily done yet and if you never use the value it will never be calculated. In short it is a way to do lazy evaluation in the otherwise eager Scheme. A typical way is to do computations on streams instead of lists.
With lists you can use higher order functions so that you can have a list, then filter it for values you are interested in, then transform these values and perhaps at some point you have enough to produce the value you needed. This is nice since you can abstract each step so that you can make logic that only does one step and compose steps to make the whole program, but in this scenario the first step needs to finish in full before the next step can handle the result of the first while it might be that if you are searching for the first prime number between 0 and 1000 having iterated over all the numbers in each step might not be so effective. Here is where streams comes in.
With streams the code looks the same, but the intermediate result is made by need. A stream is a pair where the parts are promises so that the code that would otherwise make a pair is delayed until the values are used. Every step just produces enough data for the next step and thus should it be enough for the first step to iterate just 20% of the elements for the last step to have computed the final result the 80% rest will never ever be processed in any of the steps. With such a structure the initial stream can also be infinite, like all the numbers from 0 increased by 1.
There are penalties involved using streams. Imagine you make an algorithm that would visit all the elements anyway. Then a stream version of an algorithm would be slower since the promises that are created and the forcing gives th eprogram overhead compared with doing the computation without laziness.
You might be interested in seeing Hal Abelson explaining streams and their pros and cons.
There are other alternatives to streams an lazy evaluation. One is generators. Here you can also make composable procedures that takes a generator and produces a generator. The iteration will be by need like with streams.
Another alternative would be transducers. This is also composable and iterates like streams and generators, but unlike generators initial data cannot be an infinite sequence like with streams and generators unless the underlying structure supports it.
The advantages of using promises or any other technique in this answer is not scheme specific. They are true for all eager programming languages!

Why Segment dropped in Java 8 Concurrent Hashmap? [duplicate]

Can any concurrent expert explain in ConcurrentHashMap, which concurrent features improved comparing with which in previous JDKs
Well, the ConcurrentHashMap has been entirely rewritten. Before Java 8, each ConcurrentHashMap had a “concurrency level” which was fixed at construction time. For compatibility reasons, there is still a constructor accepting such a level though not using it in the original way. The map was split into as many segments, as its concurrency level, each of them having its own lock, so in theory, there could be up to concurrency level concurrent updates, if they all happened to target different segments, which depends on the hashing.
In Java 8, each hash bucket can get updated individually, so as long as there are no hash collisions, there can be as many concurrent updates as its current capacity. This is in line with the new features like the compute methods which guaranty atomic updates, hence, locking of at least the hash bucket which gets updated. In the best case, they lock indeed only that single bucket.
Further, the ConcurrentHashMap benefits from the general hash improvements applied to all kind of hash maps. When there are hash collisions for a certain bucket, the implementation will resort to a sorted map like structure within that bucket, thus degrading to a O(log(n)) complexity rather than the O(n) complexity of the old implementation when searching the bucket.
I think there are several changes compared with JDK7:
Lazy initialization: in JDK8, the memory used for each segment is allocated only when some entity is added to the map. In JDK7,this is done when the map is created.
Some new function is added in JDK8 like forEach, reduce, search etc.
Inner structure change : the TreeBin (red-black tree) is used in jdk8 to improve the search efficiency.

Data Structure Parallel Add Serial Remove Needed

I'm working on a dynamically branching particle system on the GPU. I need a parallel data structure with the following properties:
One thread must be able to remove elements one by one, in constant time. The element returned isn't important to the algorithm--just so long as some element is returned when nonempty. For extra awesomeness, change to any number of threads.
Any number of threads must be able to add elements to the data structure in constant time. Note that some locking is allowed, (and necessary) but it must still scale with no relation on the number of threads. I.e., more threads shouldn't slow it down.
Basic synchronization primitives (mutexes, semaphores), and anything that can be implemented using them, are available.
I had toyed with the idea of a linked list, but this violates condition two (since adding would be O(m) for m threads, since locking must be taken into consideration). I'm not sure such a data structure exists--but I thought I would ask.
Without knowing more about how you want your data organized (sorted? FIFO? LIFO?) I'm. Or sure whether I can give you an exact answer. However, what you're describing sounds like the definition of a lock-free structure. Lock-free implementations of stacks and queues exist, which do support O(1) insertions and deletions even when there are a lot of threads modifying the structure concurrently. They're usually based on atomic test-and-set operations.
If locks are okay and you just want a highly-concurrent data structure that's sorted, consider looking into concurrent skip lists, which provide O(log n) sorted insertion and deletion with multiple active threads.
Hope this helps!
