Parallel Stream non-concurrent unordered collector - java-8

Suppose I have this custom collector :
public class CustomToListCollector<T> implements Collector<T, List<T>, List<T>> {
#Override
public Supplier<List<T>> supplier() {
return ArrayList::new;
}
#Override
public BiConsumer<List<T>, T> accumulator() {
return List::add;
}
#Override
public BinaryOperator<List<T>> combiner() {
return (l1, l2) -> {
l1.addAll(l2);
return l1;
};
}
#Override
public Function<List<T>, List<T>> finisher() {
return Function.identity();
}
#Override
public Set<java.util.stream.Collector.Characteristics> characteristics() {
return EnumSet.of(Characteristics.IDENTITY_FINISH, Characteristics.UNORDERED);
}
}
This is exactly the Collectors#toList implementation with one minor difference: there's also UNORDERED characteristics added.
I would assume that running this code :
List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8);
for (int i = 0; i < 100_000; i++) {
List<Integer> result = list.parallelStream().collect(new CustomToListCollector<>());
if (!result.equals(list)) {
System.out.println(result);
break;
}
}
should actually produce some result. But it does not.
I've looked under the hood a bit. ReferencePipeline#collect first checks if the stream is parallel, if the collector is concurrent and if the collector is unordered. Concurrent is missing, so it delegates to a method evaluate by creating a TerminalOp out of this collector. This under the hood is a ReducingSink, that actually cares if the collector is unordered or not:
return new ReduceOp<T, I, ReducingSink>(StreamShape.REFERENCE) {
#Override
public ReducingSink makeSink() {
return new ReducingSink();
}
#Override
public int getOpFlags() {
return collector.characteristics().contains(Collector.Characteristics.UNORDERED)
? StreamOpFlag.NOT_ORDERED
: 0;
}
};
I have not debugged further since it gets pretty complicated fast.
Thus may be there is a shortcut here and someone could explain what I am missing. It is a parallel stream that collects elements in a non-concurrent unordered collector. Shouldn't there be no order in how the threads combine the results together? If not, how is the order imposed here (by whom)?

Note that the result is the same when using list .parallelStream() .unordered() .collect(Collectors.toList()), in either case, the unordered property is not used within the current implementation.
But let’s change the setup a little bit:
List<Integer> list = Collections.nCopies(10, null).stream()
.flatMap(ig -> IntStream.range(0, 100).boxed())
.collect(Collectors.toList());
List<Integer> reference = new ArrayList<>(new LinkedHashSet<>(list));
for (int i = 0; i < 100_000; i++) {
List<Integer> result = list.parallelStream()
.distinct()
.collect(characteristics(Collectors.toList(), Collector.Characteristics.UNORDERED));
if (!result.equals(reference)) {
System.out.println(result);
break;
}
}
using the characteristics collector factory of this answer
The interesting outcome is that in Java 8 versions prior to 1.8.0_60, this has a different outcome. If we use objects with distinct identities instead of the canonical Integer instance, we could detect that in these earlier versions, not only the order of the list differs, but that the objects in the result list are not the first encountered instances.
So the unordered characteristic of a terminal operation was propagated to the stream, affecting the behavior of distinct(), similar to that of skip and limit, as discussed here and here.
As discussed in the second linked thread, the back-propagation has been removed completely, which is reasonable when thinking about it a second time. For distinct, skip and limit, the order of the source is relevant and ignoring it just because the order will be ignored in subsequent stages is not right. So the only remaining stateful intermediate operation that could benefit from back-propagation would be sorted, which would be rendered obsolete when the order is being ignored afterwards. But combining sorted with an unordered sink is more like a programming error anyway…
For stateless intermediate operations the order is irrelevant anyway. The stream processing works by splitting the source into chunks, apply all stateless intermediate operations on their elements independently and collecting into a local container, before merging into the result container. So the merging step is the only place, where respecting or ignoring the order (of the chunks) will have an impact on the result and perhaps on the performance.
But the impact isn’t very big. When you implement such an operation, e.g. via ForkJoinTasks, you simply split a task into two, wait for their completion and merge them. Alternatively, a task may split off a chunk into a sub-task, process its remaining chunk in-place, wait for the sub-task and merge. In either case, merging the results in order comes naturally due to the fact that the initiating task has hands on references to the adjacent tasks. To merge with different chunks instead, the associated sub-tasks first have to be found somehow.
The only benefit from merging with a different task would be that you can merge with the first completed task, if the tasks need different time to complete. But when waiting for a sub-task in the Fork/Join framework, the thread won’t be idle, the framework will use the thread for working on other pending tasks in-between. So as long as the main task has been split into enough sub-tasks, there will be full CPU utilization. Also, the spliterators attempt to split into even chunks to reduce the differences between the computing times. It’s very likely that the benefit of an alternative unordered merging implementation doesn’t justify the code duplication, at least with the current implementation.
Still, reporting an unordered characteristic allows the implementation to utilize it when beneficial and implementations can change.

This is not an actual answer per-se, but if I add more code and comments, it will get too many I guess.
Here is another interesting thing, actually it made me realize I was wrong in comments.
A spliterator flags need to be merged with all the terminal operation flags and intermediate ones.
Our spliterator's flags are (as reported by StreamOpFlags) : 95; this can be debugged from AbstractSpliterator#sourceSpliterator(int terminalFlags).
That is why the line below reports true:
System.out.println(StreamOpFlag.ORDERED.isKnown(95)); // true
At the same time our terminal collector's characteristics are 32:
System.out.println(StreamOpFlag.ORDERED.isKnown(32)); // false
The result:
int result = StreamOpFlag.combineOpFlags(32, 95); // 111
System.out.println(StreamOpFlag.ORDERED.isKnown(result)); // false
If you think about this, it makes complete sense. List has order, my custom collector does not => order is not preserved.
Bottom-line: that UNORDERED flag is preserved in the resulting Stream, but internally nothing is done with it. They could probably, but they choose not to.

Related

Difference between Iterator and Spliterator in Java8

I came to know while studying that Parallelism is a main advantage of Spliterator.
This may be a basic question but can anyone explain me the main differences between Iterator and Spliterator and give some examples?
An Iterator is a simple representation of a series of elements that can be iterated over.
eg:
List<String> list = Arrays.asList("Apple", "Banana", "Orange");
Iterator<String> i = list.iterator();
i.next();
i.forEachRemaining(System.out::println);
#output
Banana
Orange
A Spliterator can be used to split given element set into multiple sets so that we can perform some kind of operations/calculations on each set in different threads independently, possibly taking advantage of parallelism. It is designed as a parallel analogue of Iterator. Other than collections, the source of elements covered by a Spliterator could be, for example, an array, an IO channel, or a generator function.
There are 2 main methods in the Spliterator interface.
- tryAdvance() and forEachRemaining()
With tryAdvance(), we can traverse underlying elements one by one (just like Iterator.next()). If a remaining element exists, this method performs the consumer action on it, returning true; else returns false.
For sequential bulk traversal we can use forEachRemaining():
List<String> list = Arrays.asList("Apple", "Banana", "Orange");
Spliterator<String> s = list.spliterator();
s.tryAdvance(System.out::println);
System.out.println(" --- bulk traversal");
s.forEachRemaining(System.out::println);
System.out.println(" --- attempting tryAdvance again");
boolean b = s.tryAdvance(System.out::println);
System.out.println("Element exists: "+b);
Output:
Apple
--- bulk traversal
Banana
Orange
--- attempting tryAdvance again
Element exists: false
- Spliterator trySplit()
Splits this spliterator into two and returns the new one:
List<String> list = Arrays.asList("Apple", "Banana", "Orange");
Spliterator<String> s = list.spliterator();
Spliterator<String> s1 = s.trySplit();
s.forEachRemaining(System.out::println);
System.out.println("-- traversing the other half of the spliterator --- ");
s1.forEachRemaining(System.out::println);
Output:
Banana
Orange
-- traversing the other half of the spliterator ---
Apple
An ideal trySplit method should divide its elements exactly in half, allowing balanced parallel computation.
The splitting process is termed as 'partitioning' or 'decomposition' as well.
The names are pretty much self-explanatory, to me. Spliterator == Splittable Iterator : it can split some source, and it can iterate it too. It roughly has the same functionality as an Iterator, but with the extra thing that it can potentially split into multiple pieces: this is what trySplit is for. Splitting is needed for parallel processing.
An Iterator always has an unknown size: you can traverse elements only via hasNext/next; a Spliterator can provide the size (thus improving other operations too internally); either an exact one via getExactSizeIfKnown or a approximate via estimateSize.
On the other hand, tryAdvance is what hasNext/next is from an Iterator, but it's a single method, much easier to reason about, IMO. Related to this, is forEachRemaining which in the default implementation delegates to tryAdvance, but it does not have to always be like this (see ArrayList for example).
A Spliterator also is a "smarter" Iterator, via its internal properties like DISTINCT or SORTED, etc (which you need to provide correctly when implementing your own Spliterator). These flags are used internally to disable unnecessary operations; see for example this optimization:
someStream().map(x -> y).count();
Because size does not change in the case of the stream, the map can be skipped entirely, since all we do is counting.
You can create a Spliterator around an Iterator if you need to, via:
Spliterators.spliteratorUnknownSize(yourIterator, properties)

Algorithm - Implement two functions that assign/release unique id's from a pool

I am trying to find a good solution for this question -
Implement two functions that assign/release unique id's from a pool. Memory usage should be minimized and the assign/release should be fast, even under high contention.
alloc() returns available ID
release(id) releases previously assigned ID
The first thought was to maintain a map of IDs and availability(in boolean). Something like this
Map<Integer, Boolean> availabilityMap = new HashMap();
public Integer alloc() {
for (EntrySet es : availabilityMap.entrySet()) {
if (es.value() == false) {
Integer key = es.key();
availabilityMap.put(key, true);
return key;
}
}
}
public void release(Integer id) {
availabilityMap.put(id, false);
}
However this is not ideal for multiple threads and "Memory usage should be minimized and the assign/release should be fast, even under high contention."
What would be a good way to optimize both memory usage and speed?
For memory usage, I think map should be replaced with some other data structure but I am not sure what it is. Something like bit map or bit set? How can I maintain id and availability in this case?
For concurrency I will have to use locks but I am not sure how I can effectively handle contention. Maybe put availabile ids in separate chunks so that each of them can be accessed independently? Any good suggestions?
First of all, you do not want to run over entire map in order to find available ID.
So you can maintain two sets of IDs, the first one for available IDs, and the second one is for allocated IDs.
Then it will make allocation/release pretty easy and fast.
Also you can use ConcurrentMap for both containers (sets), it will reduce the contention.
Edit: Changed bottom sentinel, fixed a bug
First, don't iterate the entire map to find an available ID. You should only need constant time to do it.
What you could do to make it fast is to do this:
Create an int index = 1; for your counter. This is technically the number of IDs generated + 1, and is always > 0.
Create a ArrayDeque<Integer> free = new ArrayDeque<>(); to house the free IDs. Guaranteed constant access.
When you allocate an ID, if the free ID queue is empty, you can just return the counter and increment it (i.e. return index++;). Otherwise, grab its head and return that.
When you release an ID, push the previously used ID to the free deque.
Remember to synchronize your methods.
This guarantees O(1) allocation and release, and it also keeps allocation quite low (literally once per free). Although it's synchronized, it's fast enough that it shouldn't be a problem.
An implementation might look like this:
import java.util.ArrayDeque;
public class IDPool {
int index = 1;
ArrayDeque<Integer> free = new ArrayDeque<>();
public synchronized int acquire() {
if (free.isEmpty()) return index++;
else return free.pop();
}
public synchronized void release(id) {
free.push(id);
}
}
Additionally, if you want to ensure the free ID list is unique (as you should for anything important) as well as persistent, you can do the following:
Use an HashMap<Integer id, Integer prev> to hold all generated IDs. Remember it doesn't need to be ordered or even iterated.
This is technically going to be a stack encoded inside a hash map.
Highly efficient implementations of this exist.
In reality, any unordered int -> int map will do here.
Track the top ID for the free ID set. Remember that 1 can represent nothing and zero used, so you don't have to box it. (IDs are always positive.) Initially, this would just be int top = 1;
When allocating an ID, if there are free IDs (i.e. top >= 2), do the following:
Set the new top to the old head's value in the free map.
Set the old top's value in the map to 0, marking it used.
Return the old top.
When releasing an old ID, do this instead:
If the old ID is already in the pool, return early, so we don't corrupt it.
Set the ID's value in the map to the old top.
Set the new top to the ID, since it's always the last one to use.
The optimized implementation would end up looking like this:
import java.util.HashMap;
public class IDPool {
int index = 2;
int top = 1;
HashMap<Integer, Integer> pool = new HashMap<>();
public synchronized int acquire() {
int id = top;
if (id == 1) return index++;
top = pool.replace(id, 0);
return id;
}
public synchronized void release(id) {
if (pool.getOrDefault(id, 1) == 0) return;
pool.put(id, top);
top = id;
}
}
If need be, you could use a growable integer array instead of the hash map (it's always contiguous), and realize significant performance gains. Matter of fact, that is how I'd likely implement it. It'd just require a minor amount of bit twiddling to do so, because I'd maintain the array's size to be rounded up to the next power of 2.
Yeah...I had to actually write a similar pool in JavaScript because I actually needed moderately fast IDs in Node.js for potentially high-frequency, long-lived IPC communication.
The good thing about this is that it generally avoids allocations (worst case being once per acquired ID when none are released), and it's very amenable to later optimization where necessary.

Lazy sorted() in Java8 Streams, need for resorting at each iteration

I'm looking for a way to emulate the following behavior with Java 8 streams.
Given a stream of years,sort them so the top 10 values are outputed, such that
after outputing a year, that is decreased and the iteration restarts resorting again:
If I input years 2005, 2020, 2000, 1967 and 2018 I expect the following results for a limit of 10:
2020
2019
2018 2018
2017 2017
2016 2016 2016
2015 ...
The test I am using is:
public class LazyTest {
public static void main(String[] args) {
Stream.of(2005,2020,2000,1967,2018)
.map(YearWrapper::new)
.sorted()
.limit(10)
.peek(year -> year.decreaseYear())
.forEach(System.out::println);
}
public static class YearWrapper implements Comparable<YearWrapper>{
private int currentYear;
public YearWrapper(int year){
this.currentYear=year;
}
public void decreaseYear(){
currentYear--;
}
#Override
public int compareTo(YearWrapper yearsToCompare) {
return Integer.compare(yearsToCompare.currentYear, this.currentYear);
}
#Override
public String toString(){
return String.valueOf(currentYear);
}
}
}
But it seems sorted() is not lazy at all. The whole sorting is done once at the beggining so the order is calculated before any further operation and therefore, the 5 values of the example are passed ordered one by one, so decrease() have no real effect on the iteration.
Is there any way to make sorted() lazy and being applied again to the remaining elements before streaming the next one?
Any other close approach would be much appreciated!!
The documentation of Stream.sorted() says:
This is a stateful intermediate operation.
which is in turn described as
Stateful operations may need to process the entire input before producing a result. For example, one cannot produce any results from sorting a stream until one has seen all elements of the stream.
This documents the non-lazy nature of sorting, however, this has nothing to do with your problem. Even if sorting was lazy, it didn’t change the fundamental principle of streams that each item is streamed to the terminal operation at most once.
You said that you expected the sorting “to be lazy” but what you actually expected is the sorting to happen again for each item after consumption which would imply that sorting a stream of n elements means actually sorting n items n times which no-one else would expect, especially as peek is not meant to have a side effect that affects the ongoing operation.
Do I understand you correctly that you want to take the largest number from the list, decrement it, put it back into the list, repeat 10 times?
If so, then this is a job for PriorityQueue:
PriorityQueue<Integer> queue = Stream.of(2005,2020,2000,1967,2018)
.collect(toCollection(() -> new PriorityQueue(reverseOrder())));
Stream.generate(() -> {
Integer year = queue.poll();
queue.add(year - 1);
return year;
}).limit(10).forEach(System.out::println);

Hadoop / MapReduce - Optimizing "Top N" Word Count MapReduce Job

I'm working on something similar to the canonical MapReduce example - the word count, but with a twist in that I'm looking to only get the Top N results.
Let's say I have a very large set of text data in HDFS. There are plenty of examples that show how to build a Hadoop MapReduce job that will provide you with a word count for every word in that text. For example, if my corpus is:
"This is a test of test data and a good one to test this"
The result set from the standard MapReduce word count job would be:
test:3, a:2, this:2, is: 1, etc..
But what if I ONLY want to get the Top 3 words that were used in my entire set of data?
I can still run the exact same standard MapReduce word-count job, and then just take the Top 3 results once it is ready and is spitting out the count for EVERY word, but that seems a little inefficient, because a lot of data needs to be moved around during the shuffle phase.
What I'm thinking is that, if this sample is large enough, and the data is well randomly and well distributed in HDFS, that each Mapper does not need to send ALL of its word counts to the Reducers, but rather, only some of the top data. So if one mapper has this:
a:8234, the: 5422, man: 4352, ...... many more words ... , rareword: 1, weirdword: 1, etc.
Then what I'd like to do is only send the Top 100 or so words from each Mapper to the Reducer phase - since there is very little chance that "rareword" will suddenly end up in the Top 3 when all is said and done. This seems like it would save on bandwidth and also on Reducer processing time.
Can this be done in the Combiner phase? Is this sort of optimization prior to the shuffle phase commonly done?
This is a very good question, because you have hit the inefficiency of Hadoop's word count example.
The tricks to optimize your problem are the following:
Do a HashMap based grouping in your local map stage, you can also use a combiner for that. This can look like this, I'm using the HashMultiSet of Guava, which faciliates a nice counting mechanism.
public static class WordFrequencyMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
private final HashMultiset<String> wordCountSet = HashMultiset.create();
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] tokens = value.toString().split("\\s+");
for (String token : tokens) {
wordCountSet.add(token);
}
}
And you emit the result in your cleanup stage:
#Override
protected void cleanup(Context context) throws IOException,
InterruptedException {
Text key = new Text();
LongWritable value = new LongWritable();
for (Entry<String> entry : wordCountSet.entrySet()) {
key.set(entry.getElement());
value.set(entry.getCount());
context.write(key, value);
}
}
So you have grouped the words in a local block of work, thus reducing network usage by using a bit of RAM. You can also do the same with a Combiner, but it is sorting to group- so this would be slower (especially for strings!) than using a HashMultiset.
To just get the Top N, you will only have to write the Top N in that local HashMultiset to the output collector and aggregate the results in your normal way on the reduce side.
This saves you a lot of network bandwidth as well, the only drawback is that you need to sort the word-count tuples in your cleanup method.
A part of the code might look like this:
Set<String> elementSet = wordCountSet.elementSet();
String[] array = elementSet.toArray(new String[elementSet.size()]);
Arrays.sort(array, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
// sort descending
return Long.compare(wordCountSet.count(o2), wordCountSet.count(o1));
}
});
Text key = new Text();
LongWritable value = new LongWritable();
// just emit the first n records
for(int i = 0; i < N, i++){
key.set(array[i]);
value.set(wordCountSet.count(array[i]));
context.write(key, value);
}
Hope you get the gist of doing as much of the word locally and then just aggregate the top N of the top N's ;)
Quoting Thomas
To just get the Top N, you will only have to write the Top N in that
local HashMultiset to the output collector and aggregate the results
in your normal way on the reduce side. This saves you a lot of network
bandwidth as well, the only drawback is that you need to sort the
word-count tuples in your cleanup method.
If you write only top N in the local HashMultiset then there is a possibility that you are going to miss the count of an element that, If passed from this local HashMultiset, could become one of the overall top 10 elements.
For example consider following format as three maps as MapName: elementName,elemenntcount:
Map A : Ele1,4 : Ele2,5 : Ele3,5 : Ele4,2
Map B : Ele1,1 : Ele5,7 : Ele6, 3 : Ele7,6
Map C : Ele5,4 : Ele8,3 : Ele1,1 : Ele9,3
Now If we considered the top 3 of each mappers we will Miss the element "Ele1" whose total count should have been 6 but since we are calculating each mapper's top 3 we see "Ele1"'s total count as 4.
I hope that makes sense. Please let me know what you think about it.

Scala: Mutable vs. Immutable Object Performance - OutOfMemoryError

I wanted to compare the performance characteristics of immutable.Map and mutable.Map in Scala for a similar operation (namely, merging many maps into a single one. See this question). I have what appear to be similar implementations for both mutable and immutable maps (see below).
As a test, I generated a List containing 1,000,000 single-item Map[Int, Int] and passed this list into the functions I was testing. With sufficient memory, the results were unsurprising: ~1200ms for mutable.Map, ~1800ms for immutable.Map, and ~750ms for an imperative implementation using mutable.Map -- not sure what accounts for the huge difference there, but feel free to comment on that, too.
What did surprise me a bit, perhaps because I'm being a bit thick, is that with the default run configuration in IntelliJ 8.1, both mutable implementations hit an OutOfMemoryError, but the immutable collection did not. The immutable test did run to completion, but it did so very slowly -- it takes about 28 seconds. When I increased the max JVM memory (to about 200MB, not sure where the threshold is), I got the results above.
Anyway, here's what I really want to know:
Why do the mutable implementations run out of memory, but the immutable implementation does not? I suspect that the immutable version allows the garbage collector to run and free up memory before the mutable implementations do -- and all of those garbage collections explain the slowness of the immutable low-memory run -- but I'd like a more detailed explanation than that.
Implementations below. (Note: I don't claim that these are the best implementations possible. Feel free to suggest improvements.)
def mergeMaps[A,B](func: (B,B) => B)(listOfMaps: List[Map[A,B]]): Map[A,B] =
(Map[A,B]() /: (for (m <- listOfMaps; kv <-m) yield kv)) { (acc, kv) =>
acc + (if (acc.contains(kv._1)) kv._1 -> func(acc(kv._1), kv._2) else kv)
}
def mergeMutableMaps[A,B](func: (B,B) => B)(listOfMaps: List[mutable.Map[A,B]]): mutable.Map[A,B] =
(mutable.Map[A,B]() /: (for (m <- listOfMaps; kv <- m) yield kv)) { (acc, kv) =>
acc + (if (acc.contains(kv._1)) kv._1 -> func(acc(kv._1), kv._2) else kv)
}
def mergeMutableImperative[A,B](func: (B,B) => B)(listOfMaps: List[mutable.Map[A,B]]): mutable.Map[A,B] = {
val toReturn = mutable.Map[A,B]()
for (m <- listOfMaps; kv <- m) {
if (toReturn contains kv._1) {
toReturn(kv._1) = func(toReturn(kv._1), kv._2)
} else {
toReturn(kv._1) = kv._2
}
}
toReturn
}
Well, it really depends on what the actual type of Map you are using. Probably HashMap. Now, mutable structures like that gain performance by pre-allocating memory it expects to use. You are joining one million maps, so the final map is bound to be somewhat big. Let's see how these key/values get added:
protected def addEntry(e: Entry) {
val h = index(elemHashCode(e.key))
e.next = table(h).asInstanceOf[Entry]
table(h) = e
tableSize = tableSize + 1
if (tableSize > threshold)
resize(2 * table.length)
}
See the 2 * in the resize line? The mutable HashMap grows by doubling each time it runs out of space, while the immutable one is pretty conservative in memory usage (though existing keys will usually occupy twice the space when updated).
Now, as for other performance problems, you are creating a list of keys and values in the first two versions. That means that, before you join any maps, you already have each Tuple2 (the key/value pairs) in memory twice! Plus the overhead of List, which is small, but we are talking about more than one million elements times the overhead.
You may want to use a projection, which avoids that. Unfortunately, projection is based on Stream, which isn't very reliable for our purposes on Scala 2.7.x. Still, try this instead:
for (m <- listOfMaps.projection; kv <- m) yield kv
A Stream doesn't compute a value until it is needed. The garbage collector ought to collect the unused elements as well, as long as you don't keep a reference to the Stream's head, which seems to be the case in your algorithm.
EDIT
Complementing, a for/yield comprehension takes one or more collections and return a new collection. As often as it makes sense, the returning collection is of the same type as the original collection. So, for example, in the following code, the for-comprehension creates a new list, which is then stored inside l2. It is not val l2 = which creates the new list, but the for-comprehension.
val l = List(1,2,3)
val l2 = for (e <- l) yield e*2
Now, let's look at the code being used in the first two algorithms (minus the mutable keyword):
(Map[A,B]() /: (for (m <- listOfMaps; kv <-m) yield kv))
The foldLeft operator, here written with its /: synonymous, will be invoked on the object returned by the for-comprehension. Remember that a : at the end of an operator inverts the order of the object and the parameters.
Now, let's consider what object is this, on which foldLeft is being called. The first generator in this for-comprehension is m <- listOfMaps. We know that listOfMaps is a collection of type List[X], where X isn't really relevant here. The result of a for-comprehension on a List is always another List. The other generators aren't relevant.
So, you take this List, get all the key/values inside each Map which is a component of this List, and make a new List with all of that. That's why you are duplicating everything you have.
(in fact, it's even worse than that, because each generator creates a new collection; the collections created by the second generator are just the size of each element of listOfMaps though, and are immediately discarded after use)
The next question -- actually, the first one, but it was easier to invert the answer -- is how the use of projection helps.
When you call projection on a List, it returns new object, of type Stream (on Scala 2.7.x). At first you may think this will only make things worse, because you'll now have three copies of the List, instead of a single one. But a Stream is not pre-computed. It is lazily computed.
What that means is that the resulting object, the Stream, isn't a copy of the List, but, rather, a function that can be used to compute the Stream when required. Once computed, the result will be kept so that it doesn't need to be computed again.
Also, map, flatMap and filter of a Stream all return a new Stream, which means you can chain them all together without making a single copy of the List which created them. Since for-comprehensions with yield use these very functions, the use of Stream inside the prevent unnecessary copies of data.
Now, suppose you wrote something like this:
val kvs = for (m <- listOfMaps.projection; kv <-m) yield kv
(Map[A,B]() /: kvs) { ... }
In this case you aren't gaining anything. After assigning the Stream to kvs, the data hasn't been copied yet. Once the second line is executed, though, kvs will have computed each of its elements, and, therefore, will hold a complete copy of the data.
Now consider the original form::
(Map[A,B]() /: (for (m <- listOfMaps.projection; kv <-m) yield kv))
In this case, the Stream is used at the same time it is computed. Let's briefly look at how foldLeft for a Stream is defined:
override final def foldLeft[B](z: B)(f: (B, A) => B): B = {
if (isEmpty) z
else tail.foldLeft(f(z, head))(f)
}
If the Stream is empty, just return the accumulator. Otherwise, compute a new accumulator (f(z, head)) and then pass it and the function to the tail of the Stream.
Once f(z, head) has executed, though, there will be no remaining reference to the head. Or, in other words, nothing anywhere in the program will be pointing to the head of the Stream, and that means the garbage collector can collect it, thus freeing memory.
The end result is that each element produced by the for-comprehension will exist just briefly, while you use it to compute the accumulator. And this is how you save keeping a copy of your whole data.
Finally, there is the question of why the third algorithm does not benefit from it. Well, the third algorithm does not use yield, so no copy of any data whatsoever is being made. In this case, using projection only adds an indirection layer.

Resources