merge multiple large hashes - ruby

I'm trying to merge 3 hashes using .merge method. It works perfectly with small hashes, but when I try whit larger hashes I'm getting an error. Probably a memory overflow.
[1] 50734 killed ruby bin/app.rb
Example:
a = {sheet_01: { 1=>"One", 2=>"Two", 3=>"Three"} }
b = {sheet_02: { 1=>"aaa", 2=>"bbb", 3=>"ccc"} }
c = {sheet_03: { 1=>"zzz", 2=>"www", 3=>"yyy"} }
a[:sheet_01].merge(b[:sheet_02], c[:sheet_03]) do |_key, v1, v2|
result << v1 + v2
end
# {1=>"Oneaaazzz", 2=>"Twobbbwww", 3=>"Threecccyyy"}
but if I test these hashes with 600 values, my program crash

Avoid Copying Large In-Memory Data Structures Under Memory Pressure
Without a lot more information about your system or your real data, no one can really debug this for you. However, it seems likely that Ruby or its parent process is running out of memory, but the limited error message you provided doesn't tell us that it's being reaped by the OOM killer. It's not a Ruby problem per se; you'll have to look at both your memory and swap usage at the system level for that.
However, it's safe to say that merging large hashes the way you are is potentially memory intensive. This isn't just about the size of the hashes, but also (potentially) about their contents. If you're under memory pressure, you may want to consider:
Using Hash#merge! rather than Hash#merge, as the latter will make a new copy of the merged hash rather than mutate the existing hashes.
Using scope gates to ensure that excess memory from your hash variables are prunable by the garbage collector as soon as you're done with them.
Switching to a different storage mechanism, as large in-memory hashes getting passed around is inherently more memory intensive than calls to a database, on-disk structure, or external key/value store.
You may also want to revisit why your hashes are so big, and whether that's really the best representation of your data or your business logic. Large, persistent data structures in memory are sometimes an indication that you're not representing or manipulating your data structures as efficiently as possible, but your mileage may vary.

So, I wrote a little benchmark to try and reproduce it, but I failed.
as you can see, I am using a random string of ~1000 characters (an integer of 768 bytes is encoded into roughly 1024 characters), which is 100x the size of your strings. I am using 10000 lines, which is more than 10x what you have. And I am using 26 hashes, which is almost 10x what you have. All in all, my memory usage should be roughly 10000x the one you have.
With this benchmark, the merge itself takes about 1.3s, and the entire memory usage for the Ruby process never even touches 1GB. I also tried it with 100000 lines per sheet, and the memory usage went to a little over 8GB, but still no crash.
#!/usr/bin/env ruby
require 'securerandom'
require 'benchmark/ips'
def generate_sheet
Array.new(10_000) {|i| [i, SecureRandom.base64(768)] }.to_h
end
def generate_hash
{ sheet: generate_sheet }
end
a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z =
Array.new(26) { generate_hash }
Benchmark.ips do |bm|
bm.config warmup: 20, time: 50
bm.report do
a[:sheet].merge(
b[:sheet], c[:sheet], d[:sheet], e[:sheet], f[:sheet], g[:sheet],
h[:sheet], i[:sheet], j[:sheet], k[:sheet], l[:sheet], m[:sheet],
n[:sheet], o[:sheet], p[:sheet], q[:sheet], r[:sheet], s[:sheet],
b[:sheet], u[:sheet], v[:sheet], w[:sheet], x[:sheet], y[:sheet],
z[:sheet]
) do |_key, v1, v2| v1 + v2 end
end
end

Related

Julia: Parallel for loop over partitions iterator

So I'm trying to iterate over the list of partitions of something, say 1:n for some n between 13 and 21. The code that I ideally want to run looks something like this:
valid_num = #parallel (+) for p in partitions(1:n)
int(is_valid(p))
end
println(valid_num)
This would use the #parallel for to map-reduce my problem. For example, compare this to the example in the Julia documentation:
nheads = #parallel (+) for i=1:200000000
Int(rand(Bool))
end
However, if I try my adaptation of the loop, I get the following error:
ERROR: `getindex` has no method matching getindex(::SetPartitions{UnitRange{Int64}}, ::Int64)
in anonymous at no file:1433
in anonymous at multi.jl:1279
in run_work_thunk at multi.jl:621
in run_work_thunk at multi.jl:630
in anonymous at task.jl:6
which I think is because I am trying to iterate over something that is not of the form 1:n (EDIT: I think it's because you cannot call p[3] if p=partitions(1:n)).
I've tried using pmap to solve this, but because the number of partitions can get really big, really quickly (there are more than 2.5 million partitions of 1:13, and when I get to 1:21 things will be huge), constructing such a large array becomes an issue. I left it running over night and it still didn't finish.
Does anyone have any advice for how I can efficiently do this in Julia? I have access to a ~30 core computer and my task seems easily parallelizable, so I would be really grateful if anyone knows a good way to do this in Julia.
Thank you so much!
The below code gives 511, the number of partitions of size 2 of a set of 10.
using Iterators
s = [1,2,3,4,5,6,7,8,9,10]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
sum(map(is_valid, takenth(chain(1:29,drop(partitions(s), i-1)), 30)))
end
This solution combines the takenth, drop, and chain iterators to get the same effect as the take_every iterator below under PREVIOUS ANSWER. Note that in this solution, every process must compute every partition. However, because each process uses a different argument to drop, no two processes will ever call is_valid on the same partition.
Unless you want to do a lot of math to figure out how to actually skip partitions, there is no way to avoid computing partitions sequentially on at least one process. I think Simon's answer does this on one process and distributes the partitions. Mine asks each worker process to compute the partitions itself, which means the computation is being duplicated. However, it is being duplicated in parallel, which (if you actually have 30 processors) will not cost you time.
Here is a resource on how iterators over partitions are actually computed: http://www.informatik.uni-ulm.de/ni/Lehre/WS03/DMM/Software/partitions.pdf.
PREVIOUS ANSWER (More complicated than necessary)
I noticed Simon's answer while writing mine. Our solutions seem similar to me, except mine uses iterators to avoid storing partitions in memory. I'm not sure which would actually be faster for what size sets, but I figure it's good to have both options. Assuming it takes you significantly longer to compute is_valid than to compute the partitions themselves, you can do something like this:
s = [1,2,3,4]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
foldl((x,y)->(x + int(is_valid(y))), 0, take_every(partitions(s), i-1, 30))
end
which gives me 7, the number of partitions of size 2 for a set of 4. The take_every function returns an iterator that returns every 30th partition starting with the ith. Here is the code for that:
import Base: start, done, next
immutable TakeEvery{Itr}
itr::Itr
start::Any
value::Any
flag::Bool
skip::Int64
end
function take_every(itr, offset, skip)
value, state = Nothing, start(itr)
for i = 1:(offset+1)
if done(itr, state)
return TakeEvery(itr, state, value, false, skip)
end
value, state = next(itr, state)
end
if done(itr, state)
TakeEvery(itr, state, value, true, skip)
else
TakeEvery(itr, state, value, false, skip)
end
end
function start{Itr}(itr::TakeEvery{Itr})
itr.value, itr.start, itr.flag
end
function next{Itr}(itr::TakeEvery{Itr}, state)
value, state_, flag = state
for i=1:itr.skip
if done(itr.itr, state_)
return state[1], (value, state_, false)
end
value, state_ = next(itr.itr, state_)
end
if done(itr.itr, state_)
state[1], (value, state_, !flag)
else
state[1], (value, state_, false)
end
end
function done{Itr}(itr::TakeEvery{Itr}, state)
done(itr.itr, state[2]) && !state[3]
end
One approach would be to divide the problem up into pieces that are not too big to realize and then process the items within each piece in parallel, e.g. as follows:
function my_take(iter,state,n)
i = n
arr = Array[]
while !done(iter,state) && (i>0)
a,state = next(iter,state)
push!(arr,a)
i = i-1
end
return arr, state
end
function get_part(npart,npar)
valid_num = 0
p = partitions(1:npart)
s = start(p)
while !done(p,s)
arr,s = my_take(p,s,npar)
valid_num += #parallel (+) for a in arr
length(a)
end
end
return valid_num
end
valid_num = #time get_part(10,30)
I was going to use the take() method to realize up to npar items from the iterator but take() appears to be deprecated so I've included my own implementation which I've called my_take(). The getPart() function therefore uses my_take() to obtain up to npar partitions at a time and carry out a calculation on them. In this case, the calculation just adds up their lengths, because I don't have the code for the OP's is_valid() function. get_part() then returns the result.
Because the length() calculation isn't very time-consuming, this code is actually slower when run on parallel processors than it is on a single processor:
$ julia -p 1 parpart.jl
elapsed time: 10.708567515 seconds (373025568 bytes allocated, 6.79% gc time)
$ julia -p 2 parpart.jl
elapsed time: 15.70633439 seconds (548394872 bytes allocated, 9.14% gc time)
Alternatively, pmap() could be used on each piece of the problem instead of the parallel for loop.
With respect to the memory issue, realizing 30 items from partitions(1:10) took nearly 1 gigabyte of memory on my PC when I ran Julia with 4 worker processes so I expect realizing even a small subset of partitions(1:21) will require a great deal of memory. It may be desirable to estimate how much memory would be needed to see if it would be at all possible before trying such a computation.
With respect to the computation time, note that:
julia> length(partitions(1:10))
115975
julia> length(partitions(1:21))
474869816156751
... so even efficient parallel processing on 30 cores might not be enough to make the larger problem solvable in a reasonable time.

Operating in parallel on a large constant datastructure in Julia

I have a large vector of vectors of strings:
There are around 50,000 vectors of strings,
each of which contains 2-15 strings of length 1-20 characters.
MyScoringOperation is a function which operates on a vector of strings (the datum) and returns an array of 10100 scores (as Float64s). It takes about 0.01 seconds to run MyScoringOperation (depending on the length of the datum)
function MyScoringOperation(state:State, datum::Vector{String})
...
score::Vector{Float64} #Size of score = 10000
I have what amounts to a nested loop.
The outer loop typically would runs for 500 iterations
data::Vector{Vector{String}} = loaddata()
for ii in 1:500
score_total = zeros(10100)
for datum in data
score_total+=MyScoringOperation(datum)
end
end
On one computer, on a small test case of 3000 (rather than 50,000) this takes 100-300 seconds per outer loop.
I have 3 powerful servers with Julia 3.9 installed (and can get 3 more easily, and then can get hundreds more at the next scale).
I have basic experience with #parallel, however it seems like it is spending a lot of time copying the constant (It more or less hang on the smaller testing case)
That looks like:
data::Vector{Vector{String}} = loaddata()
state = init_state()
for ii in 1:500
score_total = #parallel(+) for datum in data
MyScoringOperation(state, datum)
end
state = update(state, score_total)
end
My understanding of the way this implementation works with #parallel is that it:
For Each ii:
partitions data into a chuck for each worker
sends that chuck to each worker
works all process there chunks
main procedure sums the results as they arrive.
I would like to remove step 2,
so that instead of sending a chunk of data to each worker,
I just send a range of indexes to each worker, and they look it up from their own copy of data. or even better, only giving each only their own chunk, and having them reuse it each time (saving on a lot of RAM).
Profiling backs up my belief about the functioning of #parellel.
For a similarly scoped problem (with even smaller data),
the non-parallel version runs in 0.09seconds,
and the parallel runs in
And the profiler shows almost all the time is spent 185 seconds.
Profiler shows almost 100% of this is spend interacting with network IO.
This should get you started:
function get_chunks(data::Vector, nchunks::Int)
base_len, remainder = divrem(length(data),nchunks)
chunk_len = fill(base_len,nchunks)
chunk_len[1:remainder]+=1 #remained will always be less than nchunks
function _it()
for ii in 1:nchunks
chunk_start = sum(chunk_len[1:ii-1])+1
chunk_end = chunk_start + chunk_len[ii] -1
chunk = data[chunk_start: chunk_end]
produce(chunk)
end
end
Task(_it)
end
function r_chunk_data(data::Vector)
all_chuncks = get_chunks(data, nworkers()) |> collect;
remote_chunks = [put!(RemoteRef(pid)::RemoteRef, all_chuncks[ii]) for (ii,pid) in enumerate(workers())]
#Have to add the type annotation sas otherwise it thinks that, RemoteRef(pid) might return a RemoteValue
end
function fetch_reduce(red_acc::Function, rem_results::Vector{RemoteRef})
total = nothing
#TODO: consider strongly wrapping total in a lock, when in 0.4, so that it is garenteed safe
#sync for rr in rem_results
function gather(rr)
res=fetch(rr)
if total===nothing
total=res
else
total=red_acc(total,res)
end
end
#async gather(rr)
end
total
end
function prechunked_mapreduce(r_chunks::Vector{RemoteRef}, map_fun::Function, red_acc::Function)
rem_results = map(r_chunks) do rchunk
function do_mapred()
#assert r_chunk.where==myid()
#pipe r_chunk |> fetch |> map(map_fun,_) |> reduce(red_acc, _)
end
remotecall(r_chunk.where,do_mapred)
end
#pipe rem_results|> convert(Vector{RemoteRef},_) |> fetch_reduce(red_acc, _)
end
rchunk_data breaks the data into chunks, (defined by get_chunks method) and sends those chunks each to a different worker, where they are stored in RemoteRefs.
The RemoteRefs are references to memory on your other proccesses(and potentially computers), that
prechunked_map_reduce does a variation on a kind of map reduce to have each worker first run map_fun on each of it's chucks elements, then reduce over all the elements in its chuck using red_acc (a reduction accumulator function). Finally each worker returns there result which is then combined by reducing them all together using red_acc this time using the fetch_reduce so that we can add the first ones completed first.
fetch_reduce is a nonblocking fetch and reduce operation. I believe it has no raceconditions, though this maybe because of a implementation detail in #async and #sync. When julia 0.4 comes out, it is easy enough to put a lock in to make it obviously have no race conditions.
This code isn't really battle hardened. I don;t believe the
You also might want to look at making the chuck size tunable, so that you can seen more data to faster workers (if some have better network or faster cpus)
You need to reexpress your code as a map-reduce problem, which doesn't look too hard.
Testing that with:
data = [float([eye(100),eye(100)])[:] for _ in 1:3000] #480Mb
chunk_data(:data, data)
#time prechunked_mapreduce(:data, mean, (+))
Took ~0.03 seconds, when distributed across 8 workers (none of them on the same machine as the launcher)
vs running just locally:
#time reduce(+,map(mean,data))
took ~0.06 seconds.

Unexpected access performance differences between arrays and hashes

I have evaluated access times for a two-dimensional array, implemented as
an array of arrays
a hash of arrays
a hash with arrays as keys
My expectation was to see similar acess times for all 3. I also have expected the measurements to yield similar results for MRI and JRuby.
However, for a reason that I do not understand, on MRI accessing elements within an array of arrays or within a hash of arrays is an order of magnitude faster than accessing elements of a hash.
On JRuby, instead of being 10 times as expensive, hash access was about 50 times as expensive as with an array of arrays.
The results:
MRI (2.1.1):
user system total real
hash of arrays: 1.300000 0.000000 1.300000 ( 1.302235)
array of arrays: 0.890000 0.000000 0.890000 ( 0.896942)
flat hash: 16.830000 0.000000 16.830000 ( 16.861716)
JRuby (1.7.5):
user system total real
hash of arrays: 0.280000 0.000000 0.280000 ( 0.265000)
array of arrays: 0.250000 0.000000 0.250000 ( 0.182000)
flat hash: 77.450000 0.240000 77.690000 ( 75.156000)
Here are two of my benchmarks:
ary = (0...n).map { Array.new(n, 1) }
bm.report('array of arrays:') do
iterations.times do
(0...n).each { |x|
(0...n).each { |y|
v = ary[x][y]
}
}
end
end
.
hash = {}
(0...n).each { |x|
(0...n).each { |y|
hash[[x, y]] = 1
}
}
prepared_indices = (0...n).each_with_object([]) { |x, ind|
(0...n).each { |y|
ind << [x, y]
}
}
bm.report('flat hash:') do
iterations.times do
prepared_indices.each { |i|
v = hash[i]
}
end
end
All container elements are initialized with a numeric value and have the same total number of elements.
The arrays for accessing the hash are preinitialized in order to benchmark the element access only.
Here is the complete code
I have consulted this thread and this article but still have no clue about the unexpected performance differences.
Why are the results so different from my expectations? What am I missing?
Consider the memory layout of an array of arrays, say with dimensions 3x3... you've got something like this:
memory address usage/content
base [0][0]
base+sizeof(int) [0][1]
base+2*sizeof(int) [0][2]
base+3*sizeof(int) [1][0]
base+4*sizeof(int) [1][1]
...
Given an array of dimensions [M][N], all that's needed to access an element in at indices [i][j] is to add the base memory address to the data element size times (i * M + j)... a tiny bit of simple arithmetic, and therefore extremely fast.
Hashes are far more complicated data structures and inherently slower. With a hash, you need to take time to hash the key (and the harder the hash tries to make sure different keys will - statistically - scatter pretty randomly throughout the hash output range even if they're similar keys the slower the hash tends to be, if the hash function doesn't make that effort you'll have more collisions in the hash table and slower performance there), then the hash value needs to be mapped on to the current hash table size (usually using "%"), then you need to compare keys to see if you've found the hoped-for key or a colliding element or an empty element. It's a far more involved process than array indexing. You should probably do some background reading about hash function and hash table implementations....
The reason hashes are so often useful is that the key doesn't need to be numeric (you can always work out some formula to generate a number from arbitrary key data) and need not be near-contiguous for memory efficiency (i.e. a hash table with say memory capacity for 5 integers could happily store keys 1, 1000 and 12398239 - whereas for an array keyed on those values there would be a lot of virtual address space wasted for all the indices in between, which have no data anyway, and anyway more data packed into a memory page means more cache hits).
Further - you should be careful with benchmarks - when you do clearly repetitive work with unchanging values overwriting the same variable, an optimiser may avoid it and you may not be timing what you think you are. It's good to use some run-time inputs (e.g. storing different values in the containers) and accumulate some dependent result (e.g. summing the element accesses rather than overwriting it), then outputting the result so any lazy evaluation is forced to conclude. With things like JITs and VMs involved there can also be kinks in your benchmarks as compilation kicks in or branch prediction results are incorporated.

Can Ruby handle large hash objects

I want to compare two hashes. Each can have more than 20,000 objects.
I have the following questions:
Can Ruby handle such a large amount of objects?
Will comparing these two hashes take a lot of time?
Can indexing be applied to reduce enumerations?
The hashes themselves are fast and not bound to low limits. E.g. this does not even take a millisecond here (Ruby 1.9.2 on Windows):
irb(main):008:0> hash1 = (0...20000).inject({}) { | r, i | r[rand(100)*100000 + i] = rand; r } ; 23
=> 23
irb(main):009:0> hash2 = (0...20000).inject({}) { | r, i | r[rand(100)*100000 + i] = rand; r } ; 23
=> 23
irb(main):010:0> hash3 = hash1.dup ; 23
=> 23
irb(main):011:0> hash1 == hash2
=> false
irb(main):012:0> hash1 == hash3
=> true
Everything else depends on what you are stuffing into the hashes.
Rails is a framework and has little to do with object comparison. Ruby is certainly capable of comparing 20,000 objects, assuming they fit well into memory or you are comparing them in a batch process that limits how many are instantiated at any time.
If you are are talking about comparing 20,000 ActiveRecord objects in-memory you will probably run out of memory and may experience pretty slow results even if you don't. ActiveRecord is pretty heavy-weight and not the best tool for working with large numbers of objects. However, I don't know what these 20,000 objects are or exactly how you are comparing them, so maybe they don't have to all be in-memory simultaneously and a batch process could complete this in a time frame you find acceptable.
If these are simple objects in a simple ruby hash you can certainly iterate through them pretty quickly (though what is fast is completely dependent on what this is for). If your comparison logic is pretty simple it shouldn't be too time consuming, assuming each object in your first hash is compared to a single corresponding object in the second hash. If every object in hash 1 is compared to each of the 20,000 in hash 2 then your total comparisons (20,000 * 20,0000) are much larger and this might not be as fast as you need.

Scala: Mutable vs. Immutable Object Performance - OutOfMemoryError

I wanted to compare the performance characteristics of immutable.Map and mutable.Map in Scala for a similar operation (namely, merging many maps into a single one. See this question). I have what appear to be similar implementations for both mutable and immutable maps (see below).
As a test, I generated a List containing 1,000,000 single-item Map[Int, Int] and passed this list into the functions I was testing. With sufficient memory, the results were unsurprising: ~1200ms for mutable.Map, ~1800ms for immutable.Map, and ~750ms for an imperative implementation using mutable.Map -- not sure what accounts for the huge difference there, but feel free to comment on that, too.
What did surprise me a bit, perhaps because I'm being a bit thick, is that with the default run configuration in IntelliJ 8.1, both mutable implementations hit an OutOfMemoryError, but the immutable collection did not. The immutable test did run to completion, but it did so very slowly -- it takes about 28 seconds. When I increased the max JVM memory (to about 200MB, not sure where the threshold is), I got the results above.
Anyway, here's what I really want to know:
Why do the mutable implementations run out of memory, but the immutable implementation does not? I suspect that the immutable version allows the garbage collector to run and free up memory before the mutable implementations do -- and all of those garbage collections explain the slowness of the immutable low-memory run -- but I'd like a more detailed explanation than that.
Implementations below. (Note: I don't claim that these are the best implementations possible. Feel free to suggest improvements.)
def mergeMaps[A,B](func: (B,B) => B)(listOfMaps: List[Map[A,B]]): Map[A,B] =
(Map[A,B]() /: (for (m <- listOfMaps; kv <-m) yield kv)) { (acc, kv) =>
acc + (if (acc.contains(kv._1)) kv._1 -> func(acc(kv._1), kv._2) else kv)
}
def mergeMutableMaps[A,B](func: (B,B) => B)(listOfMaps: List[mutable.Map[A,B]]): mutable.Map[A,B] =
(mutable.Map[A,B]() /: (for (m <- listOfMaps; kv <- m) yield kv)) { (acc, kv) =>
acc + (if (acc.contains(kv._1)) kv._1 -> func(acc(kv._1), kv._2) else kv)
}
def mergeMutableImperative[A,B](func: (B,B) => B)(listOfMaps: List[mutable.Map[A,B]]): mutable.Map[A,B] = {
val toReturn = mutable.Map[A,B]()
for (m <- listOfMaps; kv <- m) {
if (toReturn contains kv._1) {
toReturn(kv._1) = func(toReturn(kv._1), kv._2)
} else {
toReturn(kv._1) = kv._2
}
}
toReturn
}
Well, it really depends on what the actual type of Map you are using. Probably HashMap. Now, mutable structures like that gain performance by pre-allocating memory it expects to use. You are joining one million maps, so the final map is bound to be somewhat big. Let's see how these key/values get added:
protected def addEntry(e: Entry) {
val h = index(elemHashCode(e.key))
e.next = table(h).asInstanceOf[Entry]
table(h) = e
tableSize = tableSize + 1
if (tableSize > threshold)
resize(2 * table.length)
}
See the 2 * in the resize line? The mutable HashMap grows by doubling each time it runs out of space, while the immutable one is pretty conservative in memory usage (though existing keys will usually occupy twice the space when updated).
Now, as for other performance problems, you are creating a list of keys and values in the first two versions. That means that, before you join any maps, you already have each Tuple2 (the key/value pairs) in memory twice! Plus the overhead of List, which is small, but we are talking about more than one million elements times the overhead.
You may want to use a projection, which avoids that. Unfortunately, projection is based on Stream, which isn't very reliable for our purposes on Scala 2.7.x. Still, try this instead:
for (m <- listOfMaps.projection; kv <- m) yield kv
A Stream doesn't compute a value until it is needed. The garbage collector ought to collect the unused elements as well, as long as you don't keep a reference to the Stream's head, which seems to be the case in your algorithm.
EDIT
Complementing, a for/yield comprehension takes one or more collections and return a new collection. As often as it makes sense, the returning collection is of the same type as the original collection. So, for example, in the following code, the for-comprehension creates a new list, which is then stored inside l2. It is not val l2 = which creates the new list, but the for-comprehension.
val l = List(1,2,3)
val l2 = for (e <- l) yield e*2
Now, let's look at the code being used in the first two algorithms (minus the mutable keyword):
(Map[A,B]() /: (for (m <- listOfMaps; kv <-m) yield kv))
The foldLeft operator, here written with its /: synonymous, will be invoked on the object returned by the for-comprehension. Remember that a : at the end of an operator inverts the order of the object and the parameters.
Now, let's consider what object is this, on which foldLeft is being called. The first generator in this for-comprehension is m <- listOfMaps. We know that listOfMaps is a collection of type List[X], where X isn't really relevant here. The result of a for-comprehension on a List is always another List. The other generators aren't relevant.
So, you take this List, get all the key/values inside each Map which is a component of this List, and make a new List with all of that. That's why you are duplicating everything you have.
(in fact, it's even worse than that, because each generator creates a new collection; the collections created by the second generator are just the size of each element of listOfMaps though, and are immediately discarded after use)
The next question -- actually, the first one, but it was easier to invert the answer -- is how the use of projection helps.
When you call projection on a List, it returns new object, of type Stream (on Scala 2.7.x). At first you may think this will only make things worse, because you'll now have three copies of the List, instead of a single one. But a Stream is not pre-computed. It is lazily computed.
What that means is that the resulting object, the Stream, isn't a copy of the List, but, rather, a function that can be used to compute the Stream when required. Once computed, the result will be kept so that it doesn't need to be computed again.
Also, map, flatMap and filter of a Stream all return a new Stream, which means you can chain them all together without making a single copy of the List which created them. Since for-comprehensions with yield use these very functions, the use of Stream inside the prevent unnecessary copies of data.
Now, suppose you wrote something like this:
val kvs = for (m <- listOfMaps.projection; kv <-m) yield kv
(Map[A,B]() /: kvs) { ... }
In this case you aren't gaining anything. After assigning the Stream to kvs, the data hasn't been copied yet. Once the second line is executed, though, kvs will have computed each of its elements, and, therefore, will hold a complete copy of the data.
Now consider the original form::
(Map[A,B]() /: (for (m <- listOfMaps.projection; kv <-m) yield kv))
In this case, the Stream is used at the same time it is computed. Let's briefly look at how foldLeft for a Stream is defined:
override final def foldLeft[B](z: B)(f: (B, A) => B): B = {
if (isEmpty) z
else tail.foldLeft(f(z, head))(f)
}
If the Stream is empty, just return the accumulator. Otherwise, compute a new accumulator (f(z, head)) and then pass it and the function to the tail of the Stream.
Once f(z, head) has executed, though, there will be no remaining reference to the head. Or, in other words, nothing anywhere in the program will be pointing to the head of the Stream, and that means the garbage collector can collect it, thus freeing memory.
The end result is that each element produced by the for-comprehension will exist just briefly, while you use it to compute the accumulator. And this is how you save keeping a copy of your whole data.
Finally, there is the question of why the third algorithm does not benefit from it. Well, the third algorithm does not use yield, so no copy of any data whatsoever is being made. In this case, using projection only adds an indirection layer.

Resources