For a study project I have written a Scala application that uses a bunch of futures to do a parallel computation. I noticed that on my local machine (4 cores) the code runs faster than on the many-core server of our computer science institute (64 cores). Now I want to know why this is.
Task in Detail
The task was to create random boolean k-CNF formulas with n different variables randomly distributed over m clauses and then see how at which m/n combination the probability that a formula is solvable drops below 50% for diffent random distributions. For this I have implemented a probabilistic k-SAT algorithm, a clause generator and some other code. The core is a function that takes n and m es well as the generator function, runs 100 futures and waits for the result. The function looks like this:
Code in question
def avgNonvalidClauses(n: Int, m: Int)(implicit clauseGenerator: ClauseGenerator) = {
val startTime = System.nanoTime
/** how man iteration to build the average **/
val TRIES = 100
// do TRIES iterations in parallel
val tasks = for (i <- 0 until TRIES) yield future[Option[Config]] {
val clause = clauseGenerator(m, n)
val solution = CNFSolver.probKSat(clause)
solution
}
/* wait for all threads to finish and collect the results. we will only wait
* at most TRIES * 100ms (note: flatten filters out all
* None's) */
val results = awaitAll(100 * TRIES, tasks: _*).asInstanceOf[List[Option[Option[Config]]]].flatten
val millis = Duration(System.nanoTime - startTime, NANOSECONDS).toMillis
val avg = (results count (_.isDefined)) / results.length.toFloat
println(s"n=$n, m=$m => $avg ($millis ms)")
avg
}
Problem
On my local machine I get these results
[info] Running Main
n=20, m=120 => 0.0 (8885 ms)
n=21, m=121 => 0.0 (9115 ms)
n=22, m=122 => 0.0 (8724 ms)
n=23, m=123 => 0.0 (8433 ms)
n=24, m=124 => 0.0 (8544 ms)
n=25, m=125 => 0.0 (8858 ms)
[success] Total time: 53 s, completed Jan 9, 2013 8:21:30 PM
On the 64-core server I get:
[info] Running Main
n=20, m=120 => 0.0 (43200 ms)
n=21, m=121 => 0.0 (38826 ms)
n=22, m=122 => 0.0 (38728 ms)
n=23, m=123 => 0.0 (32737 ms)
n=24, m=124 => 0.0 (41196 ms)
n=25, m=125 => 0.0 (42323 ms)
[success] Total time: 245 s, completed 09.01.2013 20:28:22
However, I the full load on both machines (the server averages around at a load of 60 to 65) so there are running enough threads. Why is this? Am I doing something completely wrong?
My local machine has an "AMD Phenom(tm) II X4 955 Processor" CPU the server is uses "AMD Opteron(TM) Processor 6272". The local CPU has 6800 bogomips, the servers 4200. So, while the local CPU is a 1/3 faster, there are 12 times more cors on the server.
Additional
If have a trimmed down example of my code pushed to github so you can try for yourselve if you are intereste: https://github.com/Blattlaus/algodemo (It's an sbt project using Scala 2.10).
Updates
I've eleminated any randomness by seeding the random number generators with 42. This changes nothing
I've changed the testset. Now the results are even more astonishing (the server is 5 times slower!) Note: all outputs for the average percentage of not solvable clauses are zeor because of the input. This is normal and expected.
Added info about CPUs
I've noticed that calls to Random.nextInt() are a factor of 10 slower on the Server. I have wrapped all calls in a helper that measures the runtime a prints is to the console if they are slower then 10ms. On my local machine i get a few, and the typically are araound 10-20ms. On the server I get much mure calls and they tend to be above 100ms. Could this be the issue???
You have already figured out the answer in that the problem is Random.nextInt() which uses a AtomicLong(). If this is being accessed frequently from different threads then you will get cache thrashing, which will be worse on your 64 core computer because the caches will be further apart (electrically) and hence it will take longer to get the necessary cache line locks.
See this stackoverflow answer for more details, and the solution on how to avoid this problem (which is basically to use a thread local random number generator): Contention in concurrent use of java.util.Random
Operations on denormalized floating numbers, could take an order of magnitude longer on x86 architecture. See:
Why does changing 0.1f to 0 slow down performance by 10x?
Haven't examined your code, but given that you return NaN that might be the case. Try removing randomness from your test to verify that hypothesis.
Related
Originally this is a problem coming up in mathematica.SE, but since multiple programming languages have involved in the discussion, I think it's better to rephrase it a bit and post it here.
In short, michalkvasnicka found that in the following MATLAB sample
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for c = 1:s
for r = 1:s
H(r,c) = 1/(r+c-1);
end
end
toc
%Elapsed time is 1.359625 seconds.... For-loop
tic;
% vectorized version
c = 1:s;
r = c';
HH=1./(r+c-1);
toc
%Elapsed time is 0.047916 seconds.... Vectorized
isequal(H,HH)
the vectorized code piece is more than 25 times faster than the pure for-loop code piece. Though I don't have access to MATLAB so cannot test the sample myself, the timing 1.359625 seems to suggest it's tested on an average PC, just as mine.
But I cannot reproduce the timing with other languages like fortran or julia! (We know, both of them are famous for their performance of numeric calculation. Well, I admit I'm by no means an expert of fortran or julia. )
The followings are the samples I used for test. I'm using a laptop with i7-8565U CPU, Win 10.
fortran
fortran code is compiled with gfortran (TDM-GCC-10.3.0-2, with compile option -Ofast).
program tst
use, intrinsic :: iso_fortran_env
implicit none
integer,parameter::s=15000
integer::r,c
real(real64)::hmn(s,s)
do r=1,s
do c=1, s
hmn(r,c)=1._real64/(r + c - 1)
end do
end do
print *, hmn(s,s)
end program
compilation timing: 0.2057823 seconds
execution timing: 0.7179657 seconds
julia
Version of julia is 1.6.3.
#time (s=15000; Hmm=[1. /(r+c-1) for r=1:s,c=1:s];)
Timing: 0.7945998 seconds
Here comes the question:
Is the timing of MATLAB reliable?
If the answer to 1st question is yes, then how can we reproduce the performance (for 2 GHz CPU, the timing should be around 0.05 seconds) with julia, fortran, or any other programming languages?
Just to add on the Julia side - make sure you use BenchmarkToolsto benchmark, wrap the code you want to benchmark in functions so as not to benchmark in global scope, and interpolate any variables you pass to #btime.
Here's how I would do it:
julia> s = 15_000;
julia> function f_loop!(H)
for c ∈ 1:size(H, 1)
for r ∈ 1:size(H, 1)
H[r, c] = 1 / (r + c - 1)
end
end
end
f_loop! (generic function with 1 method)
julia> function f_vec!(H)
c = 1:size(H, 1)
r = c'
H .= 1 ./ (r .+ c .- 1)
end
f_vec! (generic function with 1 method)
julia> H = zeros(s, s);
julia> using BenchmarkTools
julia> #btime f_loop!($H);
625.891 ms (0 allocations: 0 bytes)
julia> H = zeros(s, s);
julia> #btime f_vec!($H);
625.248 ms (0 allocations: 0 bytes)
So both versions come in at the same time, which is what I'd expect for such a straightforward operation where a properly type-inferred code should compile down to roughly the same machine code.
tic/toc should be fine, but it looks like the timing is being skewed by memory pre-allocation.
I can reproduce similar timings to your MATLAB example, however
On first run (clear workspace)
Loop approach takes 2.08 sec
Vectorised approach takes 1.04 sec
Vectorisation saves 50% execution time
On second run (workspace not cleared)
Loop approach takes 2.55 sec
Vectorised approach takes 0.065 sec
Vectorisation "saves" 97.5% execution time
My guess would be that since the loop approach explicitly creates a new matrix via zeros, the memory is reallocated from scratch on every run and you don't see the speed improvement on subsequent runs.
However, when HH remains in memory and the HH=___ line outputs a matrix of the same size, I suspect MATLAB is doing some clever memory allocation to speed up the operation.
We can prove this theory with the following test:
Test Num | Workspace cleared | s | Loop (sec) | Vectorised (sec)
1 | Yes | 15000 | 2.10 | 1.41
2 | No | 15000 | 2.73 | 0.07
3 | No | 15000 | 2.50 | 0.07
4 | No | 15001 | 2.74 | 1.73
See the variation between tests 2 and 3, this is why timeit would have been helpful for an average runtime (see footnote). The difference in output sizes between tests 3 and 4 are pretty small, but the execution time returns to a similar magnitude of that in test 1 for the vectorised approach, suggesting that the re-allocation to create HH costs most of the time.
Footnote: tic/toc timings in MATLAB can be improved by using the in-built timeit function, which essentially takes an average over several runs. One interesting thing to observe from the workings of timeit though is that it explicitly "warms up" (quoting a comment) the tic/toc function by calling it a couple of times. You can see when running tic/toc a few times from a clear workspace (with no intermediate code) that the first call takes longer than subsequent calls, as there must be some overhead for getting the timer initialised.
I hope that the following modified benchmark could bring some new light to the problem:
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for i =1:10
for c = 1:s
for r = 1:s
H(r,c) = H(r,c) + 1/(r+c-1+i);
end
end
end
toc
tic;
% vectorized version
HH = zeros(s,s);
c = 1:s;
r = c';
for i=1:10
HH= HH + 1./(r+c-1+i);
end
toc
isequal(H,HH)
In this case any kind of "cashing" is avoided by changing of matrix H (HH) at each for-loop (over "i") iteration.
In this case we get:
Elapsed time is 3.737275 seconds. (for-loop)
Elapsed time is 1.143387 seconds. (vectorized)
So, there is still performance improvement (~ 3x) due to the vectorization, which is probably done by implicit multi-threading implementation of vectorized Matlab commands.
Yes, tic/toc vs timeit is not strictly consistent, but the overall timing functionality is very similar.
To add to this, here is a simple python script which does the vectorized operation with numpy:
from timeit import default_timer
import numpy as np
s = 15000
start = default_timer()
# for-loop
H = np.zeros([s, s])
for c in range(1, s):
for r in range(1, s):
H[r, c] = 1 / (r + c - 1)
end = default_timer()
print(end - start)
start = default_timer()
# vectorized
c = np.arange(1, s).reshape([1, -1])
r = c.T
HH = 1 / (c + r - 1)
end = default_timer()
print(end - start)
for-loop: 32.94566780002788 seconds
vectorized: 0.494859800033737 seconds
While the for-loop version is terribly slow, the vectorized version is faster than the posted fortran/julia times. Numpy internally tries to use special SIMD hardware instructions to speed up arithmetic on vectors, which can make a significant difference. It's possible that the fortran/julia compilers weren't able to generate those instructions from the provided code, but numpy/matlab were able to. However, Matlab is still about 10x faster than the numpy code, which I don't think would be explained by better use of SIMD instructions. Instead, they may also be using multiple threads to parallelize the computation, since the matrix is fairly large.
Ultimately, I think the matlab numbers are plausible, but I'm not sure exactly how they're getting their speedup.
I want to perform N=1000 bootstrapping with replacement on gridded data. One computation takes about 0.5s. I have access to a supercomputer exclusive node with 48 cores. Because the resampling are independent of each other, I naively hope to distribute the workload on all or at least many cores and get a performance increase by .8 * ncores. But I dont get it.
I still lack proper understand about dask. Based on Best practices in setting number of dask workers, I use:
from dask.distributed import Client
client = Client(processes=False, threads_per_worker=8, n_workers=6, memory_limit=‘32GB')
I also tried with SLURMCluster, but I guess I first need to understand what I do and then scale.
My MWE:
create sample data
write function I want to apply
write resampling inits function
write bootstrapping function with bootstrap (=N) as argument: see many implementations below
perform bootstrapping
import dask
import numpy as np
import xarray as xr
from dask.distributed import Client
inits = np.arange(50)
lats = np.arange(96)
lons = np.arange(192)
data = np.random.rand(len(inits), len(lats), len(lons))
a = xr.DataArray(data,
coords=[inits, lats, lons],
dims=['init', 'lat', 'lon'])
data = np.random.rand(len(inits), len(lats), len(lons))
b = xr.DataArray(data,
coords=[inits, lats, lons],
dims=['init', 'lat', 'lon'])
def func(a,b, dim='init'):
return (a-b).std(dim)
bootstrap=96
def resample(a):
smp_init = np.random.choice(inits, len(inits))
smp_a = a.sel(init=smp_init)
smp_a['init'] = inits
return smp_a
# serial function
def bootstrap_func(bootstrap=bootstrap):
res = (func(resample(a),b) for _ in range(bootstrap))
res = xr.concat(res,'bootstrap')
# leave out quantile because not issue here yet
#res_ci = res.quantile([.05,.95],'bootstrap')
return res
#dask.delayed
def bootstrap_func_delayed_decorator(bootstrap=bootstrap):
return bootstrap_func(bootstrap=bootstrap)
def bootstrap_func_delayed(bootstrap=bootstrap):
res = (dask.delayed(func)(resample(a),b) for _ in range(bootstrap))
res = xr.concat(dask.compute(*res),'bootstrap')
#res_ci = res.quantile([.05,.95],'bootstrap')
return res
for scheduler in ['synchronous','distributed','multiprocessing','processes','single-threaded','threads']:
print('scheduler:',scheduler)
def bootstrap_func_delayed_processes(bootstrap=bootstrap):
res = (dask.delayed(func)(resample(a),b) for _ in range(bootstrap))
res = xr.concat(dask.compute(*res, scheduler=scheduler),'bootstrap')
res = res.quantile([.05,.95],'bootstrap')
return res
%time c = bootstrap_func_delayed_processes()
The following results are from my 4 core laptop. But on the supercomputer I also see no speedup, rather decrease by 50%.
Results for serial:
%time c = bootstrap_func()
CPU times: user 814 ms, sys: 58.7 ms, total: 872 ms
Wall time: 862 ms
Results for parallel:
%time c = bootstrap_func_delayed_decorator().compute()
CPU times: user 96.2 ms, sys: 50 ms, total: 146 ms
Wall time: 906 ms
Results for parallelized from the loop:
scheduler: synchronous
CPU times: user 2.57 s, sys: 330 ms, total: 2.9 s
Wall time: 2.95 s
scheduler: distributed
CPU times: user 4.51 s, sys: 2.74 s, total: 7.25 s
Wall time: 8.86 s
scheduler: multiprocessing
CPU times: user 4.18 s, sys: 2.53 s, total: 6.71 s
Wall time: 7.95 s
scheduler: processes
CPU times: user 3.97 s, sys: 2.1 s, total: 6.07 s
Wall time: 7.39 s
scheduler: single-threaded
CPU times: user 2.26 s, sys: 275 ms, total: 2.54 s
Wall time: 2.47 s
scheduler: threads
CPU times: user 2.84 s, sys: 341 ms, total: 3.18 s
Wall time: 2.66 s
Expected results:
- speedup (by .8 * ncores)
Other considerations:
- I also checked whether I should chunk my data. too sample chunks. chunked arrays take longer.
My questions:
- What did I get wrong about dask parallelization?
- Is the client setup not useful that way?
- Did I implement dask.delayed not clever enough?
- Is my serial function already executed in parallel because of dask? I think not.
I finally solved this. When posting this challenge, I obviously didn't understand a few aspects of it:
I ran the timings on a laptop with two physical cores. This doesn't allow much parallelization in a CPU-bound problem. Now I ran this on a node with 48 logical CPUs
I should have thought about which parts of the algorithm are easily parallelizable and which parts are not. Only then I can chunk accordingly.
See my solution here: https://gist.github.com/aaronspring/118abd7b9bf81e555b1fced42eef427f
The game-changers wrt. the code posted initially:
I chunk a dimension (here x) with is not involved in the func (which uses time)
I still use the client as mentioned above: Best practices in setting number of dask workers
I only try to parallelize the iteration part. The quantile method is done in memory.
Conclusion: It is simpler than expected. The gist shows an implementation with dask.delayed and dask.futures but thats not even needed in my use case. First try to understand parallelism https://realpython.com/python-concurrency/ and read the dask documentation https://dask.org/.
Much faster solution with multidimensional indexing
https://xskillscore.readthedocs.io/en/latest/api/xskillscore.core.resampling.resample_iterations_idx.html#xskillscore.core.resampling.resample_iterations_idx
I run a mapreduce job on a hadoop cluster. The job's running time I saw in browser at master:8088 and master:19888 (job history server web UI) are shown below:
master:8088
master:19888
I have two questions:
Why are the elapsed times from two pictures different?
Why sometimes the Average Reduce Time is a negative number?
It looks like the Average Reduce Time is based on the times the previous tasks (shuffle/merge) took to finish and not necessarily the amount of time the reduce actually took to run.
Looking at this source code you can see the relevant calculations occurring around line 300.
if (attempt.getState() == TaskAttemptState.SUCCEEDED) {
numReduces++;
avgShuffleTime += (attempt.getShuffleFinishTime() - attempt.getLaunchTime());
avgMergeTime += attempt.getSortFinishTime() - attempt.getShuffleFinishTime();
avgReduceTime += (attempt.getFinishTime() - attempt.getSortFinishTime());
}
Followed by:
if (numReduces > 0) {
avgReduceTime = avgReduceTime / numReduces;
avgShuffleTime = avgShuffleTime / numReduces;
avgMergeTime = avgMergeTime / numReduces;
}
Looking at your numbers, they seem to be generally in-line with this approach to calculating the run times (everything converted to seconds):
Total Pre-reduce time = Map Run Time + Ave Shuffle + Ave Merge
143 = 43 + 83 + 17
Ave Reduce Time = Elapsed Time - Total Pre-reduce
-10 = 133 - 143
So looking at how long the Map, Shuffle and Merge took compared with the Elapsed we end up with a negative number close to your -8.
This is a partial answer, only for question 1!
I see a difference in "Submitted" and "Started" of 8 seconds in the second picture, while the time "Started" in the first picture is equal to the "Submitted" time of the second. I guess this covers the 8-second difference that you see as "Elapsed" time.
I am very curious for the second question as well, but it may not be a coincidence that it is also 8 seconds.
I'm trying to calculate the execution time of an application. Assuming the only stall penalty occurs on memory access instructions (100 cycles being the penalty).
How am I supposed to find out execution time in seconds with this info?
CPI (CPUCycles?) = 1.0
ClockRate = 1GHZ
TotalInstructions = 59880
MemoryAccessInstructions = 8467
CacheMissRate = 62% (0.62) (5290/8467)
CacheHits = 3117
CacheMisses = 5290
CacheMissPenalty = 100 (cycles)
Assuming no other penalties.
totalCycles = TotalInstructions + CacheMisses * CacheMissPenalty ?
I assume that cache hits cost same as other opcodes, so those are included in TotalInstructions.
That's then 588880 cycles, 1GHz is 1000000000 cycles per second.
So that code will take 0.58888ms to execute (5.8888e-7 second).
This value is of course purely theoretical estimate, as modern CPU doesn't work like that (1 instruction = 1 cycle). If you are interested in real world values, just profile it.
There doesn't seem to be any preexisting questions on this, at least from a title search. I am seeking to find the optimal amount of passes for an external merge. So, if we have 1000 chunks of data, one pass would be a 1000 way merge. Two pass could be 5 groups of 200 chunks, then a final merge of 1 group of 5 chunks. And so on. I've done some math, which must have a flaw, because it looks like two passes never beats one pass. It could very well be a misunderstanding in how data is read, though.
First, a numerical example:
Data: 100 GB
Ram: 1 GB
Since we have 1GB memory, we can load in 1GB at a time to sort using quicksort or mergesort. Now we have 100 chunks to sort. We can do a 100 way merge. This is done by making RAM/(chunks+1) size buckets = 1024MB/101 = 10.14MB. There are 100 10.14MB buckets for each of the 100 chunks, and one output bucket also of size 10.14MB. As we merge, if any input buckets empty, we do a disk seek to refill that bucket. Likewise, when the output bucket gets full, we write to the disk and empty it. I claim that the number of "times the disk needs to read" is (data/ram)*(chunks+1). I get this from the fact that we have ram/(chunks+1) sized input buckets, and we must read in the entire data for a given pass, so we read (data/bucket_size) times. In other words, every time an input bucket empties we must refill it. We do this over 100 chunks here, so numChunks*(chunk_size/bucket_size) = datasize/bucket_size or 100*(1024MB/10.14MB). BucketSize = ram/(chunks+1) so 100*(1024/10.14) = (data/ram) * (chunks+1) = 1024*100MB/1024MB * 101 = 10100 reads.
For a two pass system, we do A groups of B #chunks, then a final merge of 1 group of A #chunks. Using previous logic, we have numReads = A*( (data/ram)*(B+1)) + 1*( (data/ram)*(A+1)). We also have A*B = Data/Ram. For instance, 10 groups of 10 chunks, where each chunk is a GB. Here, A = 10 B = 10. 10*10 = 100/1 = 100, which is Data/Ram. This is because Data/Ram was the original number of chunks. For 2 pass, we want to break Data/Ram into A groups of B #chunks.
I'll try to break down the formula here, let D = data, A = #groups, B = #chunks/group, R = RAM
A*(D/R)*(B+1) + 1*(D/R)*(A+1) - This is A times the number of reads of an external merge on B #chunks plus the final merge on A #chunks.
A = D/(R*B) => D^2/(B*R^2) * (B+1) + D/R * (D/(R*B)+1)
(D^2/R^2)*[1 + 2/B] + D/R is number of reads for a 2 pass external merge. For 1 pass, we have (data/ram)*(chunks+1) where chunks = data/ram for 1 pass. Thus, for one pass we have D^2/R^2 + D/R. We see that a 2 pass only reaches that as the chunk size B goes to infinity, and even then the additional final merge gives us D^2/R^2 + D/R. So there must be something about the reads I'm missing, or my math is flawed. Thanks to anyone who takes the time to help me!
You ignore the fact that the total time it takes to read a block of data from disk is the sum of
The access time which is roughly constant and on the order of several milliseconds for rotating hard disk drives.
The transfer time which depends on the size of the data block and the transfer rate.
As the number of chunks increases, the size of the input buffers (you call them buckets) decreases. The smaller the input buffers get, the more pronounced is the effect of the constant access time on the total time is takes to fill a buffer. At a certain point, the time to fill a buffer will be almost completely dominated by the access time. So the total time for a merge pass begins to scale with the number of buffers and not the amount of data read.
That's where additional merge passes can speed up the process. It allows to use fewer and larger input buffers and mitigates the effect of access time.
Edit: Here's a quick back-of-the-envelope calculation to give an idea about where the break-even point is.
The total transfer time can be calculated easily. All the data has to read and written once per pass:
total_transfer_time = num_passes * 2 * data / transfer_rate
The total access time for buffer reads is:
total_access_time = num_passes * num_buffer_reads * access_time
Since there's only a single output buffer, it can be made larger than the input buffers without wasting too much memory, so I'll ignore the access time for writes. The number of buffer reads is data / buffer_size, buffer size is about ram / num_chunks for the one-pass approach, and the number of chunks is data / ram. So we have:
total_access_time1 = num_chunks^2 * access_time
For the two-pass solution, it makes sense to use sqrt(num_chunks) buffers to minimize access time. So buffer size is ram / sqrt(num_chunks) and we have:
total_access_time2 = 2 * (data / (ram / sqrt(num_chunks))) * acccess_time
= 2 * num_chunks^1.5 * access_time
So if we use transfer_rate = 100 MB/s, access_time = 10 ms, data = 100 GB, ram = 1 GB, the total time is:
total_time1 = (2 * 100 GB / 100 MB/s) + 100^2 * 10 ms
= 2000 s + 100 s = 2100 s
total_time2 = (2 * 2 * 100 GB / 100 MB/s) + 2 * 100^1.5 * 10 ms
= 4000 s + 20 s = 4020 s
The effect of access time is still very small. So let's change data to 1000 GB:
total_time1 = (2 * 1000 GB / 100 MB/s) + 1000^2 * 10 ms
= 20000 s + 10000 s = 30000 s
total_time2 = (2 * 2 * 1000 GB / 100 MB/s) + 2 * 1000^1.5 * 10 ms
= 40000 s + 632 s = 40632 s
Now half the time in the one-pass version is spent with disk seeks. Let's try with 5000 GB:
total_time1 = (2 * 5000 GB / 100 MB/s) + 5000^2 * 10 ms
= 100000 s + 250000 s = 350000 s
total_time2 = (2 * 2 * 5000 GB / 100 MB/s) + 2 * 5000^1.5 * 10 ms
= 200000 s + 7071 s = 207071 s
Now the two-pass version is faster.
To get an optimum you need a more sophisticated model of the disk. Let time to fill a block of size S be rS + k where k is seek time and r is read rate.
If you divide RAM of size M into C+1 buffers of size M/(C+1), then the time to load RAM once is (C+1) (r M/(C+1) + k) = rM + k(C+1). So as you'd expect, making C smaller speeds up read time by eliminating seeks. It's fastest to read all of memory in one sequential block, but merging doesn't allow it. We must make a tradeoff. That's where we need to look for the optimum.
With total data size of c times RAM size, there are c chunks to be merged.
In the one pass scheme, C=c, and the total read time must be just the time to fill RAM c times over: c (rM + k(c+1)) = c(rM + kc + k).
In the two pass scheme with an N-way division of data for the first pass, that pass has C=c/N and in the second pass, C=N. So total cost is
c ( rM + k(c/N+1) ) + c ( rM + k(N+1) ) = c ( 2rM + k(c/N + N) + 2k )
Note this model omits write time. You should fill that in eventually unless you're assuming it's overlapped I/O on a different device and thus can be ignored.
It's not hard to see here that if c and k are suitably large, then the c/N+N term in the 2-pass model can be so small compared to the c in the one-pass that the 2-pass model will be faster.
I'm going to stop now, but you can carry this logic on to (probably) get a closed approximation formula for an arbitrary number of passes. THis will require solving an infinite series. Then you can set the derivative to zero and solve for an estimate of the optimal pass number. If life is good you'll also learn the optimal value of N by setting the gradient of a 2d function in pass number and N to zero. My intuition says N ~ sqrt(c).
If the math gets intractable, you could still simulate a reasonable range of numbers of passes with the kind of simple algebra above at the start and pick an optimum that way.
This is an interesting problem and I'm sorry I don't have more time to work on it at the moment. I hope the analysis framework is enough to let you punch through to a nice result.