what is the complexity of parallel external sort - algorithm

I'm wondering what is the complexity when i making parallel external sort.
Suppose I have big array N and limited memory. F.e 1 billion entries to sort and only 1k in entries memory.
for this case i've splitted the big array into K sorted files with chunk size B using parallel threads , and save in Disk.
After that read from all files , merged back to new array using with priprityQueue and threads.
I need to calculate the complexity with big O notation.
and what happened to complexity if i would use multi process lets say N processors ?
is it ~O(N/10 * log N) ??

The time complexity is going to be O(n log(n)) regardless of the number of processors and/or the number of external drives. The total time will be T(n/a logb(n)), but since a and b are constants, the time complexity remains the same at O(n log(n)), even if the time is say 10 times as fast.
It's not clear to me what you mean by "parallel" external sort. I'll assume multiple cores or multiple processors, but are there also multiple drives? Do all N cores or processors share the same memory that only holds 1k elements or does each core or processor have its own "1k" of memory (in effect having "Nk" of memory)?
external merge sort in general
On the initial pass, the input array is read in chunks of size B, (1k elements), sorted, then written to K sorted files. The end result of this initial pass is K sorted files of size B (1k elements). All of the remaining passes will repeatedly merge the sorted files until a single sorted file is produced.
The initial pass is normally cpu bound, and using multiple cores or processors for sorting each chunk of size B will reduce the time. Any sorting method or any stable sorting method can be used for the initial pass.
For the merge phase, being able to perform I/O in parallel with doing merge operations will reduce the time. Using multi-threading to overlap I/O with merge operations will reduce time and be simpler than using asynchronous I/O to do the same thing. I'm not aware of a way to use multi-threading to reduce the time for a k-way merge operation.
For a k-way merge, the files are read in smaller chunks of size B/(k+1). This allows for k input buffers and 1 output buffer for a k-way merge operation.
For hard drives, random access overhead is an issue, say transfer rate is 200 MB/s, and average random access overhead is 0.01 seconds, which is the same amount of time to transfer 2 MB. If buffer size is 2 MB, then random access overhead effectively cuts transfer rate by 1/2 to ~100 MB/s. If buffer size is 8 KB, then random access overhead effectively cuts transfer rate by 1/250 to ~0.8 MB/s. With a small buffer, a 2-way merge will be faster, due to the overhead of random access.
For SSDs in a non-server setup, usually there's no command queuing, and command overhead is about .0001 second on reads, .000025 second on writes. Transfer rate is about 500 MB/s for Sata interface SSDs. If buffer size is 2MB, the command overhead is insignificant. If buffer size is 4KB, then read rate is cut by 1/12.5 to ~ 40 MB/s, and write rate cut by 1/3.125 to ~160 MB/s. So if buffer size is small enough, again a 2-way merge will be faster.
On a PC, these small buffer scenarios are unlikely. In the case of the gnu sort for huge text files, with default settings, it allocates a bit over 1GB of ram, creating 1GB sorted files on the initial pass, and does a 16-way merge, so buffer size is 1GB/17 ~= 60 MB. (The 17 is for 16 input buffers, 1 output buffer).
Consider the case of where all of the data fits in memory, and that the memory consists of k sorted lists. The time complexity for merging the lists will be O(n log(k)), regardless if a 2-way merge sort is used, merging pairs of lists in any order or if a k-way merge sort is used to merge all the lists in one pass.
I did some actual testing of this on my system, Intel 3770K 3.5ghz, Windows 7 Pro 64 bit. For a heap based k-way merge, with k = 16, transfer rate ~ 235 MB/sec, with k = 4, transfer rate ~ 495 MB/sec. For a non-heap 4-way merge, transfer rate ~ 1195 MB/sec. Hard drive transfer rates are typically 70 MB/sec to 200 MB/sec. Typical SSD transfer rate is ~500 MB/sec. Expensive server type SSDs (SAS or PCIe) are up to ~2GB/sec read, ~1.2GB/sec write.


why executation time of tf.nn.conv2d function different while the multiply times are the same?

I am using tensorflow to build cnn net in image classification experiment,I found such phenomenon as:
operation 1:tf.nn.conv2d(x, [3,3,32,32], strides=[1,1,1,1], padding='SAME')
the shape of x is [128,128,32],means convolution using 3x3 kernel on x,both input channels and output channels are 32,the total multiply times is
operation 2:tf.nn.conv2d(x, [3,3,64,64], strides=[1,1,1,1], padding='SAME')
the shape of x is [64,64,64],means convolution using 3x3 kernel on x,both input channels and output channels are 64,the total multiply times is
In contrast with operation 1,the feature map size of operation 2 scale down to 1/2 and the channel number doubled. The multiply times are the same so the running time should be same.But in practice the running time of operation 1 is longer than operation 2.
My measure method was shown below
eliminate an convolution of operation 1,the training time for one epoch reduced 23 seconds,means the running time of operation 1 is 23 seconds.
eliminate an convolution of operation 2,the training time for one epoch reduced 13 seconds,means the running time of operation 2 is 13 seconds.
the phenomenon can reproduction every time。
My gpu is nvidia gtx980Ti,os is ubuntu 16.04。
So that comes the question: Why the running time of operation 1 was longer than operation 2?
If I had to guess it has to do with how the image is ordered in memory. Remember that in memory everything is stored in a flattened format. This means that if you have a tensor of shape [128, 128, 32], the 32 features/channels are stored next to eachover. Then all of the rows, then all of the columns. https://en.wikipedia.org/wiki/Row-major_order
Accessing closely packed memory is very important to performance especially on a GPU which has a large memory bus and is optimized for aligned in order memory access. In case with the larger image you have to skip around the image more and the memory access is more out of order. In case 2 you can do more in order memory access which gives you more speed. Multiplications are very fast operations. I bet with a convolution memory access if the bottleneck which limits performance.
chasep255's answer is good and probably correct.
Another possibility (or alternative way of thinking about chasep255's answer) is to consider how caching (all the little hardware tricks that can speed up memory fetches, address mapping, etc) could be producing what you see...
You have basically two things: a stream of X input data and a static filter matrix. In case 1, you have 9*1024 static elements, in case 2 you have 4 times as many. Both cases have the same total multiplication count, but in case 2 the process is finding more of its data where it expects (i.e. where it was last time it was asked for.) Net result: less memory access stalls, more speed.

How does the disk seek is faster in column oriented database

I have recently started working on bigqueries, I come to know they are column oriented data base and disk seek is much faster in this type of databases.
Can any one explain me how the disk seek is faster in column oriented database compare to relational db.
The big difference is in the way the data is stored on disk.
Let's look at an (over)simplified example:
Suppose we have a table with 50 columns, some are numbers (stored binary) and others are fixed width text - with a total record size of 1024 bytes. Number of rows is around 10 million, which gives a total size of around 10GB - and we're working on a PC with 4GB of RAM. (while those tables are usually stored in separate blocks on disk, we'll assume the data is stored in one big block for simplicity).
Now suppose we want to sum all the values in a certain column (integers stored as 4 bytes in the record). To do that we have to read an integer every 1024 bytes (our record size).
The smallest amount of data that can be read from disk is a sector and is usually 4kB. So for every sector read, we only have 4 values. This also means that in order to sum the whole column, we have to read the whole 10GB file.
In a column store on the other hand, data is stored in separate columns. This means that for our integer column, we have 1024 values in a 4096 byte sector instead of 4! (and sometimes those values can be further compressed) - The total data we need to read now is around 40MB instead of 10GB, and that will also stay in the disk cache for future use.
It gets even better if we look at the CPU cache (assuming data is already cached from disk): one integer every 1024 bytes is far from optimal for the CPU (L1) cache, whereas 1024 integers in one block will speed up the calculation dramatically (those will be in the L1 cache, which is around 50 times faster than a normal memory access).
The "disk seek is much faster" is wrong. The real question is "how column oriented databases store data on disk?", and the answer usually is "by sequential writes only" (eg they usually don't update data in place), and that produces less disk seeks, hence the overall speed gain.

Why 2 way merge sorting is more efficient than one way merge sorting

I am reading about external sorting from wikipedia, and need to understand why 2 phase merging is more efficient than 1 phase merging.
Wiki : However, there is a limitation to single-pass merging. As the
number of chunks increases, we divide memory into more buffers, so
each buffer is smaller, so we have to make many smaller reads rather
than fewer larger ones.
Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single merge
pass isn't efficient: the disk seeks required to fill the input
buffers with data from each of the 500 chunks (we read 100MB / 501 ~
200KB from each chunk at a time) take up most of the sort time. Using
two merge passes solves the problem. Then the sorting process might
look like this:
Run the initial chunk-sorting pass as before.
Run a first merge pass combining 25 chunks at a time, resulting in 20 larger sorted chunks.
Run a second merge pass to merge the 20 larger sorted chunks.
Could anyone give me a simple example to understand this concept well. I am particularly confused about allocating more buffers in 2 phase merging.
The issue is random access overhead, average rotational delay is around 4 ms, average seek time around 9 ms, so say 10 ms average access time, versus an average transfer rate around 150 mega-bytes per second. So the average access overhead takes about the same time as it does to read or write 1.5 megabytes.
Large I/O's reduce the relative overhead of access time. If data is read or written 10 mega bytes at a time, the overhead is reduced to 15%. Using the wiki example 100MB of working buffer and 25 chunks to merge, plus one chunk for writes, that's 3.8 megabyte chunks, in which case random access overhead is about 39%.
On most PC's, it's possible to allocate enough memory for a 1GB working buffer, so 26 chunks at over 40 mega bytes per chunk, reducing random access overhead to 3.75%. 50 chunks would be about 20 megabytes per chunk, with random access overhead around 7.5%.
Note that even with 50 chunks, it's still a two pass sort, the first pass uses all of the 1gb buffer to do a memory sort, then writes sorted 1gb chunks. The second pass merges all 50 chunks that result in the sorted file.
The other issue with 50 chunks, is that a min heap type method is needed to maintain sorted information in order to determine which element of which chunk is the smallest of the 50 and gets moved to the output buffer. After an element is moved to the output buffer, a new element is moved into the heap and then the heap is re-heapified, which takes about 2 log2(50) operations, or about 12 operations. This is better than the simple approach of doing 49 compares to determine the smallest element from a group of 50 elements when doing the 50 way merge. The heap is composed of structures, the current element, which chunk this is (file position or file), the number of elements left in the chunk, ... .

Memory benchmark plot: understanding cache behaviour

I've tried every kind of reasoning I can possibly came out with but I don't really understand this plot.
It basically shows the performance of reading and writing from different size array with different stride.
I understand that for small stride like 4 bytes I read all the cell in the cache, consequently I have good performance. But what happen when I have the 2 MB array and the 4k stride? or the 4M and 4k stride? Why the performance are so bad? Finally why when I have 1MB array and the stride is 1/8 of the size performance are decent, when is 1/4 the size performance get worst and then at half the size, performance are super good?
Please help me, this thing is driving me mad.
At this link, the code: https://dl.dropboxusercontent.com/u/18373264/membench/membench.c
Your code loops for a given time interval instead of constant number of access, you're not comparing the same amount of work, and not all cache sizes/strides enjoy the same number of repetitions (so they get different chance for caching).
Also note that the second loop will probably get optimized away (the internal for) since you don't use temp anywhere.
Another effect in place here is TLB utilization:
On a 4k page system, as you grow your strides while they're still <4k, you'll enjoy less and less utilization of each page (finally reaching one access per page on the 4k stride), meaning growing access times as you'll have to access the 2nd level TLB on each access (possibly even serializing your accesses, at least partially).
Since you normalize your iteration count by the stride size, you'll have in general (size / stride) accesses in your innermost loop, but * stride outside. However, the number of unique pages you access differs - for 2M array, 2k stride, you'll have 1024 accesses in the inner loop, but only 512 unique pages, so 512*2k accesses to TLB L2. on the 4k stride, there would be 512 unique pages still, but 512*4k TLB L2 accesses.
For the 1M array case, you'll have 256 unique pages overall, so the 2k stride would have 256 * 2k TLB L2 accesses, and the 4k would again have twice.
This explains both why there's gradual perf drop on each line as you approach 4k, as well as why each doubling in array size doubles the time for the same stride. The lower array sizes may still partially enjoy the L1 TLB so you don't see the same effect (although i'm not sure why 512k is there).
Now, once you start growing the stride above 4k, you suddenly start benefiting again since you're actually skipping whole pages. 8K stride would access only every other page, taking half the overall TLB accesses as 4k for the same array size, and so on.

How to compute the time of an in-place external merge sort?

The original problem is like this:
You are to sort 1PB size of integers ranging from -2^31 ~ 2^31 - 1 (int), you have 1024 machines each having 1TB disk space and 16GB memory space. Assume disk speed is 128MB/s (r/w) and memory speed is 8GB/s (r/w). Time for CPU can be ignored. Network transfer time can be ignored for simplicity. Compute the approximated time needed.
I know with external sort we can sort the 1TB data on a single machine in roughly 10hrs as computed like this:
Disk access (2r2w): 1T * 4 / 128MB/s = 2 ^ 15 sec ~ 9 hrs
Mem access:
sorting 2^48 Integers in 64 parts (2 ^ 42 each) roughly takes 1.3 min each. So totally 1.4 hr.
63 way merging takes several seconds, and thus is ignored.
But what about the next step: the combination of 1024T data? I have no idea how this is computed. So any help please?
2^31 is = 2 billion (2 "giga"). So you are looking at lot of duplicate numbers and fixed range. So consider Radix Sort ( http://en.wikipedia.org/wiki/Radix_sort ).
Each processor, for a subset od data) creates 'count' array (x[0] contains the count of 0s etc). Then you can merge all results into one array. Later you can "construct" the sorted array.
