Which is faster to process a 1TB file: a single machine or 5 networked machines? - performance

Which is faster to process a 1TB file: a single machine or 5 networked
machines? ("To process" refers to finding the single UTF-16 character
with the most occurrences in that 1TB file). The rate of data
transfer is 1Gbit/sec, the entire 1TB file resides in 1 computer, and
each computer has a quad core CPU.
Below is my attempt at the question using an array of longs (with array size of 2^16) to keep track of the character count. This should fit into memory of a single machine, since 2^16 x 2^3 (size of long) = 2^19 = 0.5MB. Any help (links, comments, suggestions) would be much appreciated. I used the latency times cited by Jeff Dean, and I tried my best to use the best approximations that I knew of. The final answer is:
Single Machine: 5.8 hrs (due to slowness of reading from disk)
5 Networked Machines: 7.64 hrs (due to reading from disk and network)
1) Single Machine
a) Time to Read File from Disk --> 5.8 hrs
-If it takes 20ms to read 1MB seq from disk,
then to read 1TB from disk takes:
20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs
= 350 mins = 5.8 hrs
b) Time needed to fill array w/complete count data
--> 0 sec since it is computed while doing step 1a
-At 0.5 MB, the count array fits into L2 cache.
Since L2 cache takes only 7 ns to access,
the CPU can read & write to the count array
while waiting for the disk read.
Time: 0 sec since it is computed while doing step 1a
c) Iterate thru entire array to find max count --> 0.00625ms
-Since it takes 0.0125ms to read & write 1MB from
L2 cache and array size is 0.5MB, then the time
to iterate through the array is:
0.0125ms/MB x 0.5MB = 0.00625ms
d) Total Time
Total=a+b+c=~5.8 hrs (due to slowness of reading from disk)
2) 5 Networked Machines
a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs
1TB x 1024GB/TB x 8bits/B x 1s/Gbit
= 8,192s = 137m = 2.3hr
But since the original machine keeps a fifth of the data, it
only needs to send (4/5)ths of data, so the time required is:
2.3 hr x 4/5 = 1.84 hrs
*But to send the data, the data needs to be read, which
is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs
So total time = 1.84hrs + 4.64 hrs = 6.48 hrs
b) Time to fill array w/count data from original machine --> 1.16 hrs
-The original machine (that had the 1TB file) still needs to
read the remainder of the data in order to fill the array with
count data. So this requires (1/5)(answer 1a)=1.16 hrs.
The CPU time to read & write to the array is negligible, as
shown in 1b.
c) Time to fill other machine's array w/counts --> not counted
-As the file is being transferred, the count array can be
computed. This time is not counted.
d) Time required to receive 4 arrays --> (2^-6)s
-Each count array is 0.5MB
0.5MB x 4 arrays x 8bits/B x 1s/Gbit
= 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits
= 2^25/2^31s = (2^-6)s
d) Time to merge arrays
--> 0 sec(since it can be merge while receiving)
e) Total time
Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs

This is not an answer but just a longer comment. You have miscalculated the size of the frequency array. 1 TiB file contains 550 Gsyms and because nothing is said about their expected freqency, you would need a count array of at least 64-bit integers (that is 8 bytes/element). The total size of this frequency array would be 2^16 * 8 = 2^19 bytes or just 512 KiB and not 4 GiB as you have miscalculated. It would only take ≈4.3 ms to send this data over 1 Gbps link (protocol headers take roughly 3% if you use TCP/IP over Ethernet with an MTU of 1500 bytes /less with jumbo frames but they are not widely supported/). Also this array size perfectly fits in the CPU cache.
You have grossly overestimated the time it would take to process the data and extract the frequency and you have also overlooked the fact that it can overlap disk reads. In fact it is so fast to update the frequency array, which resides in the CPU cache, that the computation time is negligible as most of it will overlap the slow disk reads. But you have underestimated the time it takes to read the data. Even with a multicore CPU you still have only one path to the hard drive and hence you would still need the full 5.8 hrs to read the data in the single machine case.
In fact, this is an exemple kind of data processing that neither benefits from parallel networked processing nor from having more than one CPU core. This is why supercomputers and other fast networked processing systems use distributed parallel file storages that can deliver many GB/s of aggregate read/write speeds.

You only need to send 0.8tb if your source machine is part of the 5.
It may not even make sense sending the data to other machines. Consider this:
In order to for the source machine to send the data it must first hit the disk in order to read the data into main memory before it send the data over the network. If the data is already in main memory and not being processed, you are wasting that opportunity.
So under the assumption that loading to CPU cache is much less expensive than disk to memory or data over network (which is true, unless you are dealing with alien hardware), then you are better off just doing it on the source machine, and the only place splitting up the task makes sense is if the "file" is somehow created/populated in a distributed way to start with.
So you should only count the disk read time of a 1Tb file, with a tiny bit of overhead for L1/L2 cache and CPU ops. The cache access pattern is optimal since it is sequential so you only cache miss once per piece of data.
The primary point here is that disk is the primary bottleneck which overshadows everything else.

Related

Scan time of cache, RAM và disk?

Consider a computer system that has cache memory, main memory (RAM), and disk, and OS uses virtual memory. It takes 2 nsec to access a byte from the cache, 20 nsec to access a byte from RAM, and 10 msec to access a block of 1000 bytes from the disk. If a book has 1000 pages, each with 50 lines of 80 characters each, How long it will take to electronically scan the text for the case of the master copy being in each of the level as one proceeds down the memory hierarchy(from inboard memory to offline storage)?
If a book has 1000 pages, each with 50 lines of 80 characters each, then the book has 1000 * 50 * 80 = 4000000 characters. We don't know how big a character is (and it could be UTF8 where different characters are different sizes), we don't know how much meta-data there is (e.g. if it's a word-processor file with extra information in a header about page margins, tab stops; plus more data for which font and style, etc) or if there's any extra processing (e.g. compression of pieces within the file).
If we make an unfounded assumption that the file happens to be 4000000 bytes; then we might say that it'll be 4000 blocks (with 1000 bytes per block) on disk.
Then we get into trouble. A CPU can't access data on disk (and can only access data in RAM or cache); so it needs to be loaded into RAM (e.g. by a disk controller) before the CPU can access it.
If it takes the disk controller 10 msec to access a block of 1000 bytes from disk, then we might say it will take at least 10 msec * 4000 = 40000 msec = 40 seconds to read the whole file into disk. However this would be wrong - the disk controller (acting on requests by file system support code) will have to find the file (e.g. read directory info, etc), and that the file may be fragmented so the disk controller will need to read (and then follow) a "list of where the pieces of the file are".
Of course while the CPU is scanning the first part of the file the disk controller can be reading the last part of the file; either because the software is designed to use asynchronous IO or because the OS detected a sequential access pattern and started pre-fetching the file before the program asked for it. In other words; the ideal case is that when the disk controller finishes loading the last block the CPU has already scanned the first 3999 blocks and only has to scan 1 more (and the worst case is that disk controller and CPU don't do anything at the same time it becomes "40 seconds to load the file into RAM plus however long it takes for CPU to scan the data in RAM".
Of course we also don't know things like (e.g.) if the file is actually loaded 1 block at a time (if it's split into 400 transfers with 10 blocks per transfer then the "ideal case" would be worse as CPU would have to scan the last 10 blocks after they're loaded and not just the last one block); of how many reads the disk controller does before a pre-fetcher detects that it's a sequential pattern.
Once the file is in RAM we have more problems.
Specifically; anyone that understands how caches work will know that "It takes 2 nsec to access a byte from the cache, 20 nsec to access a byte from RAM" means that when you access one byte in RAM it takes 18 nsec to transfer a cache line (a group of consecutive bytes) from RAM to cache plus 2 nsec to obtain that 1 byte from cache; and then the next byte you access will have already been transferred to cache (as it's part of "group of consecutive bytes") and will only cost 2 nsec.
After the file's data is loaded into RAM by disk controller; because we don't know the cache line size, we don't know how many of software's accesses will take 20 nsec and how many will take 2 nsec.
The final thing is that we don't actually know anything useful about the caches. The general rule is that the larger a cache is the slower it is; and "large enough to contain the entire book (plus the program's code, stack, parts of the OS, etc) but fast enough to have a 2 nsec access times" is possibly an order of magnitude better than any cache that has ever existed. Essentially, the words "the cache" (in the question) can not be taken literally as it would be implausible. If we look at anything that's close to being able to provide a 2 nsec access time we see multiple levels - e.g. a fast and small L1 cache with a slower but larger L2 cache (with an even slower but larger L3 cache). To make sense of the question you must assume that "the cache" meant "the set of caches", and that the L1 cache has an access time of 2 nsec (but is too small to hold the whole file and everything else) and other levels of the cache hierarchy (e.g. L2 cache) have unknown slower access times but may be large enough to hold the whole file (and everything else).
Mostly; if I had to guess; I'd assume that the question was taken from a university course (because universities have a habit of tricking students into paying $$$ for worthless "fictional knowledge").

what is the complexity of parallel external sort

I'm wondering what is the complexity when i making parallel external sort.
Suppose I have big array N and limited memory. F.e 1 billion entries to sort and only 1k in entries memory.
for this case i've splitted the big array into K sorted files with chunk size B using parallel threads , and save in Disk.
After that read from all files , merged back to new array using with priprityQueue and threads.
I need to calculate the complexity with big O notation.
and what happened to complexity if i would use multi process lets say N processors ?
is it ~O(N/10 * log N) ??
thanks
The time complexity is going to be O(n log(n)) regardless of the number of processors and/or the number of external drives. The total time will be T(n/a logb(n)), but since a and b are constants, the time complexity remains the same at O(n log(n)), even if the time is say 10 times as fast.
It's not clear to me what you mean by "parallel" external sort. I'll assume multiple cores or multiple processors, but are there also multiple drives? Do all N cores or processors share the same memory that only holds 1k elements or does each core or processor have its own "1k" of memory (in effect having "Nk" of memory)?
external merge sort in general
On the initial pass, the input array is read in chunks of size B, (1k elements), sorted, then written to K sorted files. The end result of this initial pass is K sorted files of size B (1k elements). All of the remaining passes will repeatedly merge the sorted files until a single sorted file is produced.
The initial pass is normally cpu bound, and using multiple cores or processors for sorting each chunk of size B will reduce the time. Any sorting method or any stable sorting method can be used for the initial pass.
For the merge phase, being able to perform I/O in parallel with doing merge operations will reduce the time. Using multi-threading to overlap I/O with merge operations will reduce time and be simpler than using asynchronous I/O to do the same thing. I'm not aware of a way to use multi-threading to reduce the time for a k-way merge operation.
For a k-way merge, the files are read in smaller chunks of size B/(k+1). This allows for k input buffers and 1 output buffer for a k-way merge operation.
For hard drives, random access overhead is an issue, say transfer rate is 200 MB/s, and average random access overhead is 0.01 seconds, which is the same amount of time to transfer 2 MB. If buffer size is 2 MB, then random access overhead effectively cuts transfer rate by 1/2 to ~100 MB/s. If buffer size is 8 KB, then random access overhead effectively cuts transfer rate by 1/250 to ~0.8 MB/s. With a small buffer, a 2-way merge will be faster, due to the overhead of random access.
For SSDs in a non-server setup, usually there's no command queuing, and command overhead is about .0001 second on reads, .000025 second on writes. Transfer rate is about 500 MB/s for Sata interface SSDs. If buffer size is 2MB, the command overhead is insignificant. If buffer size is 4KB, then read rate is cut by 1/12.5 to ~ 40 MB/s, and write rate cut by 1/3.125 to ~160 MB/s. So if buffer size is small enough, again a 2-way merge will be faster.
On a PC, these small buffer scenarios are unlikely. In the case of the gnu sort for huge text files, with default settings, it allocates a bit over 1GB of ram, creating 1GB sorted files on the initial pass, and does a 16-way merge, so buffer size is 1GB/17 ~= 60 MB. (The 17 is for 16 input buffers, 1 output buffer).
Consider the case of where all of the data fits in memory, and that the memory consists of k sorted lists. The time complexity for merging the lists will be O(n log(k)), regardless if a 2-way merge sort is used, merging pairs of lists in any order or if a k-way merge sort is used to merge all the lists in one pass.
I did some actual testing of this on my system, Intel 3770K 3.5ghz, Windows 7 Pro 64 bit. For a heap based k-way merge, with k = 16, transfer rate ~ 235 MB/sec, with k = 4, transfer rate ~ 495 MB/sec. For a non-heap 4-way merge, transfer rate ~ 1195 MB/sec. Hard drive transfer rates are typically 70 MB/sec to 200 MB/sec. Typical SSD transfer rate is ~500 MB/sec. Expensive server type SSDs (SAS or PCIe) are up to ~2GB/sec read, ~1.2GB/sec write.

How does the disk seek is faster in column oriented database

I have recently started working on bigqueries, I come to know they are column oriented data base and disk seek is much faster in this type of databases.
Can any one explain me how the disk seek is faster in column oriented database compare to relational db.
The big difference is in the way the data is stored on disk.
Let's look at an (over)simplified example:
Suppose we have a table with 50 columns, some are numbers (stored binary) and others are fixed width text - with a total record size of 1024 bytes. Number of rows is around 10 million, which gives a total size of around 10GB - and we're working on a PC with 4GB of RAM. (while those tables are usually stored in separate blocks on disk, we'll assume the data is stored in one big block for simplicity).
Now suppose we want to sum all the values in a certain column (integers stored as 4 bytes in the record). To do that we have to read an integer every 1024 bytes (our record size).
The smallest amount of data that can be read from disk is a sector and is usually 4kB. So for every sector read, we only have 4 values. This also means that in order to sum the whole column, we have to read the whole 10GB file.
In a column store on the other hand, data is stored in separate columns. This means that for our integer column, we have 1024 values in a 4096 byte sector instead of 4! (and sometimes those values can be further compressed) - The total data we need to read now is around 40MB instead of 10GB, and that will also stay in the disk cache for future use.
It gets even better if we look at the CPU cache (assuming data is already cached from disk): one integer every 1024 bytes is far from optimal for the CPU (L1) cache, whereas 1024 integers in one block will speed up the calculation dramatically (those will be in the L1 cache, which is around 50 times faster than a normal memory access).
The "disk seek is much faster" is wrong. The real question is "how column oriented databases store data on disk?", and the answer usually is "by sequential writes only" (eg they usually don't update data in place), and that produces less disk seeks, hence the overall speed gain.

Why Is a Block in HDFS So Large?

Can somebody explain this calculation and give a lucid explanation?
A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.
A block will be stored as a contiguous piece of information on the disk, which means that the total time to read it completely is the time to locate it (seek time) + the time to read its content without doing any more seeks, i.e. sizeOfTheBlock / transferRate = transferTime.
If we keep the ratio seekTime / transferTime small (close to .01 in the text), it means we are reading data from the disk almost as fast as the physical limit imposed by the disk, with minimal time spent looking for information.
This is important since in map reduce jobs we are typically traversing (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it, so since we have to spend the full transferTime anyway to get all the data out of the disk, let's try to minimise the time spent doing seeks and read by big chunks, hence the large size of the data blocks.
In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.
Since 100mb is divided into 10 blocks you gotta do 10 seeks and transfer rate is (10/100)mb/s for each file.
(10ms*10) + (10/100mb/s)*10 = 1.1 sec. which is greater than 1.01 anyway.
Since 100mb is divided among 10 blocks, each block has 10mb only as it is HDFS. Then it should be 10*10ms + 10mb/(100Mb/s) = 0.1s+ 0.1s = 0.2s and even lesser time.

Random write Vs Seek time

I have a very weird question here...
I am trying to write the data randomly to a file of 100 MB.
data size is 4KB and the the random offset is page alligned.(4KB ).
I am trying to write 1 GB of data at random offset on 100 MB file.
If I remove the actual code that writes the data to the disk, the entire operation takes less than a second (say 0.04 sec).
If I keep the code that writes the data its takes several seconds .
In case of random write operation, what happens internally? whether the cost is seek time or the write time? From above scenario its really confusing.. !!!!
Can anybody explain in depth please....
The same procedure applied with a sequential offset, write is very fast.
Thank you ......
If you're writing all over the file, then the disk (I presume this is on a disk) needs to seek to a new place on every write.
Also, the write speed of hard disks isn't particularly stunning.
Say for the sake of example (taken from a WD Raptor EL150) that we have a 5.9 ms seek time. If you are writing 1GB randomly everywhere in 4KB chunks, you're seeking 1,000,000,000 ÷ 4,000 × 0.0059 seconds = a total seeking time of ~1,400 seconds!

Resources