Why Is a Block in HDFS So Large? - hadoop

Can somebody explain this calculation and give a lucid explanation?
A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.

A block will be stored as a contiguous piece of information on the disk, which means that the total time to read it completely is the time to locate it (seek time) + the time to read its content without doing any more seeks, i.e. sizeOfTheBlock / transferRate = transferTime.
If we keep the ratio seekTime / transferTime small (close to .01 in the text), it means we are reading data from the disk almost as fast as the physical limit imposed by the disk, with minimal time spent looking for information.
This is important since in map reduce jobs we are typically traversing (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it, so since we have to spend the full transferTime anyway to get all the data out of the disk, let's try to minimise the time spent doing seeks and read by big chunks, hence the large size of the data blocks.
In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.

Since 100mb is divided into 10 blocks you gotta do 10 seeks and transfer rate is (10/100)mb/s for each file.
(10ms*10) + (10/100mb/s)*10 = 1.1 sec. which is greater than 1.01 anyway.

Since 100mb is divided among 10 blocks, each block has 10mb only as it is HDFS. Then it should be 10*10ms + 10mb/(100Mb/s) = 0.1s+ 0.1s = 0.2s and even lesser time.

Related

Scan time of cache, RAM và disk?

Consider a computer system that has cache memory, main memory (RAM), and disk, and OS uses virtual memory. It takes 2 nsec to access a byte from the cache, 20 nsec to access a byte from RAM, and 10 msec to access a block of 1000 bytes from the disk. If a book has 1000 pages, each with 50 lines of 80 characters each, How long it will take to electronically scan the text for the case of the master copy being in each of the level as one proceeds down the memory hierarchy(from inboard memory to offline storage)?
If a book has 1000 pages, each with 50 lines of 80 characters each, then the book has 1000 * 50 * 80 = 4000000 characters. We don't know how big a character is (and it could be UTF8 where different characters are different sizes), we don't know how much meta-data there is (e.g. if it's a word-processor file with extra information in a header about page margins, tab stops; plus more data for which font and style, etc) or if there's any extra processing (e.g. compression of pieces within the file).
If we make an unfounded assumption that the file happens to be 4000000 bytes; then we might say that it'll be 4000 blocks (with 1000 bytes per block) on disk.
Then we get into trouble. A CPU can't access data on disk (and can only access data in RAM or cache); so it needs to be loaded into RAM (e.g. by a disk controller) before the CPU can access it.
If it takes the disk controller 10 msec to access a block of 1000 bytes from disk, then we might say it will take at least 10 msec * 4000 = 40000 msec = 40 seconds to read the whole file into disk. However this would be wrong - the disk controller (acting on requests by file system support code) will have to find the file (e.g. read directory info, etc), and that the file may be fragmented so the disk controller will need to read (and then follow) a "list of where the pieces of the file are".
Of course while the CPU is scanning the first part of the file the disk controller can be reading the last part of the file; either because the software is designed to use asynchronous IO or because the OS detected a sequential access pattern and started pre-fetching the file before the program asked for it. In other words; the ideal case is that when the disk controller finishes loading the last block the CPU has already scanned the first 3999 blocks and only has to scan 1 more (and the worst case is that disk controller and CPU don't do anything at the same time it becomes "40 seconds to load the file into RAM plus however long it takes for CPU to scan the data in RAM".
Of course we also don't know things like (e.g.) if the file is actually loaded 1 block at a time (if it's split into 400 transfers with 10 blocks per transfer then the "ideal case" would be worse as CPU would have to scan the last 10 blocks after they're loaded and not just the last one block); of how many reads the disk controller does before a pre-fetcher detects that it's a sequential pattern.
Once the file is in RAM we have more problems.
Specifically; anyone that understands how caches work will know that "It takes 2 nsec to access a byte from the cache, 20 nsec to access a byte from RAM" means that when you access one byte in RAM it takes 18 nsec to transfer a cache line (a group of consecutive bytes) from RAM to cache plus 2 nsec to obtain that 1 byte from cache; and then the next byte you access will have already been transferred to cache (as it's part of "group of consecutive bytes") and will only cost 2 nsec.
After the file's data is loaded into RAM by disk controller; because we don't know the cache line size, we don't know how many of software's accesses will take 20 nsec and how many will take 2 nsec.
The final thing is that we don't actually know anything useful about the caches. The general rule is that the larger a cache is the slower it is; and "large enough to contain the entire book (plus the program's code, stack, parts of the OS, etc) but fast enough to have a 2 nsec access times" is possibly an order of magnitude better than any cache that has ever existed. Essentially, the words "the cache" (in the question) can not be taken literally as it would be implausible. If we look at anything that's close to being able to provide a 2 nsec access time we see multiple levels - e.g. a fast and small L1 cache with a slower but larger L2 cache (with an even slower but larger L3 cache). To make sense of the question you must assume that "the cache" meant "the set of caches", and that the L1 cache has an access time of 2 nsec (but is too small to hold the whole file and everything else) and other levels of the cache hierarchy (e.g. L2 cache) have unknown slower access times but may be large enough to hold the whole file (and everything else).
Mostly; if I had to guess; I'd assume that the question was taken from a university course (because universities have a habit of tricking students into paying $$$ for worthless "fictional knowledge").

How does the disk seek is faster in column oriented database

I have recently started working on bigqueries, I come to know they are column oriented data base and disk seek is much faster in this type of databases.
Can any one explain me how the disk seek is faster in column oriented database compare to relational db.
The big difference is in the way the data is stored on disk.
Let's look at an (over)simplified example:
Suppose we have a table with 50 columns, some are numbers (stored binary) and others are fixed width text - with a total record size of 1024 bytes. Number of rows is around 10 million, which gives a total size of around 10GB - and we're working on a PC with 4GB of RAM. (while those tables are usually stored in separate blocks on disk, we'll assume the data is stored in one big block for simplicity).
Now suppose we want to sum all the values in a certain column (integers stored as 4 bytes in the record). To do that we have to read an integer every 1024 bytes (our record size).
The smallest amount of data that can be read from disk is a sector and is usually 4kB. So for every sector read, we only have 4 values. This also means that in order to sum the whole column, we have to read the whole 10GB file.
In a column store on the other hand, data is stored in separate columns. This means that for our integer column, we have 1024 values in a 4096 byte sector instead of 4! (and sometimes those values can be further compressed) - The total data we need to read now is around 40MB instead of 10GB, and that will also stay in the disk cache for future use.
It gets even better if we look at the CPU cache (assuming data is already cached from disk): one integer every 1024 bytes is far from optimal for the CPU (L1) cache, whereas 1024 integers in one block will speed up the calculation dramatically (those will be in the L1 cache, which is around 50 times faster than a normal memory access).
The "disk seek is much faster" is wrong. The real question is "how column oriented databases store data on disk?", and the answer usually is "by sequential writes only" (eg they usually don't update data in place), and that produces less disk seeks, hence the overall speed gain.

Why 2 way merge sorting is more efficient than one way merge sorting

I am reading about external sorting from wikipedia, and need to understand why 2 phase merging is more efficient than 1 phase merging.
Wiki : However, there is a limitation to single-pass merging. As the
number of chunks increases, we divide memory into more buffers, so
each buffer is smaller, so we have to make many smaller reads rather
than fewer larger ones.
Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single merge
pass isn't efficient: the disk seeks required to fill the input
buffers with data from each of the 500 chunks (we read 100MB / 501 ~
200KB from each chunk at a time) take up most of the sort time. Using
two merge passes solves the problem. Then the sorting process might
look like this:
Run the initial chunk-sorting pass as before.
Run a first merge pass combining 25 chunks at a time, resulting in 20 larger sorted chunks.
Run a second merge pass to merge the 20 larger sorted chunks.
Could anyone give me a simple example to understand this concept well. I am particularly confused about allocating more buffers in 2 phase merging.
The issue is random access overhead, average rotational delay is around 4 ms, average seek time around 9 ms, so say 10 ms average access time, versus an average transfer rate around 150 mega-bytes per second. So the average access overhead takes about the same time as it does to read or write 1.5 megabytes.
Large I/O's reduce the relative overhead of access time. If data is read or written 10 mega bytes at a time, the overhead is reduced to 15%. Using the wiki example 100MB of working buffer and 25 chunks to merge, plus one chunk for writes, that's 3.8 megabyte chunks, in which case random access overhead is about 39%.
On most PC's, it's possible to allocate enough memory for a 1GB working buffer, so 26 chunks at over 40 mega bytes per chunk, reducing random access overhead to 3.75%. 50 chunks would be about 20 megabytes per chunk, with random access overhead around 7.5%.
Note that even with 50 chunks, it's still a two pass sort, the first pass uses all of the 1gb buffer to do a memory sort, then writes sorted 1gb chunks. The second pass merges all 50 chunks that result in the sorted file.
The other issue with 50 chunks, is that a min heap type method is needed to maintain sorted information in order to determine which element of which chunk is the smallest of the 50 and gets moved to the output buffer. After an element is moved to the output buffer, a new element is moved into the heap and then the heap is re-heapified, which takes about 2 log2(50) operations, or about 12 operations. This is better than the simple approach of doing 49 compares to determine the smallest element from a group of 50 elements when doing the 50 way merge. The heap is composed of structures, the current element, which chunk this is (file position or file), the number of elements left in the chunk, ... .

Which is faster to process a 1TB file: a single machine or 5 networked machines?

Which is faster to process a 1TB file: a single machine or 5 networked
machines? ("To process" refers to finding the single UTF-16 character
with the most occurrences in that 1TB file). The rate of data
transfer is 1Gbit/sec, the entire 1TB file resides in 1 computer, and
each computer has a quad core CPU.
Below is my attempt at the question using an array of longs (with array size of 2^16) to keep track of the character count. This should fit into memory of a single machine, since 2^16 x 2^3 (size of long) = 2^19 = 0.5MB. Any help (links, comments, suggestions) would be much appreciated. I used the latency times cited by Jeff Dean, and I tried my best to use the best approximations that I knew of. The final answer is:
Single Machine: 5.8 hrs (due to slowness of reading from disk)
5 Networked Machines: 7.64 hrs (due to reading from disk and network)
1) Single Machine
a) Time to Read File from Disk --> 5.8 hrs
-If it takes 20ms to read 1MB seq from disk,
then to read 1TB from disk takes:
20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs
= 350 mins = 5.8 hrs
b) Time needed to fill array w/complete count data
--> 0 sec since it is computed while doing step 1a
-At 0.5 MB, the count array fits into L2 cache.
Since L2 cache takes only 7 ns to access,
the CPU can read & write to the count array
while waiting for the disk read.
Time: 0 sec since it is computed while doing step 1a
c) Iterate thru entire array to find max count --> 0.00625ms
-Since it takes 0.0125ms to read & write 1MB from
L2 cache and array size is 0.5MB, then the time
to iterate through the array is:
0.0125ms/MB x 0.5MB = 0.00625ms
d) Total Time
Total=a+b+c=~5.8 hrs (due to slowness of reading from disk)
2) 5 Networked Machines
a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs
1TB x 1024GB/TB x 8bits/B x 1s/Gbit
= 8,192s = 137m = 2.3hr
But since the original machine keeps a fifth of the data, it
only needs to send (4/5)ths of data, so the time required is:
2.3 hr x 4/5 = 1.84 hrs
*But to send the data, the data needs to be read, which
is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs
So total time = 1.84hrs + 4.64 hrs = 6.48 hrs
b) Time to fill array w/count data from original machine --> 1.16 hrs
-The original machine (that had the 1TB file) still needs to
read the remainder of the data in order to fill the array with
count data. So this requires (1/5)(answer 1a)=1.16 hrs.
The CPU time to read & write to the array is negligible, as
shown in 1b.
c) Time to fill other machine's array w/counts --> not counted
-As the file is being transferred, the count array can be
computed. This time is not counted.
d) Time required to receive 4 arrays --> (2^-6)s
-Each count array is 0.5MB
0.5MB x 4 arrays x 8bits/B x 1s/Gbit
= 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits
= 2^25/2^31s = (2^-6)s
d) Time to merge arrays
--> 0 sec(since it can be merge while receiving)
e) Total time
Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs
This is not an answer but just a longer comment. You have miscalculated the size of the frequency array. 1 TiB file contains 550 Gsyms and because nothing is said about their expected freqency, you would need a count array of at least 64-bit integers (that is 8 bytes/element). The total size of this frequency array would be 2^16 * 8 = 2^19 bytes or just 512 KiB and not 4 GiB as you have miscalculated. It would only take ≈4.3 ms to send this data over 1 Gbps link (protocol headers take roughly 3% if you use TCP/IP over Ethernet with an MTU of 1500 bytes /less with jumbo frames but they are not widely supported/). Also this array size perfectly fits in the CPU cache.
You have grossly overestimated the time it would take to process the data and extract the frequency and you have also overlooked the fact that it can overlap disk reads. In fact it is so fast to update the frequency array, which resides in the CPU cache, that the computation time is negligible as most of it will overlap the slow disk reads. But you have underestimated the time it takes to read the data. Even with a multicore CPU you still have only one path to the hard drive and hence you would still need the full 5.8 hrs to read the data in the single machine case.
In fact, this is an exemple kind of data processing that neither benefits from parallel networked processing nor from having more than one CPU core. This is why supercomputers and other fast networked processing systems use distributed parallel file storages that can deliver many GB/s of aggregate read/write speeds.
You only need to send 0.8tb if your source machine is part of the 5.
It may not even make sense sending the data to other machines. Consider this:
In order to for the source machine to send the data it must first hit the disk in order to read the data into main memory before it send the data over the network. If the data is already in main memory and not being processed, you are wasting that opportunity.
So under the assumption that loading to CPU cache is much less expensive than disk to memory or data over network (which is true, unless you are dealing with alien hardware), then you are better off just doing it on the source machine, and the only place splitting up the task makes sense is if the "file" is somehow created/populated in a distributed way to start with.
So you should only count the disk read time of a 1Tb file, with a tiny bit of overhead for L1/L2 cache and CPU ops. The cache access pattern is optimal since it is sequential so you only cache miss once per piece of data.
The primary point here is that disk is the primary bottleneck which overshadows everything else.

Random write Vs Seek time

I have a very weird question here...
I am trying to write the data randomly to a file of 100 MB.
data size is 4KB and the the random offset is page alligned.(4KB ).
I am trying to write 1 GB of data at random offset on 100 MB file.
If I remove the actual code that writes the data to the disk, the entire operation takes less than a second (say 0.04 sec).
If I keep the code that writes the data its takes several seconds .
In case of random write operation, what happens internally? whether the cost is seek time or the write time? From above scenario its really confusing.. !!!!
Can anybody explain in depth please....
The same procedure applied with a sequential offset, write is very fast.
Thank you ......
If you're writing all over the file, then the disk (I presume this is on a disk) needs to seek to a new place on every write.
Also, the write speed of hard disks isn't particularly stunning.
Say for the sake of example (taken from a WD Raptor EL150) that we have a 5.9 ms seek time. If you are writing 1GB randomly everywhere in 4KB chunks, you're seeking 1,000,000,000 ÷ 4,000 × 0.0059 seconds = a total seeking time of ~1,400 seconds!

Resources