Number of seeks() when reading a file in Hadoop? - hadoop

I want to make sure I am getting this concept right:
In Hadoop the Definite Guide it is stated that: "the goal while designing a file system is always to reduce the number of seeks in comparison to the amount of data to be transferred." In this statement the author is referring to the "seeks()" of Hadoop logical blocks, right?
I am thinking that no matter how big the Hadoop block size is (64MB or 128MB or bigger) the number of seeks of the physical blocks (which are usually 4KB or 8KB) that the underlying filesystem (e.g. ext3/fat) will have to perform will be the same no matter the size of Hadoop block size.
Example: To keep numbers simple, assume underlying file system block size is 1MB. We want to read a file of size 128MB. If the Hadoop block size is 64MB, the file occupies 2 blocks. When reading there are 128 seeks. if the Hadoop block size is increased to 128MB, the number of seeks performed by the files system is still 128. In the second case, Hadoop will perform 1 seek instead of 2.
Is my understanding correct?
If I am correct, a substantial performance improvement by increasing block size will only be observed only for very large files, right? I am thinking that in the case of files that are in the 1~GB size range, reducing the number of seeks from ~20 seeks (64MB block size) to ~10 seek (128MB block size) shouldn't make much of a difference, right?

You are correct that increasing the file system block size will improve performance. Linux requires that the block size be less than or equal to the page size. The x86 page size is limited to 4K; therefore, the largest block size that you can use is 4K even if the file system can support larger block sizes. The performance benefits of a large block size and page size are significant: reduction in read/write system calls, reduction in rotational delays and seeks (don't begin to consider SSDs), fewer context switches, improved cache locality, fewer TLB misses, etc. This is all goodness.
I analytically modeled the benefits of various block sizes based on our disk usage pattern and in some cases predicted order of magnitude improvements from the disk subsystem. This would shift the performance bottleneck elsewhere.
You are correct that substantial performance gains are possible. Unfortunately, a certain engineer who controls such improvements sees no value in page sizes larger than 4K. He mocks enterprise users who need high perform from largely homogeneous workloads on big iron and focuses on heterogeneous workloads that are interactively run on desktop or laptop systems where high performance is unimportant.

Related

What is the advantage of caching an entire line instead of a single byte or word at a time?

To use cache memory, main memory is divided into cache lines, typically 32 or 64 bytes long. An entire cache line is cached at once. What is the advantage of caching an entire line instead of a single byte or word at a time?
This is done to exploit the principle of locality; spatial locality to be precise. This principle states that the data bytes which lie close together in memory are likely to be referenced together in a program. This is immediately apparent when accessing large arrays in loops. However, this is not always true (e.g. pointer based memory access) and hence it is not advisable to fetch data from memory at more than the granularity of cache lines (in case the program does not have locality of reference) since cache is a very limited and important resource.
Having cache block size equal to the smallest addressable size would mean, if a larger size access is supported, multiple tags would have to be checked for such larger accesses. While parallel tag checking is often used for set associative caches, a four-fold increase (8-bit compared to 32-bit) in the number of tags to check would increase access latency and greatly increase energy cost. In addition, such introduces the possibility of partial hits for larger accesses, increasing the complexity of sending the data to a dependent operation or internal storage. While data can be speculatively sent by assuming a full hit (so latency need not be hurt by the possibility of partial hits), the complexity budget is better not spent on supporting partial hits.
32-bit cache blocks, when the largest access size is 32 bits, would avoid the above-mentioned issues, but would use a significant fraction of storage for tags. E.g., a 16KiB direct-mapped cache in a 32-bit address space would use 18 bits for the address portion of the tag; even without additional metadata such as coherence state, tags would use 36% of the storage. (Additional metadata might be avoided by having a 16KiB region of the address space be non-cacheable; a tag matching this address region would be interpreted as "invalid".)
Besides the storage overhead, having more tag data tends to increase latency (smaller tag storage facilitates earlier way selection) and access energy. In addition, having a smaller number of blocks for a cache of a given size makes way prediction and memoization easier, these are used to reduce latency and/or access energy.
(The storage overhead can be a significant factor when it allows tags to be on chip while data is too large to fit on chip. If data uses a denser storage type than tags — e.g., data in DRAM and tags in SRAM with a four-fold difference in storage density —, lower tag overhead becomes more significant.)
If caches only exploited temporal locality (the reuse of a memory location within a "short" period of time), this would typically be the most attractive block size. However, spatial locality of access (accesses to locations near an earlier access often being close in time) is common. Taken control flow instructions are typically less than a sixth of all instructions and many branches and jumps are short (so the branch/jump target is somewhat likely to be within the same cache block as the branch/jump instruction if each cache block holds four or more instructions). Stack frames are local to a function (concentrating the timing of accesses, especially for leaf functions, which are common). Array accesses often use unit stride or very small strides. Members of a structure/object tend to be accessed nearby in time (conceptually related data tends to be related in action/purpose and so accessed nearer in time). Even some memory allocation patterns bias access toward spatial locality; related structures/objects are often allocated nearby in time — if the preferred free memory is not fragmented (which would happen if spatially local allocations are freed nearby in time, if little memory has been freed, or if the allocator is clever in reducing fragmentation, then such allocations are more likely to be spatially local.
With multiple caches, coherence overhead also tends to be reduced with larger cache blocks (under the assumption spatial locality). False sharing increases coherence overhead (similar to lack of spatial locality increasing capacity and conflict misses).
In this sense, larger cache blocks can be viewed as a simple form of prefetching (even with respect to coherence). Prefetching trades bandwidth and cache capacity for a reduction in latency via cache hits (as well as from increasing the useful queue size and scheduling flexibility). One could gain the same benefit by always prefetching a chunk of memory into multiple small cache blocks, but the capacity benefit of finer-grained eviction would be modest because spatial locality of use is common. In addition, to avoid prefetching data that is already in the cache, the tags for the other blocks would have to be probed to check for hits.
With simple modulo-power-of-two indexing and modest associativity, two spatially nearby blocks are more likely to conflict and evict earlier another blocks with spatial locality (index A and index B will have the same spatial locality relationship for all addresses mapping to indexes within a larger address range). With LRU-oriented replacement, accesses within a larger cache block reduce the chance of a too-early eviction when spatial locality is common at the cost of some capacity and conflict misses.
(For a direct-mapped cache, there is no difference between always prefetching a multi-block aligned chunk and using a larger cache block, so paying the extra tag overhead would be pointless.)
Prefetching into a smaller buffer would avoid cache pollution from used data, increasing the benefit of smaller block size, but such also reduces the temporal scope of the spatial locality. A four-entry prefetch buffer would only support spatial locality within four cache misses; this would catch most stream-like accesses (rarely will more than four new "streams" be active at the same time) and many other cases of spatial locality but some spatial locality is over a larger span of time.
Mandatory prefetching (whether from larger cache blocks or a more flexible mechanism) provides significant bandwidth advantages. First, the address and request type overhead is spread over a larger amount of data. 32 bits of address and request type overhead per 32 bit access uses 50% of the bandwidth for non-data but less than 12% when 256 bits of data are transferred.
Second, the memory controller processing and scheduling overhead can be more easily averaged over more transferred data.
Finally, DRAM chips can provide greater bandwidth by exploiting internal prefetch. Even in the days of Fast Page Mode DRAM, accesses within the same DRAM page were faster and higher bandwidth (less page precharge and activation overhead); while non-mandatory prefetch could exploit such and be more general, the control and communication overheads would be larger. Modern DRAMs have minimum burst lengths (and burst chop merely drops part of the DRAM-chip-internal prefetch — the internal access energy and array occupation are nor reduced).
The ideal cache block size depends on workload ('natural' algorithm choices and legacy optimization assumptions, data set sizes and complexity, etc.), cache sizes and associativity (larger and more associative caches encourage larger blocks), available bandwidth, use of in-cache data compression (which tends to encourage larger blocks), cache block sectoring (where validity/coherence state is tracked at finer granularity than the address), and other factors.
The main advantage of caching an entire line is the probability of the next cache-hit is increased.
From Tanenbaum's "Modern Operating Systems" book:
Cache-hit: When the program needs to read a memory word, the cache hardware checks to see if the line needed is in the cache.
If we don't have a cache-hit then cache-miss will occur. A memory request is sent to the main memory.
As a result, more time will be spent to complete the process, since searching inside the memory is costly.
We can tell that, caching an entire line will increase the probability of completing the process in two-cycles.

How is circular buffer used for spilling process in hadoop?

From "Hadoop the definitive guide"
[Each map task has a circular memory buffer that it writes the output to. The buffer is
100 MB by default, a size that can be tuned by changing the io.sort.mb property. When
the contents of the buffer reaches a certain threshold size (io.sort.spill.percent,
which has the default 0.80, or 80%), a background thread will start to spill the contents
to disk]
Question here is that since each map task works on a single input split (which more or less would be equal to the size of HDFS block i.e 64 MB), the condition for spilling back to the disk shall never arose. Am i missing something. Please help.
Why do you assume the Split Size or the Block size would be 64 MB? Practically I have seen having a small Block size reduces the performance (For the scale of data I analyse). I have seen better performance with block size/ split size of 256MB in my use case.
Coming back to your question,
Having Way too many Mappers is also an overhead on the framework. Going by the use case mentioned in the question we might not be spilling keys,values from the 100 MB circular Buffer. But consider these case where split-size is 64MB and the Mapper does some calculations based on the input and emits additional calculation results as a part of Map output, there are chances that the Map output can be more than the configured circular buffer size. Another use case we have 64 MB of block-compressed data the data just bursts up in size when processing. Considers mappers which will fetch additional data from "Side Data Distribution", "distributed cache" in the Map phase.
Just an additional Note:
From my experience I can clearly say that when we work on/with a framework the default configurations will never suit our requirements. We need to tweak and tune the system to give us the best possible performance.

cache memory size limitations

I knew that cache memory stores the frequently used data to speed up process execution instead fetching them from main memory -which is slower- every time , and it's size always small in comparison with main memory because it's expensive technology and because always the real data are being processed at a time is very smaller than the whole data process held by main memory .
But is there any limitations or constrains regarding cache memory size at a some CPU speed or a some main memory size ? theoretically , if we increased the cache memory much .. will that affect in an opposite way ? or just it will be a waste increase ?
Indeed the performance gain become less and less significant after 64KB of cache size.
Here is graph from wikipedia showing that regardless of the scheme of set-associativity the miss-rate decrease only slightly as the cache size increases pass 64KB
Caches are small because the silicon used to build them is quite expensive and, expecially on CISC-type CPUs, there might not be enough space on the chip to hold them. Also making chips bigger has it cost and there's the possibility that it won't fit in its socket, which adds many more issues. It's not that simple ;)
EDIT:
Well, I haven't got any papers about this, but I'll explain my opinion anyway with a simple question: if a programs needs x bytes of memory, what would be the difference if the cache's size is 10 * x bytes or 100 * x? Once all the data is loaded in the cache (which doesn't depend on its size at all), the difference is all in the cache's access speed. And given locality of reference, it's not necessary having everything on cache.
Also, having big chaches requires having better algorithm for searching requested data in it. For example accessing data in a fully associative caches will become slower than accessing the main memory as the cache size increases (which implies there are more and more places to look for the data). Considering multitasking system, though, introduces other issues which I don't actually know much of.
To conclude, the performance gain caused by increasing caches' size becomes slighter as it approaches the usual amount of data used by the whole software running on a given machine.

What are the most efficient idioms for streaming data from disk with constant space usage?

Problem Description
I need to stream large files from disk. Assume the files are larger than will fit in memory. Furthermore, suppose that I'm doing some calculation on the data and the result is small enough to fit in memory. As a hypothetical example, suppose I need to calculate an md5sum of a 200GB file and I need to do so with guarantees about how much ram will be used.
In summary:
Needs to be constant space
Fast as possible
Assume very large files
Result fits in memory
Question
What are the fastest ways to read/stream data from a file using constant space?
Ideas I've had
If the file was small enough to fit in memory, then mmap on POSIX systems would be very fast, unfortunately that's not the case here. Is there any performance advantage to using mmap with a small buffer size to buffer successive chunks of the file? Would the system call overhead of moving the mmap buffer down the file dominate any advantages Or should I use a fixed buffer that I read into with fread?
I wouldn't be so sure that mmap would be very fast (where very fast is defined as significantly faster than fread).
Grep used to use mmap, but switched back to fread. One of the reasons was stability (strange things happen with mmap if the file shrinks whilst it is mapped or an IO error occurs). This page discusses some of the history about that.
You can compare the performance on your system with the option --mmap to grep. On my system the difference in performance on a 200GB file is negligible, but your mileage might vary!
In short, I'd use fread with a fixed size buffer. It's simpler to code, easier to handle errors and will almost certainly be fast enough.
Depending on the language you are using, a C-like fread() loop based on a file for which you declared a particular buffer size will require exactly this buffer size, no more no less.
We typically choose a buffer size of 4 to 128 kBytes, there is little gain if any with bigger buffers.
If performance was extremely important, for relatively little gain (and at the risk of re-inventing something), one could consider using a two-thread implementation, whereby one thread reads the file in a set of two buffers, and the other thread perform calculations sequential fashion in one of the buffers at a time. In this fashion the disk access delays can be removed.
mjv is right. You can use double-buffers and overlapped I/O. That way your crunching and the disk reading can be happening at the same time. Then I would profile or stack-shot the crunching to make it as fast as possible. With luck it will be faster than the I/O, so you will end up running the I/O at top speed without pause. Then things like file fragmentation come into the picture.

Difference between sequential write and random write

What is the difference between sequential write and random write in case of :-
1)Disk based systems
2)SSD [Flash Device ] based systems
When the application writes something and the information/data needs to be modified on the disk then how do we know whether it is a sequential write or a random write.As till this point a write cannot be distinguished as "sequential" or "random".The write is just buffered and then applied to the disk when we will flush the buffer.
Please correct me if I am wrong.
When people talk about sequential vs random writes to a file, they're generally drawing a distinction between writing without intermediate seeks ("sequential"), vs. a pattern of seek-write-seek-write-seek-write, etc. ("random").
The distinction is very important in traditional disk-based systems, where each disk seek will take around 10ms. Sequentially writing data to that same disk takes about 30ms per MB. So if you sequentially write 100MB of data to a disk, it will take around 3 seconds. But if you do 100 random writes of 1MB each, that will take a total of 4 seconds (3 seconds for the actual writing, and 10ms*100 == 1 second for all the seeking).
As each random write gets smaller, you pay more and more of a penalty for the disk seeks. In the extreme case where you perform 100 million random 1-byte writes, you'll still net 3 seconds for all the actual writes, but you'd now have 11.57 days worth of seeking to do! So clearly the degree to which your writes are sequential vs. random can really affect the time it takes to accomplish your task.
The situation is a bit different when it comes to flash. With flash, you don't have a physical disk head that you must move around. (This is where the 10ms seek cost comes from for a traditional disk). However, flash devices tend to have large page sizes (the smallest "typical" page size is around 512 bytes according to wikipedia, and 4K page sizes appear to be common as well). So if you're writing a small number of bytes, flash still has overhead in that you must read out an entire page, modify the bytes you're writing, and then write back the entire page. I don't know the characteristic numbers for flash off the top of my head. But the rule of thumb is that on flash if each of your writes is generally comparable in size to the device's page size, then you won't see much performance difference between random and sequential writes. If each of your writes is small compared to the device page size, then you'll see some overhead when doing random writes.
Now for all of the above, it's true that at the application layer much is hidden from you. There are layers in the kernel, disk/flash controller, etc. that could for example interject non-obvious seeks in the middle of your "sequential" writing. But in most cases, writing that "looks" sequential at the application layer (no seeks, lots of continuous I/O) will have sequential-write performance while writing that "looks" random at the application layer will have the (generally worse) random-write performance.

Resources