HDFS read big file in paralle - hadoop

I want to read a big file who's size is 500GB from my hadoop cluster with 5 nodes. Can I read the blocks in parallel, or I have to read blocks one by one?

If you are using MapReduce/Hive/Pig then the blocks will be automatically read in parallel based on the number of blocks.
Assume, if you are performing wordcount on your 500GB file and the block size is 128MB then there will be 4 blocks and hence 4 Mappers (preferably as close to the data as possible - data locality) will be launched by MapReduce to perform wordcount in parallel.

Related

How does Hadoop HDFS decide what data to be put into each block?

I have been trying to dive into how Hadoop HDFS decides what data to be put into one block and don't seem to find any solid answer. We know that Hadoop will automatically distribute data into blocks in HDFS across the cluster, however what data of each file should be put together in a block? Will it just put it arbitrarily ? And is this the same for Spark RDD?
HDFS block behavior
I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:
Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 splits and in turn 16 mappers.
Let's look at this scenario with a compressed (non-splittable) file:
Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 Blocks will converge on 1 mapper.
It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.
spark reading a HDFS splittable file:
sc.textFile doesn't commence any reading. It simply defines a driver-resident data structure which can be used for further processing.
It is not until an action is called on an RDD that Spark will build up a strategy to perform all the required transforms (including the read) and then return the result.
If there is an action called to run the sequence, and your next transformation after the read is to map, then Spark will need to read a small section of lines of the file (according to the partitioning strategy based on the number of cores) and then immediately start to map it until it needs to return a result to the driver, or shuffle before the next sequence of transformations.
If your partitioning strategy (defaultMinPartitions) seems to be swamping the workers because the java representation of your partition (an InputSplit in HDFS terms) is bigger than available executor memory, then you need to specify the number of partitions to read as the second parameter to textFile. You can calculate the ideal number of partitions by dividing your file size by your target partition size (allowing for memory growth). A simple check that the file can be read would be:
sc.textFile(file, numPartitions)
.count()

Blocks in Mapreduce

I have very important question cause I must make a presentation about map-reduce.
My Question is:
I have read that the file in map-reduce is divided into blocks and every blocks is replicated in 3 different nodes. the block can be 128 MB is this Block the input file? i mean this 128 MB block will be Splitting into parts and every part will go to single map? if yes so this 128 MB will be divided into Which Size?
or the File breaks into blocks and this blocks is the input for mapper
I'm little bit confused.
Could you see the photo and tell me which one is right.
Here HDFS File is divided into blocks and every singel block 128. MB will be as input for 1 Map
Here the HDFS file Is A Block and this 128 M.B will be splitting and every part will be input for 1 Map
Let's say you have a file of 2GB and you want to place that file in HDFS, then there will be 2GB/128MB = 16 blocks and these block will be distributed across the different DataNodes.
Data splitting happens based on file offsets. The goal of splitting the file and store it into different blocks, is parallel processing and fail over of data.
Split is logical split of the data, basically used during data processing using Map/Reduce program or other data-processing techniques in Hadoop. Split size is user defined value and one can choose his own split size based on the volume of data(How much data you are processing).
Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split. (i.e., Input Split = Input Block. So 16 mappers will be triggered for a 2 GB file). If Split size is defined as 100 MB (lets say), then 21 Mappers will be triggered (20 Mappers for 2000MB and 21st Mapper for 48MB).
Hope this clears your doubt.
HDFS stores the file as blocks and each block is 128Mb in size (default).
Mapreduce processes this HDFS file. Each mapper processes a block (input split).
So, to answer your question, 128 Mb is a single block size which will not be further split.
Note : input split size used in mapreduce context is logical split, whereas the split size mentioned in the HDFS is physical split.

Why should I avoid storing lots of small files in Hadoop HDFS?

I have read that lots of small files stored in HDFS can be a problem because lots of small files means lots of objects Hadoop NameNode memory.
However since each block is stored in named node as an object, how is it different for a large file? Whether you store 1000 blocks from a single file in memory or 1000 blocks for 1000 files, is the amount of NameNode memory used the same?
Similar question for Map jobs. Since they operate on blocks, how does it matter if blocks are of small files or from bigger ones ?
At a high-level, you can think of a Hadoop NameNode as a tracker for where blocks composing 'files' stored in HDFS are located; blocks are used to break down large files into smaller pieces when stored in an HDFS cluster.
When you have lots of small files stored in HDFS, there are also lots of blocks, and the NameNode must keep track of all of those files and blocks in memory.
When you have a large file, for example -- if you combined all of those files into bigger files, first -- you would have fewer files stored in HDFS, and you would also have fewer blocks.
First let's discuss how file size, HDFS blocks, and NameNode memory relate:
This is easier to see with examples and numbers.
Our HDFS NameNode's block size for this example is 100 MB.
Let's pretend we have a thousand (1,000) 1 MB files and we store them in HDFS. When storing these 1,000 1 MB files in HDFS, we would have also have 1,000 blocks composing those files in our HDFS cluster.
Each block stored in HDFS requires about 150 bytes of NameNode memory, which is about 150 KB of memory for those 1,000 blocks representing 1,000 1 MB files.
Now, consider that we consolidate or concatenate those 1,000 1 MB files into a single 1,000 MB file and store that single file in HDFS. When storing the 1,000 MB file in HDFS, it would be broken down into blocks based on our HDFS cluster block size; in this example our block size was 100 MB, which means our 1,000 MB file would be stored as ten (10) 100 MB blocks in the HDFS cluster.
Each block stored in HDFS requires about 150 bytes of NameNode memory, which is about 1.5 KB of memory for those 10 blocks representing the 1 1,000 MB file.
With the larger file, we have the same data stored in the HDFS cluster, but use 1% of the NameNode memory compared to the situation with many small files.
Input blocks and the number of Map tasks for a job are related.
When it comes to Map tasks, generally you will have 1-map task per input block. The size of input blocks here matters because there is overhead from starting and finishing new tasks; i.e. when Map tasks finish too quickly, the amount of this overhead becomes a greater portion of each tasks's completion time, and completion of the overall job this can be slower than the same job but with fewer, bigger input blocks. For a MapReduce2-based job, Map tasks also involve starting and stopping a YARN container at the resource management layer, for each task, which adds overhead. (Note that you can also instruct MapReduce jobs to use a minimum input size threshold when dealing with many small input blocks to address some of these inefficiencies as well)

Spark RDD partitions vs. Hadoop Splits

I am having a hard time understanding the difference between the RDD partitions and the HDFS Input Splits. So essentially when you submit a Spark application:
When the Spark application wants to read from HDFS, that file on HDFS will have input splits (of let's say 64 mb each and each of these input splits are present on different data nodes).
Now let's say the Spark application wants to load that file from HDFS using the (sc.textFile(PATH_IN_HDFS)). And the file is about 256 mb and has 4 input splits where 2 of the splits are on data node 1 and the other 2 splits are on data node 2.
Now when Spark loads this 256 mb into it's RDD abstraction, will it load each of the input splits (64mb) into 4 separate RDD's (where you will have 2 RDD's with 64mb of data in data node 1 and the other two RDD's of 64mb of data on data node 2). Or will the RDD further partition those input splits on Hadoop? Also how will these partitions be redistributed then? I do not understand if there is a correlation between the RDD partitions and the HDFS input splits?
I'm pretty new to Spark, but splits are strictly related to MapReduce jobs. Spark loads the data in memory in a distributed fashion and which machines will load the data can depend on where the data are (read: somewhat depends on where the data block are and this is very close to the split idea ).
Sparks APIs allows you to think in terms of RDD and no longer splits.
You will work on RDD, how are distributed the data into the RDD is no longer a programmer problem.
Your whole dataset, under spark, is called RDD.
Hope the below answer would help you.
When Spark reads a file from HDFS, it creates a single partition for a single input split.
If you have a 30GB text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 235 blocks, which means that the RDD you read from this file would have 235 partitions.

What should be the size of the file in HDFS for best MapReduce job performance

I want to do a copy text files from external sources to HDFS. Lets assume that I can combine and split the files based on their size, what should be the size of the text file for best custom Map Reduce job performance. Does size matter ?
HDFS is designed to support very large files not small files. Applications that are compatible with HDFS are those that deal with large data sets.
These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files.In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB.
When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB
and you want to place that file in HDFS, then there will be 1GB/64MB = 16 split/blocks and these block will be distribute across the datanodes
The goal of splitting of file is parallel processing and fail over of data. These blocks/chunk will reside on a different DataNode based on your
cluster configuration.
How mappers get assigned
Number of mappers is determined by the number of splits of your data in the MapReduce job.
In a typical InputFormat, it is directly proportional to the number of files and file sizes.
suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size
then there will be 2 split and it will occupy 2 block and then 2 mapper will get assigned based on the blocks but suppose
if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigned based on that.
So you don't need to split the large file, but If you are dealing with very small files then it worth to combine them.
This link will be helpful to understand the problem with small files.
Please refer below link to get more detail about HDFS design.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Resources