Hadoop uses 2 mapper for a small file - hadoop

I tried to run a map reduce job on a small file (200kb), with only 10 lines. I used hadoop streaming as the map reduce job is written using shell script. I assumed it will use only 1 map, but the job tracker shows 'Map Total' as 2.
Please advice what could be the reason as I assumed the number of mappers will depend on the input file size and the allocated block size.
Thanks

Related

How does Hadoop HDFS decide what data to be put into each block?

I have been trying to dive into how Hadoop HDFS decides what data to be put into one block and don't seem to find any solid answer. We know that Hadoop will automatically distribute data into blocks in HDFS across the cluster, however what data of each file should be put together in a block? Will it just put it arbitrarily ? And is this the same for Spark RDD?
HDFS block behavior
I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:
Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 splits and in turn 16 mappers.
Let's look at this scenario with a compressed (non-splittable) file:
Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 Blocks will converge on 1 mapper.
It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.
spark reading a HDFS splittable file:
sc.textFile doesn't commence any reading. It simply defines a driver-resident data structure which can be used for further processing.
It is not until an action is called on an RDD that Spark will build up a strategy to perform all the required transforms (including the read) and then return the result.
If there is an action called to run the sequence, and your next transformation after the read is to map, then Spark will need to read a small section of lines of the file (according to the partitioning strategy based on the number of cores) and then immediately start to map it until it needs to return a result to the driver, or shuffle before the next sequence of transformations.
If your partitioning strategy (defaultMinPartitions) seems to be swamping the workers because the java representation of your partition (an InputSplit in HDFS terms) is bigger than available executor memory, then you need to specify the number of partitions to read as the second parameter to textFile. You can calculate the ideal number of partitions by dividing your file size by your target partition size (allowing for memory growth). A simple check that the file can be read would be:
sc.textFile(file, numPartitions)
.count()

Mapreduce why number of splits (text file) are more than 1 even for a tiny file

I know the difference between a physical block and Inputsplits in hadoop.
BTW I am using Hadoop 2.0 version (Yarn processing).
I have a very tiny input dataset. May be 1.5 Mb in size. When I run mapredce program that consumes this tiny dataset, during the run, it shows there are 2 input splits. Why should the tiny dataset should be split into two when it is less than 128 MB in size.
In my understanding a block size is configured to be 128 MB in size and input split is logical division of data. Meaning where does each split starts (like in which node and which block number) and where it does it end. Starting location and ending location of data is the split.
I didn't get the reason for splits in a tiny datasets.
can someone explain?
thanks
nath
First try to understand how the number of splits will be decided, and it depends on two thing:
If you have not defined any custom split size then it will take default size which will be the block size, in your case 128 MB.
This is important, now if you have two small files, it will be saved in two different blocks. So number of splits will be two.
Your answer is in above two point this is extra info, now the relation between number of mapper and number of splits is one - one so number of splits will be same as number of mapper.
I had the same problem.
I tried on a file of size 2.5MB and I would get on the console from running the job
number of splits:2
Job Counters
Launched map tasks=2
Logically speaking, it should be one split, as I am using the default setting
split_size 128M, block_size 128M
However, I think the reason MapReduce job is splitting the input is due to this config param:
mapreduce.job.maps: The default number of map tasks per job. Ignored when mapreduce.framework.name is "local".
The default value is 2, as you can see in the MapReduce defaults.
You can set it if you are using MRJob for python
MRJob.JOBCONF = {'mapreduce.job.maps': '1'}
By doing so, I got one split, then I ran another file 450M or so with the same setting, it was split into 4 (ceil(450/128)). Which means this setting is ignored when more splits are needed.

Does hadoop create InputSplits parallely

I have a large text file of size around 13gb. I want to process the file using Hadoop. I know that hadoop uses FileInputFormat to create InputSplits which are assigned to mapper tasks. I want to know if hadoop creates these InputSplits sequentially or in parallel. I mean does it read the large text file sequentially on a single host and create split files which are then distributed to datanodes, or does it read chunks of say 50mb in parallel?
Does hadoop replicate the big file on multiple hosts before splitting it up?
Is it recommended that I split up the file into 50mb chunks to speed up the processing? There are many questions on appropriate split size for mapper tasks but not the exact split process itself.
Thanks
InputSplits are created in the client side and it just a logical representation of the file in the sense it would only contain the file path,start and end offset values(calculated from linerecordreader initialize function). So calculating this logical rep. will not take much time so need to split your chunks the real execution happens at the mapper end where the execution is done in a parallel way. Then the client places the inputsplits into hdfs and jobtracker takes it from there and depending on the splits it allocates a tasktracker. Now here one mapper execution is not dependent on the other. The second mapper knows very well that where it has to start processing that split, so the mapper executions are done in parallel.
I suppose you want to process the file using MapReduce not Hadoop. Hadoop is a platform which provide tools to process and store large size data.
When you store the file in HDFS (Hadoop filesystem) it splits the file into multiple blocks. The size of the block is defined in hdfs-site.xml file as dfs.block.size. For example, if dfs.block.size=128 then your input file will be split into 128MB blocks. This is how HDFS store the data internally. For user it is always as a single file.
When you provide the input file (stored in HDFS) to MapReduce, it launches mapper task for each block/split of the file. This is default behavior.
you need not to split the file in chunks, just store the file in HDFS and it will the desired for you.
First let us understand what is meant by input split.
When your text file is divided into blocks of 128 MB size (default) by hdfs, assume that 10th line of the file is divided and first half of the is in first block and the other half is in second block. But when you submit a Map Program, hadoop understands that the last line of 1st block (which becomes input split here) is not complete. So it carries the second half of the 10th line to first input split. Which implies,
1) 1st input split = 1st Block + 2nd part of 10th line from 2nd block
2) 2nd input split = 2nd Block - 2nd part of 10th line from 2nd block.
This is an inbuilt process of hadoop and you cannot change or set the size of input split. The block size of hadoop v2 is by default 128 MB. You can increase during installation but you cannot decrease it.

Reducing number of Map tasks during Hadoop Streaming

I have a folder with 3072 files, each of ~50mb. I'm running a Python script over this input using Hadoop Streaming and extracting some data.
On a single file, the script doesn't take more than 2 seconds. However, running this on an EMR cluster with 40 m1.large task nodes and 3072 files takes 12 minutes.
Hadoop streaming does this:
14/11/11 09:58:51 INFO mapred.FileInputFormat: Total input paths to process : 3072
14/11/11 09:58:52 INFO mapreduce.JobSubmitter: number of splits:3072
And hence 3072 map tasks are created.
Of course the Map Reduce overhead comes into play. From some initial research, it seems that it's very inefficient if map tasks take less than 30-40 seconds.
What can I do to reduce the number of map tasks here? Ideally, if each task handled around 10-20 files it would greatly reduce the overhead.
I've tried playing around with the block size; but since the files are all around 50mb in size, they're already in separate blocks and increasing the block size makes no differenece.
Unfortunately you can't. The number of map tasks for a given job is driven by the number of input splits. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits.
mapred.min.split.size will specify the minimum split size to process by a mapper.
So, increasing split size should reduce the no of mappers.
Check out the link
Behavior of the parameter "mapred.min.split.size" in HDFS

hadoop - how total mappers are determined

I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line.
Ramesh is studying at XXXXXXXXXX XX XXXXX XX XXXXXXXXX.
How is the total mappers determined as 28?
I added the below line into my wordcount.java program to check.
FileInputFormat.setMaxInputSplitSize(job, 2);
Also, I would like to know if the input file can contain only 2 rows. (i.e.) Suppose if I have an input file, like below.
row1,row2,row3,row4,row5,row6.......row20
Should I split the input file into 20 different files each having only 2 rows?
HDFS block and MapReduce splits are 2 different things. Blocks are physical division of data while a Split is just a logical division done during a MR job. It is the duty of InputFormat to create the Splits from a given set data and based on the number of Splits the number of Mappers is decided. When you use setMaxInputSplitSize, you overrule this behavior and give a Split size of your own. But giving a very small value to setMaxInputSplitSize would be an overkill as there will be a lot of very small Splits, and you'll end up having a lot of unnecessary Map tasks.
Actually I don't see any need for you to use FileInputFormat.setMaxInputSplitSize(job, 2); in your WC program. Also,it looks like you have mistaken the 2 here. It is not the number of lines in a file. It is the Split size, in long, which you would like to have for your MR job. You can have any number of lines in the file which you are going to use as your MR input.
Does this sound OK?
That means your input file is split into roughly 28 parts(blocks) in HDFS since, you said 28 map tasks were scheduled- but, not may not be total 28 parallel map task though. Parallelism will depend on the number of slots you'll have in your cluster. I'm talking in terms of Apache Hadoop. I don't know if Horton works did nay modification to this.
Hadoop likes to work with Large files, so, do you want to split your input file to 20 different files?

Resources