Why only 1 map and 1 reduce task and 1 node is used in a Hadoop job? - hadoop

I have configured a 3-nodes-cluster to run wordcount mapreduce program. I am using a book, whose size is 659 kb (http://www.gutenberg.org/ebooks/20417) as the test data. Interestingly, in the web UI of that Job, only 1 map, 1 reduce and 1 node is involved. I am wondering if this is because the data size is too small. If yes, could I set manually to split the data into different maps on multi nodes?
Thanks,
Allen

The default block size is 64 MB. So yes, the framework does assign only one task of each kind because your input data is smaller.
1) You can either give input data that are more than 64 MB and see what happens.
2) Change the value of mapred.max.split.size which is specific for the mapreduce jobs
(in mapred-site.xml or running the job with the -D mapred.max-split.size=noOfBytes)
or
3) Change the value of dfs.block.size which has a more global scope and applies for all the HDFS. (in hdfs-site.xml)
Don't forget to restart your cluster to apply changes in case you are modifying the conf files.

Related

only one mapper and reducer are running even though i change to 5 mappers and 2 reducers

I am new to Hadoop, and i have set one multinode Hadoop with Hadoop 2.5.1 version.
When i run a mapreduce job using command
hadoop jar jarFile <ClassName> <InputFile> <outputDirectory> -D mapreduce.job.reduces=2 mapreduce.job.maps=5
But when i see output i see only one mapper and one reducer running.
And i see there is no concept of map slot and reducer slot in Hadoop 2.5.1
And my file size is 78MB. So is that the reason as my file size is quite less, and blocks are very low so there is only one mapper running?
Help in this would be great for me to go ahead.
Thanks & Regards,
Srilatha K.
That's because the default size of the block is 128MB and hence your file of size 78M never got splitted accross multiple blocks. See this which says default block size is 128.
If you want to see two mapper then add the following lines in $HADOOP_HOME/conf/hdfs-site.xml
<property>
<name>dfs.blocksize</name>
<value>64M</value>
</property>

What effects do dfs.blocksize, file.blocksize, kfs.blocksize and etc have in hadoop mapreduce job?

When I check the job.xml file of a hadoop (version 0.21.0) mapreduce job, I found there are multiple blocksize settings exist:
dfs.blocksize = 134217728 (i.e. 128MB)
file.blocksize = 67108864 (i.e. 64MB)
kfs.blocksize = 67108864
s3.blocksize = 67108864
s3native.blocksize = 67108864
ftp.blocksize = 67108864
I am expecting some answers to explain following related questions:
What are the dfs, file, kfs, s3 and etc mean in this context?
What are the differences among them?
What effects do they have when running a mapreduce job?
Thank you very much!
Map reduce can work on data stored on different types of storage systems.The settings above are the default block sizes on the storage techniques used. dfs(distributed file system) is what we commonly use in hadoop has default block size 128MB. Other settings are for file(local), kfs(kosmos distributed filesystem), s3(amazon cloud storage) and ftp(files on ftp server).
You may research them further for a better understanding of each and using them with hadoop features.While running the map reduce job,the settings which are for the particular storage technique being used,are identified for block size.
I hope it was helpful.

Change dfs.block.size on application execution

Since dfs.block.size is an HDFS setting, it shouldn't make a difference if I change it during an application execution, right?
For example, if the block size of the files of a job are 128 and I call
hadoop jar /path/to/.jar xxx -D dfs.block.size=256
would it make a difference or would I need to change the block size before saving the files to HDFS?
Are dfs.block.size and the split size of tasks directly related? If im correct and they are not, is there a way to specify the size of a split?
Parameters which decides your split Size for each MR can be set by
mapred.max.split.size & mapred.min.split.size
"mapred.max.split.size" which can be set per job individually through
your conf Object. Don't change "dfs.block.size" which affects your
HDFS too.Which does change your output block size of execution.
if mapred.min.split.size is less than block size and
mapred.max.split.size is greater than block size then 1 block is sent
to each map task. The block data is split into key value pairs based
on the Input Format you use.

What is the default size that each Hadoop mapper will read?

Is it the block size of 64 MB for HDFS? Is there any configuration parameter that I can use to change it?
For a mapper reading gzip files, is it true that the number of gzip files must be equal to the number of mappers?
This is dependent on your:
Input format - some input formats (NLineInputFormat, WholeFileInputFormat) work on boundaries other than the block size. In general though anything extended from FileInputFormat will use the block boundaries as guides
File block size - the individual files don't need to have the same block size as the default blocks size. This is set when the file is uploaded into HDFS - if not explicitly set, then the default block size (at the time of upload) is applied. Any changes to the default / system block size after the file is will have no effect in the already uploaded file.
The two FileInputFormat configuration properties mapred.min.split.size and mapred.max.split.size usually default to 1 and Long.MAX_VALUE, but if this is overridden in your system configuration, or in your job, then this will change the amunt of data processed by each mapper, and the number of mapper tasks spawned.
Non-splittable compression - such as gzip, cannot be processed by more than a single mapper, so you'll get 1 mapper per gzip file (unless you're using something like CombineFileInputFormat, CompositeInputFormat)
So if you have file with a block size of 64m, but either want to process more or less than this per map task, then you should just be able to set the following job configuration properties:
mapred.min.split.size - larger than the default, if you want to use less mappers, at the expense of (potentially) losing data locality (all data processed by a single map task may now be on 2 or more data nodes)
mapred.max.split.size - smaller than default, if you want to use more mappers (say you have a CPU intensive mapper) to process each file
If you're using MR2 / YARN then the above properties are deprecated and replaced by:
mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize

Hadoop dfs replicate

Sorry guys,just a simple question but I cannot find exact question on google.
The question about what's dfs.replication mean? If I made one file named filmdata.txt in hdfs, if I set dfs.replication=1,so is it totally one file(one filmdata.txt)?or besides the main file(filmdata.txt) hadoop will create another replication file.
shortly say:if set dfs.replication=1,there are totally one filmdata.txt,or two filmdata.txt?
Thanks in Advance
The total number of files in the file system will be what's specified in the dfs.replication factor. So, if you set dfs.replication=1, then there will be only one copy of the file in the file system.
Check the Apache Documentation for the other configuration parameters.
To ensure high availability of data, Hadoop replicates the data.
When we are storing the files into HDFS, hadoop framework splits the file into set of blocks( 64 MB or 128 MB) and then these blocks will be replicated across the cluster nodes.The configuration dfs.replication is to specify how many replications are required.
The default value for dfs.replication is 3, But this is configurable depends on your cluster setup.
Hope this helps.
The link provided by Praveen is now broken.
Here is the updated link describing the parameter dfs.replication.
Refer Hadoop Cluster Setup. for more information on configuration parameters.
You may want to note that files can span multiple blocks and each block will be replicated number of times specified in dfs.replication (default value is 3). The size of such blocks is specified in the parameter dfs.block.size.
In HDFS framework, we use commodity machines to store the data, these commodity machines are not high end machines like servers with high RAM, there will be a chance of loosing the data-nodes(d1, d2, d3) or a block(b1,b2,b3), as a result HDFS framework splits the each block of data(64MB, 128MB) into three replications(as a default) and each block will be stored in a separate data-nodes(d1, d2, d3). Now consider block(b1) gets corrupted in data-node(d1) the copy of block(b1) is available in data-node(d2) and data-node(d3) as well so that client can request data-node(d2) to process the block(b1) data and provide the result and same as if data-node(d2) fails client can request data-node(d3) to process block(b1) data . This is called-dfs.replication mean.
Hope you got some clarity.

Resources