What effects do dfs.blocksize, file.blocksize, kfs.blocksize and etc have in hadoop mapreduce job? - hadoop

When I check the job.xml file of a hadoop (version 0.21.0) mapreduce job, I found there are multiple blocksize settings exist:
dfs.blocksize = 134217728 (i.e. 128MB)
file.blocksize = 67108864 (i.e. 64MB)
kfs.blocksize = 67108864
s3.blocksize = 67108864
s3native.blocksize = 67108864
ftp.blocksize = 67108864
I am expecting some answers to explain following related questions:
What are the dfs, file, kfs, s3 and etc mean in this context?
What are the differences among them?
What effects do they have when running a mapreduce job?
Thank you very much!

Map reduce can work on data stored on different types of storage systems.The settings above are the default block sizes on the storage techniques used. dfs(distributed file system) is what we commonly use in hadoop has default block size 128MB. Other settings are for file(local), kfs(kosmos distributed filesystem), s3(amazon cloud storage) and ftp(files on ftp server).
You may research them further for a better understanding of each and using them with hadoop features.While running the map reduce job,the settings which are for the particular storage technique being used,are identified for block size.
I hope it was helpful.

Related

Why is Spark setting partitions to the file size in bytes?

I have a very simple pyspark program that is supposed to read CSV files from S3:
r = sc.textFile('s3a://some-bucket/some-file.csv')
.map(etc... you know the drill...)
This was failing when running a local Spark node (it works in EMR). I was getting OOM errors and GC crashes. Upon further inspection, I realized that the number of partitions was insanely high. In this particular case r.getNumPartitions() would return 2358041.
I realized that that's exactly the size of my file in bytes. This, of course, makes Spark crash miserably.
I've tried different configurations, like chaning mapred.min.split.size:
conf = SparkConf()
conf.setAppName('iRank {}'.format(datetime.now()))
conf.set("mapred.min.split.size", "536870912")
conf.set("mapred.max.split.size", "536870912")
conf.set("mapreduce.input.fileinputformat.split.minsize", "536870912")
I've also tried using repartition or changing passing a partitions argument to textFile, to no avail.
I would love to know what makes Spark think that it's a good idea to derive the number of partitions from the file size.
In general it doesn't. As nicely explained by eliasah in his answer to Spark RDD default number of partitions it uses max of minPartitions (2 if not provided) and splits computed by Hadoop input format.
The latter one will by unreasonably high, only if instructed by the configuration. This suggests that some configuration file interferes with your program.
The possible problem with your code is that you use wrong configuration. Hadoop options should be set using hadoopConfiguration not Spark configuration. It looks like you use Python so you have to use private JavaSparkContext instance:
sc = ... # type: SparkContext
sc._jsc.hadoopConfiguration().setInt("mapred.min.split.size", min_value)
sc._jsc.hadoopConfiguration().setInt("mapred.max.split.size", max_value)
There was actually a bug in Hadoop 2.6 which would do this; the initial S3A release didn't provide a block size to Spark to split up, the default of "0" meant one-byte-per-job.
Later version should all take fs.s3a.block.size as the config option specifying the block size...something like 33554432 (= 32 MB) would be a start.
If you are using Hadoop 2.6.x. Don't use S3A. That's my recommendation.

Default size of input split in Hadoop

What is the default size of input split in Hadoop. As I know default size of block is 64 MB.
Is there any file in Hadoop jar in which we can see the default values of all such things ? like default replication factor etc. like anything default in Hadoop.
Remember these two parameters: mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize. I refer these as minSize, maxSize respectively. By default minSize is 1 byte and maxSize is Long.MAX_VALUE. The block size can be 64MB or 128MB or more.
The input split size is calculated by a formula like this during runtime:
max(minSize, min(maxSize, blockSize)
Courtesy: Hadoop:The definitive guide.
Yes, you can see all these configurations in hadoop etc/conf folder.
There are various files : core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
It contains all the default configuration for hadoop cluster which can be overridden as well.
You can refer following links:
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
And, if you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.

Why only 1 map and 1 reduce task and 1 node is used in a Hadoop job?

I have configured a 3-nodes-cluster to run wordcount mapreduce program. I am using a book, whose size is 659 kb (http://www.gutenberg.org/ebooks/20417) as the test data. Interestingly, in the web UI of that Job, only 1 map, 1 reduce and 1 node is involved. I am wondering if this is because the data size is too small. If yes, could I set manually to split the data into different maps on multi nodes?
Thanks,
Allen
The default block size is 64 MB. So yes, the framework does assign only one task of each kind because your input data is smaller.
1) You can either give input data that are more than 64 MB and see what happens.
2) Change the value of mapred.max.split.size which is specific for the mapreduce jobs
(in mapred-site.xml or running the job with the -D mapred.max-split.size=noOfBytes)
or
3) Change the value of dfs.block.size which has a more global scope and applies for all the HDFS. (in hdfs-site.xml)
Don't forget to restart your cluster to apply changes in case you are modifying the conf files.

How to achieve desired block size with Hadoop with data on local filesystem

I have a 2TB sequence file that I am trying to process with Hadoop which resides on a cluster set up to use a local (lustre) filesystem for storage instead of HDFS. My problem is that no matter what I try, I am always forced to have about 66000 map tasks when I run a map/reduce jobs with this data as input. This seems to correspond with a block size of 2TB/66000 =~ 32MB. The actual computation in each map task executes very quickly, but the overhead associated with so many map tasks slows things down substantially.
For the job that created the data and for all subsequent jobs, I have dfs.block.size=536870912 and fs.local.block.size=536870912 (512MB). I also found suggestions that said to try this:
hadoop fs -D fs.local.block.size=536870912 -put local_name remote_location
to make a new copy with larger blocks, which I did to no avail. I have also changed the stripe size of the file on lustre. It seems that any parameters having to do with block size are ignored for local file system.
I know that using lustre instead of HDFS is a non-traditional use of hadoop, but this is what I have to work with. I'm wondering if others either have experience with this, or have any ideas to try other than what I have mentioned.
I am using cdh3u5 if that is useful.

Hadoop dfs replicate

Sorry guys,just a simple question but I cannot find exact question on google.
The question about what's dfs.replication mean? If I made one file named filmdata.txt in hdfs, if I set dfs.replication=1,so is it totally one file(one filmdata.txt)?or besides the main file(filmdata.txt) hadoop will create another replication file.
shortly say:if set dfs.replication=1,there are totally one filmdata.txt,or two filmdata.txt?
Thanks in Advance
The total number of files in the file system will be what's specified in the dfs.replication factor. So, if you set dfs.replication=1, then there will be only one copy of the file in the file system.
Check the Apache Documentation for the other configuration parameters.
To ensure high availability of data, Hadoop replicates the data.
When we are storing the files into HDFS, hadoop framework splits the file into set of blocks( 64 MB or 128 MB) and then these blocks will be replicated across the cluster nodes.The configuration dfs.replication is to specify how many replications are required.
The default value for dfs.replication is 3, But this is configurable depends on your cluster setup.
Hope this helps.
The link provided by Praveen is now broken.
Here is the updated link describing the parameter dfs.replication.
Refer Hadoop Cluster Setup. for more information on configuration parameters.
You may want to note that files can span multiple blocks and each block will be replicated number of times specified in dfs.replication (default value is 3). The size of such blocks is specified in the parameter dfs.block.size.
In HDFS framework, we use commodity machines to store the data, these commodity machines are not high end machines like servers with high RAM, there will be a chance of loosing the data-nodes(d1, d2, d3) or a block(b1,b2,b3), as a result HDFS framework splits the each block of data(64MB, 128MB) into three replications(as a default) and each block will be stored in a separate data-nodes(d1, d2, d3). Now consider block(b1) gets corrupted in data-node(d1) the copy of block(b1) is available in data-node(d2) and data-node(d3) as well so that client can request data-node(d2) to process the block(b1) data and provide the result and same as if data-node(d2) fails client can request data-node(d3) to process block(b1) data . This is called-dfs.replication mean.
Hope you got some clarity.

Resources