HDFS BLOCK SIZE - hadoop

What is default block size 128MB or 64MB for hdfs?In hadoop definite guide it is mentioned as "HDFS, too, has the concept of a block, but it is a much larger unit—128 MB by default".Anyone can tell which is the default size?

The best place to refer the default values of hadoop configuration is the apache hadoop website. The default configurations and values are present in the default xml files in apache hadoop website.
The answer to your question is present in the hdfs-default.xml. The latest stable version of hadoop as of now is 2.9.0 and the value of block size (dfs.blocksize) is 128 MB (134217728 bytes).
In hadoop 1.x the default value of block size was 64 MB. Now in the world with high speed network and low cost storage, they made the default value as 128 MB.
Note: The hdfs-default.xml, core-default.xml, yarn-default.xml files present in the apache website can be used as the reference to find the default values of properties in apache hadoop.

This document states that it is 128M (search for dfs.blocksize)

Related

Default size of input split in Hadoop

What is the default size of input split in Hadoop. As I know default size of block is 64 MB.
Is there any file in Hadoop jar in which we can see the default values of all such things ? like default replication factor etc. like anything default in Hadoop.
Remember these two parameters: mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize. I refer these as minSize, maxSize respectively. By default minSize is 1 byte and maxSize is Long.MAX_VALUE. The block size can be 64MB or 128MB or more.
The input split size is calculated by a formula like this during runtime:
max(minSize, min(maxSize, blockSize)
Courtesy: Hadoop:The definitive guide.
Yes, you can see all these configurations in hadoop etc/conf folder.
There are various files : core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
It contains all the default configuration for hadoop cluster which can be overridden as well.
You can refer following links:
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
And, if you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.

How to configure Hadoop parameters on Amazon EMR?

I run a MR job with one Master and two slavers on the Amazon EMR, but got lots of the error messages like running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container after map 100% reduce 35%
I modified my codes by adding the following lines in the Hadoop 2.6.0 MR configuration, but I still got the same error messages.
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest2");
//conf.set("mapreduce.input.fileinputformat.split.minsize","3073741824");
conf.set("mapreduce.map.memory.mb", "8192");
conf.set("mapreduce.map.java.opts", "-Xmx8192m");
conf.set("mapreduce.reduce.memory.mb", "8192");
conf.set("mapreduce.reduce.java.opts", "-Xmx8192m");
What is the correct way to configure those parameters(mapreduce.map.memory.mb, mapreduce.map.java.opts, mapreduce.reduce.memory.mb, mapreduce.reduce.java.opts) on Amazon EMR? Thank you!
Hadoop 2.x allows you to set the map and reduce settings per job so you are setting the correct section. The problem is the Java opts Xmx memory must be less than the map/reduce.memory.mb. This property represents the total memory for heap and off heap usage. Take a look at the defaults as an example: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-hadoop-task-config.html. If Yarn was killing off the containers for exceeding the memory when using the default settings then this means you need to give more memory to the off heap portion, thus increasing the gap between Xmx and the total map/reduce.memory.mb.
Take a look at the documentation for the AWS CLI. There is a section on Hadoop and how to map to specific XML config files on EMR instance creation. I have found this to be the best approach available on EMR.

What effects do dfs.blocksize, file.blocksize, kfs.blocksize and etc have in hadoop mapreduce job?

When I check the job.xml file of a hadoop (version 0.21.0) mapreduce job, I found there are multiple blocksize settings exist:
dfs.blocksize = 134217728 (i.e. 128MB)
file.blocksize = 67108864 (i.e. 64MB)
kfs.blocksize = 67108864
s3.blocksize = 67108864
s3native.blocksize = 67108864
ftp.blocksize = 67108864
I am expecting some answers to explain following related questions:
What are the dfs, file, kfs, s3 and etc mean in this context?
What are the differences among them?
What effects do they have when running a mapreduce job?
Thank you very much!
Map reduce can work on data stored on different types of storage systems.The settings above are the default block sizes on the storage techniques used. dfs(distributed file system) is what we commonly use in hadoop has default block size 128MB. Other settings are for file(local), kfs(kosmos distributed filesystem), s3(amazon cloud storage) and ftp(files on ftp server).
You may research them further for a better understanding of each and using them with hadoop features.While running the map reduce job,the settings which are for the particular storage technique being used,are identified for block size.
I hope it was helpful.

Hadoop dfs replicate

Sorry guys,just a simple question but I cannot find exact question on google.
The question about what's dfs.replication mean? If I made one file named filmdata.txt in hdfs, if I set dfs.replication=1,so is it totally one file(one filmdata.txt)?or besides the main file(filmdata.txt) hadoop will create another replication file.
shortly say:if set dfs.replication=1,there are totally one filmdata.txt,or two filmdata.txt?
Thanks in Advance
The total number of files in the file system will be what's specified in the dfs.replication factor. So, if you set dfs.replication=1, then there will be only one copy of the file in the file system.
Check the Apache Documentation for the other configuration parameters.
To ensure high availability of data, Hadoop replicates the data.
When we are storing the files into HDFS, hadoop framework splits the file into set of blocks( 64 MB or 128 MB) and then these blocks will be replicated across the cluster nodes.The configuration dfs.replication is to specify how many replications are required.
The default value for dfs.replication is 3, But this is configurable depends on your cluster setup.
Hope this helps.
The link provided by Praveen is now broken.
Here is the updated link describing the parameter dfs.replication.
Refer Hadoop Cluster Setup. for more information on configuration parameters.
You may want to note that files can span multiple blocks and each block will be replicated number of times specified in dfs.replication (default value is 3). The size of such blocks is specified in the parameter dfs.block.size.
In HDFS framework, we use commodity machines to store the data, these commodity machines are not high end machines like servers with high RAM, there will be a chance of loosing the data-nodes(d1, d2, d3) or a block(b1,b2,b3), as a result HDFS framework splits the each block of data(64MB, 128MB) into three replications(as a default) and each block will be stored in a separate data-nodes(d1, d2, d3). Now consider block(b1) gets corrupted in data-node(d1) the copy of block(b1) is available in data-node(d2) and data-node(d3) as well so that client can request data-node(d2) to process the block(b1) data and provide the result and same as if data-node(d2) fails client can request data-node(d3) to process block(b1) data . This is called-dfs.replication mean.
Hope you got some clarity.

Is the HDFS sink in Flume using a "anti-pattern" with it's default config

looking at the HDFS sink default parameters in Apache Flume it seems that this will produce tons of very small files (1 kB rolls). From what I learned about GFS/HDFS is that blocksizes are 64MB and filesizes should rather be gigabytes to make sure everything runs efficiently.
So I'm curious whether the default parameters of Flume are just misleading or whether I missed something else here.
Cheers.

Resources