How do we change the block size in hadoop - hadoop

What is the difference between Cloudera CDH3 cluster and Cloudera CDH4 cluster
What is the default hdfs block size in CDH3
What is the default hdfs block size in CDH4
How to change the hdfs block size in cloudera CDH3 and CDH4 cluster

You can see the hdfs block size in the hdfs-site.xml file. The default is generally 64 or 128 MB, but you can change it in the mentioned file, by changing the dfs.blocksize property:
<property>
<name>dfs.blocksize</name>
<value>SIZE_IN_BYTES</value>
</property>
Bear in mind that the value you write must be in bytes, so 128MB for example would be 134217728 Bytes.

Related

Hadoop replication factor precedence

I have this only in my namenode:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
In my data nodes, I have this:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Now my question is, will the replication factor be 3 or 1?
At the moment, the output of hdfs dfs -ls hdfs:///user/hadoop-user/data/0/0/0 shows 1 replication factor:
-rw-r--r-- 1 hadoop-user supergroup 68313 2015-11-06 19:32 hdfs:///user/hadoop-user/data/0/0/0/00099954tnemhcatta.bin
Appreciate your answer.
by default replication factor is 3, it is standard in most of the distributed system. if the replication factor was set to 3 (default value in HDFS) there would be one original block and two replicas. Most of time when we working on single node cluster(single machine) that time we put it 1. because if we will take 3 then there will be no benefit as all the copy are on single machine. so simple understanding. in multi node cluster replication factor should be 3 used in failure and in single machine replication factor should be 1.
Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory. Change or add the following property to hdfs-site.xml:
<property>
<name>dfs.replication<name>
<value>3<value>
<description>Block Replication<description>
<property>
You can also change the replication factor on a per-file basis using the Hadoop FS shell.
[jpanda#localhost ~]$ hadoop fs –setrep –w 3 /my/file
Alternatively, you can change the replication factor of all the files under a directory.
[jpanda#localhost ~]$ hadoop fs –setrep –w 3 -R /my/dir

always Hive Job running in-process local Hadoop

When I set this property in hive-site.xml
<property>
<name>hive.exec.mode.local.auto</name>
<value>false</value>
</property>
Hive always runs the hadoop job locally.
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 55
Job running in-process (local Hadoop)
Why does this happen?
As mentioned in HIVE-2585,Going forward Hive will assume that the metastore is operating in local mode if the configuration property hive.metastore.uris is unset, and will assume remote mode otherwise.
Ensure following property is set in Hive-site.xml:
<property>
<name>hive.metastore.uris</name>
<value><URIs of metastore server>:9083</value>
</property>
<property>
<name> hive.metastore.local</name>
<value>false</value>
</property>
The hive.metastore.local property is no longer supported as of Hive 0.10; setting hive.metastore.uris is sufficient to indicate that you are using a remote metastore.
EDIT:
Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are hive.exec.mode.local.auto, hive.exec.mode.local.auto.inputbytes.max, and hive.exec.mode.local.auto.tasks.max:
hive> SET hive.exec.mode.local.auto=false;
Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:
1. The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
2. The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
3. The total number of reduce tasks required is 1 or 0.
So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.
Reference: Hive Getting started

Use of core-site.xml in mapreduce program

I have seen mapreduce programs using/adding core-site.xml as a resource in the program. What is or how can core-site.xml be used in mapreduce programs ?
From documentation,
Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
core-default.xml : Read-only defaults for hadoop,
core-site.xml: Site-specific configuration for a given hadoop installation
Configuration config = new Configuration();
config.addResource(new Path("/user/hadoop/core-site.xml"));
config.addResource(new Path("/user/hadoop/hdfs-site.xml"));
Core-site.xml and HDFS-site.xml will denote the hadoop and its hdfs so that your mapreduce program will findout the which cluster to be pointing out and where it to be performed..

Yarn container lauch failed exception and mapred-site.xml configuration

I have 7 nodes in my Hadoop cluster [8GB RAM and 4VCPUs to each nodes], 1 Namenode + 6 datanodes.
EDIT-1#ARNON: I followed the link, mad calculation according to the hardware configruation on my nodes and have added the update mapred-site and yarn-site.xml files in my question. Still my application is crashing with the same exection
My mapreduce application has 34 input splits with a block size of 128MB.
mapred-site.xml has the following properties:
mapreduce.framework.name = yarn
mapred.child.java.opts = -Xmx2048m
mapreduce.map.memory.mb = 4096
mapreduce.map.java.opts = -Xmx2048m
yarn-site.xml has the following properties:
yarn.resourcemanager.hostname = hadoop-master
yarn.nodemanager.aux-services = mapreduce_shuffle
yarn.nodemanager.resource.memory-mb = 6144
yarn.scheduler.minimum-allocation-mb = 2048
yarn.scheduler.maximum-allocation-mb = 6144
EDIT-2#ARNON: Setting yarn.scheduler.minimum-allocation-mb to 4096 puts all the map task in suspended state and assigning it as 3072 crashes with the follwoing
Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 3876 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1424264025191_0002/container_1424264025191_0002_01_000011/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_000011
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.12 50842 attempt_1424264025191_0002_m_000005_0 11 >
/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_000011/stdout 2>
/home/ubuntu/hadoop/logs/userlogs/application_1424264025191_0002/container_1424264025191_0002_01_000011/stderr
How can avoid this?any help is appreciated
Is there an option to restrict number of containers on hadoop ndoes?
It seems you are allocating too much memory your tasks (even without looking at all the configurations) 8GB RAM and 8GB per map task and all of which is heap
Try to use lower allocations 2Gb with 1GB heap or something like that

hadoop no data node started

I am following this tutorial.
http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
I got to this point and started the nodes.
Start NameNode daemon and DataNode daemon:
$ sbin/start-dfs.sh
But then when I run the next steps, it looks like no data node is running (as I get errors saying so).
Why is the data node down? And how can I fix this?
Here is the log from my data node.
hduser#test02:/usr/local/hadoop$ jps
3792 SecondaryNameNode
3929 Jps
3258 NameNode
hduser#test02:/usr/local/hadoop$ cat /usr/local/hadoop/logs/hadoop-hduser-datanode-test02.out
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
-m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 3781
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
hduser#test02:/usr/local/hadoop$
EDIT:
Seems I had this port number wrong.
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
Now when I made it right (i.e. equal to 9000) I have no name node starting up.
hduser#test02:/usr/local/hadoop$ jps
10423 DataNode
10938 Jps
10703 SecondaryNameNode
and I cannot browse:
http://my-server-name:50070/
any more.
Hope this gives you some hint what is happening.
I am total beginner with Hadoop and kind of lost now.
[core-site.xml]
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>
[hdfs-site.xml]
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
In mapred-site.xml I have nothing.
1.first stop all the entities like namenode, datanode etc. (you will be having some script or command to do that)
Format tmp directory
Go to /var/cache/hadoop-hdfs/hdfs/dfs/ and delete all the contents in the directory manually
Now format your namenode again
start all the entities then use jps command to confirm that the datanode has been started
Now run whichever application you may like or have.
Hope this helps.
Add this configuration
conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
stop hadoop
bin/stop-all.sh
change permission and remove temp directory data
chmod 755 /var/lib/hadoop/tmp
rm -Rf /var/lib/hadoop/tmp/*
format name node
bin/hadoop namenode -format
After 1 day of struggle, I just removed version 2.4 and installed Hadoop 2.2 (as I realized 2.2 is the latest stable version). Then I got it all working by following this nice tutorial.
http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1
Something is not right with this document about 2.4 which I was reading.
Not to talk that it's not suitable for beginners, and it's usually beginners who stumble upon it.
Maybe your slave's data master's data are not synced, delete data & name folder in ./hadoop/hdfs and recreate them. re-format namenode. Than start dfs.

Resources