Hadoop replication factor precedence

Hadoop replication factor precedence - hadoop

I have this only in my namenode:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
In my data nodes, I have this:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Now my question is, will the replication factor be 3 or 1?
At the moment, the output of hdfs dfs -ls hdfs:///user/hadoop-user/data/0/0/0 shows 1 replication factor:
-rw-r--r-- 1 hadoop-user supergroup 68313 2015-11-06 19:32 hdfs:///user/hadoop-user/data/0/0/0/00099954tnemhcatta.bin
Appreciate your answer.

by default replication factor is 3, it is standard in most of the distributed system. if the replication factor was set to 3 (default value in HDFS) there would be one original block and two replicas. Most of time when we working on single node cluster(single machine) that time we put it 1. because if we will take 3 then there will be no benefit as all the copy are on single machine. so simple understanding. in multi node cluster replication factor should be 3 used in failure and in single machine replication factor should be 1.

Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory. Change or add the following property to hdfs-site.xml:
<property>
<name>dfs.replication<name>
<value>3<value>
<description>Block Replication<description>
<property>
You can also change the replication factor on a per-file basis using the Hadoop FS shell.
[jpanda#localhost ~]$ hadoop fs –setrep –w 3 /my/file
Alternatively, you can change the replication factor of all the files under a directory.
[jpanda#localhost ~]$ hadoop fs –setrep –w 3 -R /my/dir

Related

what would happen if nodes in hadoop change their IP address?

my hadoop clusters do not work fine because of the network conditions.What if i change the entire network,like another router,thus change the IP addresses? could the clusters still work by updating some configurations? or i must torn it down and rebuilt everything?
Thanks in advance

It works once you change the ip addresses into the configuration, why did not you use the DNS?
Ok, it was not a good answer, let me apologize and give a better answer.
If you need to change configuration on a running cluster you can decommission and commission the data nodes.
Switch off the data node is not a good idea.
Data Node Decomissioning
The fist step is tell to yarn you are going to remove some nodes, then you have to say the same to node manager.
I don't know if your system is configured for decommissioning, if it so you have the key yarn.resourcemanager.nodes.exclude-path into the yarn-site.xml and dfs.hosts.exclude into hdfs-site.xml
hdfs-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>$YOUR_PATH/dfs.exclude</value>
<final>true</final>
</property>
yarn-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>$YOUR_PATH/dfs.exclude</value>
<final>true</final>
</property>
Open the file $YOUR_PATH/dfs.exclude and add hostnames / ip addresses of node you need to stop.
execute
yarn rmadmin -refreshNodes
hdfs dfsadmin -refreshNodes
Check if the data nodes are in decommission checking the web interface.
Data Node Comissioning
Works in the same way of the Decommissioning
yarn-site.xml
<property>
<name>yarn.resourcemanager.nodes.include-path</name>
<value>$YOUR_PATH/dfs.include</value>
<final>true</final>
</property>
hdfs-site.xml
<property>
<name>dfs.hosts</name>
<value>$YOUR_PATH/dfs.include</value>
<final>true</final>
</property>
Open the file $YOUR_PATH/dfs.include and add hostnames / ip addresses of node you need to add.
yarn rmadmin -refreshNodes
hdfs dfsadmin -refreshNodes
wait some time
hdfs dfsadmin -report
Now the hosts you added are into the list.
If your configurations are missing the above keys you need to halt/restart the node manager and yarn after adding them.
Using these procedure you can halt data nodes in a safe way.

Cluster configuration and hdfs

Im trying to configure my cluster by following this tutorial -
https://developer.yahoo.com/hadoop/tutorial/module2.html
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.71.128:9000</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop-user/hdfs/data</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop-user/hdfs/name</value>
</property>
</configuration>
I have also copied a local file to /user/prema/ using the below commands
hadoop-user#hadoop-desk:~/hadoop$ bin/hadoop dfs -put /home/hadoop-user/googlebooks-eng-all-1gram-20120701-0 /user/prema
hadoop-user#hadoop-desk:~/hadoop$ bin/hadoop dfs -ls /user/prema
Found 1 items
-rw-r--r-- 1 hadoop-user supergroup 192403080 2014-11-19 02:43 /user/prema
Now, I'm confused. I have datafiles here- /user/prema but the data node in the cluster config points to this - /home/hadoop-user/hdfs/data..How does it get related?

The /user/prema is a folder within HDFS. The folder /home/hadoop-user/hdfs/data is a folder within the regular filesystem.
The regular filesystem folder is the place where HDFS stores its data. So when you read data from HDFS, it actually goes to the physical regular filesystem folder to read the data. You should never need to touch this data as its format is not very user-friendly - the HDFS takes care of data manipulation for you.

hadoop no data node started

I am following this tutorial.
http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
I got to this point and started the nodes.
Start NameNode daemon and DataNode daemon:
$ sbin/start-dfs.sh
But then when I run the next steps, it looks like no data node is running (as I get errors saying so).
Why is the data node down? And how can I fix this?
Here is the log from my data node.
hduser#test02:/usr/local/hadoop$ jps
3792 SecondaryNameNode
3929 Jps
3258 NameNode
hduser#test02:/usr/local/hadoop$ cat /usr/local/hadoop/logs/hadoop-hduser-datanode-test02.out
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
-m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 3781
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
hduser#test02:/usr/local/hadoop$
EDIT:
Seems I had this port number wrong.
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
Now when I made it right (i.e. equal to 9000) I have no name node starting up.
hduser#test02:/usr/local/hadoop$ jps
10423 DataNode
10938 Jps
10703 SecondaryNameNode
and I cannot browse:
http://my-server-name:50070/
any more.
Hope this gives you some hint what is happening.
I am total beginner with Hadoop and kind of lost now.
[core-site.xml]
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>
[hdfs-site.xml]
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
In mapred-site.xml I have nothing.

1.first stop all the entities like namenode, datanode etc. (you will be having some script or command to do that)
Format tmp directory
Go to /var/cache/hadoop-hdfs/hdfs/dfs/ and delete all the contents in the directory manually
Now format your namenode again
start all the entities then use jps command to confirm that the datanode has been started
Now run whichever application you may like or have.
Hope this helps.

Add this configuration
conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
stop hadoop
bin/stop-all.sh
change permission and remove temp directory data
chmod 755 /var/lib/hadoop/tmp
rm -Rf /var/lib/hadoop/tmp/*
format name node
bin/hadoop namenode -format

After 1 day of struggle, I just removed version 2.4 and installed Hadoop 2.2 (as I realized 2.2 is the latest stable version). Then I got it all working by following this nice tutorial.
http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1
Something is not right with this document about 2.4 which I was reading.
Not to talk that it's not suitable for beginners, and it's usually beginners who stumble upon it.

Maybe your slave's data master's data are not synced, delete data & name folder in ./hadoop/hdfs and recreate them. re-format namenode. Than start dfs.

Namenode stops working after hadoop restart

I have a server with Hadoop installed on.
I wanted to change some configuration (about the mapreduce.map.output.compress); therefore, I changed the configuration file, and I restarted Hadoop, with:
stop-all.sh
start-all.sh
After that, I was not able to use it again, becouse it was in Safe Mode:
The reported blocks is only 0 but the threshold is 0.9990 and the total blocks 11313. Safe mode will be turned off automatically.
Please, notice that the number of reported blocks is 0, and it was not increasing at all.
Therefore, I forced it to leave the Safe Mode with:
bin/hadoop dfsadmin -safemode leave
Now, I get errors like this:
2014-03-09 18:16:40,586 [Thread-1] ERROR org.apache.hadoop.hdfs.DFSClient - Failed to close file /tmp/temp-39739076/tmp2073328134/GQL.jar
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/temp-39739076/tmp2073328134/GQL.jar could only be replicated to 0 nodes, instead of 1
If it helps, my hdfs-site.xml is:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hduser/hadoop/name/data</value>
</property>
</configuration>

I've run into this problem many times. Whenever you get the error stating x could only be replicated to 0 nodes, instead of 1, the following steps should fix the problem:
Stop all Hadoop services with: stop-all.sh
Delete the dfs/name and dfs/data directories
Format the NameNode with: hadoop namenode -format
Start Hadoop again with: start-all.sh

Hadoop / Yarn (v0.23.3) Psuedo-Distributed Mode setup :: No job node

I just setup Hadoop/Yarn 2.x (specifically, v0.23.3) in Psuedo-Distributed mode.
I followed the instructions of a few blogs & websites which, more-or-less provide the
same prescription for setting it up. I also followed the 3rd-Edition of O'reilly's
Hadoop book (which ironically was the least helpful).
THE PROBLEM:
After running "start-dfs.sh" and then "start-yarn.sh", while all of the daemons
do start (as indicated by jps(1)), the Resource Manager web portal
(Here: http://localhost:8088/cluster/nodes) indicates 0 (zero) job-nodes in the
cluster. So while submitting the example/test Hadoop job indeed does get
scheduled, it pends forever because, I assume, the configuration doesn't see a
node to run it on.
Below are the steps I performed, including resultant configuration files.
Hopefully the community help me out... (And thank you in advance).
THE CONFIGURATION:
The following environment variables are set in both my and hadoop's UNIX account profiles: ~/.profile:
export HADOOP_HOME=/home/myself/APPS.d/APACHE_HADOOP.d/latest
# Note: /home/myself/APPS.d/APACHE_HADOOP.d/latest -> hadoop-0.23.3
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_INSTALL=${HADOOP_HOME}
export HADOOP_CLASSPATH=${HADOOP_HOME}/lib
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/conf
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop/conf
export JAVA_HOME=/usr/lib/jvm/jre
hadoop$ java -version
java version "1.7.0_06-icedtea<br>
OpenJDK Runtime Environment (fedora-2.3.1.fc17.2-x86_64)<br>
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)<br>
# Although the above shows OpenJDK, the same problem happens with Sun's JRE/JDK.
The NAMENODE & DATANODE directories, also specified in etc/hadoop/conf/hdfs-site.xml:
/home/myself/APPS.d/APACHE_HADOOP.d/latest/YARN_DATA.d/HDFS.d/DATANODE.d/
/home/myself/APPS.d/APACHE_HADOOP.d/latest/YARN_DATA.d/HDFS.d/NAMENODE.d/
Next, the various XML configuration files (again, YARN/MRv2/v0.23.3 here):
hadoop$ pwd; ls -l
/home/myself/APPS.d/APACHE_HADOOP.d/latest/etc/hadoop/conf
lrwxrwxrwx 1 hadoop hadoop 16 Sep 20 13:14 core-site.xml -> ../core-site.xml
lrwxrwxrwx 1 hadoop hadoop 16 Sep 20 13:14 hdfs-site.xml -> ../hdfs-site.xml
lrwxrwxrwx 1 hadoop hadoop 18 Sep 20 13:14 httpfs-site.xml -> ../httpfs-site.xml
lrwxrwxrwx 1 hadoop hadoop 18 Sep 20 13:14 mapred-site.xml -> ../mapred-site.xml
-rw-rw-r-- 1 hadoop hadoop 10 Sep 20 15:36 slaves
lrwxrwxrwx 1 hadoop hadoop 16 Sep 20 13:14 yarn-site.xml -> ../yarn-site.xml
core-site.xml
<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<!-- Same problem whether this (legacy) stanza is included or not. -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/HDFS.d/NAMENODE.d</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/HDFS.d/DATANODE.d</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<!-- yarn-site.xml -->
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/TEMP.d</value>
</property>
</configuration>
etc/hadoop/conf/saves
localhost
# Community/friends, is this entry correct/needed for my psuedo-dist mode?
Miscellaneous wrap-up notes:
(1) As you may have gleaned from above, all files/directories are owned
by the 'hadoop' UNIX user. There is a hadoop:hadoop, UNIX User and
Group, respectively.
(2) The following command was run after the NAMENODE & DATANODE directories
(listed above) were created (and whose paths were entered into
hdfs-site.xml):
hadoop$ hadoop namenode -format
(3) Next, I ran "start-dfs.sh", then "start-yarn.sh".
Here is jps(1) output:
hadoop#e6510$ jps
21979 DataNode
22253 ResourceManager
22384 NodeManager
22156 SecondaryNameNode
21829 NameNode
22742 Jps
Thank you!

After much toil on this problem without success (and trust me I tried it all), I instituted
hadoop using a different solution. Whereas above I downloaded a gzip/tar ball
of the hadoop distribution (again v0.23.3) from one of the download mirrors, this
time I used the Caldera CDH distribution of RPM packages, which I installed via
their YUM repos. In hopes that this will help someone, here are the detailed steps.
Step-1:
For Hadoop 0.20.x (MapReduce version 1):
# rpm -Uvh http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch.rpm
# rpm --import http://archive.cloudera.com/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
# yum install hadoop-0.20-conf-pseudo
-or-
For Hadoop 0.23.x (MapReduce version 2):
# rpm -Uvh http://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.noarch.rpm
# rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
# yum install hadoop-conf-pseudo
In both cases above, installing that "psuedo" package (which stands for "pseudo-distributed
Hadoop" mode), will alone conveniently trigger the installation of all the other necessary packages you'll need (via dependency resolution).
Step-2:
Install Sun/Oracle's Java JRE (if you haven't already done so). You can
install it via the RPM that they provide, or the gzip/tar ball portable
version. It doesn't matter which as long as you set and export the "JAVA_HOME"
environment appropriately, and ensure ${JAVA_HOME}/bin/java is in your path.
# echo $JAVA_HOME; which java
/home/myself/APPS.d/JAVA-JRE.d/jdk1.7.0_07
/home/myself/APPS.d/JAVA-JRE.d/jdk1.7.0_07/bin/java
Note: I actually create a symlink called "latest" and point/re-point it to the JAVA
version specific directory whenever I update the JAVA. I was explicit above for
the reader's understanding.
Step-3: Format hdfs as the "hdfs" Unix user (created during "yum install" above).
# sudo su hdfs -c "hadoop namenode -format"
Step-4:
Manually start the hadoop daemons.
for file in `ls /etc/init.d/hadoop*`
do
{
${file} start
}
done
Step-5:
Check to see if things are working. The following is for MapReduce v1
(It's not that much different for MapReduce v2 at this superficial level).
root# jps
23104 DataNode
23469 TaskTracker
23361 SecondaryNameNode
23187 JobTracker
23267 NameNode
24754 Jps
# Do the next commands as yourself (not as "root").
myself$ hadoop fs -mkdir /foo
myself$ hadoop fs -rmr /foo
myself$ hadoop jar /usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u5-examples.jar pi 2 100000
I hope this helped!

Noel,
I followed this other day the steps in this tutorial http://www.thecloudavenue.com/search?q=0.23 and I managed to set up a small cluster of 3 centos 6.3 machines

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hadoop replication factor precedence - hadoop

Related

what would happen if nodes in hadoop change their IP address?

Cluster configuration and hdfs

hadoop no data node started

Namenode stops working after hadoop restart

Hadoop / Yarn (v0.23.3) Psuedo-Distributed Mode setup :: No job node

Categories

Resources