HDFS Federation Unknown Namespace - hadoop

Let's say that I have configured two name nodes to manage /marketing and /finance respectively. I am wondering what is going to be happened if I were to put a file in /accounting directory. Would HDFS accept the file? If so, which namespace manages the file?

The write will fail. Neither namespace will manage the file.
You will get an IOException with a No such file or directory error from the ViewFs client.
For example, given the following ViewFs config in core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>viewfs:///</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./namenode-a</name>
<value>hdfs://namenode-a</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./namenode-b</name>
<value>hdfs://namenode-b</value>
</property>
</configuration>
The following behavior is exhibited:
$ bin/hdfs dfs -ls /
-r--r--r-- - sirianni gopher 0 2013-10-22 15:58 /namenode-a
-r--r--r-- - sirianni gopher 0 2013-10-22 15:58 /namenode-b
$ bin/hdfs dfs -copyFromLocal /tmp/bar.txt /foo/bar.txt
copyFromLocal: `/foo/bar.txt': No such file or directory

Related

Incompatible clusterIDs in datanode and namenode

I checked solutions in this site.
I went to the (hadoop folder)/data/dfs/datanode to change ID.
but, there are not anything in datanode folder.
what can I do?
Thank for reading.
And If you help me, I will be appreciate you.
PS
2017-04-11 20:24:05,507 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage directory [DISK]file:/tmp/hadoop-knu/dfs/data/
java.io.IOException: Incompatible clusterIDs in /tmp/hadoop-knu/dfs/data: namenode clusterID = CID-4491e2ea-b0dd-4e54-a37a-b18aaaf5383b; datanode clusterID = CID-13a3b8e1-2f8e-4dd2-bcf9-c602420c1d3d
2017-04-11 20:24:05,509 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to localhost/127.0.0.1:9010. Exiting.
java.io.IOException: All specified directories are failed to load.
2017-04-11 20:24:05,509 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool (Datanode Uuid unassigned) service to localhost/127.0.0.1:9010
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9010</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/knu/hadoop/hadoop-2.7.3/data/dfs/namenode</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/home/knu/hadoop/hadoop-2.7.3/data/dfs/namesecondary</value>
</property>
<property>
<name>dfs.dataode.data.dir</name>
<value>/home/knu/hadoop/hadoop-2.7.3/data/dfs/datanode</value>
</property>
<property>
<name>dfs.http.address</name>
<value>localhost:50070</value>
</property>
<property>
<name>dfs.secondary.http.address</name>
<value>localhost:50090</value>
</property>
</configuration>
PS2
[knu#localhost ~]$ ls -l /home/knu/hadoop/hadoop-2.7.3/data/dfs/
drwxrwxr-x. 2 knu knu 6 4월 11 21:28 datanode
drwxrwxr-x. 3 knu knu 40 4월 11 22:15 namenode
drwxrwxr-x. 3 knu knu 40 4월 11 22:15 namesecondary
The problem is with the property name dfs.datanode.data.dir, it is misspelt as dfs.dataode.data.dir. This invalidates the property from being recognised and as a result, the default location of ${hadoop.tmp.dir}/hadoop-${USER}/dfs/data is used as data directory.
hadoop.tmp.dir is /tmp by default, on every reboot the contents of this directory will be deleted and forces datanode to recreate the folder on startup. And thus Incompatible clusterIDs.
Edit this property name in hdfs-site.xml before formatting the namenode and starting the services.
My solution :
rm -rf ./tmp/hadoop-${user}/dfs/data/*
./bin/hadoop namenode -format
./sbin/hadoop-daemon.sh start datanode
Try formatting the namenode and then restarting HDFS.
Copy the cluster under directory /hadoop/bin$:
./hdfs namenode -format -clusterId CID-a5a9348a-3890-4dce-94dc-0fec2ba999a9

Uneven Data Replication on Hadoop DataNodes

I am working to create a small Hadoop cluster on my network. I have 1 NameNode and 2 DataNodes:
garage => NameNode
garage2 => DataNode
garage3 => DataNode
On the NameNode, I formatted hdfs using:
hadoop namenode -format
I then created the user directories:
hadoop dfs -mkdir /user
hadoop dfs -mkdir /user/erik
hadoop dfs -mkdir movielens
I then uploaded a few files to test it out:
hadoop dfs -put * movielens
My expectation was that both datanodes would contain full copies of the data since my replication factor is set to 2 in hdfs-site.xml (Same config file on all 3 nodes):
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/mnt/data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/mnt/data/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
However, I am seeing an uneven distribution of data files in the hdfs folders on disk:
garage2 (DataNode):
erik#garage2:/mnt/data/hdfs$ du -h
4.0K ./datanode/current/BP-152062109-192.168.0.100-1475633473579/tmp
4.0K ./datanode/current/BP-152062109-192.168.0.100-1475633473579/current/rbw
619M ./datanode/current/BP-152062109-192.168.0.100-1475633473579/current/finalized/subdir0/subdir0
619M ./datanode/current/BP-152062109-192.168.0.100-1475633473579/current/finalized/subdir0
619M ./datanode/current/BP-152062109-192.168.0.100-1475633473579/current/finalized
619M ./datanode/current/BP-152062109-192.168.0.100-1475633473579/current
619M ./datanode/current/BP-152062109-192.168.0.100-1475633473579
619M ./datanode/current
619M ./datanode
619M .
And from garage3 (DataNode):
erik#garage3:/mnt/data/hdfs$ du -h
4.0K ./datanode
8.0K .
Am I missing something in my configuration that will even this distribution/data replication out?

What's the standard way to create files in your hdfs filesystem?

I learned that I have to configure the NameNode and DataNode dir in hdfs-site.xml. So that's my hdfs-site.xml configuration on the NameNode:
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file://usr/local/hadoop-2.6.0/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
</configuration>
I did almost the same on my DataNode and changed dfs.namenode to dfs.datanode.
Then I formatted the filesystem via
hadoop namenode -format
Everything seems to be finished without an error.
Then I wanted to create a directory in my HDFS filesystem by using:
hdfs dfs -mkdir test
And I got an error:
mkdir: `test': No such file or directory
What did I miss or what's the common process from formatting to creating files/directories with HDFS?
Well, it's so easy.
hdfs dfs -mkdir /test
was created successfully.
hdfs dfs -put myFile /test/myFile
works as well.
Create a directory:
hdfs dfs -mkdir directoryName
Create a new file in directory
hdfs dfs -touchz directoryName/Newfilename
Write into newly created file in HDFS
nano filename
Save it Cntr+X Y
Read the newly created file from HDFS
nano fileName
Or
hdfs dfs -cat directoryName/fileName
HDFS is a non POSIX compliant file systems so you can't edit files directly inside of HDFS, however you can Copy a file from your local system to HDFS using following command:
hdfs dfs -put /path/in/source/system/filename /path/in/HDFS/system/destination
If you want to create multiple sub-directories then you should also use -p flag:
hdfs dfs -mkdir -p /test/another_test/one_more_test

Cluster configuration and hdfs

Im trying to configure my cluster by following this tutorial -
https://developer.yahoo.com/hadoop/tutorial/module2.html
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.71.128:9000</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop-user/hdfs/data</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop-user/hdfs/name</value>
</property>
</configuration>
I have also copied a local file to /user/prema/ using the below commands
hadoop-user#hadoop-desk:~/hadoop$ bin/hadoop dfs -put /home/hadoop-user/googlebooks-eng-all-1gram-20120701-0 /user/prema
hadoop-user#hadoop-desk:~/hadoop$ bin/hadoop dfs -ls /user/prema
Found 1 items
-rw-r--r-- 1 hadoop-user supergroup 192403080 2014-11-19 02:43 /user/prema
Now, I'm confused. I have datafiles here- /user/prema but the data node in the cluster config points to this - /home/hadoop-user/hdfs/data..How does it get related?
The /user/prema is a folder within HDFS. The folder /home/hadoop-user/hdfs/data is a folder within the regular filesystem.
The regular filesystem folder is the place where HDFS stores its data. So when you read data from HDFS, it actually goes to the physical regular filesystem folder to read the data. You should never need to touch this data as its format is not very user-friendly - the HDFS takes care of data manipulation for you.

Hadoop / Yarn (v0.23.3) Psuedo-Distributed Mode setup :: No job node

I just setup Hadoop/Yarn 2.x (specifically, v0.23.3) in Psuedo-Distributed mode.
I followed the instructions of a few blogs & websites which, more-or-less provide the
same prescription for setting it up. I also followed the 3rd-Edition of O'reilly's
Hadoop book (which ironically was the least helpful).
THE PROBLEM:
After running "start-dfs.sh" and then "start-yarn.sh", while all of the daemons
do start (as indicated by jps(1)), the Resource Manager web portal
(Here: http://localhost:8088/cluster/nodes) indicates 0 (zero) job-nodes in the
cluster. So while submitting the example/test Hadoop job indeed does get
scheduled, it pends forever because, I assume, the configuration doesn't see a
node to run it on.
Below are the steps I performed, including resultant configuration files.
Hopefully the community help me out... (And thank you in advance).
THE CONFIGURATION:
The following environment variables are set in both my and hadoop's UNIX account profiles: ~/.profile:
export HADOOP_HOME=/home/myself/APPS.d/APACHE_HADOOP.d/latest
# Note: /home/myself/APPS.d/APACHE_HADOOP.d/latest -> hadoop-0.23.3
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_INSTALL=${HADOOP_HOME}
export HADOOP_CLASSPATH=${HADOOP_HOME}/lib
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/conf
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop/conf
export JAVA_HOME=/usr/lib/jvm/jre
hadoop$ java -version
java version "1.7.0_06-icedtea<br>
OpenJDK Runtime Environment (fedora-2.3.1.fc17.2-x86_64)<br>
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)<br>
# Although the above shows OpenJDK, the same problem happens with Sun's JRE/JDK.
The NAMENODE & DATANODE directories, also specified in etc/hadoop/conf/hdfs-site.xml:
/home/myself/APPS.d/APACHE_HADOOP.d/latest/YARN_DATA.d/HDFS.d/DATANODE.d/
/home/myself/APPS.d/APACHE_HADOOP.d/latest/YARN_DATA.d/HDFS.d/NAMENODE.d/
Next, the various XML configuration files (again, YARN/MRv2/v0.23.3 here):
hadoop$ pwd; ls -l
/home/myself/APPS.d/APACHE_HADOOP.d/latest/etc/hadoop/conf
lrwxrwxrwx 1 hadoop hadoop 16 Sep 20 13:14 core-site.xml -> ../core-site.xml
lrwxrwxrwx 1 hadoop hadoop 16 Sep 20 13:14 hdfs-site.xml -> ../hdfs-site.xml
lrwxrwxrwx 1 hadoop hadoop 18 Sep 20 13:14 httpfs-site.xml -> ../httpfs-site.xml
lrwxrwxrwx 1 hadoop hadoop 18 Sep 20 13:14 mapred-site.xml -> ../mapred-site.xml
-rw-rw-r-- 1 hadoop hadoop 10 Sep 20 15:36 slaves
lrwxrwxrwx 1 hadoop hadoop 16 Sep 20 13:14 yarn-site.xml -> ../yarn-site.xml
core-site.xml
<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<!-- Same problem whether this (legacy) stanza is included or not. -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/HDFS.d/NAMENODE.d</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/HDFS.d/DATANODE.d</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<!-- yarn-site.xml -->
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/TEMP.d</value>
</property>
</configuration>
etc/hadoop/conf/saves
localhost
# Community/friends, is this entry correct/needed for my psuedo-dist mode?
Miscellaneous wrap-up notes:
(1) As you may have gleaned from above, all files/directories are owned
by the 'hadoop' UNIX user. There is a hadoop:hadoop, UNIX User and
Group, respectively.
(2) The following command was run after the NAMENODE & DATANODE directories
(listed above) were created (and whose paths were entered into
hdfs-site.xml):
hadoop$ hadoop namenode -format
(3) Next, I ran "start-dfs.sh", then "start-yarn.sh".
Here is jps(1) output:
hadoop#e6510$ jps
21979 DataNode
22253 ResourceManager
22384 NodeManager
22156 SecondaryNameNode
21829 NameNode
22742 Jps
Thank you!
After much toil on this problem without success (and trust me I tried it all), I instituted
hadoop using a different solution. Whereas above I downloaded a gzip/tar ball
of the hadoop distribution (again v0.23.3) from one of the download mirrors, this
time I used the Caldera CDH distribution of RPM packages, which I installed via
their YUM repos. In hopes that this will help someone, here are the detailed steps.
Step-1:
For Hadoop 0.20.x (MapReduce version 1):
# rpm -Uvh http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch.rpm
# rpm --import http://archive.cloudera.com/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
# yum install hadoop-0.20-conf-pseudo
-or-
For Hadoop 0.23.x (MapReduce version 2):
# rpm -Uvh http://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.noarch.rpm
# rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
# yum install hadoop-conf-pseudo
In both cases above, installing that "psuedo" package (which stands for "pseudo-distributed
Hadoop" mode), will alone conveniently trigger the installation of all the other necessary packages you'll need (via dependency resolution).
Step-2:
Install Sun/Oracle's Java JRE (if you haven't already done so). You can
install it via the RPM that they provide, or the gzip/tar ball portable
version. It doesn't matter which as long as you set and export the "JAVA_HOME"
environment appropriately, and ensure ${JAVA_HOME}/bin/java is in your path.
# echo $JAVA_HOME; which java
/home/myself/APPS.d/JAVA-JRE.d/jdk1.7.0_07
/home/myself/APPS.d/JAVA-JRE.d/jdk1.7.0_07/bin/java
Note: I actually create a symlink called "latest" and point/re-point it to the JAVA
version specific directory whenever I update the JAVA. I was explicit above for
the reader's understanding.
Step-3: Format hdfs as the "hdfs" Unix user (created during "yum install" above).
# sudo su hdfs -c "hadoop namenode -format"
Step-4:
Manually start the hadoop daemons.
for file in `ls /etc/init.d/hadoop*`
do
{
${file} start
}
done
Step-5:
Check to see if things are working. The following is for MapReduce v1
(It's not that much different for MapReduce v2 at this superficial level).
root# jps
23104 DataNode
23469 TaskTracker
23361 SecondaryNameNode
23187 JobTracker
23267 NameNode
24754 Jps
# Do the next commands as yourself (not as "root").
myself$ hadoop fs -mkdir /foo
myself$ hadoop fs -rmr /foo
myself$ hadoop jar /usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u5-examples.jar pi 2 100000
I hope this helped!
Noel,
I followed this other day the steps in this tutorial http://www.thecloudavenue.com/search?q=0.23 and I managed to set up a small cluster of 3 centos 6.3 machines

Resources