Cluster configuration and hdfs - hadoop

Im trying to configure my cluster by following this tutorial -
https://developer.yahoo.com/hadoop/tutorial/module2.html
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.71.128:9000</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop-user/hdfs/data</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop-user/hdfs/name</value>
</property>
</configuration>
I have also copied a local file to /user/prema/ using the below commands
hadoop-user#hadoop-desk:~/hadoop$ bin/hadoop dfs -put /home/hadoop-user/googlebooks-eng-all-1gram-20120701-0 /user/prema
hadoop-user#hadoop-desk:~/hadoop$ bin/hadoop dfs -ls /user/prema
Found 1 items
-rw-r--r-- 1 hadoop-user supergroup 192403080 2014-11-19 02:43 /user/prema
Now, I'm confused. I have datafiles here- /user/prema but the data node in the cluster config points to this - /home/hadoop-user/hdfs/data..How does it get related?

The /user/prema is a folder within HDFS. The folder /home/hadoop-user/hdfs/data is a folder within the regular filesystem.
The regular filesystem folder is the place where HDFS stores its data. So when you read data from HDFS, it actually goes to the physical regular filesystem folder to read the data. You should never need to touch this data as its format is not very user-friendly - the HDFS takes care of data manipulation for you.

Related

Uneven Data Replication on Hadoop DataNodes

I am working to create a small Hadoop cluster on my network. I have 1 NameNode and 2 DataNodes:
garage => NameNode
garage2 => DataNode
garage3 => DataNode
On the NameNode, I formatted hdfs using:
hadoop namenode -format
I then created the user directories:
hadoop dfs -mkdir /user
hadoop dfs -mkdir /user/erik
hadoop dfs -mkdir movielens
I then uploaded a few files to test it out:
hadoop dfs -put * movielens
My expectation was that both datanodes would contain full copies of the data since my replication factor is set to 2 in hdfs-site.xml (Same config file on all 3 nodes):
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/mnt/data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/mnt/data/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
However, I am seeing an uneven distribution of data files in the hdfs folders on disk:
garage2 (DataNode):
erik#garage2:/mnt/data/hdfs$ du -h
4.0K ./datanode/current/BP-152062109-192.168.0.100-1475633473579/tmp
4.0K ./datanode/current/BP-152062109-192.168.0.100-1475633473579/current/rbw
619M ./datanode/current/BP-152062109-192.168.0.100-1475633473579/current/finalized/subdir0/subdir0
619M ./datanode/current/BP-152062109-192.168.0.100-1475633473579/current/finalized/subdir0
619M ./datanode/current/BP-152062109-192.168.0.100-1475633473579/current/finalized
619M ./datanode/current/BP-152062109-192.168.0.100-1475633473579/current
619M ./datanode/current/BP-152062109-192.168.0.100-1475633473579
619M ./datanode/current
619M ./datanode
619M .
And from garage3 (DataNode):
erik#garage3:/mnt/data/hdfs$ du -h
4.0K ./datanode
8.0K .
Am I missing something in my configuration that will even this distribution/data replication out?

Hadoop replication factor precedence

I have this only in my namenode:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
In my data nodes, I have this:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Now my question is, will the replication factor be 3 or 1?
At the moment, the output of hdfs dfs -ls hdfs:///user/hadoop-user/data/0/0/0 shows 1 replication factor:
-rw-r--r-- 1 hadoop-user supergroup 68313 2015-11-06 19:32 hdfs:///user/hadoop-user/data/0/0/0/00099954tnemhcatta.bin
Appreciate your answer.
by default replication factor is 3, it is standard in most of the distributed system. if the replication factor was set to 3 (default value in HDFS) there would be one original block and two replicas. Most of time when we working on single node cluster(single machine) that time we put it 1. because if we will take 3 then there will be no benefit as all the copy are on single machine. so simple understanding. in multi node cluster replication factor should be 3 used in failure and in single machine replication factor should be 1.
Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory. Change or add the following property to hdfs-site.xml:
<property>
<name>dfs.replication<name>
<value>3<value>
<description>Block Replication<description>
<property>
You can also change the replication factor on a per-file basis using the Hadoop FS shell.
[jpanda#localhost ~]$ hadoop fs –setrep –w 3 /my/file
Alternatively, you can change the replication factor of all the files under a directory.
[jpanda#localhost ~]$ hadoop fs –setrep –w 3 -R /my/dir

What's the standard way to create files in your hdfs filesystem?

I learned that I have to configure the NameNode and DataNode dir in hdfs-site.xml. So that's my hdfs-site.xml configuration on the NameNode:
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file://usr/local/hadoop-2.6.0/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
</configuration>
I did almost the same on my DataNode and changed dfs.namenode to dfs.datanode.
Then I formatted the filesystem via
hadoop namenode -format
Everything seems to be finished without an error.
Then I wanted to create a directory in my HDFS filesystem by using:
hdfs dfs -mkdir test
And I got an error:
mkdir: `test': No such file or directory
What did I miss or what's the common process from formatting to creating files/directories with HDFS?
Well, it's so easy.
hdfs dfs -mkdir /test
was created successfully.
hdfs dfs -put myFile /test/myFile
works as well.
Create a directory:
hdfs dfs -mkdir directoryName
Create a new file in directory
hdfs dfs -touchz directoryName/Newfilename
Write into newly created file in HDFS
nano filename
Save it Cntr+X Y
Read the newly created file from HDFS
nano fileName
Or
hdfs dfs -cat directoryName/fileName
HDFS is a non POSIX compliant file systems so you can't edit files directly inside of HDFS, however you can Copy a file from your local system to HDFS using following command:
hdfs dfs -put /path/in/source/system/filename /path/in/HDFS/system/destination
If you want to create multiple sub-directories then you should also use -p flag:
hdfs dfs -mkdir -p /test/another_test/one_more_test

hadoop file system list my own root directory

I met a very wired situation when I try to install single node hadoop yarn 2.2.0 on my mac. I follow the tutorial on this link: http://raseshmori.wordpress.com/2012/09/23/install-hadoop-2-0-1-yarn-nextgen/.
When I start the hadoop, and jps to check the status, it shows: (which means normal, I think)
5552 Jps
7162 ResourceManager
7512 Jps
7243 NodeManager
6962 DataNode
7060 SecondaryNameNode
6881 NameNode
However, after enter
hadoop fs -ls /
The files lists are the files in my own root but not the hadoop file system root. There must be some error when I set the hadoop that mix my own fs with the hdfs. Could any one give me a hint about it?
Use the following command for accessing HDFS
hadoop fs -ls hdfs://localhost:9000/
Or
Populate ${HADOOP_CONF_DIR}/core-site.xml as follows. If your doing so even without specifying hdfs:// URI you will be able to access HDFS.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Add the following line at the starting of the file $HOME/yarn/hadoop-2.0.1-alpha/libexec/hadoop-config.sh
export HADOOP_CONF_DIR=$HOME/yarn/hadoop-2.0.1-alpha/etc/hadoop

HDFS Federation Unknown Namespace

Let's say that I have configured two name nodes to manage /marketing and /finance respectively. I am wondering what is going to be happened if I were to put a file in /accounting directory. Would HDFS accept the file? If so, which namespace manages the file?
The write will fail. Neither namespace will manage the file.
You will get an IOException with a No such file or directory error from the ViewFs client.
For example, given the following ViewFs config in core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>viewfs:///</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./namenode-a</name>
<value>hdfs://namenode-a</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./namenode-b</name>
<value>hdfs://namenode-b</value>
</property>
</configuration>
The following behavior is exhibited:
$ bin/hdfs dfs -ls /
-r--r--r-- - sirianni gopher 0 2013-10-22 15:58 /namenode-a
-r--r--r-- - sirianni gopher 0 2013-10-22 15:58 /namenode-b
$ bin/hdfs dfs -copyFromLocal /tmp/bar.txt /foo/bar.txt
copyFromLocal: `/foo/bar.txt': No such file or directory

Resources