datanode does not have "namenode" directory under hdfs - hadoop

So I have a small Hadoop cluster with 1 master and 5 workers. My hdfs-site.xml for masters and workers look like this:
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/username/hadoop/yarn/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/username/hadoop/yarn/hdfs/datanode</value>
</property>
</configuration>
My cluster is running smoothly, all daemons runnning fine. I am able to access HDFS to import, export data, run word count jobs, etc... However in my workers, there is no "namenode" folder under "/home/username/hadoop/yarn/hdfs/" path. Is this a normal behaviour?

NameNode folder should be available on namenode hosts

Your namenode folder is present on the node where namenode daemon is running.
According to the hdfs-site.xml stated above. It is created at
/home/username/hadoop/yarn/hdfs/namenode

It's been a while since you've posted your question, however, that is the way it should be.
In other words a worker is a data node and therefor only needs a "datanode" directory. Consequently, but this depends on your configuration, a name node does usually only have a namenode directory. An exception would be that your name node is also acting as a data node which is not recommended but possible.

Related

Raspberry Pi Hadoop Cluster Configuration

I've recently been trying to build and configure a (8-Pi) Raspberry Pi 3 Hadoop-cluster (as a personal project over the summer). Please bear with me (unfortunately I am a little new to Hadoop). I am using is Hadoop version 2.9.2. I think its important to note that right now I am trying to just get one Namenode and one Datanode completely functional with one-another, before moving ahead and replicating the same procedure on the remaining seven Pi's.
The issue: My Namenode (alias: master) is the only node that is being displayed as a 'Live Datanode' under both the dfs-health interface, and through the use of :
dfsadmin -report
Even though the Datanode is being displayed as an 'Active Node' (within the Nodes of the cluster Hadoop UI) and 'master' is not listed within the slaves file. The configuration I am aiming for is that the Namenode should not perform any of Datanode operations. Additionally I am trying to configure the cluster in such a way that the command above will display my Datanode (alias: slave-01) as a 'Live Datanode'.
I suspect that my issue is caused by the fact that both my Namenode and Datanode make use of the same host-name (raspberrypi), however am unsure of the configuration changes I am required to make in order to correct the issue. After having looked into the documentation, I unfortunately couldn't find a conclusive answer as to whether this is allowed or not.
If someone could please help me solve this issue it would be extremely appreciated! I have provided any relevant file-information below (which I thought may be useful for solving the issue). Thank you :)
PS: All files are identical within the Namenode and Datanode unless otherwise specified.
===========================================================================
Update 1
I have removed localhost from the slaves file on both the Namenode and Datanode, and changed their respective hostnames to 'master' and 'slave-01' as well.
After running JPS: I have noticed that all of the correct processes are running on the master node, however I am having an error on the Datanode for which the log shows:
ExitCodeException exitCode=1: chmod: changing permissions of '/opt/hadoop_tmp/hdfs/datanode': Operation not permitted.
If someone could please help me solve this issue it would be extremely appreciated! Unfortunately the issue persists despite changing permissions using 'chmod 777'. Thanks in advance :)
===========================================================================
Hosts File
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
127.0.1.1 raspberrypi
192.168.1.2 master
192.168.1.3 slave-01
Master File
master
Slaves File
localhost
slave-01
Core-Site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000/</value>
</property>
<property>
<name>fs.default.FS</name>
<value>hdfs://master:9000/</value>
</property>
</configuration>
HDFS-Site.xml
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop_tmp/hdfs/datanode</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop_tmp/hdfs/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>master:50070</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Mapred-Site.xml
<configuration>
<property>
<name>mapreduce.job.tracker</name>
<value>master:5431</value>
</property>
<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Yarn-Site.xml
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8050</value>
</property>
</configuration>
You could let your local router serve up the host names rather than manipulate /etc/hosts yourselves, but in order to change each Pi's name, edit /etc/hostname and reboot.
Before and after boots, check running hostname -f
Note: "master" is really meaningless once you have a "YARN master", "HDFS master", "Hive Master", etc. Best to literally say namenode, data{1,2,3}, yarn-rm, and so on
Regarding permissions issues, you could run everything as root, but that's insecure outside a homelab, so you'd want to run a few adduser commands for at least hduser (as documented elsewhere, but can be anything else), and yarn, then run commands as those users, after chown -R the data and log directories to be owned by these users and Unix groups they belong to

Hadoop fs -ls outputs current working directory's files rather than hdfs volume's files

Have set up a single pseudo-distributed node (localhost) with Hadoop 2.8.2 on OpenSuse Tumbleweed 20170703. Java version is 1.8.0_151. Generally, it seems to be set up correctly. I can format namenode with no errors etc.
However, when I try hadoop fs -ls, files/dirs from the current working directory are returned rather than the expected behaviour of returning the hdfs volume files (which should be nothing at the moment).
Was originally following this guide for CentOS (making changes as required) and the Apache Hadoop guide.
I'm assuming that it's a config issue, but I can't see why it would be. I've played around with core-site.xml and hdfs-site.xml as per below with no luck.
/opt/hadoop-hdfs-volume/ exists and is assigned to user hadoop in user group hadoop. As is the /opt/hadoop/ directory (for bin stuff).
EDIT:
/tmp/hadoop-hadoop/dfs/name is where the hdfs namenode -format command runs. /tmp/ also seems to hold my user (/tmp/hadoop-dijksterhuis) and the hadoop user directories.
This seems odd to me considering the *-site.xml config files below.
Have tried restarting the dfs and yarn services with the .sh scripts in the hadoop/sbin/ directory. Have also rebooted. No luck!
core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-hdfs-volume/${user.name}</value>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>${hadoop.tmp.dir}/dfs/data</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>${hadoop.tmp.dir}/dfs/name</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Anyone have any ideas? I can provide more details if needed.
Managed to hack a fix via another SO answer:
Add $HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop to the hadoop user's .bashrc.
This has the effect of overriding the value in etc/hadoop-env.sh, which keeps pointing the namenode to the default tmp/hadoop-${user-name} directory.
source .bashrc et voila! Problem fixed.

how to configure hadoop so that namenode gets started without reformatting on reboot

Namenode is missing every time the hadoop is started after a reboot. hadoop namenode -format corrects this problem at the cost of losing files through format. How to prevent formatting and preserve namenode files
put the following configuration in you hdfs-site.xml
$vi hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hduser/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hduser/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
the dfs.name.dir Determines where on the local file system the DFS name node should store the name table(fsimage).The name table is replicated in all of the directories, for redundancy
the dfs.data.dir Determines where on the local file system an DFS data node should store its blocks.Data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.

Configuration of Hadoop xml file

What is the right configuration of hdfs-site.xml file while configuring Hadoop.
On all the websites I see this :
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
I used the same configuration but was unable to start the datanode.
later I changed the configuration of datanode to
<name>dfs.datanode.name.dir</name>
instead of
<name>dfs.datanode.data.dir</name>
and it worked.
which one is right name.dir or data.dir?
because all the websites say the data.dir but that does not work in my case.
Thanks guys.
The options are dfs.namenode.name.dir and dfs.datanode.data.dir. The first defines the directory where namenode information will be held. Second defines where datanode information will be held.
If the node is a namenode, you need the namenode folder configuration. If it's a datanode, you need the datanode folder configuration. If it's a single-node cluster, you need both, as the node acts as both namenode and datanode.
If that doesn't work, what is the error message you get? Have you formatted the namenode?

Failed to get system directory - hadoop

Using hadoop multinode setup (1 mater , 1 salve)
After starting up start-mapred.sh on master , i found below error in TT logs (Slave an)
org.apache.hadoop.mapred.TaskTracker: Failed to get system directory
can some one help me to know what can be done to avoid this error
I am using
Hadoop 1.2.0
jetty-6.1.26
java version "1.6.0_23"
mapred-site.xml file
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/workspace</value>
</property>
</configuration>
It seems that you just added hadoop.tmp.dir and started the job. You need to restart the Hadoop daemons after adding any property to the configuration files. You have specified in your comment that you added this property at a later stage. This means that all the data and metadata along with other temporary files is still in the /tmp directory. Copy all those things from there into your /home/hduser/workspace directory, restart Hadoop and re run the job.
Do let me know the result. Thank you.
If, it is your windows PC and you are using cygwin to run Hadoop. Then task tracker will not work.

Resources