I'am trying to set up the pseudo distributed mode for Hadoop. all first steps are ok, but when I format the namenode and browse the filesystem (50070), it shows "there are no datanode in the cluster"
http://hadoop.apache.org/docs/stable/single_node_setup.html
should I do other configuration? change directory path?
thanks
From your comments it looks like you've formatted your name node twice, but not deleted the underlying data for the data node.
I suggest you clean up as follows:
Ensure that all hadoop services are stopped (bin/stop-all.sh)
Remove all data in the directories named in conf/hdfs-site.xml conf file
dfs.name.dir
dfs.data.dir
Reformat the name node again
Start HDFS only (bin/start-dfs.sh)
Check the logs for both the name node and data node to ensure that everything started without error
If you're still having problems, post your name node and data node logs back into your original question
Related
I've recently been settings up hadoop in pseudo distributed mode and I have created data and loaded that into HDFS. Later I have formatted namenode because of a problem. Now when I do that, I find that the directories and the files which were already there before on the datanodes don't show up anymore. (the word "Formatting" makes sense though) But now, I do have this doubt. As the namenode doesn't hold the metadata of the files anymore, is access to the previously loaded files cut-off? If that's a yes, then how do we delete the data already there on the datanodes?
Your previous datanode directories are now stale, yes.
You need to manually go through each datanode and delete the contents of those directories. There is no such format command via the Hadoop CLI
By default, the data node directory is a single folder under /tmp
Otherwise, you've configured your XML files where to store data
Where HDFS stores data
The screenshot added below shows the output of hdfs fsck /. It shows that the "/" directory is corrupted. This is the masternode of my Hadoop cluster. What to do?
If you are using Hadoop 2, you can run a Standby namenode to achieve High Availability. Without that, your cluster's master will be a Single Point of Failure.
You can not retrieve the data of Namenode from anywhere else since it is different from the usual data you store. If your namenode goes down, your blocks and files will still be there, but you won't be able to access them since there would be no related metadata in the namenode.
I want to delete datanode from my hadoop cluster, but don't want to lose my data. Is there any technique so that data which are there on the node which I am going to delete may get replicated to the reaming datanodes?
What is the replication factor of your hadoop cluster?
If it is default which is generally 3, you can delete the datanode directly since the data automatically gets replicated. this process is generally controlled by name node.
If you changed the replication factor of the cluster to 1, then if you delete the node, the data in it will be lost. You cannot replicate it further.
Check all the current data nodes are healthy, for these you can go to the Hadoop master admin console under the Data nodes tab, the address is normally something link http://server-hadoop-master:50070
Add the server you want to delete to the files /opt/hadoop/etc/hadoop/dfs.exclude using the full domain name in the Hadoop master and all the current datanodes (your config directory installation can be different, please double check this)
Refresh the cluster nodes configuration running the command hdfs dfsadmin -refreshNodes from the Hadoop name node master
Check the Hadoop master admin home page to check the state of the server to remove at the "Decommissioning" section, this may take from couple of minutes to several hours and even days depending of the volume of data you have.
Once the server is shown as decommissioned complete, you may delete the server.
NOTE: if you have other services like Yarn running on the same server, the process is relative similar but with the file /opt/hadoop/etc/hadoop/yarn.exclude and then running yarn rmadmin -refreshNodes from the Yarn master node
I have installed a hadoop cluster with total 3 machines, with 2 nodes acting as datanodes and 1 node acting as Namenode and as well as a Datanode.
I wanted to clear certain doubts regarding hadoop cluster installation and architecture.
Here is a list of questions I am looking answers for----
I uploaded a data file around 500mb size in the cluster and then checked the hdfs report.
I noticed that the namenode I made is also occupying 500mb size in the hdfs, along with datanodes with a replication factor of 2.
The problem here is that I want the namenode not to store any data on it, in short i dont want it to work as a datanode as it is also storing the file I am uploading. So what is the way of making it only act as a Master Node and not like a datanode?
I tried running the command hadoop -daemon.sh stop on the Namenode to stop the datanode services on it but it wasnt of any help.
How much metadata does a Namenode generate for a filesize typically of 1 GB? Any approximations?
Go to conf directory inside your $HADOOP_HOME directory on your master. Edit the file named slaves and remove the entry corresponding to your name node from it. This way you are only asking the other two nodes to act as slaves and name node as only the master.
Please explain what's dfs.include file purpose and how to define it.
I've added a new node to the Hadoop cluster but it's not identified by the namenode. In one of the posts I found that dfs.include can resolve this issue.
Thank you in advance,
Vladi
Just including the node name in the dfs.include and mapred.include is not sufficient. The slave file has to be updated on the namenode/jobtracker. The tasktracker and the datanode have to be started on the new node and the refreshNodes command has to be run on the NameNode and the JobTracker to make them aware of the new node.
Here are the instructions on how to do this.
According to the 'Hadoop : The Definitive Guide'
The file (or files) specified by the dfs.hosts and mapred.hosts properties is different from the slaves file. The former is used by the namenode and jobtracker to determine which worker nodes may connect. The slaves file is used by the Hadoop control scripts to perform cluster-wide operations, such as cluster restarts. It is never used by the Hadoop
daemons.