0 datanodes when copying file from local to hadoop - hadoop

My OS is Windows 10.
Ubuntu 20.04.3 LTS (GNU/Linux 4.4.0-19041-Microsoft x86_64) installed on Windows 10.
When I copy the local file to hadoop, I am receiving an error as 0 datanodes available.
I am able to copy the file from hadoop to local folder. I can see the file in local directory using the command $ ls -l
Also I am able to create directory or files in hadoop. But if restart the ubuntu terminal again, there is no such directory or files exist. It shows empty.
The steps I followed:
1. start-all.sh
2. jps
(datanodes missing)
3. copy the local file to hadoop
ERROR as 0 datanodes available
4. copy files from hadoop to local directory successful

If you stop/restart the WSL2 terminal without running stop-dfs or stop-all, you run the risk of corrupting the namenode, and it needs to be reformatted using hadoop namenode -format, not rm the namenode directory.
After formatting, you can restart the datanodes and they should become healthy again.
Same logic applies in a production environment, which is why you should always have a standby namenode for failover

Related

How to copy a file from HDFS to a Windows machine?

I want to copy a .csv file from our Hadoop cluster in my local Desktop, so I can edit the file and upload back (replace).
I tried:
hadoop fs -copyToLocal /c_transaction_label.csv C:/Users/E_SJIRAK/Desktop
which yielded:
copyToLocal: '/Users/E_SJIRAK/Desktop': No such file or directory:
file:////Users/E_SJIRAK/Desktop
Help would be appreciated.
If you have SSH'd into the Hadoop cluster, then you cannot copyToLocal into Windows.
You need a 2 step process. Download from HDFS to the Linux environment. Then use SFTP (WinSCP, Filezilla, etc) or Putty scp command from Windows host to get files into your Windows machine.
Otherwise, you need to setup hadoop CLI command on Windows itself.

Whenever I restart my ubuntu system (Vbox) and start my hadoop , my name node is not working

Whenever I restart my ubuntu system (Vbox) and start my Hadoop, my name node is not working.
To resolve this I have to always the folders of namenode and datanode and format Hadoop every time I restart my system.
Since 2 days am trying to resolve the issue but its not working. I tried to give the permissions 777 again to the namenode and datanode folders, also I tried changing the paths for the same.
My error is
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /blade/Downloads/Hadoop/data/datanode is in an inconsistent state: storage directory does not exist or is not accessible
Please help me to resolve the issue.
You cannot just shutdown the VM. You need to cleanly stop the datanode and namenode processes in that order, otherwise there's a potential for a corrupted HDFS, causing you to need to reformat, assuming that you don't have a backup system
I'd also suggest putting Hadoop data for a VM in its own VM drive and mount, not a shared host folder under Downloads

Cannot start running on browser the namenode for Hadoop

It is my first time in installing Hadoop on my Linux (Fedora distro) running on VM (using Parallel on my Mac). And I followed every step on this video and including the textual version of it.And then when I run it on localhost (or the equivalent value from hostname) in port 50070, I got the following message.
...can't establish a connection to the server at localhost:50070
When I run the jps by the way command I don't have the datanode and namenode unlike at the end of the textual version tutorial which has the following:
While mine has only the following processes running:
6021 NodeManager
3947 SecondaryNameNode
5788 ResourceManager
8941 Jps
When I run the hadoop namenode command I have some of the following [redacted] error:
Cannot access storage directory /usr/local/hadoop_store/hdfs/namenode
16/10/11 21:52:45 WARN namenode.FSNamesystem: Encountered exception loading fsimage
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /usr/local/hadoop_store/hdfs/namenode is in an inconsistent state: storage directory does not exist or is not accessible.
I tried to access by the way the above mentioned directories and it existed.
Any hint for this newbie? ;-)
You would need to give read and write permission to user with which you are running the services on directory /usr/local/hadoop_store/hdfs/namenode.
Once done, you should run format command using hadoop namenode -format
Then try to start your services.
delete files /app/hadoop/tmp/*
and try again formatting the namenode and then start-dfs.sh & start-yarn.sh

Need help adding multiple DataNodes in pseudo-distributed mode (one machine), using Hadoop-0.18.0

I am a student, interested in Hadoop and started to explore it recently.
I tried adding an additional DataNode in the pseudo-distributed mode but failed.
I am following the Yahoo developer tutorial and so the version of Hadoop I am using is hadoop-0.18.0
I tried to start up using 2 methods I found online:
Method 1 (link)
I have a problem with this line
bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS
--script bin/hdfs doesn't seem to be valid in the version I am using. I changed it to --config $HADOOP_HOME/conf2 with all the configuration files in that directory, but when the script is ran it gave the error:
Usage: Java DataNode [-rollback]
Any idea what does the error mean? The log files are created but DataNode did not start.
Method 2 (link)
Basically I duplicated conf folder to conf2 folder, making necessary changes documented on the website to hadoop-site.xml and hadoop-env.sh. then I ran the command
./hadoop-daemon.sh --config ..../conf2 start datanode
it gives the error:
datanode running as process 4190. stop it first.
So I guess this is the 1st DataNode that was started, and the command failed to start another DataNode.
Is there anything I can do to start additional DataNode in the Yahoo VM Hadoop environment? Any help/advice would be greatly appreciated.
Hadoop start/stop scripts use /tmp as a default directory for storing PIDs of already started daemons. In your situation, when you start second datanode, startup script finds /tmp/hadoop-someuser-datanode.pid file from the first datanode and assumes that the datanode daemon is already started.
The plain solution is to set HADOOP_PID_DIR env variable to something else (but not /tmp). Also do not forget to update all network port numbers in conf2.
The smart solution is start a second VM with hadoop environment and join them in a single cluster. It's the way hadoop is intended to use.

Hadoop DFSClient installation

I run Hadoop cluster and I'm interested to install one more machine with DFSClient only.
This machine (lets call it machine X) will not be a part of the cluster.
Machine X will run DFSClient and I should be able to see HDFS from it.
In order to install DFSClient, I copied Hadoop home directory from one of cluster's node to machine X (including .jar files and configuration).
Then I run:
hadoop fs -ls /
I get the local ROOT directory (not HDFS ROOT).
What am I doing wrong?
Copy hdfs-site.xml and place in a folder under your local linux account home dir. Then ensure that your name node (default.fs.name) is pointing to the remote namenode. Then try hadoop --config <your_config_folder> fs -ls / where your_config_folder is where you placed your hdfs-site.xml.
Technically it should work if the following steps are done
If you have copied the configuration files(*.xml) from the hadoop cluster.
HADOOP_HOME set with the copied hadoop path.
Machine X should have access to the cluster network

Resources