Datanode, tasktracker dies when executing hadoop fs -put command - hadoop

I have a hadoop cluster with 1 master(running namenode and jobtracker) and 2 slaves(running datanode and tasktracker on each). Now whenever I execute
hadoop fs -put localfile /user/root/tmp/input
for 4-8 GB of data, the command executes perfectly.
But when I increase the data to 30GB one of the slaves dies. As in I get an java.io.noRouteToHost exception and the command exits unsuccessfully. Immediately I did ping to that slave and found that even the Ethernet connection is down. So I have to manually do
ifup eth0
on that slave to bring the host up again.
I am not able to figure out the problem here. I also changed the following properties
dfs.socket.timeout, for read timeout
dfs.datanode.socket.write.timeout, for write timeout
I increased the read timeout to 600000 and write timeout I changed to 0 to make it infinity. Please any suggestions. I've been stuck on this for a couple of days

try using "distCp" to copy large data.

Got the solution. The problem was with hardware. Though my NIC card was Gigabit the switch in which all the nodes were plugged was 100MBps supported. Changed the switch to Gigabit and worked perfectly fine.

I faced a smiliar problem and I used -copyFromLocal instead of -put and it resolved the issue.
hadoop fs -copyFromLocal localfile /user/root/tmp/input

Related

Hadoop localhost:9870 don't work before format hdfs namenode

I have installed the hadoop.
When I start dfs and yarn, just yarn localhost work. For dfs localhost work, I need to do "bin/hdfs namenode -format" every time I start my laptop, and then start dfs, and it works.
How can I fix this ?
Sorry my bad english
You always have to format the namenode at first start.
If you are needing to do it more than once, you'll need to look at the logs to find out why HDFS is not starting... More than likely, you're just shutting down your computer, and not stopping HDFS process, and the file blocks are becoming corrupt

Connection refused error for Hadoop

When I start my system and opens Hadoop. It gives error as "Connection refused".
When I format my name node using hadoop nodname -format, I'm able to access my Hadoop directory using hadoop dfs -ls /.
But every time I have to format my nodename.
You can't just turn off your computer and expect Hadoop to pick up where it left off when turning the system back on
You need to actually run stop-dfs to prevent corruption in the Namenode and Datanode directories
Check both namenode and datanode logs to inspect why it's not starting if you do get "connection refused", otherwise it's a network problem

Datanode is in dead state as DFS used is 100 percent

I am having a standalone setup of Apache Hadoop with Namenode and Datanode running in the same machine.
I am currently running Apache Hadoop 2.6 (I cannot upgrade it) running on Ubuntu 16.04.
Although my system is having more than 400 GB of Hard disk left but my Hadoop dashboard is showing 100%.
Why Apache Hadoop is not consuming the rest of the disk space available to it? Can anybody help me figuring out the solution.
There can be certain reasons for it.
You can try following steps:
Goto $HADOOP_HOME/bin
./hadoop-daemon.sh --config $HADOOP_HOME/conf start datanode
Then you can try the following things:-
If any directory other than your namenode and datanode directories taking up too much space, you can start cleaning up
Also you can run hadoop fs -du -s -h /user/hadoop (to see usage of the directories).
Identify all the unnecessary directories and start cleaning up by running hadoop fs -rm -R /user/hadoop/raw_data (-rm is to delete -R is to delete recursively, be careful while using -R).
Run hadoop fs -expunge (to clean up the trash immediately, some times you need to run multiple times).
Run hadoop fs -du -s -h / (it will give you hdfs usage of the entire file system or you can run dfsadmin -report as well - to confirm whether storage is reclaimed)
Many times it shows missing blocks ( with replication 1).

Hadoop:hadoop fs -put error MSG:[ There are 2 datanode(s) running and 2 node(s) are excluded in this operation.]

enter image description here
I have installed the hadoop 2.6.5,when i try to put file from local to hdfs ,there comes this Exception ,i don't know how to solve this problem !!need hlep...
This is going to be a networking problem. The client process (where you ran the hdfs dfs -put command) failed to connect to the DataNode hosts. I can tell from the stack trace that at this point, you have already passed the point of interacting with the NameNode, so connectivity from client to NameNode is fine.
I recommend handling this as a basic network connectivity troubleshooting problem between client and all DataNode hosts. Use tools like ping or nc or telnet to test connectivity. If basic connectivity fails, then resolve it by fixing network configuration.

Hadoop: Datanodes available: 0 (0 total, 0 dead)

Each time I run:
hadoop dfsadmin -report
I get the following output:
Configured Capacity: 0 (0 KB)
Present Capacity: 0 (0 KB)
DFS Remaining: 0 (0 KB)
DFS Used: 0 (0 KB)
DFS Used%: �%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 0 (0 total, 0 dead)
There is no data directory in my dfs/ folder.
A lock file exists in this folder: in_use.lock
The master, job tracker and data nodes are running fine.
Please check the datanode logs . It will log errors when it is unable to report to namenode . If you post the those errors , people will be able to help ..
I had exactly same problem and when I checked datanodes logs, there were lots of could not connect to master:9000, and when I checked ports on master via netstat -ntlp I had this in output:
tcp 0 0 127.0.1.1:9000 ...
I realized that I should change my master machine name or change master in all configs. I decided to do the first cause it seems much easier.
so I modified /etc/hosts and changed 127.0.1.1 master to 127.0.1.1 master-machine and added an entry at the end of the file like this:
192.168.1.1 master
Then I changed master to master-machine in /etc/hostname and restart the machine.
The problem was gone.
um...
Did you check firewall?
When i use hadoop, I turn off firewall (iptables -F, in the all nodes)
and then try again.
It has happened to us, when we restarted the cluster. But after a while, the datanodes were automatically detected. Could be possibly because of block report delay time property.
Usually there are errors of namespace id issues in the datanode.
So delete the name dir from master and delete the data dir from the datanodes.
Now format the datanode and try start-dfs.
The report usually takes some time to reflect all the datanodes.
Even I was getting 0 datanodes, but after some time master detects the slaves.
I had the same problem and I just solved it.
/etc/hosts of all nodes should look like this:
127.0.0.1 localhost
xxx.xxx.xxx.xxx master
xxx.xxx.xxx.xxx slave-1
xxx.xxx.xxx.xxx slave-2
Just resolved the issue by following below steps -
Make sure the IP addresses for master and slave nodes are correct in /etc/hosts file
Unless you really need the data, stop-dfs.sh, delete all data directories in master/slave nodes, then run hdfs namenode -format and start-dfs.sh. This should recreate the hdfs and fix the issue
Just formatting the namenode didn't work for me. So I checked the logs at $HADOOP_HOME/logs. In secondarynamenode, I found this error:
ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
java.io.IOException: Inconsistent checkpoint fields.
LV = -64 namespaceID = 2095041698 cTime = 1552034190786 ; clusterId = CID-db399b3f-0a68-47bf-b798-74ed4f5be097 ; blockpoolId = BP-31586866-127.0.1.1-1552034190786.
Expecting respectively: -64; 711453560; 1550608888831; CID-db399b3f-0a68-47bf-b798-74ed4f5be097; BP-2041548842-127.0.1.1-1550608888831.
at org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:143)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:550)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:482)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321)
at java.lang.Thread.run(Thread.java:748)
So I stopped hadoop and then specifically formatted the given cluster id:
hdfs namenode -format -clusterId CID-db399b3f-0a68-47bf-b798-74ed4f5be097
This solved the problem.
There's another obscure reason this could happen as well: Your datanode did not start properly, but everything else was working.
In my case, when going through the log, I found that the bound port, 510010, was already in use by SideSync (for MacOS). I found this through
sudo lsof -iTCP -n -P|grep 0010,
But you can use similar techniques to determine what might have already taken your well known data node port.
Killing this off and restarting fixed the problem.
Additionally, if you've installed Hadoop/Yarn as root, but have data dirs in individual home directories, and then try to run it as an individual user, you'll have to make the data node directory public.

Resources