HDFS live node showing as decommissioned - hadoop

I have 5 datanodes in my Cloudera cluster (CDH4) and all are showing as healthy.
The only service with an issue is HDFS which is showing under replication. When I go to host:50070/dfsnodelist.jsp?whatNodes=LIVE I'm seeing all 5 nodes, of which only one of them is shwoing is in Service. The rest of them are decommissioned.
I'm seeing a lot of information about removing the decommissioned nodes but very little on how to recommission. As far as cloudera manager is concerned, these nodes are active.

Well that is strange. Althought Cloudera manager thinks they're okay, I thought I'd try decommissioning the datanodes and then recommissioning them as a last ditch. It seems to have fixed the reporting as decommissioned
In addition its fixed my underreplication issue as well.

Related

Number of yarn applications getting launched as soon as hadoop services gets up. Cluster is 4 nodes ie. Hadoop HA cluster

Hadoop-HA cluster - 4 nodes
As soon as I start hadoop services unnecessary yarn applications gets launched and no application logs gets generated. Not able to debug problem without logs. Can anyone help me to resolve this issue.
https://i.stack.imgur.com/RjvkB.png
Never come across such issue. But it seems that there is some script or may be some oozie job triggering these apps. Try Yarn-Clean if this is of any help.
Yarn-Clean

hadoop 50070 Datanode tab won't show both data nodes

I know it is a duplicate question but since the other questions did not have answers with it, I am reposting it. I have recently installed hadoop cluster using 2 VMs on my laptop. I could go and checkout port 50070 and under datanodes tab I can see only one data node, but I have 2 data nodes, one on master node and other on slave node. What could be the reasons?
Sorry, feels like it's been a time. But still I'd like to share my answer: the root cause is from hadoop/etc/hadoop/hdfs-site.xml: the xml file has a property named dfs.datanode.data.dir. If you set all the datanodes with the same name, then hadoop is assuming the cluster has only one datanode. So the proper way of doing it is naming every datanode with a unique name:
Regards, YUN HANXUAN

Hortonworks HDP , heartbeat lost in one of the 3 nodes

I have installed HDP Ambari with three nodes in VM, i restarted one of three nodes i.e., datanode2 after that, i lost heart beat from that node in Ambari. I restarted ambari-agent in all three nodes, then also not working. Kindly find me a solution.
Well the provided information is not sufficient, anyway i will try to tell you the normal approach I take to debug this.
First check if all the ambari-agents are running, use the command ambari-agent status.
Check the logs of both ambari-agent and ambari-server. Normally the logs are available at /var/log/ambari-agent and /var/log/ambari-server. Logs should tell you the exact reason for heartbeat lost.
Most common reasons for the agent failure would be Connection issues between the machines, version mismatch or corrupt database entry.
I think log files should help you.

How to clear Dead Region Servers in HBase UI?

I have removed a regionserver from my HBase cluster.I removed its hostname from $HBASE_HOME/conf/regionservers, and restarted the HBase cluster, but the HBase UI still shows the removed region server as a 'dead' region server.
The 'status' command in the hbase shell also shows it as a dead region server. How should I get rid of it ?
Cluster getting haunted by dead regionserver :D
HBase may sometimes still show a decommissioned regionserver as dead. This is because, the WAL (Write-Ahead Log) of the dead regionserver was still in HDFS in the “splitting” state, so from HBase perspective it’s not dead !
So the solution is to go to WALs directory in HDFS (usually at /hbase/WALs) and remove the files of the old regionserver.
Found this at this wonderful blog kill zombie dead regionservers after much digging.

"LOST" node in EMR Cluster

How do I troubleshoot and recover a Lost Node in my long running EMR cluster?
The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.
EMR nodes are ephemeral and you cannot recover them once they are marked as LOST. You can avoid this in first place by enabling 'Termination Protection' feature during a cluster launch.
Regarding finding reason for LOST node, you can probably check YARN ResourceManager logs and/or Instance controller logs of your cluster to find out more about root cause.

Resources