How do I troubleshoot and recover a Lost Node in my long running EMR cluster?
The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.
EMR nodes are ephemeral and you cannot recover them once they are marked as LOST. You can avoid this in first place by enabling 'Termination Protection' feature during a cluster launch.
Regarding finding reason for LOST node, you can probably check YARN ResourceManager logs and/or Instance controller logs of your cluster to find out more about root cause.
Related
I am trying to setup HDFS on minikube (for now) and later on a DEV kubernetes cluster so I can use it with Spark. I want Spark to run locally on my machine so I can run in debug mode during development so it should have access to my HDFS on K8s.
I have already set up 1 namenode deployment and a datanode statefulset (3 replicas) and those work fine when I am using HDFS from within the cluster. I am using a headless service for the datanodes and a cluster-ip service for the namenode.
The problem starts when I am trying to expose hdfs. I was thinking of using an ingress for that but that only exposes port 80 outside of the cluster and maps paths to different services inside the cluster which is not what I'm looking for. As far as I understand, my local spark jobs (or hdfs client) talk to the namenode which replies with an address for each block of data. That address though is something like 172.17.0.x:50010 and of course my local machine can't see those.
Is there any way I make this work? Thanks in advance!
I know this question is about just getting it to run in a dev environment, but HDFS is very much a work in progress on K8s, so I wouldn't by any means run it in production (as of this writing). It's quite tricky to get it working on a container orchestration system because:
You are talking about a lot of data and a lot of nodes (namenodes/datanodes) that are not meant to start/stop in different places in your cluster.
You have the risk of having a constantly unbalanced cluster if you are not pinning your namenodes/datanodes to a K8s node (which defeats the purpose of having a container orchestration system)
If you run your namenodes in HA mode and it for any reason your namenodes die and restart you run the risk of corrupting the namenode metadata which would make you lose all your data. It's also risky if you have a single node and you don't pin it to a K8s node.
You can't scale up and down easily without running in an unbalanced cluster. Running an unbalanced cluster defeats one of the main purposes of HDFS.
If you look at DC/OS they were able to make it work on their platform, so that may give you some guidance.
In K8s you basically need to create services for all your namenode ports and all your datanode ports. Your client needs to be able to find every namenode and datanode so that it can read/write from them. Also the some ports cannot go through an Ingress because they are layer 4 ports (TCP) for example the IPC port 8020 on the namenode and 50020 on the datanodes.
Hope it helps!
Hadoop-HA cluster - 4 nodes
As soon as I start hadoop services unnecessary yarn applications gets launched and no application logs gets generated. Not able to debug problem without logs. Can anyone help me to resolve this issue.
https://i.stack.imgur.com/RjvkB.png
Never come across such issue. But it seems that there is some script or may be some oozie job triggering these apps. Try Yarn-Clean if this is of any help.
Yarn-Clean
I am working on hadoop-2.6.0 single node cluster in windows. When i submit any mapreduce job, it always in accepted state. It seems my nodemanager is in unhealthy state. How to make it healthy? Why the nodemanager in unhealthy state? or when it will back to the healthy state?
Found the solution here
It seems that the cause of the problem is low disk space in the hadoop installed drive. So i just cleaned up with more space then nodemanager automatically changed into healthy state. We can't do it manually using any commands for changing the states of the hadoop nodes as analyzed.
When the job is in Accepted stage , it means that its waiting for the datanode to accept and start processing.
The following are to be done:
Check for available slots
If slots are available and its taking time to change status to Running , then check the datanodes health using either cloudera manager or hadoop dfs admin command.
If there are dead nodes , Restarting would solve the issue.
Please try to add the config in yarn-site.xml
name=yarn.nodemanager.disk-health-checker.enable value=false
I have 5 datanodes in my Cloudera cluster (CDH4) and all are showing as healthy.
The only service with an issue is HDFS which is showing under replication. When I go to host:50070/dfsnodelist.jsp?whatNodes=LIVE I'm seeing all 5 nodes, of which only one of them is shwoing is in Service. The rest of them are decommissioned.
I'm seeing a lot of information about removing the decommissioned nodes but very little on how to recommission. As far as cloudera manager is concerned, these nodes are active.
Well that is strange. Althought Cloudera manager thinks they're okay, I thought I'd try decommissioning the datanodes and then recommissioning them as a last ditch. It seems to have fixed the reporting as decommissioned
In addition its fixed my underreplication issue as well.
We are running Hadoop on Amazon EC2 cluster. We start the master, slaves and attach the ebs volumes and finally waiting for hadoop jobtracker, tasktracker etc to start and we have timeout of 3600 seconds. We are noticing 50% of the time that job tracker is not able to start before the timeout. Reason being, hdfs is not initialized properly and still in safemode and job tracker is unable to start. I noticed few connectivity issues between nodes on EC2 as I tried manually pinging slaves.
Did anyone face similar issue and know how to solve this?
I'm not sure, whether this issue is related to Amazon EC2.
I had this problem very often too - although I had a pseudo-distributed installation on my machine.
In these cases I could turn the safemode off manually and safely.
Try this command:bin/hadoop dfsadmin -safemode leave
I think you can't do wrong here. It seems to be a buggy feature of hadoop. I used 0.18.3, what version do you run?