Hadoop on Amazon EC2 : Job tracker not starting properly - amazon-ec2

We are running Hadoop on Amazon EC2 cluster. We start the master, slaves and attach the ebs volumes and finally waiting for hadoop jobtracker, tasktracker etc to start and we have timeout of 3600 seconds. We are noticing 50% of the time that job tracker is not able to start before the timeout. Reason being, hdfs is not initialized properly and still in safemode and job tracker is unable to start. I noticed few connectivity issues between nodes on EC2 as I tried manually pinging slaves.
Did anyone face similar issue and know how to solve this?

I'm not sure, whether this issue is related to Amazon EC2.
I had this problem very often too - although I had a pseudo-distributed installation on my machine.
In these cases I could turn the safemode off manually and safely.
Try this command:bin/hadoop dfsadmin -safemode leave
I think you can't do wrong here. It seems to be a buggy feature of hadoop. I used 0.18.3, what version do you run?

Related

Number of yarn applications getting launched as soon as hadoop services gets up. Cluster is 4 nodes ie. Hadoop HA cluster

Hadoop-HA cluster - 4 nodes
As soon as I start hadoop services unnecessary yarn applications gets launched and no application logs gets generated. Not able to debug problem without logs. Can anyone help me to resolve this issue.
https://i.stack.imgur.com/RjvkB.png
Never come across such issue. But it seems that there is some script or may be some oozie job triggering these apps. Try Yarn-Clean if this is of any help.
Yarn-Clean

"LOST" node in EMR Cluster

How do I troubleshoot and recover a Lost Node in my long running EMR cluster?
The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.
EMR nodes are ephemeral and you cannot recover them once they are marked as LOST. You can avoid this in first place by enabling 'Termination Protection' feature during a cluster launch.
Regarding finding reason for LOST node, you can probably check YARN ResourceManager logs and/or Instance controller logs of your cluster to find out more about root cause.

How do I safely remove a Hadoop datanode for maintenance?

I want to take a single machine out of a Hadoop cluster temporarily.
Most documentation says take it out of by adding it to the yarn and dfs .exclude files. I don't want to add it to the dfs.exclude and yarn.exclude files and decommission it with hdfs dfsadmin -refreshNodes, though, because I want to take it out, make some changes to the machine, and bring it back online as soon as possible. I don't want to copy hundreds of gigabytes of data over to avoid under-replicated blocks!
Instead, I'd like to be able to power off the machine quickly while making sure:
The cluster as a whole is still operational.
No data is lost by the journalmanager or nodemanager processes.
No Yarn jobs fail or go AWOL when the process dies.
My best guess at how to do this is by issuing:
./hadoop-daemon.sh --hosts hostname stop datanode
./hadoop-daemon.sh --hosts hostname stop journalnode
./yarn-daemon.sh --hosts hostname stop nodemanager
And then starting each of these processes individually again when the machine comes back online.
Is that safe? And is there a more efficient way to do this?

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

Region server geting down frequently after system start

I am running hbase on HDP on Amazon machine,
When i reboot my system and start all hbase services, it get started.
But after some time my region server get down.
Latest error that i am getting from its log file is that
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /apps/hbase/data/usertable/dd5a251551619e0109349a0dce855e1b/recovered.edits/0000000000000001172.temp could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1657)
Now i am not able to start it.
Any suggestion why it is happing.
Thanks in advance.
Make sure you datanodes are up and running. Also, set "dfs.data.dir" to some permanent location, if you haven't done it yet. It defaults to the "/tmp" dir which gets emptied at each restart. Also, make sure that your datanodes are able to talk to the namenode and there is no network related issue and the datanode machines have enough free space left.

Resources