Number of yarn applications getting launched as soon as hadoop services gets up. Cluster is 4 nodes ie. Hadoop HA cluster - hadoop

Hadoop-HA cluster - 4 nodes
As soon as I start hadoop services unnecessary yarn applications gets launched and no application logs gets generated. Not able to debug problem without logs. Can anyone help me to resolve this issue.
https://i.stack.imgur.com/RjvkB.png

Never come across such issue. But it seems that there is some script or may be some oozie job triggering these apps. Try Yarn-Clean if this is of any help.
Yarn-Clean

Related

Yarn UI shows no information about applications

I know that the similar question was asked Applications not shown in yarn UI when running mapreduce hadoop job?
but the answers did not solve my problems.
I am running Hadoop streaming on Linux 17.01. I setup a cluster with 3 nodes and 1 master node.
When I start Hadoop, I can access localhost:50070 to see other nodes (all nodes are alive).
However, I see no information in "Application" of localhost:8088
as well as by command "yarn application -list -appStates ALL".
Here is my configuration.
My yarn-site.xml (for all nodes)
Here is all processes on master node
The problems may due to yarn services are running on ipv6. However, I followed I followed this thread
https://askubuntu.com/questions/440649/how-to-disable-ipv6-in-ubuntu-14-04
to change all Yarn services to ipv4. However, still there is no tasks displayed on Yarn UI, even I can see all nodes in my cluster marked as "active" on Yarn UI.
So, I do not know why this happened. Do you have any suggestion?
Thank you very much.
I haven't typically seen YARN being configured for IPv4, but this property is added into the hadoop-env.sh
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"
I'm sure you also add a similar variable into the yarn-env.sh for YARN_OPTS, I think
However, it's not really clear from the your question when / if you've even submitted an application for anything to appear

Spark Shell stuck in YARN Accepted state

Running Spark 1.3.1 on Yarn and EMR. When I run the spark-shell everything looks normal until I start seeing messages like INFO yarn.Client: Application report for application_1439330624449_1561 (state: ACCEPTED). These messages are generated endlessly, once per second. Meanwhile, I am unable to use the Spark shell.
I don't understand why this is happening.
Seeing (near) endless Accepted messages from YARN has always been a sure sign that there were not enough cluster resources to allocate for my Spark jobs / shell. YARN will continue trying to schedule your Spark application, but will eventually time-out if not enough resources become available in a certain amount of time.
Are you providing any command line options to spark-shell that override the defaults provided? When I ask for too many executors/cores/memory YARN will accept my request but never transition to a Running ApplicationMaster.
Try running a spark-shell with no options (other than perhaps --master yarn) and see if it gets past Accepted.
Realized there were a couple of streaming jobs I had killed in the terminal, but I guess they were somehow still running. I was able to find these in the UI showing all running applications on YARN (I wasn't able to execute Hive queries as either). Once I killed the jobs using the command below the spark-shell started as usual.
yarn application -kill application_1428487296152_25597
I guess that YARN is not having resources enough for running jobs.
Please check
https://www.cloudera.com/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html
for calculating how many resources can you provide to YARN.
Please check the number of cores and the RAM quantity that it is controlled by the following variables:
yarn.nodemanager.resource.cpu-vcores
yarn.nodemanager.resource.memory-mb

"LOST" node in EMR Cluster

How do I troubleshoot and recover a Lost Node in my long running EMR cluster?
The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.
EMR nodes are ephemeral and you cannot recover them once they are marked as LOST. You can avoid this in first place by enabling 'Termination Protection' feature during a cluster launch.
Regarding finding reason for LOST node, you can probably check YARN ResourceManager logs and/or Instance controller logs of your cluster to find out more about root cause.

Spark EC-2 deployment error: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up

I have a question in regard to deploying a spark application on a standalone EC-2 Cluster. I have followed the tutorial by Spark ans was able to successfully deploy a standalone EC-2 cluster. I verified that by connecting to the clusrer UI and making sure that everything is as it supposed to be. I developed a simple application and tested it locally. Everything works fine. When I submit it to the cluster (just changing --master local[4] into --masers spark://.... ) I get the following error: ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up. Does any one know how to overcome this problem. my deploy-mode is client.
Make sure that you have provided the correct url to the master.
Basically, the exact spark master URL is displayed on the page when you connected to the Web UI.
URL on the page is something like: Spark Master at spark://IPAddress:port
Also you may notice that web UI and the Spark running port numbers may be different

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

Resources