SparkException: Master removed our application - amazon-ec2

I know there are other very similar questions on Stackoverflow but those either didn't get answered or didn't help me out. In contrast to those questions I put much more stack trace and log file information into this question. I hope that helps, although it made the question to become sorta long and ugly. I'm sorry.
Setup
I'm running a 9 node cluster on Amazon EC2 using m3.xlarge instances with DSE (DataStax Enterprise) version 4.6 installed. For each workload (Cassandra, Search and Analytics) 3 nodes are used. DSE 4.6 bundles Spark 1.1 and Cassandra 2.0.
Issue
The application (Spark/Shark-Shell) gets removed after ~3 minutes even if I do not run any query. Queries on small datasets run successful as long as they finish within ~3 minutes.
I would like to analyze much larger datasets. Therefore I need the application (shell) not to get removed after ~3 minutes.
Error description
On the Spark or Shark shell, after idling ~3 minutes or while executing (long-running) queries, Spark will eventually abort and give the following stack trace:
15/08/25 14:58:09 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
This is not very helpful (to me), that's why I'm going to show you more log file information.
Error Details / Log Files
Master
From the master.log I think the interesing parts are
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.46.48:46715 got disassociated, removing it.
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.33.35:42136 got disassociated, removing it.
and
ERROR 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Application Shark::ip-172-31-46-49 with ID app-20150825091745-0007 failed 10 times, removing it
INFO 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Removing app app-20150825091745-0007
Why do the worker nodes get disassociated?
In case you need to see it, I attached the master's executor (ID 1) stdout as well. The executors stderr is empty. However, I think it shows nothing useful to tackle the issue.
On the Spark Master UI I verified to see all worker nodes to be ALIVE. The second screenshot shows the application details.
There is one executor spawned on the master instance while executors on the two worker nodes get respawned until the whole application is removed. Is that okay or does it indicate some issue? I think it might be related to the "(it) failed 10 times" error message from above.
Worker logs
Furthermore I can show you logs of the two Spark worker nodes. I removed most of the class path arguments to shorten the logs. Let me know if you need to see it. As each worker node spawns multiple executors I attached links to some (not all) executor stdout and stderr dumps. Dumps of the remaining executors look basically the same.
Worker I
worker.log
Executor (ID 10) stdout
Executor (ID 10) stderr
Worker II
worker.log
Executor (ID 3) stdout
Executor (ID 3) stderr
The executor dumps seem to indicate some issue with permission and/or timeout. But from the dumps I can't figure out any details.
Attempts
As mentioned above, there are some similar questions but none of those got answered or it didn't help me to solve the issue. Anyway, things I tried and verified are:
Opened port 2552. Nothing changes.
Increased spark.akka.askTimeout which results in the Spark/Shark app to live longer but eventually it still gets removed.
Ran the Spark shell locally with spark.master=local[4]. On the one hand this allowed me to run queries longer than ~3 minutes successfully, on the other hand it obviously doesn't take advantage of the distributed environment.
Summary
To sum up, one could say that the timeouts and the fact long-running queries are successfully executed in local mode all indicate some misconfiguration. Though I cannot be sure and I don't know how to fix it.
Any help would be very much appreciated.
Edit: Two of the Analytics and two of the Solr nodes were added after the initial setup of the cluster. Just in case that matters.
Edit (2): I was able to work around the issue described above by replacing the Analytics nodes with three freshly installed Analytics nodes. I can now run queries on much larger datasets without the shell being removed. I intend not to put this as an answer to the question as it is still unclear what is wrong with the three original Analytics nodes. However, as it is a cluster for testing purposes, it was okay to simply replace the nodes (after replacing the nodes I performed a nodetool rebuild -- Cassandra on each of the new nodes to recover their data from the Cassandra datacenter).

As mentioned in the attempts, the root cause is a timeout between the master node, and one or more workers.
Another thing to try: Verify that all workers are reachable by hostname from the master, either via dns or an entry in the /etc/hosts file.
In my case, the problem was that the cluster was running in an AWS subnet without DNS. The cluster grew over time by spinning up a node, the adding the node to the cluster. When the master was built, only a subset of the addresses in the cluster was known, and only that subset was added to the /etc/hosts file.
When dse spark was run from a "new" node, then communication from the master using the worker's hostname failed and the master killed the job.

Related

HDFS Showing 0 Blocks after cluster reboot

I've setup a small cluster for testing / academic proposes, I have 3 nodes, one of which is acting both as namenode and datanode (and secondarynamenode).
I've uploaded 60GB of files (about 6.5 Million files) and uploads started to get really slow, so I read on the internet that I could stop the secondary namenode service on the main machine, at the moment it had no effect on anything.
After I rebooted all 3 computers, two of my datanodes show 0 blocks (despite showing disk usage in web interface) even with both namenodes services running.
One of the nodes with problem is the one running the namenode as well so I am guessing it is not a network problem.
any ideas on how can I get these blocks to be recognized again? (without start it all over again which took about two weeks to upload all)
Update
After half an hour after another reboot this showed in the logs:
2018-03-01 08:22:50,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x199d1a180e357c12, containing 1 storage report(s), of which we sent 0. The reports had 6656617 total blocks and used 0 RPC(s). This took 679 msec to generate and 94 msecs for RPC and NN processing. Got back no commands.
2018-03-01 08:22:50,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.EOFException: End of File Exception between local host is: "Warpcore/192.168.15.200"; destination host is: "warpcore":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
And the EOF stack trace, after searching the web I discovered this [http://community.cloudera.com/t5/CDH-Manual-Installation/CDH-5-5-0-datanode-failed-to-send-a-large-block-report/m-p/34420] but still can't understand how to fix this.
The report block is too big and need to be split, but I don't know how or where should I configure this. I´m googling...
The problem seems to be low RAM on my namenode, as a workaround I added more directories to the namenode configuration as if I had multiple disks and rebalanced the files manually as instructed ins the comments here.
As hadoop 3.0 reports each disk separately the datenode was able to report and I was able to retrieve the files, this is an ugly workaround and not for production, but good enough for my academic purposes.
An interesting side effect was the datanode reporting multiple times the available disk space wich could lead into serious problems on production.
It seems a better solution is using HAR to reduce the number of blocks as described here and here

IPython Notebook with Spark on EC2 : Initial job has not accepted any resources

I am trying to run the simple WordCount job in IPython notebook with Spark connected to an AWS EC2 cluster. The program works perfectly when I use Spark in the local standalone mode but throws the problem when I try to connect it to the EC2 cluster.
I have taken the following steps
I have followed instructions given in this Supergloo blogpost.
No errors are found until the last line where I try to write the output to a file. [The lazyloading feature of Spark means that this when the program really starts to execute]
This is where I get the error
[Stage 0:> (0 + 0) / 2]16/08/05 15:18:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Actually there is no error, we have this warning and the program goes into an indefinite wait state. Nothing happens until I kill the IPython notebook.
I have seen this Stackoverflow post and have reduced the number of cores to 1 and memory to 512 by using this options after the main command
--total-executor-cores 1 --executor-memory 512m
The screen capture from the SparkUI is as follows
sparkUI
This clearly shows that both core and UI is not being fully utilized.
Finally, I see from this StackOverflow post that
The spark-ec2 script configure the Spark Cluster in EC2 as standalone,
which mean it can not work with remote submits. I've been struggled
with this same error you described for days before figure out it's not
supported. The message error is unfortunately incorrect.
So you have to copy your stuff and log into the master to execute your
spark task.
If this is indeed the case, then there is nothing more to be done, but since this statement was made in 2014, I am hoping that in the last 2 years the script has been rectified or there is a workaround. If there is any workaround, I would be grateful if someone can point it out to me please.
Thank you for your reading till this point and for any suggestions offered.
You can not submit jobs except on the Master - as you see - unless you set up a REST based Spark job server.

Hortonworks HDP , heartbeat lost in one of the 3 nodes

I have installed HDP Ambari with three nodes in VM, i restarted one of three nodes i.e., datanode2 after that, i lost heart beat from that node in Ambari. I restarted ambari-agent in all three nodes, then also not working. Kindly find me a solution.
Well the provided information is not sufficient, anyway i will try to tell you the normal approach I take to debug this.
First check if all the ambari-agents are running, use the command ambari-agent status.
Check the logs of both ambari-agent and ambari-server. Normally the logs are available at /var/log/ambari-agent and /var/log/ambari-server. Logs should tell you the exact reason for heartbeat lost.
Most common reasons for the agent failure would be Connection issues between the machines, version mismatch or corrupt database entry.
I think log files should help you.

Oozie Hive action on AWS - unpredictable ip sources break the job

I've been having a few days of unalloyed torture getting Hive jobs to run via Oozie on an AWS 5 machine cluster. The simplest job that involved the live metastore succeeds or fails unpredictably. The error messages are pretty unhelpful:
Hive failed, error message[Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [1]]
Thanks Oozie!
After a lot of fun changing just about every imaginable setting, I studied hivemetastore.log carefully (we have mySQL as the metastore) and realised that every successful request came from 172.31.40.3. Unsuccessful requests came from 172.31.40.2,172.31.40.4 and 172.31.40.5 . The Hive console app makes requests without problems on 172.31.40.1
This is getting somewhere after nearly week of having no idea whatsover is going on. The question is now, what do I need to change to allow all requests from 172.31.40.1-5 in? Or funnel Oozie requests solely through 172.31.40.1 or 172.31.40.3, either.
Why would only 172.31.40.1 and 172.31.40.3 work?
all ideas and suggestions warmly received.
many thanks
Toby
this was so simple in the end - the Oozie client was only installed on 2 of the 5 machines in the cluster. Corresponding, of course, to the 2 IP addresses that could make successful requests to the hive metastore
Once we installed the Oozie client onto all the machines in the cluster, all the jobs were automatically accepted and ran OK
obvious when you know the answer ...

Hadoop mapreduce getMapOutput failed

Current setup:
- Hadoop 0.20.2-cdh3u3
- Hbase Version 0.90.4-cdh3u3
- Jetty-6.1.14
- Running on VM (Debian Squeeze)
Problem appears during mapreduce process on Hbase table. On the Reduce phase it crashes every time at the very same point with these logs in tasktracker.log:
ERROR org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_201205290717_0001_m_000010_0,3) failed:
org.mortbay.jetty.EofException
WARN org.mortbay.log: Committed before 410 getMapOutput(attempt_201205290717_0001_m_000010_0,3) failed :
org.mortbay.jetty.EofException
ERROR org.mortbay.log: /mapOutput
java.lang.IllegalStateException: Committed
Hoping anyone faced the same or similar problem before, looking for a solution.
I am facing the same issue here.
On my cluster, this happens on all slaves (datanode & tasttrackers) except for one, which results in the general reduce process to first progress very slowly and at a certain point in a reroll of the reduce progress so far due to some error. the reduce process then starts all over again: the job never finishes.
There is an open major issue in the bugtracker. See https://issues.apache.org/jira/browse/MAPREDUCE-5
Let us hope, it will be fixed some day, but at the very moment, i can not use my hadoop program with huge files > 3 GB at all. In my case i hope, i can fix it by additional data cleaning and more efficient data structures (trove, fastutils), so the problem doesnt occur at all, but honestly, this feels kind of like the wrong approach here. Not to do those smaller tweaks was the main reason starting with hadoop anyways.
The Jetty EOFException is observed when the reduce Task prematurely closes a connection to a jetty server. Restart the tasktrackers and run the job again. See if it works for you.

Resources