BlockMissingException on running Spark job - hadoop

I am running a spark job via pyspark, which consistently returns an error:
Diagnostics: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-908041201-10.122.103.38-1485808002236:blk_1073741831_1007 file=/hdp/apps/2.5.3.0-37/spark/spark-hdp-assembly.jar
The error is always on the same block, namely BP-908041201-10.122.103.38-1485808002236:blk_1073741831_1007.
When i look in the hadoop tracking url, the message reads:
Application application_1505726128034_2371 failed 2 times due to AM Container
for appattempt_1505726128034_2371_000002 exited with exitCode: -1000
I can only assume from this that there is some corrupted data? How can i view the data / block via hadoop command line and see exactly which data is on this potentially corrupted block.
Unfortunately there doesnt appear to be more detailed logs on the specific nodes of failure when looking in the web based logs.
Also - is there a way in pyspark to ignore any 'corrupted' blocks and simply ignore any files/blocks it cannot fully read?
Thanks

Related

Yarn shows the jobs is succeeded but the EMR shows the step is still running

Yarn shows the jobs is succeeded(in Yarn UI) but the EMR shows the step(in EMR console UI) is still running and it shows like tat forever. Any thought ?
I am writing to s3 as json part files and I see this in driver logs :
Caused by: java.io.IOException: File already exists:s3n:
But the driver is still running but yarn shows as successed.
I ran into the same issue, where s3 was telling me that the file already existed and the job was finishing as expected. First of all, don't use s3n://, use s3:// instead, as recommended in this issue.
In order to get rid of the IOException, I enabled EMRFS consistent view, which is recommended for "clusters that run quick, sequential steps using Amazon S3 as a data store", which was my case. YMMV.

Max limit of oozie workflows

Does anyone have any idea on what's the maximum limit of oozie workflows that can execute in parallel?
I'm running 35 workflows in parallel (or that's what oozie UI mentions that they all got started in parallel). All the subworkflows perform ingestion of files from local to HDFS & do some validation checks henceforth on the metadata of file. Simple as that.
However, I see some subworkflows get failed during execution; the step in which they fail tries to put the files into HDFS location, i.e., the process wasn't able to execute hdfs dfs -put command. However, when I rerun these subworkflows they run successfully.
Not sure what caused them to execute and fail on hdfs dfs -put.
Any clues/suggestions on what could be happening?
First limitation does not depends on Oozie, but on resources available in YARN to execute Oozie actions as each action is executed in one map. But this limit will not fail your workflow: they will just wait for resources.
The major limit we've faced, leading to troubles, was on the callable queue of oozie services. Sometime, on heavy loads created by plenty of coordinator submitting plenty of worklow, Oozie was loosing more time in processing its internal callable queue than running workflows :/
Check oozie.service.CallableQueueService settings for informations about this.

IPython Notebook with Spark on EC2 : Initial job has not accepted any resources

I am trying to run the simple WordCount job in IPython notebook with Spark connected to an AWS EC2 cluster. The program works perfectly when I use Spark in the local standalone mode but throws the problem when I try to connect it to the EC2 cluster.
I have taken the following steps
I have followed instructions given in this Supergloo blogpost.
No errors are found until the last line where I try to write the output to a file. [The lazyloading feature of Spark means that this when the program really starts to execute]
This is where I get the error
[Stage 0:> (0 + 0) / 2]16/08/05 15:18:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Actually there is no error, we have this warning and the program goes into an indefinite wait state. Nothing happens until I kill the IPython notebook.
I have seen this Stackoverflow post and have reduced the number of cores to 1 and memory to 512 by using this options after the main command
--total-executor-cores 1 --executor-memory 512m
The screen capture from the SparkUI is as follows
sparkUI
This clearly shows that both core and UI is not being fully utilized.
Finally, I see from this StackOverflow post that
The spark-ec2 script configure the Spark Cluster in EC2 as standalone,
which mean it can not work with remote submits. I've been struggled
with this same error you described for days before figure out it's not
supported. The message error is unfortunately incorrect.
So you have to copy your stuff and log into the master to execute your
spark task.
If this is indeed the case, then there is nothing more to be done, but since this statement was made in 2014, I am hoping that in the last 2 years the script has been rectified or there is a workaround. If there is any workaround, I would be grateful if someone can point it out to me please.
Thank you for your reading till this point and for any suggestions offered.
You can not submit jobs except on the Master - as you see - unless you set up a REST based Spark job server.

SparkException: Master removed our application

I know there are other very similar questions on Stackoverflow but those either didn't get answered or didn't help me out. In contrast to those questions I put much more stack trace and log file information into this question. I hope that helps, although it made the question to become sorta long and ugly. I'm sorry.
Setup
I'm running a 9 node cluster on Amazon EC2 using m3.xlarge instances with DSE (DataStax Enterprise) version 4.6 installed. For each workload (Cassandra, Search and Analytics) 3 nodes are used. DSE 4.6 bundles Spark 1.1 and Cassandra 2.0.
Issue
The application (Spark/Shark-Shell) gets removed after ~3 minutes even if I do not run any query. Queries on small datasets run successful as long as they finish within ~3 minutes.
I would like to analyze much larger datasets. Therefore I need the application (shell) not to get removed after ~3 minutes.
Error description
On the Spark or Shark shell, after idling ~3 minutes or while executing (long-running) queries, Spark will eventually abort and give the following stack trace:
15/08/25 14:58:09 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
This is not very helpful (to me), that's why I'm going to show you more log file information.
Error Details / Log Files
Master
From the master.log I think the interesing parts are
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.46.48:46715 got disassociated, removing it.
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.33.35:42136 got disassociated, removing it.
and
ERROR 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Application Shark::ip-172-31-46-49 with ID app-20150825091745-0007 failed 10 times, removing it
INFO 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Removing app app-20150825091745-0007
Why do the worker nodes get disassociated?
In case you need to see it, I attached the master's executor (ID 1) stdout as well. The executors stderr is empty. However, I think it shows nothing useful to tackle the issue.
On the Spark Master UI I verified to see all worker nodes to be ALIVE. The second screenshot shows the application details.
There is one executor spawned on the master instance while executors on the two worker nodes get respawned until the whole application is removed. Is that okay or does it indicate some issue? I think it might be related to the "(it) failed 10 times" error message from above.
Worker logs
Furthermore I can show you logs of the two Spark worker nodes. I removed most of the class path arguments to shorten the logs. Let me know if you need to see it. As each worker node spawns multiple executors I attached links to some (not all) executor stdout and stderr dumps. Dumps of the remaining executors look basically the same.
Worker I
worker.log
Executor (ID 10) stdout
Executor (ID 10) stderr
Worker II
worker.log
Executor (ID 3) stdout
Executor (ID 3) stderr
The executor dumps seem to indicate some issue with permission and/or timeout. But from the dumps I can't figure out any details.
Attempts
As mentioned above, there are some similar questions but none of those got answered or it didn't help me to solve the issue. Anyway, things I tried and verified are:
Opened port 2552. Nothing changes.
Increased spark.akka.askTimeout which results in the Spark/Shark app to live longer but eventually it still gets removed.
Ran the Spark shell locally with spark.master=local[4]. On the one hand this allowed me to run queries longer than ~3 minutes successfully, on the other hand it obviously doesn't take advantage of the distributed environment.
Summary
To sum up, one could say that the timeouts and the fact long-running queries are successfully executed in local mode all indicate some misconfiguration. Though I cannot be sure and I don't know how to fix it.
Any help would be very much appreciated.
Edit: Two of the Analytics and two of the Solr nodes were added after the initial setup of the cluster. Just in case that matters.
Edit (2): I was able to work around the issue described above by replacing the Analytics nodes with three freshly installed Analytics nodes. I can now run queries on much larger datasets without the shell being removed. I intend not to put this as an answer to the question as it is still unclear what is wrong with the three original Analytics nodes. However, as it is a cluster for testing purposes, it was okay to simply replace the nodes (after replacing the nodes I performed a nodetool rebuild -- Cassandra on each of the new nodes to recover their data from the Cassandra datacenter).
As mentioned in the attempts, the root cause is a timeout between the master node, and one or more workers.
Another thing to try: Verify that all workers are reachable by hostname from the master, either via dns or an entry in the /etc/hosts file.
In my case, the problem was that the cluster was running in an AWS subnet without DNS. The cluster grew over time by spinning up a node, the adding the node to the cluster. When the master was built, only a subset of the addresses in the cluster was known, and only that subset was added to the /etc/hosts file.
When dse spark was run from a "new" node, then communication from the master using the worker's hostname failed and the master killed the job.

Marathon kills container when requests increase

I deploy Docker containers on Mesos(0.21) and Marathon(0.7.6) on Google Cloud Engine.
I use JMeter to test a REST service that run on Marathon. When the concurrent requests are less than 10, it works normal, but when the concurrent requests are over 50, the container is killed and Mesos start another container. I increase RAM, CPU but it still happens.
This is log in /var/log/mesos/
E0116 09:33:31.554816 19298 slave.cpp:2344] Failed to update resources for container 10e47946-4c54-4d64-9276-0ce94af31d44 of executor dev_service.2e25332d-964f-11e4-9004-42010af05efe running task dev_service.2e25332d-964f-11e4-9004-42010af05efe on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/612/cgroup: Failed to open file '/proc/612/cgroup': No such file or directory
The error message you're seeing is actually another symptom, not the root cause of the issue. There's a good explanation/discussion in this Apache Jira bug report:
https://issues.apache.org/jira/browse/MESOS-1837
Basically, your container is crashing for one reason or another and the /proc/pid#/ directory is getting cleared without Mesos being aware, so it throws the error message you found when it goes to check that /proc directory.
Try setting your allocated CPU higher in the JSON file describing your task.

Resources