Apache Spark EC2 job not running. No space left on device - amazon-ec2

I have been running my program multiple times on a 20 node cluster. All of a sudden every time I run the program I get the following error:
15/04/19 16:52:35 WARN scheduler.TaskSetManager: Lost task 35.0 in stage 9.0 (TID 384, ip-XXX.XXX.compute.internal): java.io.FileNotFoundException: /mnt/spark/spark-local-XXX-ebd3/18/shuffle_2_35_64 (No space left on device)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.<init>(FileOutputStream.java:221)
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Checking the UI and it says there's absolutely nothing on the nodes. I have run the program maybe 15 times and only all of a sudden has this started. Why has this occurred out of the blue? And how do I fix it?

"No space left on device" is a quite clear exception: That node has no space left on the mount where the spark local files are being written: /mnt/spark/
Solution: go to the node (or nodes) and clean that up. rm -rf FTW.
If jobs are breaking before they terminate, due to manual intervention or failure, they will often leave temp data behind.

Related

Hadoop streaming job using Mxnet failing in AWS Emr

I have setup an emr step in AWS datapipeline. The step command looks like this:
/usr/lib/hadoop-mapreduce/hadoop-streaming.jar,-input,s3n://input-bucket/input-file,-output,s3://output/output-dir,-mapper,/bin/cat,-reducer,reducer.py,-file,/scripts/reducer.py,-file,/params/parameters.bin
I am getting the following error
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
I have tried running reducer step separately on my desktop
(on a single node hadoop setup) and its working. I have already included the #!/usr/bin/env python in the reducer script. I suspect that I am not writing the EMR step correctly.
EMR version: 5.5.0
EDIT:
After further investigation, I have found out the exact line of code where the reducer code is failing in emr.
I am doing Machine Learning predictions using mxnet library in the reducer. When I load the model parameters, the reducer fails. Reference to API doc is here
module.load_params('parameters.bin')
I have checked the current working directory of the EMR node [using os.listdir(os.getcwd())] and it contains the parameters.bin file (I have even printed the file contents successfully).
I want to point out again that the streaming job is working fine on my single-node local setup.
EDIT2: I set the number of reducer tasks to 2. I enclosed my reducer code in a try-except block and I see the following error in one of the tasks (the other one runs fine)
[10:27:25] src/ndarray/ndarray.cc:299: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (119,) to.shape=(111,)
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc72fc) [0x7f81443842fc]
[bt] (1) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc166f4) [0x7f8144ed36f4]
[bt] (2) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc74c24) [0x7f8144f31c24]
[bt] (3) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(MXImperativeInvoke+0x2cd) [0x7f8144db935d]
[bt] (4) /usr/lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7f8150b8acec]
[bt] (5) /usr/lib64/libffi.so.6(ffi_call+0x1f5) [0x7f8150b8a615]
[bt] (6) /usr/lib64/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x30b) [0x7f8150d9d97b]
[bt] (7) /usr/lib64/python2.7/lib-dynload/_ctypes.so(+0xa915) [0x7f8150d97915]
[bt] (8) /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f815a69e183]
[bt] (9) /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x337d) [0x7f815a73107d]
I figured out the issue. Actually, the shapes expected by mxnet were dependent on the data-set (it actually depended on the maximum value in the data-set). Training happens on a single gpu box and has the whole dataset. The prediction however, works well with single node setup because it has all the data used in training. But when multi-node cluster is used, the data-set gets split which made the max-value different for each node. This was causing the error.
I have now made the expected shapes independent of the data-set and this error is not occurring anymore. I hope this clarifies things.

FileNotFoundException while executing spark job

I am trying to execute a program on Spark. I have a cluster with a master and two slave nodes. I am receiving the following error during execution.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4.0 failed 4 times, most recent failure: Lost task 3.3 in stage 4.0 (TID 44, hadoopslave3): java.lang.RuntimeException: java.io.FileNotFoundException: File /home/ubuntu/hadoop/hadoop-te/dl4j/1485860107978_-4ccc8c8/0/data/dataset_4-4ccc8c8_68.bin does not exist
Driver stacktrace is as follows:
Driver stacktrace:
at og.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/01/31 10:56:08 INFO scheduler.TaskSetManager: Lost task 1.3 in stage 4.0 (TID 45) on executor hadoopslave3: java.lang.RuntimeException (java.io.FileNotFoundException: File /home/ubuntu/hadoop/hadoop-te/dl4j/1485860107978_-4ccc8c8/0/data/dataset_2-4ccc8c8_77.bin does not exist) [duplicate 3]
However, I can see all the dataset objects (.bin files) created on the HDFS. Any suggesstions ?
Since you have a cluster with "two slave nodes" set up: do you also have a Hadoop filesystem set up? If not, then that's your issue.
The example you're linking to, when using a non-local cluster, transfers references to datasets using Hadoop. The behavior of this example has been made more predictable (with a big error message) in release 0.8.0.

Spark jobs failing because HDFS is caching jars

I upload Scala / Spark jars to HDFS to test them on our cluster. After running, I frequently realize there are changes that need to be made. So I make the changes locally then push the new jar back up to HDFS. However, often (not always) when I do this, hadoop throws an error essentially saying that this jar is not the same as the old jar (duh).
I try clearing my Trash, .staging, and .sparkstaging directories but that doesn't do anything. I try renaming the jar, which will work sometimes and other times it won't (it's still ridiculous I have to do this in the first place).
Does anyone know why this is occurring and how I can prevent it from occurring? Thanks for any help. Here are some logs if that helps (edited out some paths):
Application application_1475165877428_124781 failed 2 times due to AM
Container for appattempt_1475165877428_124781_000002 exited with
exitCode: -1000 For more detailed output, check application tracking
page:http://examplelogsite/ Then, click on links to logs of each
attempt. Diagnostics: Resource MYJARPATH/EXAMPLE.jar changed on src
filesystem (expected 1475433291946, was 1475433292850
java.io.IOException: Resource MYJARPATH/EXAMPLE.jar changed on src
filesystem (expected 1475433291946, was 1475433292850 at
org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) at
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356) at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Failing this attempt. Failing
the application.
I haven't seen that exit code before, so to me, it doesn't say anything, I would suggest you to check the logs, like this:
yarn logs -applicationId <your_application_ID>
According to your log, I'm sure it comes from yarn side.
You can modify yarn yourself to skip this exception as workaround.
I ran into this thread cause the error log changed on src filesystem, I met this issue and skipped it by modify yarn src code.
For more details, you can refer to how-to-fix-resource-changed-on-src-filesystem-issue

Failed to get broadcast_1_piece0 of broadcast_1 in Spark Streaming job

I am running spark jobs on yarn in cluster mode. The job get the messages from kafka direct stream. I am using broadcast variables and checkpointing every 30 seconds. When I start the job first time it runs fine without any issue. If I kill the job and restart it throws below exception in executor upon receiving a message from kafka:
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1178)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at net.juniper.spark.stream.LogDataStreamProcessor$2.call(LogDataStreamProcessor.java:177)
at net.juniper.spark.stream.LogDataStreamProcessor$2.call(LogDataStreamProcessor.java:1)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1298)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1298)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Does anyone have idea how to resolve this error?
Spark version: 1.5.0
CDH 5.5.1
When encountering issues where only the first run works, it always resulted in issues revolving the checkpoint data. Moreover, the use of checkpoints only happens when there is something to check, which is the first message from kafka.
I suggest you check if you the job is indeed dead, that is, maybe the process is still running on the machine that executed it.
try running a simple ps -fe and see if something is still running. if there are 2 processes trying to use the same checkpoint folder, it will always fail.
hope this helps

Storm-YARN : Application container fails to launch

I am running a storm (trident) topology that reads avro from kafka & writes the records in hbase.
The topology is running as expected in Localcluster mode, but while using Stormsubmitter I'm facing below issues.
In Distributed Hadoop mode I'm getting the below error [1] while launching the YARN application.
In Hadoop (local mode, with 1 box only) Yarn is spawnning the nimbus server and storm-ui. But there are no supervisor(s) running to run the spout/bolts in the topology. I guess the reason might be insufficient memory (4G to run the topology + hbase, hdfs, kafka, zookeeper etc...).
Can you help me out in understanding the reason of this container failure? There are no errors/info present in application logs.
[1] YARN container fails to launch with below error on running.
storm-yarn launch /homext/storm-yarn.yml --queue default -appname storm-yarn-demo --stormZip /tmp/storm-0.9.zip
Application application_1415038356032_0304 failed 2 times due to AM Container for appattempt_1415038356032_0304_000002 exited with exitCode: 127 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 127
.Failing this attempt.. Failing the application.
This log is insufficient to diagnose. All it says is that the container failed to launch. You should look into the container output. Check the ${yarn.nodemanager.log-dirs} on the nodes, there will be an application folder (application_1415038356032_0304) and in there there will be a container folder for each attempt (...1415038356032_0304_000002) containing the stderr, stdout and syslog of this attempt. Read those and you'll likely identify the problem.
If these don't exist, look in ${yarn.nodemanager.local-dirs} you'll find the container launch script (I thinks is called container-launch.sh) for this app/container attempt. In it will be the actual command to launch the container. Try to run that from the shell prompt and see what you get.
If it fails at an early stage then the logs can be found in HDFS under:
/tmp/logs/<user>/logs/
This should give enough information to diagnose the problem.
In my case I found a log file:
/tmp/logs/hdfs/logs/application_1426618997634_0004/vagrant-cdh-node4_8041
With some errors like:
/bin/bash: /usr/lib/jvm/java-7-oracle/bin/java: No such file or directory
And fixing the JAVA_HOME environment variable did the trick.

Resources