Hadoop streaming job using Mxnet failing in AWS Emr

Hadoop streaming job using Mxnet failing in AWS Emr - hadoop

I have setup an emr step in AWS datapipeline. The step command looks like this:
/usr/lib/hadoop-mapreduce/hadoop-streaming.jar,-input,s3n://input-bucket/input-file,-output,s3://output/output-dir,-mapper,/bin/cat,-reducer,reducer.py,-file,/scripts/reducer.py,-file,/params/parameters.bin
I am getting the following error
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
I have tried running reducer step separately on my desktop
(on a single node hadoop setup) and its working. I have already included the #!/usr/bin/env python in the reducer script. I suspect that I am not writing the EMR step correctly.
EMR version: 5.5.0
EDIT:
After further investigation, I have found out the exact line of code where the reducer code is failing in emr.
I am doing Machine Learning predictions using mxnet library in the reducer. When I load the model parameters, the reducer fails. Reference to API doc is here
module.load_params('parameters.bin')
I have checked the current working directory of the EMR node [using os.listdir(os.getcwd())] and it contains the parameters.bin file (I have even printed the file contents successfully).
I want to point out again that the streaming job is working fine on my single-node local setup.
EDIT2: I set the number of reducer tasks to 2. I enclosed my reducer code in a try-except block and I see the following error in one of the tasks (the other one runs fine)
[10:27:25] src/ndarray/ndarray.cc:299: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (119,) to.shape=(111,)
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc72fc) [0x7f81443842fc]
[bt] (1) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc166f4) [0x7f8144ed36f4]
[bt] (2) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc74c24) [0x7f8144f31c24]
[bt] (3) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(MXImperativeInvoke+0x2cd) [0x7f8144db935d]
[bt] (4) /usr/lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7f8150b8acec]
[bt] (5) /usr/lib64/libffi.so.6(ffi_call+0x1f5) [0x7f8150b8a615]
[bt] (6) /usr/lib64/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x30b) [0x7f8150d9d97b]
[bt] (7) /usr/lib64/python2.7/lib-dynload/_ctypes.so(+0xa915) [0x7f8150d97915]
[bt] (8) /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f815a69e183]
[bt] (9) /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x337d) [0x7f815a73107d]

I figured out the issue. Actually, the shapes expected by mxnet were dependent on the data-set (it actually depended on the maximum value in the data-set). Training happens on a single gpu box and has the whole dataset. The prediction however, works well with single node setup because it has all the data used in training. But when multi-node cluster is used, the data-set gets split which made the max-value different for each node. This was causing the error.
I have now made the expected shapes independent of the data-set and this error is not occurring anymore. I hope this clarifies things.

Related

FileNotFoundException while executing spark job

I am trying to execute a program on Spark. I have a cluster with a master and two slave nodes. I am receiving the following error during execution.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4.0 failed 4 times, most recent failure: Lost task 3.3 in stage 4.0 (TID 44, hadoopslave3): java.lang.RuntimeException: java.io.FileNotFoundException: File /home/ubuntu/hadoop/hadoop-te/dl4j/1485860107978_-4ccc8c8/0/data/dataset_4-4ccc8c8_68.bin does not exist
Driver stacktrace is as follows:
Driver stacktrace:
at og.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/01/31 10:56:08 INFO scheduler.TaskSetManager: Lost task 1.3 in stage 4.0 (TID 45) on executor hadoopslave3: java.lang.RuntimeException (java.io.FileNotFoundException: File /home/ubuntu/hadoop/hadoop-te/dl4j/1485860107978_-4ccc8c8/0/data/dataset_2-4ccc8c8_77.bin does not exist) [duplicate 3]
However, I can see all the dataset objects (.bin files) created on the HDFS. Any suggesstions ?

Since you have a cluster with "two slave nodes" set up: do you also have a Hadoop filesystem set up? If not, then that's your issue.
The example you're linking to, when using a non-local cluster, transfers references to datasets using Hadoop. The behavior of this example has been made more predictable (with a big error message) in release 0.8.0.

spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

I am reading millions of xml files via
val xmls = sc.binaryFiles(xmlDir)
The operation runs fine locally but on yarn it fails with:
client token: N/A
diagnostics: Application application_1433491939773_0012 failed 2 times due to ApplicationMaster for attempt appattempt_1433491939773_0012_000002 timed out. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1433750951883
final status: FAILED
tracking URL: http://controller01:8088/cluster/app/application_1433491939773_0012
user: ariskk
Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:622)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
On hadoops/userlogs logs I am frequently getting these messages:
15/06/08 09:15:38 WARN util.AkkaUtils: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;#2b4f336b,BlockManagerId(1, controller01.stratified, 58510))] in 2 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
I run my spark job via spark-submit and it works for an other HDFS directory that contains only 37k files. Any ideas how to resolve this?

Ok after getting some help on sparks mailing list, I found out there were 2 issues:
the src directory, if it is given as /my_dir/ it makes spark fail and creates the heartbeat issues. Instead it should be given as hdfs:///my_dir/*
An out of memory error appears in the logs after fixing #1. This is the spark driver running on yarn running out of memory due to the number of files (apparently it keeps all file info in memory). So I spark-submit'ed the job with --conf spark.driver.memory=8g which fixed the issue.

Unable to initialize any output collector in CDH5.3

15/05/24 06:11:40 INFO mapreduce.Job: Task Id : attempt_1432456238397_0004_m_000000_0, Status : FAILED
Error: java.io.IOException: Unable to initialize any output collector
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:412)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:439)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
I am using CDH 5.3 cloudera quickstart, I wrote MapReduce Program. When i run that on shell i getting above exception.
Can any one please help me on this, how to resolve

The error "Unable to initialize any output collector" indicates that the job failed to start the container's, there can be multiple reasons for the same. However, one must review the container logs at hdfs to identify the cause the error.
In this specific instance, the value of mapreduce.task.io.sort.mb value was entered greater than 2047 MB, however the maximum value which it allows is 2047 MB, thus anything above its causes the jobs to fail marking the value provided as Invalid.
Solution:
Set the value of mapreduce.task.io.sort.mb < 2048MB
Reference:
https://support.pivotal.io/hc/en-us/articles/205649987-Map-Reduce-job-failed-with-Unable-to-initialize-any-output-collector-
CDH5.2: MR, Unable to initialize any output collector
https://community.cloudera.com/t5/Storage-Random-Access-HDFS/HBase-MapReduce-Job-Error-java-io-IOException-Unable-to/td-p/23786

Apache Spark EC2 job not running. No space left on device

I have been running my program multiple times on a 20 node cluster. All of a sudden every time I run the program I get the following error:
15/04/19 16:52:35 WARN scheduler.TaskSetManager: Lost task 35.0 in stage 9.0 (TID 384, ip-XXX.XXX.compute.internal): java.io.FileNotFoundException: /mnt/spark/spark-local-XXX-ebd3/18/shuffle_2_35_64 (No space left on device)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.<init>(FileOutputStream.java:221)
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Checking the UI and it says there's absolutely nothing on the nodes. I have run the program maybe 15 times and only all of a sudden has this started. Why has this occurred out of the blue? And how do I fix it?

"No space left on device" is a quite clear exception: That node has no space left on the mount where the spark local files are being written: /mnt/spark/
Solution: go to the node (or nodes) and clean that up. rm -rf FTW.
If jobs are breaking before they terminate, due to manual intervention or failure, they will often leave temp data behind.

Cassandra hadoop word count example not working

I am trying to run the hadoop examples provided in the cassandra examples/hadoop_cql3_word_count. I am using the cassandra-2.0 branch. I have a single node cassandra running locally. I was able to run the ./bin/word_count_setup step successfully but when I run the ./bin/word_count step I am getting the following error :
java.lang.RuntimeException
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:661)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.<init>(CqlPagingRecordReader.java:297)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:163)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: InvalidRequestException(why:consistency level LOCAL_ONE not compatible with replication strategy (org.apache.cassandra.locator.SimpleStrategy))
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result$execute_prepared_cql3_query_resultStandardScheme.read(Cassandra.java:52627)
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result$execute_prepared_cql3_query_resultStandardScheme.read(Cassandra.java:52604)
at org.apache.cassandra.thrift.Cassandra$execute_prepared_cql3_query_result.read(Cassandra.java:52519)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1785)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1770)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:631)
... 6 more
----------
Has anyone seen this before ? Am I missing something ?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hadoop streaming job using Mxnet failing in AWS Emr - hadoop

Related

FileNotFoundException while executing spark job

spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

Unable to initialize any output collector in CDH5.3

Apache Spark EC2 job not running. No space left on device

Cassandra hadoop word count example not working

Categories

Resources