Spark rdd.first throws KryoException - IndexOutOfBoundsException - hadoop

I'm trying to read a hadoop file as follows:
sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
.set("spark.ui.enabled", "false")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(new Class<?>[]{
scala.Tuple2.class,
org.apache.hadoop.hbase.client.Put.class,
org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Mutation.class,
java.util.Map.class,
java.util.NavigableMap.class,
java.util.List.class,
java.util.TreeMap.class,
})
.set("spark.app.id", appID());
SparkContext sc = new SparkContext(sparkConf);
JavaPairRDD<ImmutableBytesWritable, Put> putRdd = sharedContext.jsc().newAPIHadoopRDD(hadoopConf, SequenceFileInputFormat.class, ImmutableBytesWritable.class, Put.class);
Tuple2<ImmutableBytesWritable, Put> tuple1 = putRdd.first();
Then I get the following exception even if I explicitly registered kryo classes:
2017-04-06 17:31:35,287 ERROR [task-result-getter-3] scheduler.TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 48, Size: 8
Serialization trace:
familyMap (org.apache.hadoop.hbase.client.Put)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:275)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:97)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Index: 48, Size: 8
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:773)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:134)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
... 18 more
any idea how I could correctly register the classes and avoid this serialisation problem?

Related

[SPARK]: java.lang.IllegalArgumentException: java.net.UnknownHostException: plumber

I build a spark streaming application.This application read data form socket, and compute, then write the result into hdfs. But application run in A hadoop cluster, the hdfs in B hadoop cluster. Below is my code:
if (args.length < 2) {
System.out.println("Usage: StreamingWriteHdfs hostname port")
System.exit(-1)
}
val conf = new SparkConf()
conf.setAppName("StreamingWriteHdfs")
val ssc = new StreamingContext(conf, Durations.seconds(10))
ssc.checkpoint("/tmp")
val hostname: String = args(0)
val port :Int = Integer.parseInt(args(1))
val lines = ssc.socketTextStream(hostname, port)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
//TODO write to hdfs
wordCounts.saveAsHadoopFiles("hdfs://plumber/tmp/test/streaming",
"out",
classOf[Text],
classOf[IntWritable],
classOf[TextOutputFormat[Text, IntWritable]])
ssc.start()
ssc.awaitTermination()
When run this appplication in A cluster,get this execption:
java.lang.IllegalArgumentException: java.net.UnknownHostException:plumber
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: plumber
the B hadoop cluster's fs.defaultFS is hdfs://plumber.
some one can help me out! thxs.
I think you need to modify the host name like
"hdfs://plumber:8020/tmp/test/streaming".

Inline Script error

I am trying to run the Update api using Inline Scripting ,my code is
client.prepareUpdate("result",typeName, "1")
.setScript(new Script("ctx._source.gender=doc['"+AggregateValue_First+"'].value*doc['"+AggregateValue_Second+"'].value",ScriptType.INLINE, null, null))
.get();
When I am executing it , i am getting
java.lang.IllegalArgumentException: failed to execute script
my log looks like
Caused by: ScriptException[failed to run inline script
[ctx._source.gender =
doc['AVG_PRICE_PER_UNIT'].value*doc['NUMBER_OF_UNITS'].value] using
lang [groovy]]; nested:
NotSerializableExceptionWrapper[missing_property_exception: No such
property: doc for class: af9b76c11012333a0eeba6af6df35125322f36b8];
at
org.elasticsearch.script.groovy.GroovyScriptEngineService$GroovyScript.run(GroovyScriptEngineService.java:320)
at
org.elasticsearch.action.update.UpdateHelper.executeScript(UpdateHelper.java:252)... 14 more
Caused by: NotSerializableExceptionWrapper[missing_property_exception:
No such property: doc for class:
af9b76c11012333a0eeba6af6df35125322f36b8] at
org.codehaus.groovy.runtime.ScriptBytecodeAdapter.unwrap(ScriptBytecodeAdapter.java:53)
at
org.codehaus.groovy.vmplugin.v7.IndyGuardsFiltersAndSignatures.unwrap(IndyGuardsFiltersAndSignatures.java:177)
at
org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:228)
at
af9b76c11012333a0eeba6af6df35125322f36b8.run(af9b76c11012333a0eeba6af6df35125322f36b8:1)
at
org.elasticsearch.script.groovy.GroovyScriptEngineService$GroovyScript$1.run(GroovyScriptEngineService.java:313)
at java.security.AccessController.doPrivileged(Native Method) at
org.elasticsearch.script.groovy.GroovyScriptEngineService$GroovyScript.run(GroovyScriptEngineService.java:310)
... 15 more
Can someone help me with this?
i tried
client.prepareUpdate("result",typeName,"1").setScript(newScript("ctx._source.gender.value =
ctx._source['"+AggregateValue_First+"'].value *
ctx._source['"+AggregateValue_Second+"'].value", ScriptType.INLINE,
null, null)) .get();
now the error is something like
log4j:WARN No appenders could be found for logger
(org.elasticsearch.node). log4j:WARN Please initialize the log4j
system properly. log4j:WARN See
http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.lang.IllegalArgumentException: failed to execute script at
org.elasticsearch.action.update.UpdateHelper.executeScript(UpdateHelper.java:257)
at
org.elasticsearch.action.update.UpdateHelper.prepare(UpdateHelper.java:197)
at
org.elasticsearch.action.update.UpdateHelper.prepare(UpdateHelper.java:80)
at
org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:174)
at
org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:168)
at
org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:66)
at
org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$ShardTransportHandler.messageReceived(TransportInstanceSingleOperationAction.java:244)
at
org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$ShardTransportHandler.messageReceived(TransportInstanceSingleOperationAction.java:240)
at
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source) Caused by:
ScriptException[failed to run inline script [ctx._source.gender.value
= ctx._source['AVG_PRICE_PER_UNIT'].value * ctx._source['5'].value] using lang [groovy]]; nested: AssertionError[BUG! UNCAUGHT EXCEPTION:
member is private: java.lang.Integer.value/int/getField, from
org.codehaus.groovy.vmplugin.v7.IndyInterface]; nested:
NotSerializableExceptionWrapper[illegal_access_exception: member is
private: java.lang.Integer.value/int/getField, from
org.codehaus.groovy.vmplugin.v7.IndyInterface]; at
org.elasticsearch.script.groovy.GroovyScriptEngineService$GroovyScript.run(GroovyScriptEngineService.java:320)
at
org.elasticsearch.action.update.UpdateHelper.executeScript(UpdateHelper.java:252)
... 14 more Caused by: java.lang.AssertionError: BUG! UNCAUGHT
EXCEPTION: member is private: java.lang.Integer.value/int/getField,
from org.codehaus.groovy.vmplugin.v7.IndyInterface at
org.codehaus.groovy.vmplugin.v7.Selector$PropertySelector.chooseMeta(Selector.java:311)
at
org.codehaus.groovy.vmplugin.v7.Selector$MethodSelector.setCallSiteTarget(Selector.java:954)
at
org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:224)
at
a45669ea4b74cc2cb4371072fd14ab69cb5dd5f6.run(a45669ea4b74cc2cb4371072fd14ab69cb5dd5f6:1)
at
org.elasticsearch.script.groovy.GroovyScriptEngineService$GroovyScript$1.run(GroovyScriptEngineService.java:313)
at java.security.AccessController.doPrivileged(Native Method) at
org.elasticsearch.script.groovy.GroovyScriptEngineService$GroovyScript.run(GroovyScriptEngineService.java:310)
... 15 more Caused by:
NotSerializableExceptionWrapper[illegal_access_exception: member is
private: java.lang.Integer.value/int/getField, from
org.codehaus.groovy.vmplugin.v7.IndyInterface] at
java.lang.invoke.MemberName.makeAccessException(Unknown Source) at
java.lang.invoke.MethodHandles$Lookup.checkAccess(Unknown Source) at
java.lang.invoke.MethodHandles$Lookup.checkField(Unknown Source) at
java.lang.invoke.MethodHandles$Lookup.getDirectFieldCommon(Unknown
Source) at
java.lang.invoke.MethodHandles$Lookup.getDirectFieldNoSecurityManager(Unknown
Source) at
java.lang.invoke.MethodHandles$Lookup.unreflectField(Unknown Source)
at java.lang.invoke.MethodHandles$Lookup.unreflectGetter(Unknown
Source) at
org.codehaus.groovy.vmplugin.v7.Selector$PropertySelector.chooseMeta(Selector.java:302)
... 21 more
Try this (i.e. use ctx._source instead of doc):
client.prepareUpdate("result",typeName, "1")
.setScript(new Script("ctx._source.gender = ctx._source['"+AggregateValue_First+"'] * ctx._source['"+AggregateValue_Second+"']", ScriptType.INLINE, null, null))
.get();

How to write an avro file with Spark?

I've an Array[Byte] that represents an avro schema. I'm trying to write it to Hdfs as avro file with spark. This is the code:
val values = messages.map(row => (null,AvroUtils.decode(row._2,topic)))
.saveAsHadoopFile(
outputPath,
classOf[org.apache.hadoop.io.NullWritable],
classOf[CrashPacket],
classOf[AvroOutputFormat[SpecificRecordBase]]
)
row._2 is Array[Byte]
I'm getting this error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 4 times, most recent failure: Lost task 4.3 in stage 1.0 (TID 98, bdac1nodec06.servizi.gr-u.it): java.lang.NullPointerException
at java.io.StringReader.<init>(StringReader.java:50)
at org.apache.avro.Schema$Parser.parse(Schema.java:958)
at org.apache.avro.Schema.parse(Schema.java:1010)
at org.apache.avro.mapred.AvroJob.getOutputSchema(AvroJob.java:143)
at org.apache.avro.mapred.AvroOutputFormat.getRecordWriter(AvroOutputFormat.java:153)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Consider, that there is an avro class StringPair with constructor StringPair(String a, String b). Then the code that writes records to avro files could look like this:
import com.test.{StringPair}
import org.apache.avro.Schema
import org.apache.avro.mapred.{AvroValue, AvroKey}
import org.apache.avro.mapreduce.{AvroKeyValueOutputFormat, AvroJob}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.mapreduce.Job
object TestWriteAvro {
def main (args: Array[String]){
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val job = new Job(sc.hadoopConfiguration)
AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.STRING))
AvroJob.setOutputValueSchema(job, StringPair.getClassSchema)
val myRdd = sc
.parallelize(List("1,2", "3,4"))
.map(x => (x.split(",")(0), x.split(",")(1)))
.map {case (x, y) => (new AvroKey[String](x), new AvroValue[StringPair](new StringPair(x, y)))}
myRdd.saveAsNewAPIHadoopFile(args(0), classOf[AvroKey[_]], classOf[AvroValue[_]], classOf[AvroKeyValueOutputFormat[_, _]], job.getConfiguration)
sc.stop()
}
}

Saving RDD using a Proprietary OutputFormatter

I am using a Proprietary database which provides its own OutputFormatter. Using This OutputFormatter I can write a Map Reduce Job and save the data from MR into this database.
However I am trying to use the OutputFormatter inside of Spark and trying to save an RDD to a database.
The code I have written is
object VerticaSpark extends App {
val scConf = new SparkConf
val sc = new SparkContext(scConf)
val conf = new Configuration()
val job = new Job(conf)
job.setInputFormatClass(classOf[VerticaInputFormat])
job.setOutputKeyClass(classOf[Text])
job.setOutputValueClass(classOf[VerticaRecord])
job.setOutputFormatClass(classOf[VerticaOutputFormat])
VerticaInputFormat.setInput(job, "select * from Foo where key = ?", "1", "2", "3", "4")
VerticaOutputFormat.setOutput(job, "Bar", true, "name varchar", "total int")
val rddVR : RDD[VerticaRecord] = sc.newAPIHadoopRDD(job.getConfiguration, classOf[VerticaInputFormat], classOf[LongWritable], classOf[VerticaRecord]).map(_._2)
val rddTup = rddVR.map(x => (x.get(1).toString(), x.get(2).toString().toInt))
val rddGroup = rddTup.reduceByKey(_ + _)
val rddVROutput = rddGroup.map({
case(x, y) => (new Text("Bar"), getVerticaRecord(x, y, job.getConfiguration))
})
//rddVROutput.saveAsNewAPIHadoopFile("Bar", classOf[Text], classOf[VerticaRecord], classOf[VerticaOutputFormat], job.getConfiguration)
rddVROutput.saveAsNewAPIHadoopDataset(job.getConfiguration)
def getVerticaRecord(name : String, value : Int , conf: Configuration) : VerticaRecord = {
var retVal = new VerticaRecord(conf)
//println(s"going to build Vertica Record with ${name} and ${value}")
retVal.set(0, new Text(name))
retVal.set(1, new IntWritable(value))
retVal
}
}
I entire solution can be downloaded from here
https://github.com/abhitechdojo/VerticaSpark.git
My code works perfectly till the saveAsNewAPIHadoopFile function is reached. At this line it throws a NullPointer Exception
The same logic and same Input and Output Formatter work perfectly in a Map Reduce Program and I can write successfully from DB using the MR program
https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/HadoopConnector/ExampleHadoopConnectorApplication.htm%3FTocPath%3DIntegrating%2520with%2520Hadoop%7CUsing%2520the%2520%2520MapReduce%2520Connector%7C_____7
The stack trace of the error is
16/01/15 16:42:53 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 5, machine): java.lang.NullPointerException
at com.abhi.VerticaSpark$$anonfun$4.apply(VerticaSpark.scala:39)
at com.abhi.VerticaSpark$$anonfun$4.apply(VerticaSpark.scala:38)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:999)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 12, machine): java.lang.NullPointerException
at com.abhi.VerticaSpark$$anonfun$4.apply(VerticaSpark.scala:39)
at com.abhi.VerticaSpark$$anonfun$4.apply(VerticaSpark.scala:38)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:999)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
16/01/15 16:42:54 INFO TaskSetManager: Lost task 3.1 in stage 1.0 (TID 11) on executor machine: java.lang.NullPointerException (null) [duplicate 7]

Unable to run distributed shell on YARN

I am trying to run distributed shell example on YARN cluster.
#Test
public void realClusterTest() throws Exception {
System.setProperty("HADOOP_USER_NAME", "hdfs");
String[] args = {
"--jar",
APPMASTER_JAR,
"--num_containers",
"1",
"--shell_command",
"ls",
"--master_memory",
"512",
"--container_memory",
"128"
};
LOG.info("Initializing DS Client");
Client client = new Client(new Configuration());
boolean initSuccess = client.init(args);
Assert.assertTrue(initSuccess);
LOG.info("Running DS Client");
boolean result = client.run();
LOG.info("Client run completed. Result=" + result);
Assert.assertTrue(result);
}
But it fails with:
2013-09-17 11:45:28,338 INFO [main] distributedshell.Client (Client.java:monitorApplication(600)) - Got application report from ASM for, appId=11, clientToAMToken=null, appDiagnostics=Application application_1379338026167_0011 failed 2 times due to AM Container for appattempt_1379338026167_0011_000002 exited with exitCode: 1 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:458)
at org.apache.hadoop.util.Shell.run(Shell.java:373)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
................
.Failing this attempt.. Failing the application., appMasterHost=N/A, appQueue=default, appMasterRpcPort=0, appStartTime=1379407525237, yarnAppState=FAILED, distributedFinalState=FAILED, appTrackingUrl=ip-10-232-149-222.us-west-2.compute.internal:8088/proxy/application_1379338026167_0011/, appUser=hdfs
Here is what I see in server logs:
2013-09-17 08:45:26,870 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(213)) - Exception from container-launch with container ID: container_1379338026167_0011_02_000001 and exit code: 1
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:458)
at org.apache.hadoop.util.Shell.run(Shell.java:373)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:258)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:74)
The question is how can I get more details to identify what is going wrong.
PS: we are using HDP 2.0.5

Resources