How to write an avro file with Spark? - hadoop

I've an Array[Byte] that represents an avro schema. I'm trying to write it to Hdfs as avro file with spark. This is the code:
val values = messages.map(row => (null,AvroUtils.decode(row._2,topic)))
.saveAsHadoopFile(
outputPath,
classOf[org.apache.hadoop.io.NullWritable],
classOf[CrashPacket],
classOf[AvroOutputFormat[SpecificRecordBase]]
)
row._2 is Array[Byte]
I'm getting this error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 4 times, most recent failure: Lost task 4.3 in stage 1.0 (TID 98, bdac1nodec06.servizi.gr-u.it): java.lang.NullPointerException
at java.io.StringReader.<init>(StringReader.java:50)
at org.apache.avro.Schema$Parser.parse(Schema.java:958)
at org.apache.avro.Schema.parse(Schema.java:1010)
at org.apache.avro.mapred.AvroJob.getOutputSchema(AvroJob.java:143)
at org.apache.avro.mapred.AvroOutputFormat.getRecordWriter(AvroOutputFormat.java:153)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Consider, that there is an avro class StringPair with constructor StringPair(String a, String b). Then the code that writes records to avro files could look like this:
import com.test.{StringPair}
import org.apache.avro.Schema
import org.apache.avro.mapred.{AvroValue, AvroKey}
import org.apache.avro.mapreduce.{AvroKeyValueOutputFormat, AvroJob}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.mapreduce.Job
object TestWriteAvro {
def main (args: Array[String]){
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val job = new Job(sc.hadoopConfiguration)
AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.STRING))
AvroJob.setOutputValueSchema(job, StringPair.getClassSchema)
val myRdd = sc
.parallelize(List("1,2", "3,4"))
.map(x => (x.split(",")(0), x.split(",")(1)))
.map {case (x, y) => (new AvroKey[String](x), new AvroValue[StringPair](new StringPair(x, y)))}
myRdd.saveAsNewAPIHadoopFile(args(0), classOf[AvroKey[_]], classOf[AvroValue[_]], classOf[AvroKeyValueOutputFormat[_, _]], job.getConfiguration)
sc.stop()
}
}

Related

Flink is not adding any data to Elasticsearch but no errors

Folks, I'm new to all this data streaming process but I was able to build and submit a Flink job that will read some CSV data from Kafka and aggregate it then put it in Elasticsearch.
I was able to do the first two parts, and print out my aggregation to STDOUT. But when I added the code to put it to Elasticsearch, it seems nothing is happening there (no data being added). I looked at the Flink job manager log and it looks fine (no errors) and says:
2020-03-03 16:18:03,877 INFO
org.apache.flink.streaming.connectors.elasticsearch7.Elasticsearch7ApiCallBridge
- Created Elasticsearch RestHighLevelClient connected to [http://elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local:9200]
Here is my code at this point:
/*
* This Scala source file was generated by the Gradle 'init' task.
*/
package flinkNamePull
import java.time.LocalDateTime
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer010, FlinkKafkaProducer010}
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.{DataTypes, Table}
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.table.descriptors.{Elasticsearch, Json, Schema}
object Demo {
/**
* MapFunction to generate Transfers POJOs from parsed CSV data.
*/
class TransfersMapper extends RichMapFunction[String, Transfers] {
private var formatter = null
#throws[Exception]
override def open(parameters: Configuration): Unit = {
super.open(parameters)
//formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")
}
#throws[Exception]
override def map(csvLine: String): Transfers = {
//var splitCsv = csvLine.stripLineEnd.split("\n")(1).split(",")
var splitCsv = csvLine.stripLineEnd.split(",")
val arrLength = splitCsv.length
val i = 0
if (arrLength != 13) {
for (i <- arrLength + 1 to 13) {
if (i == 13) {
splitCsv = splitCsv :+ "0.0"
} else {
splitCsv = splitCsv :+ ""
}
}
}
var trans = new Transfers()
trans.rowId = splitCsv(0)
trans.subjectId = splitCsv(1)
trans.hadmId = splitCsv(2)
trans.icuStayId = splitCsv(3)
trans.dbSource = splitCsv(4)
trans.eventType = splitCsv(5)
trans.prev_careUnit = splitCsv(6)
trans.curr_careUnit = splitCsv(7)
trans.prev_wardId = splitCsv(8)
trans.curr_wardId = splitCsv(9)
trans.inTime = splitCsv(10)
trans.outTime = splitCsv(11)
trans.los = splitCsv(12).toDouble
return trans
}
}
def main(args: Array[String]) {
// Create streaming execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// Set properties per KafkaConsumer API
val properties = new Properties()
properties.setProperty("bootstrap.servers", "kafka.kafka:9092")
properties.setProperty("group.id", "test")
// Add Kafka source to environment
val myKConsumer = new FlinkKafkaConsumer010[String]("raw.data3", new SimpleStringSchema(), properties)
// Read from beginning of topic
myKConsumer.setStartFromEarliest()
val streamSource = env
.addSource(myKConsumer)
// Transform CSV (with a header row per Kafka event into a Transfers object
val streamTransfers = streamSource.map(new TransfersMapper())
// create a TableEnvironment
val tEnv = StreamTableEnvironment.create(env)
println("***** NEW EXECUTION STARTED AT " + LocalDateTime.now() + " *****")
// register a Table
val tblTransfers: Table = tEnv.fromDataStream(streamTransfers)
tEnv.createTemporaryView("transfers", tblTransfers)
tEnv.connect(
new Elasticsearch()
.version("7")
.host("elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local", 9200, "http") // required: one or more Elasticsearch hosts to connect to
.index("transfers-sum")
.documentType("_doc")
.keyNullLiteral("n/a")
)
.withFormat(new Json().jsonSchema("{type: 'object', properties: {curr_careUnit: {type: 'string'}, sum: {type: 'number'}}}"))
.withSchema(new Schema()
.field("curr_careUnit", DataTypes.STRING())
.field("sum", DataTypes.DOUBLE())
)
.inUpsertMode()
.createTemporaryTable("transfersSum")
val result = tEnv.sqlQuery(
"""
|SELECT curr_careUnit, sum(los)
|FROM transfers
|GROUP BY curr_careUnit
|""".stripMargin)
result.insertInto("transfersSum")
// Elasticsearch elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local:9200
env.execute("Flink Streaming Demo Dump to Elasticsearch")
}
}
I'm not sure how I can debug this beast... Wondering if somebody can help me figure out why the Flink job is not adding data to Elasticsearch :(
From my Flink cluster, I'm able to query Elasticsearch just fine (manually) and add records to my index:
curl -XPOST "http://elasticsearch-elasticsearch-coordinating-only.default.svc.cluster.local:9200/transfers-sum/_doc" -H 'Content-Type: application/json' -d'{"curr_careUnit":"TEST123","sum":"123"}'
A kind soul in the Flink mailist pointed out the fact that it could be Elasticsearch buffering my records... Well, it was. ;)
I have added the following options to the Elasticsearch connector:
.bulkFlushMaxActions(2)
.bulkFlushInterval(1000L)
Flink Elasticsearch Connector 7 using Scala
Please find a working and detailed answer which I have provided here.

java.lang.NullPointerException: writeSupportClass should not be null while writing parquet file in a spark streaming job

In a spark streaming job, I am saving my rdd data into a parquet file in HDFS of Hadoop using code snippet below:
readyToSave.foreachRDD((VoidFunction<JavaPairRDD<Void, MyProtoRecord>>) rdd -> {
Configuration configuration = rdd.context().hadoopConfiguration();
Job job = Job.getInstance(configuration);
ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
ProtoParquetOutputFormat.setProtobufClass(job, MyProtoRecord.class);
rdd.saveAsNewAPIHadoopFile("path-to-hdfs", Void.class, MyProtoRecord.class, ParquetOutputFormat.class, configuration);
});
and I get exception below:
java.lang.NullPointerException: writeSupportClass should not be null
at parquet.Preconditions.checkNotNull(Preconditions.java:38)
at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:326)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:272)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1112)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1095)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
how can I solve the problem?
Found out the problem!
while calling "saveAsNewAPIHadoopFile() method, you sould specify your job's configuration (job.getConfiguration()):
readyToSave.foreachRDD((VoidFunction<JavaPairRDD<Void, MyProtoRecord>>) rdd -> {
Configuration configuration = rdd.context().hadoopConfiguration();
Job job = Job.getInstance(configuration);
ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
ProtoParquetOutputFormat.setProtobufClass(job, MyProtoRecord.class);
rdd.saveAsNewAPIHadoopFile("path-to-hdfs", Void.class, MyProtoRecord.class, ParquetOutputFormat.class, job.getConfiguration());
});

IllegalStateException when trying to run spark streaming with twitter

I am new to spark and scala. I am trying to run an example given in google. I am encounting following exception when running this program.
Exception is:
17/05/25 11:13:42 ERROR ReceiverTracker: Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error starting Twitter stream - java.lang.IllegalStateException: Authentication credentials are missing.
Code that I am executing is as follows:
PrintTweets.scala
package example
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
import org.apache.log4j.Level
import Utilities._
object PrintTweets {
def main(args: Array[String]) {
// Configure Twitter credentials using twitter.txt
setupTwitter()
val appName = "TwitterData"
val conf = new SparkConf()
conf.setAppName(appName).setMaster("local[3]")
val ssc = new StreamingContext(conf, Seconds(5))
//val ssc = new StreamingContext("local[*]", "PrintTweets", Seconds(10))
setupLogging()
// Create a DStream from Twitter using our streaming context
val tweets = TwitterUtils.createStream(ssc, None)
// Now extract the text of each status update into RDD's using map()
val statuses = tweets.map(status => status.getText())
statuses.print()
ssc.start()
ssc.awaitTermination()
}
}
Utilities.scala
package example
import org.apache.log4j.Level
import java.util.regex.Pattern
import java.util.regex.Matcher
object Utilities {
/** Makes sure only ERROR messages get logged to avoid log spam. */
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
/** Configures Twitter service credentials using twiter.txt in the main workspace directory */
def setupTwitter() = {
import scala.io.Source
for (line <- Source.fromFile("../twitter.txt").getLines) {
val fields = line.split(" ")
if (fields.length == 2) {
System.setProperty("twitter4j.oauth." + fields(0), fields(1))
}
}
}
/** Retrieves a regex Pattern for parsing Apache access logs. */
def apacheLogPattern():Pattern = {
val ddd = "\\d{1,3}"
val ip = s"($ddd\\.$ddd\\.$ddd\\.$ddd)?"
val client = "(\\S+)"
val user = "(\\S+)"
val dateTime = "(\\[.+?\\])"
val request = "\"(.*?)\""
val status = "(\\d{3})"
val bytes = "(\\S+)"
val referer = "\"(.*?)\""
val agent = "\"(.*?)\""
val regex = s"$ip $client $user $dateTime $request $status $bytes $referer $agent"
Pattern.compile(regex)
}
}
When I check using print statments I find the exception is happening at line
val tweets = TwitterUtils.createStream(ssc, None)
I am giving credentials in twitter.txt file which is read properly by program. When I don't place twitter.txt in appropriate directory it shows explicit error, It shows explicit error unauthorized access when I give blank keys for customer key and secret etc in twitter.txt
If you need more details about error related information or versions of software let me know.
Thanks,
Madhu.
I could reproduce the issue with your code. I believe its your problem.
You might have not configured twitter.txt properly. Your twitter.txt file should be like this ->
consumerKey your_consumerKey
consumerSecret your_consumerSecret
accessToken your_accessToken
accessTokenSecret your_accessTokenSecret
I hope it helps.
After changing twitter.txt file syntax to following , single space between key and value it worked
consumerKey your_consumerKey
consumerSecret your_consumerSecret
accessToken your_accessToken
accessTokenSecret your_accessTokenSecret

Getting NPE when trying to do spark streaming with Twitter

I am new to SparkStreaming, when tried to submit the Spark-Twitter streaming job, getting the following error:
Lost task 0.0 in stage 0.0 (TID 0,sandbox.hortonworks.com):java.lang.NullPointerException
at org.apache.spark.util.Utils$.decodeFileNameInURI(Utils.scala:340)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:365)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:404)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:396)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:396)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745
Here is the code snippet:
val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = args.take(4)
val filters = args.takeRight(args.length - 4)
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)
System.setProperty("twitter4j.oauth.accessToken", accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val stream = TwitterUtils.createStream(ssc,None, filters)
val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#")))
val topCounts60 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(60))
.map{case (topic, count) => (count, topic)}
.transform(_.sortByKey(false))
topCounts60.foreachRDD(rdd => {
val topList = rdd.take(10)
println("\nPopular topics in last 60 seconds (%s total):".format(rdd.count()))
topList.foreach{case (count, tag) => println("%s (%s tweets)".format(tag, count))}
})
ssc.start()
ssc.awaitTermination()
Any clue why I am getting this NPE?? Any help on how to debug this further?
After debugging a bit, in my spark-submit script, Job jar file got added to the --jar list and getting this error. But this seems to be a bug in the spark-core package.

Saving RDD using a Proprietary OutputFormatter

I am using a Proprietary database which provides its own OutputFormatter. Using This OutputFormatter I can write a Map Reduce Job and save the data from MR into this database.
However I am trying to use the OutputFormatter inside of Spark and trying to save an RDD to a database.
The code I have written is
object VerticaSpark extends App {
val scConf = new SparkConf
val sc = new SparkContext(scConf)
val conf = new Configuration()
val job = new Job(conf)
job.setInputFormatClass(classOf[VerticaInputFormat])
job.setOutputKeyClass(classOf[Text])
job.setOutputValueClass(classOf[VerticaRecord])
job.setOutputFormatClass(classOf[VerticaOutputFormat])
VerticaInputFormat.setInput(job, "select * from Foo where key = ?", "1", "2", "3", "4")
VerticaOutputFormat.setOutput(job, "Bar", true, "name varchar", "total int")
val rddVR : RDD[VerticaRecord] = sc.newAPIHadoopRDD(job.getConfiguration, classOf[VerticaInputFormat], classOf[LongWritable], classOf[VerticaRecord]).map(_._2)
val rddTup = rddVR.map(x => (x.get(1).toString(), x.get(2).toString().toInt))
val rddGroup = rddTup.reduceByKey(_ + _)
val rddVROutput = rddGroup.map({
case(x, y) => (new Text("Bar"), getVerticaRecord(x, y, job.getConfiguration))
})
//rddVROutput.saveAsNewAPIHadoopFile("Bar", classOf[Text], classOf[VerticaRecord], classOf[VerticaOutputFormat], job.getConfiguration)
rddVROutput.saveAsNewAPIHadoopDataset(job.getConfiguration)
def getVerticaRecord(name : String, value : Int , conf: Configuration) : VerticaRecord = {
var retVal = new VerticaRecord(conf)
//println(s"going to build Vertica Record with ${name} and ${value}")
retVal.set(0, new Text(name))
retVal.set(1, new IntWritable(value))
retVal
}
}
I entire solution can be downloaded from here
https://github.com/abhitechdojo/VerticaSpark.git
My code works perfectly till the saveAsNewAPIHadoopFile function is reached. At this line it throws a NullPointer Exception
The same logic and same Input and Output Formatter work perfectly in a Map Reduce Program and I can write successfully from DB using the MR program
https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/HadoopConnector/ExampleHadoopConnectorApplication.htm%3FTocPath%3DIntegrating%2520with%2520Hadoop%7CUsing%2520the%2520%2520MapReduce%2520Connector%7C_____7
The stack trace of the error is
16/01/15 16:42:53 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 5, machine): java.lang.NullPointerException
at com.abhi.VerticaSpark$$anonfun$4.apply(VerticaSpark.scala:39)
at com.abhi.VerticaSpark$$anonfun$4.apply(VerticaSpark.scala:38)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:999)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 12, machine): java.lang.NullPointerException
at com.abhi.VerticaSpark$$anonfun$4.apply(VerticaSpark.scala:39)
at com.abhi.VerticaSpark$$anonfun$4.apply(VerticaSpark.scala:38)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:999)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
16/01/15 16:42:54 INFO TaskSetManager: Lost task 3.1 in stage 1.0 (TID 11) on executor machine: java.lang.NullPointerException (null) [duplicate 7]

Resources