Creating single object DataFrame for predictions - spark-streaming

once I got my classification models trained, I'd like them to use in my web application to make classification predictions on the data that has been collected for a given session.
That is:
1) I have some session data structure that I need to map to a DataFrame row
2) feed tha DataFrame row into my ML model to predict the classification
3) use the prediction with the origination session to show it to the user in front of the browser.
The examples to create a DataFrame as input to a Spark pipeline that I've seen so far create it from a data source like a file. Now it seems a bit unwieldy to first create a single POJO or JsonNode, serialize it to file containing just on record and then use that file to create the DataFrame to feed the model.
Writing this I also get the feeling that it might not be a great idea to create and tear down the ML pipeline for each request, which seems to follow from this approach.
So maybe I should better think "Spark Streaming"?
Feed the mapped session data into some kind of message queue and feed that into my Spark pipeline? What kind of "stream" would be appropriate here?
I read somewhere that Spark streaming consumes the stream in micro batches and not record by record - that implies some delay until enough records have been collected to fill up the micro batch (or some preconfigured delay to wait until the micro batch is considered to be "full enough"). What does that mean for the responsiveness of the web application? Can I trigger the micro batches like every 100 milliseconds?
I would appreciate if someone could point me in the right direction.
Maybe Spark is not a good fit here and I should switch to Apache Flink?
Thanks in advance, Bernd

Ok, by now I have found some ways to solve my problem, maybe that
helps someone else:
Use a Sequence containing one tuple and name the columns separately
val df= spark.createDataFrame(
Seq("val1", "val2")
).toDF("label1", "label2")
Using a JSON-String
val sqlContext = spark.sqlContext
val jsonData= """{ "label1": "val1", "label2": "val2" }"""
val rdd= sparkSession.sparkContext.parallelize(Seq(jsonData))
val df= sqlContext.read.json(rdd)
NOT Working: create from Sequence case class Objects:
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
val myData= Seq(Feat("value1", "value2"))
val ds: Dataset[Feat]= myData.toDS()
ds.show(10, false)
This compiles ok, but yields an Exception at runtime:
[error] a.a.OneForOneStrategy - java.lang.RuntimeException:
Error while encoding: java.lang.ClassCastException:
es.core.recommender.Feat cannot be cast to es.core.recommender.Feat
I'd love to include more of the stacktrace, but this glorious editor
won't let me...
It would be nice to know why this alternative did not work...

Related

How to speed up basic pyspark statements

As a new spark/pyspark user, I have a script running on an AWS t2.small ec2 instance in local mode (for testing purposes ony).
ie. As an example:
from __future__ import print_function
from pyspark.ml.classification import NaiveBayesModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
import ritc (my library)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("NaiveBayesExample")\
.getOrCreate()
...
request_dataframe = spark.createDataFrame(ritc.request_parameters, ["features"])
model = NaiveBayesModel.load(ritc.model_path)
...
prediction = model.transform(ritc.request_dataframe)
prediction.createOrReplaceTempView("result")
df = spark.sql("SELECT prediction FROM result")
p = map(lambda row: row.asDict(), df.collect())
...
I have left out code so as to focus on my question, relating to the speed of basic spark statements such as spark = SparkSession...
Using the datetime library (not shown above), I have timings for the three biggest 'culprits':
'spark = SparkSession...' -- 3.7 secs
'spark.createDataFrame()' -- 2.6 secs
'NaiveBayesModel.load()' -- 3.4 secs
Why are these times so long??
To give a little background, I would like to provide the capability to expose scripts such as the above as REST services.
In supervised context:
- service #1: train a model and save the model in the filesystem
- service #2: load the model from the filesystem and get a prediction for a single instance
(Note: The #2 REST requests would run at different, and unanticipated (random) times. The general pattern would be:
-> once: train the model - expecting a long turnaround time
-> multiple times: request a prediction for a single instance - expecting a turnaround time in milliseconds - eg. < 400 ms.
Is there a flaw in my thinking? Can I expect to increase performance dramatically to achieve this goal of sub-second turnaround time?
In most every article/video/discussion on spark performance that I have come across, the emphasis has been on 'heavy' tasks. The 'train model' task above may indeed be a 'heavy' one - I expect this will be the case when run in production. But the 'request a prediction for a single instance' needs to be responsive.
Can anyone help?
Thanks in anticipation.
Colin Goldberg
So ApacheSpark is designed to be used in this way. You might want to look at Spark Streaming if your goal is to handle streaming input data for predictions. You may also want to look at other options for serving Spark models, like PMML or MLeap.

Do I need caching after repartitining

I have madde a dataaframe which I repartitined based on its primary key on the nodes
val config=new SparkConf().setAppName("MyHbaseLoader").setMaster("local[10]")
val context=new SparkContext(config)
val sqlContext=new SQLContext(context)
val rows="sender,time,time(utc),reason,context-uuid,rat,cell-id,first-pkt,last-pkt,protocol,sub-proto,application-id,server-ip,server-domain-name, http-proxy-ip,http-proxy-domain-name, video,packets-dw, packets-ul, bytes-dw, bytes-ul"
val scheme= new StructType(rows.split(",").map(e=>new StructField(e.trim,StringType,true)))
val dFrame=sqlContext.read
.schema(scheme)
.format("csv")
.load("E:\\Users\\Mehdi\\Downloads\\ProbDocument\\ProbDocument\\ggsn_cdr.csv")
dFrame.registerTempTable("GSSN")
dFrame.persist(StorageLevel.MEMORY_AND_DISK)
val distincCount=sqlContext.sql("select count(distinct sender) as SENDERS from GSSN").collectAsList().get(0).get(0).asInstanceOf[Long]
dFrame.repartition(distincCount.toInt/3,dFrame("sender"))
Do I need to call my presist method again after repartitioning for next reducing jobs on dataframe?
Yes, repartition returns a new DataFrame so you would need to cache it again.
While the answer provided by Dikei seems to address your direct question it is important to note that in a case like this there is typically no reason to explicitly cache at all.
Every shuffle in Spark (here it is repartition) serves as an implicit caching point. If some part of lineage has to be re-executed and none of the executors has been lost you it won't have to go further back than to the last shuffle and read shuffle files.
It means that caching just before or just after a shuffle is typically a waste of time and resources especially if you're not interested in in-memory only or some non standard caching mechanism.
You would need to persist the reparation DataFrame, since DataFrames are immutable and reparation returns a new DataFrame.
A approach which you could follow is to persist dFrame and after its reparation the new DataFrame which returned is dFrameRepart. At this stage you could persist the dFrameRepart and unpersist the dFrame in order to free up the memory, provided that you won't be using dFrame again. In case your using dFrame after the reparation operation , both the DataFrames can be persisted.
dFrame.registerTempTable("GSSN")
dFrame.persist(StorageLevel.MEMORY_AND_DISK)
val distincCount=sqlContext.sql("select count(distinct sender) as SENDERS from GSSN").collectAsList().get(0).get(0).asInstanceOf[Long]
valdFrameRepart=dFrame.repartition(distincCount.toInt/3, dFrame("sender")).persist(StorageLevel.MEMORY_AND_DISK)
dFrame.unpersist

Apache Spark: Apply existing mllib model on Incoming DStreams/DataFrames

Using Apache Spark's mllib, I have a Logistic Regression model that I store in HDFS. This Logistic Regression model is trained on historical data coming in from some sensors.
I have another spark program that consumes streaming data from these sensors. I want to be able to use the pre-existing trained model to do predictions on incoming data stream. Note: I don't want my model to be updated by this data.
To load the training model, I'd have to use the following line in my code:
val logisticModel = LogisticRegressionModel.load(sc, <location of model>)
sc: spark context.
However, this application is a streaming application and hence already has a "StreamingContext" setup. Now, from what I've read, it is bad practice to have two contexts in the same program (even though it is possible).
Does this mean that my approach is wrong and I can't do what I'm trying to ?
Also, would it make more sense if I keep storing the stream data in a file and keep running logistic regression on that rather than trying to do it directly in the streaming application ?
StreamingContext can created in a few ways including two constructors which take an existing SparkContext:
StreamingContext(path: String, sparkContext: SparkContext) - where path is a path to a checkpoint file
StreamingContext(sparkContext: SparkContext, batchDuration: Duration)
So you can simply create SparkContext, load required models, and create StreamingContext:
val sc: SparkContext = ???
...
val ssc = new StreamingContext(sc, Seconds(1))
You can also get SparkContext using StreamingContext.sparkContext method:
val ssc: StreamingContext = ???
ssc.sparkContext: SparkContext

Spark Streaming and ElasticSearch - Could not write all entries

I'm currently writing a Scala application made of a Producer and a Consumer. The Producers get some data from and external source and writes em inside Kafka. The Consumer reads from Kafka and writes to Elasticsearch.
The consumer is based on Spark Streaming and every 5 seconds fetches new messages from Kafka and writes them to ElasticSearch. The problem is I'm not able to write to ES because I get a lot of errors like the one below :
ERROR] [2015-04-24 11:21:14,734] [org.apache.spark.TaskContextImpl]:
Error in TaskCompletionListener
org.elasticsearch.hadoop.EsHadoopException: Could not write all
entries [3/26560] (maybe ES was overloaded?). Bailing out... at
org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:225)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:57)
~[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[na:na] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[na:na] at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.scheduler.Task.run(Task.scala:58)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
[spark-core_2.10-1.2.1.jar:1.2.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_65] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Consider that the producer is writing 6 messages every 15 seconds so I really don't understand how this "overload" can possibly happen (I even cleaned the topic and flushed all old messages, I thought it was related to an offset issue). The task executed by Spark Streaming every 5 seconds can be summarized by the following code :
val result = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Map("wasp.raw" -> 1), StorageLevel.MEMORY_ONLY_SER_2)
val convertedResult = result.map(k => (k._1 ,AvroToJsonUtil.avroToJson(k._2)))
//TO-DO : Remove resource (yahoo/yahoo) hardcoded parameter
log.info(s"*** EXECUTING SPARK STREAMING TASK + ${java.lang.System.currentTimeMillis()}***")
convertedResult.foreachRDD(rdd => {
rdd.map(data => data._2).saveToEs("yahoo/yahoo", Map("es.input.json" -> "true"))
})
If I try to print the messages instead of sending to ES, everything is fine and I actually see only 6 messages. Why can't I write to ES?
For the sake of completeness, I'm using this library to write to ES : elasticsearch-spark_2.10 with the latest beta version.
I found, after many retries, a way to write to ElasticSearch without getting any error. Basically passing the parameter "es.batch.size.entries" -> "1" to the saveToES method solved the problem. I don't understand why using the default or any other batch size leads to the aforementioned error considering that I would expect an error message if I'm trying to write more stuff than the allowed max batch size, not less.
Moreover I've noticed that actually I was writing to ES but not all my messages, I was losing between 1 and 3 messages per batch.
When I pushed dataframe to ES on Spark, I had the same error message. Even with "es.batch.size.entries" -> "1" configuration,I had the same error.
Once I increased thread pool in ES, I could figure out this issue.
for example,
Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 600
threadpool.bulk.queue_size: 30000
Like it was already mentioned here, this is a document write conflict.
Your convertedResult data stream contains multiple records with the same id. When written to elastic as part of the same batch produces the error above.
Possible solutions:
Generate unique id for each record. Depending on your use case it can be done in a few different ways. As example, one common solution is to create a new field by combining the id and lastModifiedDate fields and use that field as id when writing to elastic.
Perform de-duplication of records based on id - select only one record with particular id and discard other duplicates. Depending on your use case, this could be the most current record (based on time stamp field), most complete (most of the fields contain data), etc.
The #1 solution will store all records that you receive in the stream.
The #2 solution will store only the unique records for a specific id based on your de-duplication logic. This result would be the same as setting "es.batch.size.entries" -> "1", except you will not limit the performance by writing one record at a time.
One of the possibility is the cluster/shard status being RED. Please address this issue which may be due to unassigned replicas. Once status turned GREEN the API call succeeded just fine.
This is a document write conflict.
For example:
Multiple documents specify the same _id for Elasticsearch to use.
These documents are located in different partitions.
Spark writes multiple partitions to ES simultaneously.
Result is Elasticsearch receiving multiple updates for a single Document at once - from multiple sources / through multiple nodes / containing different data
"I was losing between 1 and 3 messages per batch."
Fluctuating number of failures when batch size > 1
Success if batch write size "1"
Just adding another potential reason for this error, hopefully it helps someone.
If your Elasticsearch index has child documents then:
if you are using a custom routing field (not _id), then according to
the documentation the uniqueness of the documents is not guaranteed.
This might cause issues while updating from spark.
If you are using the standard _id, the uniqueness will be preserved, however you need to make sure the following options are provided while writing from Spark to Elasticsearch:
es.mapping.join
es.mapping.routing

spark map(func).cache slow

When I use the cache to store data,I found that spark is running very slow. However, when I don't use cache Method,the speed is very good.My main profile is follows:
SPARK_JAVA_OPTS+="-Dspark.local.dir=/home/wangchao/hadoop-yarn-spark/tmp_out_info
-Dspark.rdd.compress=true -Dspark.storage.memoryFraction=0.4
-Dspark.shuffle.spill=false -Dspark.executor.memory=1800m -Dspark.akka.frameSize=100
-Dspark.default.parallelism=6"
And my test code is:
val file = sc.textFile("hdfs://10.168.9.240:9000/user/bailin/filename")
val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).cache()..reduceByKey(_+_)
count.collect()
Any answers or suggestions on how I can resolve this are greatly appreciated.
cache is useless in the context you are using it. In this situation cache is saying save the result of the map, .map(word => (word, 1)) in memory. Whereas if you didn't call it the reducer could be chained to the end of the map and the maps results discarded after they are used. cache is better used in a situation where multiple transformations/actions will be called on the RDD after it is created. For example if you create a data set you want to join to 2 different datasets it is helpful to cache it, because if you don't on the second join the whole RDD will be recalculated. Here is an easily understandable example from spark's website.
val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR")).cache() //errors is cached to prevent recalculation when the two filters are called
// Count all the errors
errors.count()
// Count errors mentioning MySQL
errors.filter(line => line.contains("MySQL")).count()
// Fetch the MySQL errors as an array of strings
errors.filter(line => line.contains("MySQL")).collect()
What cache is doing internally is removing the ancestors of an RDD by keeping it in memory/saving to disk(depending on the storage level), the reason an RDD must save its ancestors is so it can be recalculated on demand, this is the recovery method of RDD's.

Resources