Apache Spark: Apply existing mllib model on Incoming DStreams/DataFrames - hadoop

Using Apache Spark's mllib, I have a Logistic Regression model that I store in HDFS. This Logistic Regression model is trained on historical data coming in from some sensors.
I have another spark program that consumes streaming data from these sensors. I want to be able to use the pre-existing trained model to do predictions on incoming data stream. Note: I don't want my model to be updated by this data.
To load the training model, I'd have to use the following line in my code:
val logisticModel = LogisticRegressionModel.load(sc, <location of model>)
sc: spark context.
However, this application is a streaming application and hence already has a "StreamingContext" setup. Now, from what I've read, it is bad practice to have two contexts in the same program (even though it is possible).
Does this mean that my approach is wrong and I can't do what I'm trying to ?
Also, would it make more sense if I keep storing the stream data in a file and keep running logistic regression on that rather than trying to do it directly in the streaming application ?

StreamingContext can created in a few ways including two constructors which take an existing SparkContext:
StreamingContext(path: String, sparkContext: SparkContext) - where path is a path to a checkpoint file
StreamingContext(sparkContext: SparkContext, batchDuration: Duration)
So you can simply create SparkContext, load required models, and create StreamingContext:
val sc: SparkContext = ???
...
val ssc = new StreamingContext(sc, Seconds(1))
You can also get SparkContext using StreamingContext.sparkContext method:
val ssc: StreamingContext = ???
ssc.sparkContext: SparkContext

Related

Once in a while Spark Structured Streaming write stream is getting IllegalStateException: Race while writing batch 4

I have multiple queries running on the same spark structured streaming session.
The queries are writing parquet records to Google Bucket and checkpoint to Google Bucket.
val query1 = df1
.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
.select("key","data.*")
.writeStream.format("parquet").option("path", path).outputMode("append")
.option("checkpointLocation", checkpoint_dir1)
.partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
.queryName("query1").start()
val query2 = df2.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
.select("key","data.*")
.writeStream.format("parquet").option("path", path).outputMode("append")
.option("checkpointLocation", checkpoint_dir2)
.partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
.queryName("query2").start()
Problem: Sometimes job fails with ava.lang.IllegalStateException: Race while writing batch 4
Logs:
Caused by: java.lang.IllegalStateException: Race while writing batch 4
at org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:67)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:187)
... 20 more
20/07/24 19:40:15 INFO SparkContext: Invoking stop() from shutdown hook
This error is because there are two writers writing to the output path. The file streaming sink doesn't support multiple writers. It assumes there is only one writer writing to the path. Each query needs to use its own output directory.
Hence, in order to fix this, you can make each query use its own output directory. When reading back the data, you can load each output directory and union them.
You can also use a streaming sink that supports multiple concurrent writers, such as the Delta Lake library. It's also supported by Google Cloud: https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc . This link has instructions about how to use Delta Lake on Google Cloud. It doesn't mention the streaming case, but what you need to do is changing format("parquet") to format("delta") in your codes.

how to use python bolt in storm crawler?

I have some image classifiers that have been written in python. A lot of examples are available on the web which describes the way of using python in storm bolt that uses from stdin/stdout. I want to integrate my python image classifier with storm crawler topology. Is it possible or not?
Thanks
Definitely possible, did that a few years ago to integrate an image classifier with Tensorflow into a StormCrawler topology. Can't remember the details and the code stayed with the customers I wrote it for but it was based on the multilang protocol, don't remember the details unfortunately.
Yes, you can. if you are using Flux, this is a sample definition of how to use a python bolt in your topology:
- id: "pythonbolt"
className: "org.apache.storm.flux.wrappers.bolts.FluxShellBolt"
constructorArgs:
- ["python", "/absolute/path/to/your/python_file.py"]
# declare your outputs here:
- ["output0", "output1", "output2"]
parallelism: 1
NOTE: Make sure you emit simple data types (like string, integer, etc.) to your python bolt. not a java data type, or it'll throw errors!
first download storm.py form
here
And this is a sample python bolt:
import storm
class SampleBolt(storm.BasicBolt):
# Initialize this instance
def initialize(self, conf, context):
self._conf = conf
self._context = context
def process(self, tup):
# Some processes here, and then emit your outputs.
storm.emit([output0, output1, output2])
# Start the bolt when it's invoked
SampleBolt().run()

How to speed up basic pyspark statements

As a new spark/pyspark user, I have a script running on an AWS t2.small ec2 instance in local mode (for testing purposes ony).
ie. As an example:
from __future__ import print_function
from pyspark.ml.classification import NaiveBayesModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
import ritc (my library)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("NaiveBayesExample")\
.getOrCreate()
...
request_dataframe = spark.createDataFrame(ritc.request_parameters, ["features"])
model = NaiveBayesModel.load(ritc.model_path)
...
prediction = model.transform(ritc.request_dataframe)
prediction.createOrReplaceTempView("result")
df = spark.sql("SELECT prediction FROM result")
p = map(lambda row: row.asDict(), df.collect())
...
I have left out code so as to focus on my question, relating to the speed of basic spark statements such as spark = SparkSession...
Using the datetime library (not shown above), I have timings for the three biggest 'culprits':
'spark = SparkSession...' -- 3.7 secs
'spark.createDataFrame()' -- 2.6 secs
'NaiveBayesModel.load()' -- 3.4 secs
Why are these times so long??
To give a little background, I would like to provide the capability to expose scripts such as the above as REST services.
In supervised context:
- service #1: train a model and save the model in the filesystem
- service #2: load the model from the filesystem and get a prediction for a single instance
(Note: The #2 REST requests would run at different, and unanticipated (random) times. The general pattern would be:
-> once: train the model - expecting a long turnaround time
-> multiple times: request a prediction for a single instance - expecting a turnaround time in milliseconds - eg. < 400 ms.
Is there a flaw in my thinking? Can I expect to increase performance dramatically to achieve this goal of sub-second turnaround time?
In most every article/video/discussion on spark performance that I have come across, the emphasis has been on 'heavy' tasks. The 'train model' task above may indeed be a 'heavy' one - I expect this will be the case when run in production. But the 'request a prediction for a single instance' needs to be responsive.
Can anyone help?
Thanks in anticipation.
Colin Goldberg
So ApacheSpark is designed to be used in this way. You might want to look at Spark Streaming if your goal is to handle streaming input data for predictions. You may also want to look at other options for serving Spark models, like PMML or MLeap.

Creating single object DataFrame for predictions

once I got my classification models trained, I'd like them to use in my web application to make classification predictions on the data that has been collected for a given session.
That is:
1) I have some session data structure that I need to map to a DataFrame row
2) feed tha DataFrame row into my ML model to predict the classification
3) use the prediction with the origination session to show it to the user in front of the browser.
The examples to create a DataFrame as input to a Spark pipeline that I've seen so far create it from a data source like a file. Now it seems a bit unwieldy to first create a single POJO or JsonNode, serialize it to file containing just on record and then use that file to create the DataFrame to feed the model.
Writing this I also get the feeling that it might not be a great idea to create and tear down the ML pipeline for each request, which seems to follow from this approach.
So maybe I should better think "Spark Streaming"?
Feed the mapped session data into some kind of message queue and feed that into my Spark pipeline? What kind of "stream" would be appropriate here?
I read somewhere that Spark streaming consumes the stream in micro batches and not record by record - that implies some delay until enough records have been collected to fill up the micro batch (or some preconfigured delay to wait until the micro batch is considered to be "full enough"). What does that mean for the responsiveness of the web application? Can I trigger the micro batches like every 100 milliseconds?
I would appreciate if someone could point me in the right direction.
Maybe Spark is not a good fit here and I should switch to Apache Flink?
Thanks in advance, Bernd
Ok, by now I have found some ways to solve my problem, maybe that
helps someone else:
Use a Sequence containing one tuple and name the columns separately
val df= spark.createDataFrame(
Seq("val1", "val2")
).toDF("label1", "label2")
Using a JSON-String
val sqlContext = spark.sqlContext
val jsonData= """{ "label1": "val1", "label2": "val2" }"""
val rdd= sparkSession.sparkContext.parallelize(Seq(jsonData))
val df= sqlContext.read.json(rdd)
NOT Working: create from Sequence case class Objects:
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
val myData= Seq(Feat("value1", "value2"))
val ds: Dataset[Feat]= myData.toDS()
ds.show(10, false)
This compiles ok, but yields an Exception at runtime:
[error] a.a.OneForOneStrategy - java.lang.RuntimeException:
Error while encoding: java.lang.ClassCastException:
es.core.recommender.Feat cannot be cast to es.core.recommender.Feat
I'd love to include more of the stacktrace, but this glorious editor
won't let me...
It would be nice to know why this alternative did not work...

Spark streaming jobs duration in program

How do I get in my program (which is running the spark streaming job) the time taken for each rdd job.
for example
val streamrdd = KafkaUtils.createDirectStream[String, String, StringDecoder,StringDecoder](ssc, kafkaParams, topicsSet)
val processrdd = streamrdd.map(some operations...).savetoxyz
In the above code for each microbatch rdd the job is run for map and saveto operation.
I want to get the timetake for each streaming job. I can see the job in port 4040 UI, but want to get in the spark code itself.
Pardon if my question is not clear.
You can use the StreamingListener in you spark app. This interface provides a method onBatchComplete that can give you total time taken by the batch jobs.
context.addStreamingListener(new StatusListenerImpl());
StatusListenerImpl is the implementation class that you have to implement using StreamingListener.
There are more other methods also available in listener you should explore them as well.

Resources