How to speed up basic pyspark statements - performance

As a new spark/pyspark user, I have a script running on an AWS t2.small ec2 instance in local mode (for testing purposes ony).
ie. As an example:
from __future__ import print_function
from pyspark.ml.classification import NaiveBayesModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
import ritc (my library)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("NaiveBayesExample")\
.getOrCreate()
...
request_dataframe = spark.createDataFrame(ritc.request_parameters, ["features"])
model = NaiveBayesModel.load(ritc.model_path)
...
prediction = model.transform(ritc.request_dataframe)
prediction.createOrReplaceTempView("result")
df = spark.sql("SELECT prediction FROM result")
p = map(lambda row: row.asDict(), df.collect())
...
I have left out code so as to focus on my question, relating to the speed of basic spark statements such as spark = SparkSession...
Using the datetime library (not shown above), I have timings for the three biggest 'culprits':
'spark = SparkSession...' -- 3.7 secs
'spark.createDataFrame()' -- 2.6 secs
'NaiveBayesModel.load()' -- 3.4 secs
Why are these times so long??
To give a little background, I would like to provide the capability to expose scripts such as the above as REST services.
In supervised context:
- service #1: train a model and save the model in the filesystem
- service #2: load the model from the filesystem and get a prediction for a single instance
(Note: The #2 REST requests would run at different, and unanticipated (random) times. The general pattern would be:
-> once: train the model - expecting a long turnaround time
-> multiple times: request a prediction for a single instance - expecting a turnaround time in milliseconds - eg. < 400 ms.
Is there a flaw in my thinking? Can I expect to increase performance dramatically to achieve this goal of sub-second turnaround time?
In most every article/video/discussion on spark performance that I have come across, the emphasis has been on 'heavy' tasks. The 'train model' task above may indeed be a 'heavy' one - I expect this will be the case when run in production. But the 'request a prediction for a single instance' needs to be responsive.
Can anyone help?
Thanks in anticipation.
Colin Goldberg

So ApacheSpark is designed to be used in this way. You might want to look at Spark Streaming if your goal is to handle streaming input data for predictions. You may also want to look at other options for serving Spark models, like PMML or MLeap.

Related

Creating single object DataFrame for predictions

once I got my classification models trained, I'd like them to use in my web application to make classification predictions on the data that has been collected for a given session.
That is:
1) I have some session data structure that I need to map to a DataFrame row
2) feed tha DataFrame row into my ML model to predict the classification
3) use the prediction with the origination session to show it to the user in front of the browser.
The examples to create a DataFrame as input to a Spark pipeline that I've seen so far create it from a data source like a file. Now it seems a bit unwieldy to first create a single POJO or JsonNode, serialize it to file containing just on record and then use that file to create the DataFrame to feed the model.
Writing this I also get the feeling that it might not be a great idea to create and tear down the ML pipeline for each request, which seems to follow from this approach.
So maybe I should better think "Spark Streaming"?
Feed the mapped session data into some kind of message queue and feed that into my Spark pipeline? What kind of "stream" would be appropriate here?
I read somewhere that Spark streaming consumes the stream in micro batches and not record by record - that implies some delay until enough records have been collected to fill up the micro batch (or some preconfigured delay to wait until the micro batch is considered to be "full enough"). What does that mean for the responsiveness of the web application? Can I trigger the micro batches like every 100 milliseconds?
I would appreciate if someone could point me in the right direction.
Maybe Spark is not a good fit here and I should switch to Apache Flink?
Thanks in advance, Bernd
Ok, by now I have found some ways to solve my problem, maybe that
helps someone else:
Use a Sequence containing one tuple and name the columns separately
val df= spark.createDataFrame(
Seq("val1", "val2")
).toDF("label1", "label2")
Using a JSON-String
val sqlContext = spark.sqlContext
val jsonData= """{ "label1": "val1", "label2": "val2" }"""
val rdd= sparkSession.sparkContext.parallelize(Seq(jsonData))
val df= sqlContext.read.json(rdd)
NOT Working: create from Sequence case class Objects:
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
val myData= Seq(Feat("value1", "value2"))
val ds: Dataset[Feat]= myData.toDS()
ds.show(10, false)
This compiles ok, but yields an Exception at runtime:
[error] a.a.OneForOneStrategy - java.lang.RuntimeException:
Error while encoding: java.lang.ClassCastException:
es.core.recommender.Feat cannot be cast to es.core.recommender.Feat
I'd love to include more of the stacktrace, but this glorious editor
won't let me...
It would be nice to know why this alternative did not work...

How can I handle TensorFlow sessions to train multiple Keras models at the same time?

I need to train multiple Keras models at the same time. I'm using TensorFlow backend. Problem is, when I try to train, say, two models at the same time, I get Attempting to use uninitialized value.
The error is not really relevant, the main problem seems to be that Keras is forcing the two models to be created in the same session with the same graph so it conflicts.
I am a newbie in TensorFlow but my gut feeling is that the answer is pretty straightforward : you would have to create a different session for each Keras model and train them in their own session. Could someone explain me how it would be done ?
I really hope it is possible to solve this problem while still using Keras and not coding everything in pure TensorFlow. Any workaround would be appreciated too.
You are right, Keras automatically works with the default session.
You could use tf.compat.v1.keras.backend.get_session() or tf.compat.v1.keras.backend.set_session(sess) to manually set the global Keras session (see documentation).
For instance:
sess1 = tf.Session()
tf.compat.v1.keras.backend.set_session(sess1)
# Train your first Keras model here ...
sess2 = tf.Session()
tf.compat.v1.keras.backend.set_session(sess2)
# Train your second Keras model here ...
I train multiple models in parallel by using pythons multiprocessing, https://docs.python.org/3.4/library/multiprocessing.html.
I have a function that takes two parameters, an input queue and an output queue, this function runs in each process. The function has the following structure:
def worker(in_queue, out_queue):
import keras
while True:
parameters = in_queue.get()
network_parameters = parameters[0]
train_inputs = parameters[1]
train_outputs = parameters[2]
test_inputs = parameters[3]
test_outputs = parameters[4]
build the network based on the given parameters
train the network
test the network if required
out_queue.put(result)
From the main python script start as many processes (and create as many in and out queues) as required. Add jobs to a worker by calling put on its in queue and get the results by calling get on its out queue.

python 3 requests_futures requests to same server in different processes

I am looking into parallelization of url requests onto one single webserver in python for the first time.
I would like to use requests_futures for this task as it seems that one can really split up processes onto several cores with the ProcessPoolExecutor.
The example code from the module documentation is:
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=2))
future_one = session.get('http://httpbin.org/get')
future_two = session.get('http://httpbin.org/get?foo=bar')
response_one = future_one.result()
print('response one status: {0}'.format(response_one.status_code))
print(response_one.content)
response_two = future_two.result()
print('response two status: {0}'.format(response_two.status_code))
print(response_two.content)
The above code works for me, however, I need some help with getting it customized to my needs.
I want to query the same server, let's say, 50 times (e.g. 50 different httpbin.org/get?... requests). What would be a good way to split these up onto different futures other than just defining future_one, ..._two and so on?
I am thinking about using different processes. According to the module documentation, it should be just a change in the first three lines of the above code:
from concurrent.futures import ProcessPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ProcessPoolExecutor(max_workers=2))
If I execute this I get the following error:
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
How do I get this running properly?

Spark streaming jobs duration in program

How do I get in my program (which is running the spark streaming job) the time taken for each rdd job.
for example
val streamrdd = KafkaUtils.createDirectStream[String, String, StringDecoder,StringDecoder](ssc, kafkaParams, topicsSet)
val processrdd = streamrdd.map(some operations...).savetoxyz
In the above code for each microbatch rdd the job is run for map and saveto operation.
I want to get the timetake for each streaming job. I can see the job in port 4040 UI, but want to get in the spark code itself.
Pardon if my question is not clear.
You can use the StreamingListener in you spark app. This interface provides a method onBatchComplete that can give you total time taken by the batch jobs.
context.addStreamingListener(new StatusListenerImpl());
StatusListenerImpl is the implementation class that you have to implement using StreamingListener.
There are more other methods also available in listener you should explore them as well.

Apache Spark: Apply existing mllib model on Incoming DStreams/DataFrames

Using Apache Spark's mllib, I have a Logistic Regression model that I store in HDFS. This Logistic Regression model is trained on historical data coming in from some sensors.
I have another spark program that consumes streaming data from these sensors. I want to be able to use the pre-existing trained model to do predictions on incoming data stream. Note: I don't want my model to be updated by this data.
To load the training model, I'd have to use the following line in my code:
val logisticModel = LogisticRegressionModel.load(sc, <location of model>)
sc: spark context.
However, this application is a streaming application and hence already has a "StreamingContext" setup. Now, from what I've read, it is bad practice to have two contexts in the same program (even though it is possible).
Does this mean that my approach is wrong and I can't do what I'm trying to ?
Also, would it make more sense if I keep storing the stream data in a file and keep running logistic regression on that rather than trying to do it directly in the streaming application ?
StreamingContext can created in a few ways including two constructors which take an existing SparkContext:
StreamingContext(path: String, sparkContext: SparkContext) - where path is a path to a checkpoint file
StreamingContext(sparkContext: SparkContext, batchDuration: Duration)
So you can simply create SparkContext, load required models, and create StreamingContext:
val sc: SparkContext = ???
...
val ssc = new StreamingContext(sc, Seconds(1))
You can also get SparkContext using StreamingContext.sparkContext method:
val ssc: StreamingContext = ???
ssc.sparkContext: SparkContext

Resources