For example suppose I have the code:
vectorizer = CountVectorizer(input=u'filename', decode_error=u'replace')
classifier = OneVsRestClassifier(LinearSVC())
pipeline = Pipeline([
('vect', vectorizer),
('clf', classifier)])
with parallel_backend('distributed', scheduler_host=host_port):
scores = cross_val_score(pipeline, X, y, cv=10)
If I execute this code I can see in the dask webview (through Bokeh) that 10 tasks are created (1 for each fold). However if I execute:
(I know X and y should be split into training and testing, but this is just for testing purposes).
with parallel_backend('distributed', scheduler_host=host_port):
pipeline.fit(X,y)
I can see 1 task for each y class being created (20 in my case). Is there a way to have the cross_val_score be run in parallel AND the underlying OneVsRestClassifier run in parallel? Or is the original code of
with parallel_backend('distributed', scheduler_host=host_port):
scores = cross_val_score(pipeline, X, y, cv=10)
running the OneVsRestClassifier in parallel along with the cross_val_score in parallel and I'm just not seeing it? Will I have to implement this manually with dask-distributed?
The design of the parallel backends of joblib is currently too limited to handle nested parallel calls. This problem is tracked here: https://github.com/joblib/joblib/pull/538
We will also need to extend the distributed backend of joblib to use http://distributed.readthedocs.io/en/latest/api.html#distributed.get_client
Related
A simple dask cache example. Cache does not work as expected. Let's assume we have a list of data and a series of delayed functions, expected that for a function that encounters the same input to cache/memoize the results according to cachey score.
This example demonstrates that is not the case.
import time
import dask
from dask.cache import Cache
from dask.diagnostics import visualize
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler
def slow_func(x):
time.sleep(5)
return x+1
output = []
data = np.ones((100))
for x in data:
a = dask.delayed(slow_func)(x)
output.append(a)
total = dask.delayed(sum)(output)
cache = Cache(2e9)
cache.register()
with Profiler() as prof, ResourceProfiler(dt=0.25) as rprof,CacheProfiler() as cprof:
total.compute()
visualize([prof, rprof, cprof])
cache cprof plot
After the initial parallel execution of the function would expect the next iteration upon calling the function with the same value to use a cache version. But obviously does not, dask_key_name is for designating the same output, but i want to assess this function for a variety of inputs and if seeing the same input use cached version. We can tell if this is happening very easily with this function due to the 5 second delay and should see it execute roughly 5 seconds as soon as the first value is cached after execution. This example executes every single function delayed 5 seconds. I am able to create a memoized version using the cachey library directly but this should work using the dask.cache library.
In dask.delayed you may need to specify the pure=True keyword.
You can verify that this worked because all of your dask delayed values will have the same key.
You don't need to use Cache for this if they are all in the same dask.compute call.
What is the best way to execute parallel queries from gremlin python jupyter notebook to a Neptune cluster? I am trying to solve this using the Multiprocess package in Python. However my three db.r5.4xlarge readers max out very soon at 100% CPU as shown in the graph below. Graph 1 is CPU utilisation and graph 2 is gremlin errors. Below is my code. Is there a way this can be tackled better using websockets? If yes can you please help me with that since I am very new to gremlin or neptune.
params = [tuple(x) for x in new_registrations_list[['id','createddate']].values]
pool = Pool(42)
df=pool.starmap(process_vertex,params)
pool.close()
def process_vertex(vertex_id, reg_date):
g=neptune.graphTraversal(neptune_endpoint='neptune-endpoint', neptune_port=xxx1x)
vertices=g.V(str(vertex_id)).repeat(__.both().dedup()).emit().project('id').by(T.id).toList()
As a new spark/pyspark user, I have a script running on an AWS t2.small ec2 instance in local mode (for testing purposes ony).
ie. As an example:
from __future__ import print_function
from pyspark.ml.classification import NaiveBayesModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
import ritc (my library)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("NaiveBayesExample")\
.getOrCreate()
...
request_dataframe = spark.createDataFrame(ritc.request_parameters, ["features"])
model = NaiveBayesModel.load(ritc.model_path)
...
prediction = model.transform(ritc.request_dataframe)
prediction.createOrReplaceTempView("result")
df = spark.sql("SELECT prediction FROM result")
p = map(lambda row: row.asDict(), df.collect())
...
I have left out code so as to focus on my question, relating to the speed of basic spark statements such as spark = SparkSession...
Using the datetime library (not shown above), I have timings for the three biggest 'culprits':
'spark = SparkSession...' -- 3.7 secs
'spark.createDataFrame()' -- 2.6 secs
'NaiveBayesModel.load()' -- 3.4 secs
Why are these times so long??
To give a little background, I would like to provide the capability to expose scripts such as the above as REST services.
In supervised context:
- service #1: train a model and save the model in the filesystem
- service #2: load the model from the filesystem and get a prediction for a single instance
(Note: The #2 REST requests would run at different, and unanticipated (random) times. The general pattern would be:
-> once: train the model - expecting a long turnaround time
-> multiple times: request a prediction for a single instance - expecting a turnaround time in milliseconds - eg. < 400 ms.
Is there a flaw in my thinking? Can I expect to increase performance dramatically to achieve this goal of sub-second turnaround time?
In most every article/video/discussion on spark performance that I have come across, the emphasis has been on 'heavy' tasks. The 'train model' task above may indeed be a 'heavy' one - I expect this will be the case when run in production. But the 'request a prediction for a single instance' needs to be responsive.
Can anyone help?
Thanks in anticipation.
Colin Goldberg
So ApacheSpark is designed to be used in this way. You might want to look at Spark Streaming if your goal is to handle streaming input data for predictions. You may also want to look at other options for serving Spark models, like PMML or MLeap.
I need to train multiple Keras models at the same time. I'm using TensorFlow backend. Problem is, when I try to train, say, two models at the same time, I get Attempting to use uninitialized value.
The error is not really relevant, the main problem seems to be that Keras is forcing the two models to be created in the same session with the same graph so it conflicts.
I am a newbie in TensorFlow but my gut feeling is that the answer is pretty straightforward : you would have to create a different session for each Keras model and train them in their own session. Could someone explain me how it would be done ?
I really hope it is possible to solve this problem while still using Keras and not coding everything in pure TensorFlow. Any workaround would be appreciated too.
You are right, Keras automatically works with the default session.
You could use tf.compat.v1.keras.backend.get_session() or tf.compat.v1.keras.backend.set_session(sess) to manually set the global Keras session (see documentation).
For instance:
sess1 = tf.Session()
tf.compat.v1.keras.backend.set_session(sess1)
# Train your first Keras model here ...
sess2 = tf.Session()
tf.compat.v1.keras.backend.set_session(sess2)
# Train your second Keras model here ...
I train multiple models in parallel by using pythons multiprocessing, https://docs.python.org/3.4/library/multiprocessing.html.
I have a function that takes two parameters, an input queue and an output queue, this function runs in each process. The function has the following structure:
def worker(in_queue, out_queue):
import keras
while True:
parameters = in_queue.get()
network_parameters = parameters[0]
train_inputs = parameters[1]
train_outputs = parameters[2]
test_inputs = parameters[3]
test_outputs = parameters[4]
build the network based on the given parameters
train the network
test the network if required
out_queue.put(result)
From the main python script start as many processes (and create as many in and out queues) as required. Add jobs to a worker by calling put on its in queue and get the results by calling get on its out queue.
I've got some questions about the Spark framework.
First, if I want to write some applications that runs on spark clusters, is it unavoidable to follow the map-reduce procedure? Since to follow the map-reduce procedure, lots of codes has to be changed to parallelize forms, I'm looking for some simple way to move current project to cluster with little changes in codes.
Second is about the spark-shell. I've tried to launch the spark-shell on a cluster using the following code: MASTER=spark://IP:PORT ./bin/spark-shell. Then I write some scala codes on the spark-shell,for example:
var count1=0
var ntimes=10000
var index=0
while(index<ntimes)
{
index+=1
val t1 = Math.random()
val t2 = Math.random()
if (t1*t1 + t2*t2 < 1)
count1+=1
}
var pi= 4.0 * count1 / ntimes
val count2 = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count2 / NUM_SAMPLES)
These codes contain two different Pi caculation programs. I'm wandering whether all of these codes runs on the cluster. I guess that only these codes surrounded by the map{} function are executed on cluster while other codes only executed on the master node. but I'm not sure whether that's correct.
Spark provides a more generic framework than simply Map & Reduce. If you examine the API you can find quite a few other functions that are more generic, such as aggregate. In addition, Spark supports features such as broadcast variables and accumulators that make parallel programming much more effective.
The second question (you really should separate the two):
Yes, the two codes are executed differently. If you want to take advantage of Spark's parallel capabilities, you have to use the RDD data structures. Until you understand how the RDD is distributed and how operations affect the RDD, it will be difficult to use Spark effectively.
Any code that is not executing in an method over an RDD is not parallel.