Neptune | Gremlin Python | Parallel queries using websockets - websocket

What is the best way to execute parallel queries from gremlin python jupyter notebook to a Neptune cluster? I am trying to solve this using the Multiprocess package in Python. However my three db.r5.4xlarge readers max out very soon at 100% CPU as shown in the graph below. Graph 1 is CPU utilisation and graph 2 is gremlin errors. Below is my code. Is there a way this can be tackled better using websockets? If yes can you please help me with that since I am very new to gremlin or neptune.
params = [tuple(x) for x in new_registrations_list[['id','createddate']].values]
pool = Pool(42)
df=pool.starmap(process_vertex,params)
pool.close()
def process_vertex(vertex_id, reg_date):
g=neptune.graphTraversal(neptune_endpoint='neptune-endpoint', neptune_port=xxx1x)
vertices=g.V(str(vertex_id)).repeat(__.both().dedup()).emit().project('id').by(T.id).toList()

Related

How to speed up basic pyspark statements

As a new spark/pyspark user, I have a script running on an AWS t2.small ec2 instance in local mode (for testing purposes ony).
ie. As an example:
from __future__ import print_function
from pyspark.ml.classification import NaiveBayesModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
import ritc (my library)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("NaiveBayesExample")\
.getOrCreate()
...
request_dataframe = spark.createDataFrame(ritc.request_parameters, ["features"])
model = NaiveBayesModel.load(ritc.model_path)
...
prediction = model.transform(ritc.request_dataframe)
prediction.createOrReplaceTempView("result")
df = spark.sql("SELECT prediction FROM result")
p = map(lambda row: row.asDict(), df.collect())
...
I have left out code so as to focus on my question, relating to the speed of basic spark statements such as spark = SparkSession...
Using the datetime library (not shown above), I have timings for the three biggest 'culprits':
'spark = SparkSession...' -- 3.7 secs
'spark.createDataFrame()' -- 2.6 secs
'NaiveBayesModel.load()' -- 3.4 secs
Why are these times so long??
To give a little background, I would like to provide the capability to expose scripts such as the above as REST services.
In supervised context:
- service #1: train a model and save the model in the filesystem
- service #2: load the model from the filesystem and get a prediction for a single instance
(Note: The #2 REST requests would run at different, and unanticipated (random) times. The general pattern would be:
-> once: train the model - expecting a long turnaround time
-> multiple times: request a prediction for a single instance - expecting a turnaround time in milliseconds - eg. < 400 ms.
Is there a flaw in my thinking? Can I expect to increase performance dramatically to achieve this goal of sub-second turnaround time?
In most every article/video/discussion on spark performance that I have come across, the emphasis has been on 'heavy' tasks. The 'train model' task above may indeed be a 'heavy' one - I expect this will be the case when run in production. But the 'request a prediction for a single instance' needs to be responsive.
Can anyone help?
Thanks in anticipation.
Colin Goldberg
So ApacheSpark is designed to be used in this way. You might want to look at Spark Streaming if your goal is to handle streaming input data for predictions. You may also want to look at other options for serving Spark models, like PMML or MLeap.

How can I handle TensorFlow sessions to train multiple Keras models at the same time?

I need to train multiple Keras models at the same time. I'm using TensorFlow backend. Problem is, when I try to train, say, two models at the same time, I get Attempting to use uninitialized value.
The error is not really relevant, the main problem seems to be that Keras is forcing the two models to be created in the same session with the same graph so it conflicts.
I am a newbie in TensorFlow but my gut feeling is that the answer is pretty straightforward : you would have to create a different session for each Keras model and train them in their own session. Could someone explain me how it would be done ?
I really hope it is possible to solve this problem while still using Keras and not coding everything in pure TensorFlow. Any workaround would be appreciated too.
You are right, Keras automatically works with the default session.
You could use tf.compat.v1.keras.backend.get_session() or tf.compat.v1.keras.backend.set_session(sess) to manually set the global Keras session (see documentation).
For instance:
sess1 = tf.Session()
tf.compat.v1.keras.backend.set_session(sess1)
# Train your first Keras model here ...
sess2 = tf.Session()
tf.compat.v1.keras.backend.set_session(sess2)
# Train your second Keras model here ...
I train multiple models in parallel by using pythons multiprocessing, https://docs.python.org/3.4/library/multiprocessing.html.
I have a function that takes two parameters, an input queue and an output queue, this function runs in each process. The function has the following structure:
def worker(in_queue, out_queue):
import keras
while True:
parameters = in_queue.get()
network_parameters = parameters[0]
train_inputs = parameters[1]
train_outputs = parameters[2]
test_inputs = parameters[3]
test_outputs = parameters[4]
build the network based on the given parameters
train the network
test the network if required
out_queue.put(result)
From the main python script start as many processes (and create as many in and out queues) as required. Add jobs to a worker by calling put on its in queue and get the results by calling get on its out queue.

Scikit-Learn with Dask-Distributed using nested parallelism?

For example suppose I have the code:
vectorizer = CountVectorizer(input=u'filename', decode_error=u'replace')
classifier = OneVsRestClassifier(LinearSVC())
pipeline = Pipeline([
('vect', vectorizer),
('clf', classifier)])
with parallel_backend('distributed', scheduler_host=host_port):
scores = cross_val_score(pipeline, X, y, cv=10)
If I execute this code I can see in the dask webview (through Bokeh) that 10 tasks are created (1 for each fold). However if I execute:
(I know X and y should be split into training and testing, but this is just for testing purposes).
with parallel_backend('distributed', scheduler_host=host_port):
pipeline.fit(X,y)
I can see 1 task for each y class being created (20 in my case). Is there a way to have the cross_val_score be run in parallel AND the underlying OneVsRestClassifier run in parallel? Or is the original code of
with parallel_backend('distributed', scheduler_host=host_port):
scores = cross_val_score(pipeline, X, y, cv=10)
running the OneVsRestClassifier in parallel along with the cross_val_score in parallel and I'm just not seeing it? Will I have to implement this manually with dask-distributed?
The design of the parallel backends of joblib is currently too limited to handle nested parallel calls. This problem is tracked here: https://github.com/joblib/joblib/pull/538
We will also need to extend the distributed backend of joblib to use http://distributed.readthedocs.io/en/latest/api.html#distributed.get_client

Mahout - ParallelALSFactorizationJob running too long?

I am trying to run Mahout ALS recommendation on AWS EMR cluster, however, it takes much longer than I expected.
The following is the command I run:
aws add-steps --cluster-id <cluster_id> \
--steps Type=CUSTOM_JAR,\
Name="Mahout ALS Factorization Job",\
Jar=s3://<my_bucket>/recproto/mahout-mr-0.10.0-job.jar,\
MainClass=org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob,\
Args=["--input","s3://<my_bucket>/recproto/trainingdata/userClicks.csv.gz",\
"--output","s3://<my_bucket>/recproto/als-output/",\
"--implicitFeedback","true",\
"--lambda","150",\
"--alpha","0.05",\
"--numFeatures","100",\
"--numIterations","3",\
"--numThreadsPerSolver","4",\
"--usesLongIDs","true"]
In the userClicks.csv file, there are 1,567,808 ratings from 335,636 users and 23,934 items.
The job is run on a 10-c3.xlarge nodes EMR cluster, and the job has been running for more than 2 hours. I would like to know is this normal? In the case of my rating file, which scale of EMR cluster and parameters should I use so I can get a more acceptable running time?
I solved this problem by simply use Spark ALS. The training process spends LESS THAN 2 MINUTES ON MY LAPTOP on the same dataset with the same parameters.
I can now understand why some machine learning algorithms are deprecated due to performance issues...(e.g., the Minhash algorithm)

Neo4j becomes extremely slow after serving a query

I have a medium-sized neo4j database with about 700000 nodes and 1-5 outgoing relations on each node.
If I use browser interface for querying nodes on indexed attribute and finding adjacent nodes, it takes about 1500ms, which is fine for me.
MATCH (n {id_str : 'some_id'})-->(child) return child.id_str
...
Returned 2 rows in 1655 ms
But if I run a similar Cypher query mentioning relations using Ruby Neography library it tooks a couple of minutes to complete.
lookup_links = "MATCH (n {id_str : {id_str}})-[:internal_link]->(child) return child.id_str"
links = #neo.execute_query(lookup_links, :id_str => id_str)
And after that regular browser queries become extremely slow too, taking about two minutes each.
MATCH (n2 {id_str : 'some_id'})-->(child) return child.id_str
Returned 2 rows in 116201 ms
I run the experiments on 64bit ubuntu 14.04 laptop with 8GB ram and 1GB heap for neo4j. Neo4j version is 2.1.3 installed from official deb packet. Neography version is 1.6.0. I use MRI-1.9.3.
I've done a stackdump using kill -3 while neo is busy serving the query.
https://gist.github.com/akamaus/a06bc9e04c7209c480e9
Any Ideas what's going wrong and how to investivate it?

Resources