When to cache in pyspark?

When to cache in pyspark? - performance

I've been reading about pyspark caching and how execution works. It is clear for me how using .cache() when multiple actions trigger the same computation:
df = sc.sql("select * from table")
df.count()
df = df.where({something})
df.count()
can be improved by doing:
df = sc.sql("select * from table").cache()
df.count()
df = df.where({something})
df.count()
However, it is not clear for me if and why it would be advantageous without intermediate actions:
df = sc.sql("select * from table")
df2 = sc.sql("select * from table2")
df = df.where({something})
df2 = df2.where({something})
df3 = df.join(df2).where({something})
df3.count()
In this type of code (where we have only one final action) is cache() useful?

Being straight to the point: no, in that case it would not be useful.
Transformations have lazy evaluation in Spark. I.e., they are recorded but the execution needs to be triggered by an Action (such as your count).
So, when you execute df3.count() it will evaluate all the transformations up to that point.
If you do not perform another action, then it is certain that adding .cache() anywhere will not provide any performance improvement.
However, even if you do more than one action, .cache() [or .checkpoint(), depending on your problem] sometimes does not provide any performance increase. It will highly depends on your problem, and the transformation costs you have - e.g., a join can be very costly.
Also if you are running Spark using its interactive shell, eventually sometimes .checkpoint() can be better suited after costly transformations.

Related

Working with a constant stream of realtime data

So I have a project idea that requires me to process incoming realtime data and constantly track some metrics about the realtime data. Then every now and then I want to be able to request for the metrics I am calculating and do some stuff with that data.
Currently I have a simple Python script that uses the socket library to get the realtime data. It is basically just...
metric1 = 0
metric2 = ''
while True:
response = socket.recv(512).decode('utf-8')
if response.startswith('PING'):
sock.send("PONG\n".encode('utf-8'))
else:
process(response)
In the above process(response) will update metric1 and metric2 with data from each response. (For example they might be mean len(response) and most common response respectively)
What I want to do is run the above script constantly after starting up the project and occasionally query for metric1 and metric2 in a script I have running locally. I am guessing that I will have to look into running code on a server which I have very little experience with.
What are the most accessible tools to do what I want? I am pretty comfortable with a variety of languages so if there is a library or tool in another language that is better suited for all of this, please tell me about it
Thanks!

I worked on a similar project, not sure if it specifically can be applied to your case, but maybe it can give you a starting point.
Although I am very aware it's not best practice to use Pandas Dataframes for real-time purposes, in my case it's just fast enough (I am actually open to suggestions on how to improve my workflow!), here is my code:
all_prices = pd.Dataframe()
readprice():
global all_prices
msg = mysock.recv(16384)
msg_stringa=str(msg,'utf-8')
new_price = pd.read_csv(StringIO(msg_stringa) , sep=";", error_bad_lines=False,
index_col=None, header=None, engine='c', names=range(33),
decimal = '.')
...
...
all_prices = all_prices.append(new_price, ignore_index=True).copy()
So 'all_prices' Pandas Dataframe is global, new prices get appended to the general 'all_prices' DF . This global DF can be used by other functions in order to read the content ect. Be very careful about the variable sharing between two or more threads, it can lead to errors.
More info here: http://www.laurentluce.com/posts/python-threads-synchronization-locks-rlocks-semaphores-conditions-events-and-queues/
In my case, I don't share the DF to a parallel thread, other threads are launched after the append, not in the meantime.

pyspark cache() dataframe issue

I have program written in order to parallelize the process, cache has been applied after certain transformations on dataframe's. Lets say:
df1 = df.filter()
df3 = df1.join(df2, join_cond, "left")
df3.cache() #ex: it has col1, col2, col3, col4 columns
After cache, we have some other steps to take care:
#1
df4 = df3.select(df3.col1, df3.col2)
df4.filter(df3.col1 > 500).show()
#2
df5 = df3.select(df3.col3, df3.col4)
df5.filter(df3.col4 > 2000)
df3.unpersist()
So, in this process if any issue or error occurs we have to uncache the dataframe or the old cache will destroy automatically when we are rerunning the program.
Could you please help me how the cache() will work if there is any kind of failures in the program at a certain point of time.
Thanks

cache persists the lazy evaluation result in memory, so after the cache, any transformation could directly from scanning the df in memory and start working.
action vs transformation, action leads to a non-rdd non-df object like in your code .show(), transformation leads to another rdd/spark df, like in your code .filter, .select, .join
just based on your code snippet, there is no problem, your df4's dependency is just scanned df3 -> df4 and there is just one action. But if you want to call df5.filter().show() or df4.show() again, it will become a problem. Because you unpersist df3, there is no data in memory, in order to regenerate df4, the spark application need to start from df1 -> df2 -> df3 -> df4.
Does unpersist break your code? no, but definitely influence your performance of your application. I will double-check whether a persisted df is no longer needed in any further downstream job, then unpersist it

Tensorflow dequeue is very slow on Cloud ML

I am trying to run a CNN on the cloud (Google Cloud ML) because my laptop does not have a GPU card.
So I uploaded my data on Google Cloud Storage. A .csv file with 1500 entries, like so:
| label | img_path |
| label_1| /img_1.jpg |
| label_2| /img_2.jpg |
and the corresponding 1500 jpgs.
My input_fn looks like so:
def input_fn(filename,
batch_size,
num_epochs=None,
skip_header_lines=1,
shuffle=False):
filename_queue = tf.train.string_input_producer(filename, num_epochs=num_epochs)
reader = tf.TextLineReader(skip_header_lines=skip_header_lines)
_, row = reader.read(filename_queue)
row = parse_csv(row)
pt = row.pop(-1)
pth = filename.rpartition('/')[0] + pt
img = tf.image.decode_jpeg(tf.read_file(tf.squeeze(pth)), 1)
img = tf.to_float(img) / 255.
img = tf.reshape(img, [IMG_SIZE, IMG_SIZE, 1])
row = tf.concat(row, 0)
if shuffle:
return tf.train.shuffle_batch(
[img, row],
batch_size,
capacity=2000,
min_after_dequeue=2 * batch_size + 1,
num_threads=multiprocessing.cpu_count(),
)
else:
return tf.train.batch([img, row],
batch_size,
allow_smaller_final_batch=True,
num_threads=multiprocessing.cpu_count())
Here is what the full graph looks like (very simple CNN indeed):
Running the training with a batch size of 200, then most of the compute time on my laptop (on my laptop, the data is stored locally) is spent on the gradients node which is what I would expect. The batch node has a compute time of ~12ms.
When I run it on the cloud (scale-tier is BASIC), the batch node takes more than 20s. And the bottleneck seems to be coming from the QueueDequeueUpToV2 subnode according to tensorboard:
Anyone has any clue why this happens? I am pretty sure I am getting something wrong here, so I'd be happy to learn.
Few remarks:
-Changing between batch/shuffle_batch with different min_after_dequeue does not affect.
-When using BASIC_GPU, the batch node is also on the CPU which is normal according to what I read and it takes roughly 13s.
-Adding a time.sleep after queues are started to ensure no starvation also has no effect.
-Compute time is indeed linear in batch_size, so with a batch_size of 50, the compute time would be 4 times smaller than with a batch_size of 200.
Thanks for reading and would be happy to give more details if anyone needs.
Best,
Al
Update:
-Cloud ML instance and Buckets were not in the same region, making them in the same region improved result 4x.
-Creating a .tfrecords file made the batching take 70ms which seems to be acceptable. I used this blog post as a starting point to learn about it, I recommend it.
I hope this will help others to create a fast data input pipeline!

Try converting your images to tfrecord format and read them directly from graph. The way you are doing it, there is no possibility of caching and if your images are small, you are not taking advantage of the high sustained reads from cloud storage. Saving all your jpg images into a tfrecord file or small number of files will help.
Also, make sure your bucket is a single region bucket in a region that had gpus and that you are submitting to cloudml in that region.

I've got the similar problem before. I solved it by changing tf.train.batch() to tf.train.batch_join(). In my experiment, with 64 batch size and 4 GPUs, it took 22 mins by using tf.train.batch() whilst it only took 2 mins by using tf.train.batch_join().
In Tensorflow doc:
If you need more parallelism or shuffling of examples between files, use multiple reader instances using the tf.train.shuffle_batch_join
https://www.tensorflow.org/programmers_guide/reading_data

How to convert my H2O prediction to a data.frame in a fast way

I am using H2O, on a large dataset, 8 Million rows and 10 col. I trained my randomForest using h2o.randomForest. The model was trained fine and also prediction worked correctly. Now I would like to convert my predictions to a data.frame. I did this :
A2=h2o.predict(m1,Tr15_h2o)
pred2=as.data.frame(A2)
but it is too slow, takes forever. Is there any faster way to do the conversion from H2o to data.frame or data.table?

Here is some code which demonstrates how to use the data.table package on the backend, along with some benchmarks on my macbook:
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "16G")
hf <- h2o.createFrame(rows = 10000000)
options("h2o.use.data.table"=FALSE) #no data.table
system.time(df <- as.data.frame(hf))
# user system elapsed
# 224.387 13.274 272.252
options("datatable.verbose"=TRUE)
options("h2o.use.data.table"=TRUE) # use data.table
system.time(df2 <- as.data.frame(hf))
# user system elapsed
# 50.686 4.020 82.946
You can get more detailed info when using data.table if you turn on this option: options("datatable.verbose"=TRUE).

We have seen this issue with large prediction datasets when exporting to prediction dataframe or converting them to other types takes long time. I have opened the following JIRA to track it now:
https://0xdata.atlassian.net/browse/PUBDEV-4166

Yes there are some new options to turn on using data.table::fread to speed it up. Type h2o:::as.data.frame.H2OFrame to see the small amount of R source code containing the options, or H2O release notes. Please also try latest fread from dev which is now parallel as of yesterday.
Once users have reported success we can turn the default on by default.

tensorflow code optimization strategy

Please excuse the broadness of this question. Maybe once I know more perhaps I can ask more specifically.
I have performance sensitive piece of tensorflow code. From the perspective of someone who knows little about gpu programming, I would like to know what guides or strategies would be a "good place to start" to optimizing my code. (single gpu)
Perhaps even a readout of how long was spent on each tensorflow op would be nice...
I have a vague understanding that
Some operations go faster when assigned to a cpu rather than a gpu, but it's not clear which
There is a piece of google software called "EEG" that I read about in a
paper that may one day be open sourced.
There may also be other common factors at play that I am not aware of..

I wanted to give a more complete answer about how to use the Timeline object to get the time of execution for each node in the graph:
you use a classic sess.run() but specifying arguments options and run_metadata
you then create a Timeline object with the run_metadata.step_stats data
Here is in example code:
import tensorflow as tf
from tensorflow.python.client import timeline
x = tf.random_normal([1000, 1000])
y = tf.random_normal([1000, 1000])
res = tf.matmul(x, y)
# Run the graph with full trace option
with tf.Session() as sess:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(res, options=run_options, run_metadata=run_metadata)
# Create the Timeline object, and write it to a json
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline.json', 'w') as f:
f.write(ctf)
You can then open Google Chrome, go to the page chrome://tracing and load the timeline.json file.
You should something like:

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

When to cache in pyspark? - performance

Related

Working with a constant stream of realtime data

pyspark cache() dataframe issue

Tensorflow dequeue is very slow on Cloud ML

How to convert my H2O prediction to a data.frame in a fast way

tensorflow code optimization strategy

Categories

Resources