tensorflow code optimization strategy

tensorflow code optimization strategy - performance

Please excuse the broadness of this question. Maybe once I know more perhaps I can ask more specifically.
I have performance sensitive piece of tensorflow code. From the perspective of someone who knows little about gpu programming, I would like to know what guides or strategies would be a "good place to start" to optimizing my code. (single gpu)
Perhaps even a readout of how long was spent on each tensorflow op would be nice...
I have a vague understanding that
Some operations go faster when assigned to a cpu rather than a gpu, but it's not clear which
There is a piece of google software called "EEG" that I read about in a
paper that may one day be open sourced.
There may also be other common factors at play that I am not aware of..

I wanted to give a more complete answer about how to use the Timeline object to get the time of execution for each node in the graph:
you use a classic sess.run() but specifying arguments options and run_metadata
you then create a Timeline object with the run_metadata.step_stats data
Here is in example code:
import tensorflow as tf
from tensorflow.python.client import timeline
x = tf.random_normal([1000, 1000])
y = tf.random_normal([1000, 1000])
res = tf.matmul(x, y)
# Run the graph with full trace option
with tf.Session() as sess:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(res, options=run_options, run_metadata=run_metadata)
# Create the Timeline object, and write it to a json
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline.json', 'w') as f:
f.write(ctf)
You can then open Google Chrome, go to the page chrome://tracing and load the timeline.json file.
You should something like:

Related

Working with a constant stream of realtime data

So I have a project idea that requires me to process incoming realtime data and constantly track some metrics about the realtime data. Then every now and then I want to be able to request for the metrics I am calculating and do some stuff with that data.
Currently I have a simple Python script that uses the socket library to get the realtime data. It is basically just...
metric1 = 0
metric2 = ''
while True:
response = socket.recv(512).decode('utf-8')
if response.startswith('PING'):
sock.send("PONG\n".encode('utf-8'))
else:
process(response)
In the above process(response) will update metric1 and metric2 with data from each response. (For example they might be mean len(response) and most common response respectively)
What I want to do is run the above script constantly after starting up the project and occasionally query for metric1 and metric2 in a script I have running locally. I am guessing that I will have to look into running code on a server which I have very little experience with.
What are the most accessible tools to do what I want? I am pretty comfortable with a variety of languages so if there is a library or tool in another language that is better suited for all of this, please tell me about it
Thanks!

I worked on a similar project, not sure if it specifically can be applied to your case, but maybe it can give you a starting point.
Although I am very aware it's not best practice to use Pandas Dataframes for real-time purposes, in my case it's just fast enough (I am actually open to suggestions on how to improve my workflow!), here is my code:
all_prices = pd.Dataframe()
readprice():
global all_prices
msg = mysock.recv(16384)
msg_stringa=str(msg,'utf-8')
new_price = pd.read_csv(StringIO(msg_stringa) , sep=";", error_bad_lines=False,
index_col=None, header=None, engine='c', names=range(33),
decimal = '.')
...
...
all_prices = all_prices.append(new_price, ignore_index=True).copy()
So 'all_prices' Pandas Dataframe is global, new prices get appended to the general 'all_prices' DF . This global DF can be used by other functions in order to read the content ect. Be very careful about the variable sharing between two or more threads, it can lead to errors.
More info here: http://www.laurentluce.com/posts/python-threads-synchronization-locks-rlocks-semaphores-conditions-events-and-queues/
In my case, I don't share the DF to a parallel thread, other threads are launched after the append, not in the meantime.

Tensorflow dequeue is very slow on Cloud ML

I am trying to run a CNN on the cloud (Google Cloud ML) because my laptop does not have a GPU card.
So I uploaded my data on Google Cloud Storage. A .csv file with 1500 entries, like so:
| label | img_path |
| label_1| /img_1.jpg |
| label_2| /img_2.jpg |
and the corresponding 1500 jpgs.
My input_fn looks like so:
def input_fn(filename,
batch_size,
num_epochs=None,
skip_header_lines=1,
shuffle=False):
filename_queue = tf.train.string_input_producer(filename, num_epochs=num_epochs)
reader = tf.TextLineReader(skip_header_lines=skip_header_lines)
_, row = reader.read(filename_queue)
row = parse_csv(row)
pt = row.pop(-1)
pth = filename.rpartition('/')[0] + pt
img = tf.image.decode_jpeg(tf.read_file(tf.squeeze(pth)), 1)
img = tf.to_float(img) / 255.
img = tf.reshape(img, [IMG_SIZE, IMG_SIZE, 1])
row = tf.concat(row, 0)
if shuffle:
return tf.train.shuffle_batch(
[img, row],
batch_size,
capacity=2000,
min_after_dequeue=2 * batch_size + 1,
num_threads=multiprocessing.cpu_count(),
)
else:
return tf.train.batch([img, row],
batch_size,
allow_smaller_final_batch=True,
num_threads=multiprocessing.cpu_count())
Here is what the full graph looks like (very simple CNN indeed):
Running the training with a batch size of 200, then most of the compute time on my laptop (on my laptop, the data is stored locally) is spent on the gradients node which is what I would expect. The batch node has a compute time of ~12ms.
When I run it on the cloud (scale-tier is BASIC), the batch node takes more than 20s. And the bottleneck seems to be coming from the QueueDequeueUpToV2 subnode according to tensorboard:
Anyone has any clue why this happens? I am pretty sure I am getting something wrong here, so I'd be happy to learn.
Few remarks:
-Changing between batch/shuffle_batch with different min_after_dequeue does not affect.
-When using BASIC_GPU, the batch node is also on the CPU which is normal according to what I read and it takes roughly 13s.
-Adding a time.sleep after queues are started to ensure no starvation also has no effect.
-Compute time is indeed linear in batch_size, so with a batch_size of 50, the compute time would be 4 times smaller than with a batch_size of 200.
Thanks for reading and would be happy to give more details if anyone needs.
Best,
Al
Update:
-Cloud ML instance and Buckets were not in the same region, making them in the same region improved result 4x.
-Creating a .tfrecords file made the batching take 70ms which seems to be acceptable. I used this blog post as a starting point to learn about it, I recommend it.
I hope this will help others to create a fast data input pipeline!

Try converting your images to tfrecord format and read them directly from graph. The way you are doing it, there is no possibility of caching and if your images are small, you are not taking advantage of the high sustained reads from cloud storage. Saving all your jpg images into a tfrecord file or small number of files will help.
Also, make sure your bucket is a single region bucket in a region that had gpus and that you are submitting to cloudml in that region.

I've got the similar problem before. I solved it by changing tf.train.batch() to tf.train.batch_join(). In my experiment, with 64 batch size and 4 GPUs, it took 22 mins by using tf.train.batch() whilst it only took 2 mins by using tf.train.batch_join().
In Tensorflow doc:
If you need more parallelism or shuffling of examples between files, use multiple reader instances using the tf.train.shuffle_batch_join
https://www.tensorflow.org/programmers_guide/reading_data

How to check Matplotlib's speed in Xcode and increase performance?

I'm running into some considerable speed bottlenecks with a Python-Matplotlib-Xcode combination. I know some immediate responses will probably ask "Why are you doing python stuff in Xcode, just man up and use vim" --> I like the organizing ability and the built in version control, it makes elements of my work easier to deal with.
Getting python to run in xcode in the first place was a bit more tricky than I had hoped, but its possible. Now I have the following scenario:
A master file, 'main.py' does all the import stuff for me and sets up some universal formatting to make all the figures (for eventual inclusion in my PhD thesis) nice and uniform. Afterwards it runs a series of execfile commands to generate whichever graphics I need. Two things I can think of right off the bat:
1) at the very beginning of main.py after I import all the normal python stuff you tend to need, I call a system script which checks whether a certain filesystem is mounted. I keep all my climate model data on there since my local hard drive is too small to deal with all of it at once. Python pauses itself and waits for the system to do its thing, but once the filesystem has been found, it keeps going. Usually this only needs to happen once in the morning when I get to work, or if the VPN server kicked me off for whatever reason. (Side question, it'd be cool to know if theres a trick to automate an VPN login to reconnect as soon as it notices its not connected)
2) I'm not sure how much xcode is using on its own. running the same program from terminal is (somewhat) faster. I've tried to be memory conscience and turn off stuff I don't need while running the python/xcode combination.
Also, python launches a little window whenever I call plt.show(), this in itself takes time, I've considered just saving them as quick png files and opening them with some other viewer, although I guess that would also have to somehow take time to open up. Given how often these graphics change as I add model runs or think of nicer ways of displaying the data, it'd be nice to not waste something on the order of 15 to 30 minutes (possibly more) out of the entire day twiddling my thumbs and waiting for a window to pop up.

Benchmark it!
import datetime
start = datetime.datetime.now()
# your plotting code
td = datetime.datetime.now() - start
print td.total_seconds() # requires python version >= 2.7
Run it in xcode and from the command line, see what the difference is.

Profiling a multi-tiered, distributed, web application (server side)

I would like to profile a complex web application from the server PoV.
According to the wikipedia link above, and the Stack Overflow profiling tag description, profiling (in one of its forms) means getting a list (or a graphical representation) of APIs/components of the application, each with the number of calls and time spent in it during run-time.
Note that unlike a traditional one-program/one-language a web server application may be:
Distributed over multiple machines
Different components may be written in different languages
Different components may be running on top of different OSes, etc.
So the traditional "Just use a profiler" answer is not easily applicable to this problem.
I'm not looking for:
Coarse performance stats like the ones provided by various log-analysis tools (e.g. analog) nor for
client-side, per-page performance stats like the ones presented by tools like Google's Pagespeed, or Yahoo! Y!Slow, waterfall diagrams, and browser component load times)
Instead, I'm looking for a classic profiler-style report:
number of calls
call durations
by function/API/component-name, on the server-side of the web application.
Bottom line, the question is:
How can one profile a multi-tiered, multi-platform, distributed web application?
A free-software based solution is much preferred.
I have been searching the web for a solution for a while and couldn't find anything satisfactory to fit my needs except some pretty expensive commercial offerings. In the end, I bit the bullet, thought about the problem, and wrote my own solution which I wanted to freely share.
I'm posting my own solution since this practice is encouraged on SO
This solution is far from perfect, for example, it is at very high level (individual URLs) which may not good for all use-cases. Nevertheless, it has helped me immensely in trying to understand where my web-app spends its time.
In the spirit on open source and knowledge sharing, I welcome any other, especially superior, approaches and solutions from others.

Thinking of how traditional profilers work, it should be straight-forward to come up with a general free-software solution to this challenge.
Let's break the problem into two parts:
Collecting the data
Presenting the data
Collecting the data
Assume we can break our web application into its individual
constituent parts (API, functions) and measure the time it takes
each of these parts to complete. Each part is called thousands of
times a day, so we could collect this data over a full day or so on
multiple hosts. When the day is over we would have a pretty big and
relevant data-set.
Epiphany #1: substitute 'function' with 'URL', and our existing
web-logs are "it". The data we need is already there:
Each part of a web API is defined by the request URL (possibly
with some parameters)
The round-trip times (often in microseconds) appear on each line
We have a day, (week, month) worth of lines with this data handy
So if we have access to standard web-logs for all the distributed
parts of our web application, part one of our problem (collecting
the data) is solved.
Presenting the data
Now we have a big data-set, but still no real insight.
How can we gain insight?
Epiphany #2: visualize our (multiple) web-server logs directly.
A picture is worth a 1000 words. Which picture can we use?
We need to condense 100s of thousands or millions lines of multiple
web-server logs into a short summary which would tell most of the
story about our performance. In other words: the goal is to generate
a profiler-like report, or even better: a graphical profiler report,
directly from our web logs.
Imagine we could map:
Call-latencies to one dimension
Number of calls to another dimension, and
The function identities to a color (essentially a 3rd dimension)
One such picture: a stacked-density chart of latencies by API
appears below (functions names were made-up for illustrative purposes).
The Chart:
Some observations from this example
We have a tri-modal distribution representing 3 radically
different 'worlds' in our application:
The fastest responses, are centered around ~300 microseconds
of latency. These responses are coming from our varnish cache
The second fastest, taking a bit less than 0.01 seconds on
average, are coming from various APIs served by our middle-layer
web application (Apache/Tomcat)
The slowest responses, centered around 0.1 seconds and
sometimes taking several seconds to respond to, involve round-trips
to our SQL database.
We can see how dramatic caching effects can be on an application
(note that the x-axis is on a log10 scale)
We can specifically see which APIs tend to be fast vs slow, so
we know what to focus on.
We can see which APIs are most often called each day.
We can also see that some of them are so rarely called, it is hard to even see their color on the chart.
How to do it?
The first step is to pre-process and extract the subset needed-data
from the logs. A trivial utility like Unix 'cut' on multiple logs
may be sufficient here. You may also need to collapse multiple
similar URLs into shorter strings describing the function/API like
'registration', or 'purchase'. If you have a multi-host unified log
view generated by a load-balancer, this task may be easier. We
extract only the names of the APIs (URLs) and their latencies, so we
end up with one big file with a pair of columns, separated by TABs
*API_Name Latency_in_microSecs*
func_01 32734
func_01 32851
func_06 598452
...
func_11 232734
Now we run the R script below on the resulting data pairs to produce
the wanted chart (using Hadley Wickham's wonderful ggplot2 library).
Voilla!
The code to generate the chart
Finally, here's the code to produce the chart from the API+Latency TSV data file:
#!/usr/bin/Rscript --vanilla
#
# Generate stacked chart of API latencies by API from a TSV data-set
#
# ariel faigon - Dec 2012
#
.libPaths(c('~/local/lib/R',
'/usr/lib/R/library',
'/usr/lib/R/site-library'
))
suppressPackageStartupMessages(library(ggplot2))
# grid lib needed for 'unit()':
suppressPackageStartupMessages(library(grid))
#
# Constants: width, height, resolution, font-colors and styles
# Adapt to taste
#
wh.ratio = 2
WIDTH = 8
HEIGHT = WIDTH / wh.ratio
DPI = 200
FONTSIZE = 11
MyGray = gray(0.5)
title.theme = element_text(family="FreeSans", face="bold.italic",
size=FONTSIZE)
x.label.theme = element_text(family="FreeSans", face="bold.italic",
size=FONTSIZE-1, vjust=-0.1)
y.label.theme = element_text(family="FreeSans", face="bold.italic",
size=FONTSIZE-1, angle=90, vjust=0.2)
x.axis.theme = element_text(family="FreeSans", face="bold",
size=FONTSIZE-1, colour=MyGray)
y.axis.theme = element_text(family="FreeSans", face="bold",
size=FONTSIZE-1, colour=MyGray)
#
# Function generating well-spaced & well-labeled y-axis (count) breaks
#
yscale_breaks <- function(from.to) {
from <- 0
to <- from.to[2]
# round to 10 ceiling
to <- ceiling(to / 10.0) * 10
# Count major breaks on 10^N boundaries, include the 0
n.maj = 1 + ceiling(log(to) / log(10))
# if major breaks are too few, add minor-breaks half-way between them
n.breaks <- ifelse(n.maj < 5, max(5, n.maj*2+1), n.maj)
breaks <- as.integer(seq(from, to, length.out=n.breaks))
breaks
}
#
# -- main
#
# -- process the command line args: [tsv_file [png_file]]
# (use defaults if they aren't provided)
#
argv <- commandArgs(trailingOnly = TRUE)
if (is.null(argv) || (length(argv) < 1)) {
argv <- c(Sys.glob('*api-lat.tsv')[1])
}
tsvfile <- argv[1]
stopifnot(! is.na(tsvfile))
pngfile <- ifelse(is.na(argv[2]), paste(tsvfile, '.png', sep=''), argv[2])
# -- Read the data from the TSV file into an internal data.frame d
d <- read.csv(tsvfile, sep='\t', head=F)
# -- Give each data column a human readable name
names(d) <- c('API', 'Latency')
#
# -- Convert microseconds Latency (our weblog resolution) to seconds
#
d <- transform(d, Latency=Latency/1e6)
#
# -- Trim the latency axis:
# Drop the few 0.001% extreme-slowest outliers on the right
# to prevent them from pushing the bulk of the data to the left
Max.Lat <- quantile(d$Latency, probs=0.99999)
d <- subset(d, Latency < Max.Lat)
#
# -- API factor pruning
# Drop rows where the APIs is less than small % of total calls
#
Rare.APIs.pct <- 0.001
if (Rare.APIs.pct > 0.0) {
d.N <- nrow(d)
API.counts <- table(d$API)
d <- transform(d, CallPct=100.0*API.counts[d$API]/d.N)
d <- d[d$CallPct > Rare.APIs.pct, ]
d.N.new <- nrow(d)
}
#
# -- Adjust legend item-height &font-size
# to the number of distinct APIs we have
#
API.count <- nlevels(as.factor(d$API))
Legend.LineSize <- ifelse(API.count < 20, 1.0, 20.0/API.count)
Legend.FontSize <- max(6, as.integer(Legend.LineSize * (FONTSIZE - 1)))
legend.theme = element_text(family="FreeSans", face="bold.italic",
size=Legend.FontSize,
colour=gray(0.3))
# -- set latency (X-axis) breaks and labels (s.b made more generic)
lat.breaks <- c(0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10)
lat.labels <- sprintf("%g", lat.breaks)
#
# -- Generate the chart using ggplot
#
p <- ggplot(data=d, aes(x=Latency, y=..count../1000.0, group=API, fill=API)) +
geom_bar(binwidth=0.01) +
scale_x_log10(breaks=lat.breaks, labels=lat.labels) +
scale_y_continuous(breaks=yscale_breaks) +
ggtitle('APIs Calls & Latency Distribution') +
xlab('Latency in seconds - log(10) scale') +
ylab('Call count (in 1000s)') +
theme(
plot.title=title.theme,
axis.title.y=y.label.theme,
axis.title.x=x.label.theme,
axis.text.x=x.axis.theme,
axis.text.y=y.axis.theme,
legend.text=legend.theme,
legend.key.height=unit(Legend.LineSize, "line")
)
#
# -- Save the plot into the png file
#
ggsave(p, file=pngfile, width=WIDTH, height=HEIGHT, dpi=DPI)

Your discussion of "back in the day" profiling practice is true.
There's just one little problem it always had:
In non-toy software, it may find something, but it doesn't find much, for a bunch of reasons.
The thing about opportunities for higher performance is, if you don't find them, the software doesn't break, so you just can pretend they don't exist.
That is, until a different method is tried, and they are found.
In statistics, this is called a type 2 error - a false negative.
An opportunity is there, but you didn't find it.
What it means is if somebody does know how to find it, they're going to win, big time.
Here's probably more than you ever wanted to know about that.
So if you're looking at the same kind of stuff in a web app - invocation counts, time measurements, you're not liable to do better than the same kind of non-results.
I'm not into web apps, but I did a fair amount of performance tuning in a protocol-based factory automation app many years ago.
I used a logging technique.
I won't say it was easy, but it did work.
The people I see doing something similar is here, where they use what they call a waterfall chart.
The basic idea is rather than casting a wide net and getting a lot of measurements, you trace through a single logical thread of transactions, analyzing where delays are occurring that don't have to.
So if results are what you're after, I'd look down that line of thinking.

Wisdom in FFTW doesn't import/export

I am using FFTW for FFTs, it's all working well but the optimisation takes a long time with the FFTW_PATIENT flag. However, according to the FFTW docs, I can improve on this by reusing wisdom between runs, which I can import and export to file. (I am using the floating point fftw routines, hence the fftwf_ prefix below instead of fftw_)
So, at the start of my main(), I have :
char wisdom_file[] = "optimise.fft";
fftwf_import_wisdom_from_filename(wisdom_file);
and at the end, I have:
fftwf_export_wisdom_to_filename(wisdom_file);
(I've also got error-checking to check the return is non-zero, omitted for simplicity above, so I know the files are reading and writing correctly)
After one run I get a file optimise.fft with what looks like ASCII wisdom. However, subsequent runs do not get any faster, and if I create my plans with the FFTW_WISDOM_ONLY flag, I get a null plan, showing that it doesn't see any wisdom there.
I am using 3 different FFTs (2 real to complex and 1 inverse complex to real), so have also tried import/export in each FFT, and to separate files, but that doesn't help.
I am using FFTW-3.3.3, I can see that FFTW-2 seemed to need more setting up to reuse wisdom, but the above seems sufficient now- what am I doing wrong?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio