Related
So I have a very simple NN script written in Tensorflow, and I am having a hard time trying to trace down where some "randomness" is coming in from.
I have recorded the
Weights,
Gradients,
Logits
of my network as I train, and for the first iteration, it is clear that everything starts off the same. I have a SEED value both for how data is read in, and a SEED value for initializing the weights of the net. Those I never change.
My problem is that on say the second iteration of every re-run I do, I start to see the gradients diverge, (by a small amount, like say, 1e-6 or so). However over time, this of course leads to non-repeatable behaviour.
What might the cause of this be? I dont know where any possible source of randomness might be coming from...
Thanks
There's a good chance you could get deterministic results if you run your network on CPU (export CUDA_VISIBLE_DEVICES=), with single-thread in Eigen thread pool (tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1)), one Python thread (no multi-threaded queue-runners that you get from ops like tf.batch), and a single well-defined operation order. Also using inter_op_parallelism_threads=1 may help in some scenarios.
One issue is that floating point addition/multiplication is non-associative, so one fool-proof way to get deterministic results is to use integer arithmetic or quantized values.
Barring that, you could isolate which operation is non-deterministic, and try to avoid using that op. For instance, there's tf.add_n op, which doesn't say anything about the order in which it sums the values, but different orders produce different results.
Getting deterministic results is a bit of an uphill battle because determinism is in conflict with performance, and performance is usually the goal that gets more attention. An alternative to trying to have exact same numbers on reruns is to focus on numerical stability -- if your algorithm is stable, then you will get reproducible results (ie, same number of misclassifications) even though exact parameter values may be slightly different
The tensorflow reduce_sum op is specifically known to be non-deterministic. Furthermore, reduce_sum is used for calculating bias gradients.
This post discusses a workaround to avoid using reduce_sum (ie taking the dot product of any vector w/ a vector of all 1's is the same as reduce_sum)
I have faced the same problem..
The working solution for me was to:
1- use tf.set_random_seed(1) in order to make all tf functions have the same seed every new run
2- Training the model using CPU not the GPU to avoid GPU non-deterministic operations due to precision.
For benchmarking the efficiency of different algorithms for a simple task and comparing them, the way I find most often is to set a constant number of times to iterate over the task, and measure the time interval spent for each algorithm.
But, if the number of times is set too small, the interval difference among the algorithms will be too small, and may be masked by external factors. If you set the number of times too large, then it will take too much time to execute. So you have to guess the right number of times by trial end error.
Rather than doing it this way, I think it makes more sense to set a constant time interval that you want to run each algorithm, and then measure how many iterations can be made in that interval for each algorithm.
By doing so, the reliability of the benchmark will be more stable. In the conventional way, the benchmark will be more reliable for tasks that take time.
I haven't seen this way of benchmarking. Do people actually do this way, and is there a benchmarking framework for this way of measurement? I am asking this as a non-language-specific question, but if there is such framework, especially for Ruby, please introduce some. Or am I wrong about this idea?
I found this gem: benchmark/ips.
Take a look at perfer:
https://github.com/jruby/perfer
This has a couple of mechanisms including iterations/s. Don't worry that this is a jruby repo, it works on all Ruby implementations and was written as part of GSoC 2012.
Lets say I am going to run process X and see how long it takes.
I am going to save into a database a date I ran this process, and the time it took. I want to know what to put into the DB.
Process X almost always runs under 1500ms, so this is a short process. It usually runs between 500 and 1500ms, quite a range (3x difference).
My question is, how many "runs" should be saved into the DB as a single run?
Every run saved into the DB as its
own row?
5 Runs, averaged, then save that
time?
10 Runs averaged?
20 Runs, remove anything more than 2
std deviations away, and save
everything inside that range?
Does anyone have any good info backing them up on this?
Save the data for every run into its own row. Then later you can use and analyze the data however you like... ie, all you the other options you listed can be performed after the fact. It's not really possible for someone else to draw meaningful conclusions about how to average/analyze the data without knowing more about what's going on.
The fastest run is the one that most accurately times only your code.
All slower runs are slower because of noise introduced by the operating system scheduler.
The variance you experience is going to differ from machine to machine, and even on identical machines, the set of runnable processes will introduce noise.
None of the above. Bran is close though. You should save every measurment. But don't average them. The average (arithmetic mean) can be very misleading in this type of analysis. The reason is that some of your measurments will be much longer than the others. This will happen becuse things can interfere with your process - even on 'clean' test systems. It can also happen becuse your process may not be as deterministic as you might thing.
Some people think that simply taking more samples (running more iterations) and averaging the measurmetns will give them better data. It doesn't. The more you run, the more likelty it is that you will encounter a perturbing event, thus making the average overly high.
A better way to do this is to run as many measurments as you can (time permitting). 100 is not a bad number, but 30-ish can be enough.
Then, sort these by magnitude and graph them. Note that this is not a standard distribution. Compute compute some simple statistics: mean, median, min, max, lower quaertile, upper quartile.
Contrary to some guidance, do not 'throw away' outside vaulues or 'outliers'. These are often the most intersting measurments. For example, you may establish a nice baseline, then look for departures. Understanding these departures will help you fully understand how your process works, how the sytsem affecdts your process, and what can interfere with your process. It will often readily expose bugs.
Depends what kind of data you want. I'd say one line per run initially, then analyze the data, go from there. Maybe store a min/max/average of X runs if you want to consolidate it.
http://en.wikipedia.org/wiki/Sample_size
Bryan is right - you need to investigate more. if your code has that much variance even "most" of the time then you might have a lot of fluctuation in your test environment because of other processes, os paging or other factors. If not it seems that you have code paths doing wildly varying amount of work and coming up with a single number/run data to describe the performance of such a multi-modal system is not going to tell you much. So i'd say isolate your setup as much as possible, run at least 30 trials and get a feel for what your performance curve looks like. Once you have that, you can use that wikipedia page to come up with a number that will tell you how many trials you need to run per code-change to see if the performance has increased/decreased with some level of statistical significance.
While saying, "Save every run," is nice, it might not be practical in your case. However, I do think that storing only the average eliminates too much data. I like storing the average of ten runs, but instead of storing just the average, I'd also store the max and min values, so that I can get a feel for the spread of the data in addition to its center.
The max and min information in particular will tell you how often corner cases arise. Is the 1500ms case a one-in-1000 outlier? Or is it something that recurs on a regular basis?
We've all poked fun at the 'X minutes remaining' dialog which seems to be too simplistic, but how can we improve it?
Effectively, the input is the set of download speeds up to the current time, and we need to use this to estimate the completion time, perhaps with an indication of certainty, like '20-25 mins remaining' using some Y% confidence interval.
Code that did this could be put in a little library and used in projects all over, so is it really that difficult? How would you do it? What weighting would you give to previous download speeds?
Or is there some open source code already out there?
Edit: Summarising:
Improve estimated completion time via better algo/filter etc.
Provide interval instead of single time ('1h45-2h30 mins'), or just limit the precision ('about 2 hours').
Indicate when progress has stalled - although if progress consistently stalls and then continues, we should be able to deal with that. Perhaps 'about 2 hours, currently stalled'
More generally, I think you are looking for a way to give an instant mesure of the transfer speed, which is generally obtained by an average over a small period.
The problem is generally that in order to be reactive, the period is usually extremely small, which leads to the yoyo effect.
I would propose a very simple scheme, let's model it.
Think of a curve speed (y) over time (x).
the Instant Speed, is no more than reading y for the current x (x0).
the Average Speed, is no more than Integral(f(x), x in [x0-T,x0]) / T
the scheme I propose is to apply a filter, to give more weight to the last moments, while still taking into account the past moments.
It can be easily implement as g(x,x0,T) = 2 * (x - x0) + 2T which is a simple triangle of surface T.
And now you can compute Integral(f(x)*g(x,x0,T), x in [x0-T,x0]) / T, which should work because both functions are always positive.
Of course you could have a different g as long as it's always positive in the given interval and that its integral on the interval is T (so that its own average is exactly 1).
The advantage of this method is that because you give more weight to immediate events, you can remain pretty reactive even if you consider larger time intervals (so that the average is more precise, and less susceptible to hiccups).
Also, what I have rarely seen but think would provide more precise estimates would be to correlate the time used for computing the average to the estimated remaining time:
if I download a 5ko file, it's going to be loaded in an instant, no need to estimate
if I download a 15 Mo file, it's going to take between 2 minutes roughly, so I would like estimates say... every 5 seconds ?
if I download a 1.5 Go file, it's going to take... well around 200 minutes (with the same speed)... which is to say 3h20m... perhaps that an estimates every minute would be sufficient ?
So, the longer the download is going to take, the less reactive I need to be, and the more I can average out. In general, I would say that a window could cover 2% of the total time (perhaps except for the few first estimates, because people appreciate immediate feedback). Also, indicating progress by whole % at a time is sufficient. If the task is long, I was prepared to wait anyway.
I wonder, would a state estimation technique produce good results here? Something like a Kalman Filter?
Basically you predict the future by looking at your current model, and change the model at each time step to reflect the changes to the real world. I think this kind of technique is used for estimating the time left on your laptop battery, which can also vary according to use, age of battery, etc'.
see http://en.wikipedia.org/wiki/Kalman_filter for a more in depth description of the algorithm.
The filter also gives a variance measure, which could be used to indicate your confidence of the estimate (allthough, as was mentioned by other answers, it might not be the best idea to show this to the end user)
Does anyone know if this is actually used somewhere for download (or file copy) estimation?
Don't confuse your users by providing more information than they need. I'm thinking of the confidence interval. Skip it.
Internet download times are highly variable. The microwave interferes with WiFi. Usage varies by time of day, day of week, holidays, and releases of new exciting games. The server may be heavily loaded right now. If you carry your laptop to cafe, the results will be different than at home. So, you probably can't rely on historical data to predict the future of download speeds.
If you can't accurately estimate the time remaining, then don't lie to your user by offering such an estimate.
If you know how much data must be downloaded, you can provide % completed progress.
If you don't know at all, provide a "heartbeat" - a piece of moving UI that shows the user that things are working, even through you don't know how long remains.
Improving the estimated time itself: Intuitively, I would guess that the speed of the net connection is a series of random values around some temporary mean speed - things tick along at one speed, then suddenly slow or speed up.
One option, then, could be to weight the previous set of speeds by some exponential, so that the most recent values get the strongest weighting. That way, as the previous mean speed moves further into the past, its effect on the current mean reduces.
However, if the speed randomly fluctuates, it might be worth flattening the top of the exponential (e.g. by using a Gaussian filter), to avoid too much fluctuation.
So in sum, I'm thinking of measuring the standard deviation (perhaps limited to the last N minutes) and using that to generate a Gaussian filter which is applied to the inputs, and then limiting the quoted precision using the standard deviation.
How, though, would you limit the standard deviation calculation to the last N minutes? How do you know how long to use?
Alternatively, there are pattern recognition possibilities to detect if we've hit a stable speed.
I've considered this off and on, myself. I the answer starts with being conservative when computing the current (and thus, future) transfer rate, and includes averaging over longer periods, to get more stable estimates. Perhaps low-pass filtering the time that is displayed, so that one doesn't get jumps between 2 minutes and 2 days.
I don't think a confidence interval is going to be helpful. Most people wouldn't be able to interpret it, and it would just be displaying more stuff that is a guess.
We've all poked fun at the 'X minutes remaining' dialog which seems to be too simplistic, but how can we improve it?
Effectively, the input is the set of download speeds up to the current time, and we need to use this to estimate the completion time, perhaps with an indication of certainty, like '20-25 mins remaining' using some Y% confidence interval.
Code that did this could be put in a little library and used in projects all over, so is it really that difficult? How would you do it? What weighting would you give to previous download speeds?
Or is there some open source code already out there?
Edit: Summarising:
Improve estimated completion time via better algo/filter etc.
Provide interval instead of single time ('1h45-2h30 mins'), or just limit the precision ('about 2 hours').
Indicate when progress has stalled - although if progress consistently stalls and then continues, we should be able to deal with that. Perhaps 'about 2 hours, currently stalled'
More generally, I think you are looking for a way to give an instant mesure of the transfer speed, which is generally obtained by an average over a small period.
The problem is generally that in order to be reactive, the period is usually extremely small, which leads to the yoyo effect.
I would propose a very simple scheme, let's model it.
Think of a curve speed (y) over time (x).
the Instant Speed, is no more than reading y for the current x (x0).
the Average Speed, is no more than Integral(f(x), x in [x0-T,x0]) / T
the scheme I propose is to apply a filter, to give more weight to the last moments, while still taking into account the past moments.
It can be easily implement as g(x,x0,T) = 2 * (x - x0) + 2T which is a simple triangle of surface T.
And now you can compute Integral(f(x)*g(x,x0,T), x in [x0-T,x0]) / T, which should work because both functions are always positive.
Of course you could have a different g as long as it's always positive in the given interval and that its integral on the interval is T (so that its own average is exactly 1).
The advantage of this method is that because you give more weight to immediate events, you can remain pretty reactive even if you consider larger time intervals (so that the average is more precise, and less susceptible to hiccups).
Also, what I have rarely seen but think would provide more precise estimates would be to correlate the time used for computing the average to the estimated remaining time:
if I download a 5ko file, it's going to be loaded in an instant, no need to estimate
if I download a 15 Mo file, it's going to take between 2 minutes roughly, so I would like estimates say... every 5 seconds ?
if I download a 1.5 Go file, it's going to take... well around 200 minutes (with the same speed)... which is to say 3h20m... perhaps that an estimates every minute would be sufficient ?
So, the longer the download is going to take, the less reactive I need to be, and the more I can average out. In general, I would say that a window could cover 2% of the total time (perhaps except for the few first estimates, because people appreciate immediate feedback). Also, indicating progress by whole % at a time is sufficient. If the task is long, I was prepared to wait anyway.
I wonder, would a state estimation technique produce good results here? Something like a Kalman Filter?
Basically you predict the future by looking at your current model, and change the model at each time step to reflect the changes to the real world. I think this kind of technique is used for estimating the time left on your laptop battery, which can also vary according to use, age of battery, etc'.
see http://en.wikipedia.org/wiki/Kalman_filter for a more in depth description of the algorithm.
The filter also gives a variance measure, which could be used to indicate your confidence of the estimate (allthough, as was mentioned by other answers, it might not be the best idea to show this to the end user)
Does anyone know if this is actually used somewhere for download (or file copy) estimation?
Don't confuse your users by providing more information than they need. I'm thinking of the confidence interval. Skip it.
Internet download times are highly variable. The microwave interferes with WiFi. Usage varies by time of day, day of week, holidays, and releases of new exciting games. The server may be heavily loaded right now. If you carry your laptop to cafe, the results will be different than at home. So, you probably can't rely on historical data to predict the future of download speeds.
If you can't accurately estimate the time remaining, then don't lie to your user by offering such an estimate.
If you know how much data must be downloaded, you can provide % completed progress.
If you don't know at all, provide a "heartbeat" - a piece of moving UI that shows the user that things are working, even through you don't know how long remains.
Improving the estimated time itself: Intuitively, I would guess that the speed of the net connection is a series of random values around some temporary mean speed - things tick along at one speed, then suddenly slow or speed up.
One option, then, could be to weight the previous set of speeds by some exponential, so that the most recent values get the strongest weighting. That way, as the previous mean speed moves further into the past, its effect on the current mean reduces.
However, if the speed randomly fluctuates, it might be worth flattening the top of the exponential (e.g. by using a Gaussian filter), to avoid too much fluctuation.
So in sum, I'm thinking of measuring the standard deviation (perhaps limited to the last N minutes) and using that to generate a Gaussian filter which is applied to the inputs, and then limiting the quoted precision using the standard deviation.
How, though, would you limit the standard deviation calculation to the last N minutes? How do you know how long to use?
Alternatively, there are pattern recognition possibilities to detect if we've hit a stable speed.
I've considered this off and on, myself. I the answer starts with being conservative when computing the current (and thus, future) transfer rate, and includes averaging over longer periods, to get more stable estimates. Perhaps low-pass filtering the time that is displayed, so that one doesn't get jumps between 2 minutes and 2 days.
I don't think a confidence interval is going to be helpful. Most people wouldn't be able to interpret it, and it would just be displaying more stuff that is a guess.