Determinism in tensorflow gradient updates? - random

So I have a very simple NN script written in Tensorflow, and I am having a hard time trying to trace down where some "randomness" is coming in from.
I have recorded the
Weights,
Gradients,
Logits
of my network as I train, and for the first iteration, it is clear that everything starts off the same. I have a SEED value both for how data is read in, and a SEED value for initializing the weights of the net. Those I never change.
My problem is that on say the second iteration of every re-run I do, I start to see the gradients diverge, (by a small amount, like say, 1e-6 or so). However over time, this of course leads to non-repeatable behaviour.
What might the cause of this be? I dont know where any possible source of randomness might be coming from...
Thanks

There's a good chance you could get deterministic results if you run your network on CPU (export CUDA_VISIBLE_DEVICES=), with single-thread in Eigen thread pool (tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1)), one Python thread (no multi-threaded queue-runners that you get from ops like tf.batch), and a single well-defined operation order. Also using inter_op_parallelism_threads=1 may help in some scenarios.
One issue is that floating point addition/multiplication is non-associative, so one fool-proof way to get deterministic results is to use integer arithmetic or quantized values.
Barring that, you could isolate which operation is non-deterministic, and try to avoid using that op. For instance, there's tf.add_n op, which doesn't say anything about the order in which it sums the values, but different orders produce different results.
Getting deterministic results is a bit of an uphill battle because determinism is in conflict with performance, and performance is usually the goal that gets more attention. An alternative to trying to have exact same numbers on reruns is to focus on numerical stability -- if your algorithm is stable, then you will get reproducible results (ie, same number of misclassifications) even though exact parameter values may be slightly different

The tensorflow reduce_sum op is specifically known to be non-deterministic. Furthermore, reduce_sum is used for calculating bias gradients.
This post discusses a workaround to avoid using reduce_sum (ie taking the dot product of any vector w/ a vector of all 1's is the same as reduce_sum)

I have faced the same problem..
The working solution for me was to:
1- use tf.set_random_seed(1) in order to make all tf functions have the same seed every new run
2- Training the model using CPU not the GPU to avoid GPU non-deterministic operations due to precision.

Related

Multiple parameter optimization with a stochastic element

I am looking for a method to find the best parameters for a simulation. It's about break-shots in billiards / pool. A shot is defined by 7 parameters, I can simulate the shot and then rate the outcome and I would like to compute the best parameters.
I have found the following link here:
Multiple parameter optimization with lots of local minima
suggesting 4 kinds of algorithms. In the pool simulator I am using, the shots are altered by a little random value each time it is simulated. If I simulate the same shot twice, the outcome will be different. So I am looking for an algorithm like the ones in the link above, only with the addition of a stochastical element, optimizing for the 7 parameters that will on average yield the best parameters, i.e. a break shot that most likely will be a success. My initial idea was simulating the shot 100 or 1000 times and just take the average as rating for the algorithms above, but I still feel like there is a better way. Does anyone have an idea?
The 7 parameters are continuous but within different ranges (one from 0 to 10, another from 0.0 to 0.028575 and so on).
Thank you
At least for some of the algorithms, simulating the same shot repeatedly might not be neccessary. As long as your alternatives have some form of momentum, like in the swarm simulation approach, you can let that be affected by the outcome of each individual simulation. In that case, a single unlucky simulation would slow the movement in parameter space only slightly, whereas a serious loss of quality should be enough to stop and reverse the movement. Thos algorithms which don't use momentum might be tweaked to have momentum. If not, then repeated simulation seems the best approach. Unless you can get your hands on the internals of the simulator, and rate the shot as a whole without having to simulate it over and over again.
You can use the algorithms you mentioned in your non-deterministic scenario with independent stochastic runs. Your idea with repeated simulations is good, you can read more about how many repeats you might have to consider for your simulations (unfortunately, there is no trivial answer). If you are not so much into maths, and the runs go fast, do 1.000 repeats, then 10.000 repeats, and see if the results differ largely. If yes, you have to collect more samples, if not, you are probably on the safe side (the central limit theorem states that the results converge).
Further, do not just consider the average! Make sure to look into the standard deviation for each algorithm's results; you might want to use box plots to compare their quartiles. If you rely on the average only, you could pick an algorithm that produces very varying results, sometimes excellent, sometimes terrible in performance.
I don't know what language you are using, but if you use Java, I am maintaining a tool that could simplify your "monte carlo" style experiments.

When timing how long a quick process runs, how many runs should be used?

Lets say I am going to run process X and see how long it takes.
I am going to save into a database a date I ran this process, and the time it took. I want to know what to put into the DB.
Process X almost always runs under 1500ms, so this is a short process. It usually runs between 500 and 1500ms, quite a range (3x difference).
My question is, how many "runs" should be saved into the DB as a single run?
Every run saved into the DB as its
own row?
5 Runs, averaged, then save that
time?
10 Runs averaged?
20 Runs, remove anything more than 2
std deviations away, and save
everything inside that range?
Does anyone have any good info backing them up on this?
Save the data for every run into its own row. Then later you can use and analyze the data however you like... ie, all you the other options you listed can be performed after the fact. It's not really possible for someone else to draw meaningful conclusions about how to average/analyze the data without knowing more about what's going on.
The fastest run is the one that most accurately times only your code.
All slower runs are slower because of noise introduced by the operating system scheduler.
The variance you experience is going to differ from machine to machine, and even on identical machines, the set of runnable processes will introduce noise.
None of the above. Bran is close though. You should save every measurment. But don't average them. The average (arithmetic mean) can be very misleading in this type of analysis. The reason is that some of your measurments will be much longer than the others. This will happen becuse things can interfere with your process - even on 'clean' test systems. It can also happen becuse your process may not be as deterministic as you might thing.
Some people think that simply taking more samples (running more iterations) and averaging the measurmetns will give them better data. It doesn't. The more you run, the more likelty it is that you will encounter a perturbing event, thus making the average overly high.
A better way to do this is to run as many measurments as you can (time permitting). 100 is not a bad number, but 30-ish can be enough.
Then, sort these by magnitude and graph them. Note that this is not a standard distribution. Compute compute some simple statistics: mean, median, min, max, lower quaertile, upper quartile.
Contrary to some guidance, do not 'throw away' outside vaulues or 'outliers'. These are often the most intersting measurments. For example, you may establish a nice baseline, then look for departures. Understanding these departures will help you fully understand how your process works, how the sytsem affecdts your process, and what can interfere with your process. It will often readily expose bugs.
Depends what kind of data you want. I'd say one line per run initially, then analyze the data, go from there. Maybe store a min/max/average of X runs if you want to consolidate it.
http://en.wikipedia.org/wiki/Sample_size
Bryan is right - you need to investigate more. if your code has that much variance even "most" of the time then you might have a lot of fluctuation in your test environment because of other processes, os paging or other factors. If not it seems that you have code paths doing wildly varying amount of work and coming up with a single number/run data to describe the performance of such a multi-modal system is not going to tell you much. So i'd say isolate your setup as much as possible, run at least 30 trials and get a feel for what your performance curve looks like. Once you have that, you can use that wikipedia page to come up with a number that will tell you how many trials you need to run per code-change to see if the performance has increased/decreased with some level of statistical significance.
While saying, "Save every run," is nice, it might not be practical in your case. However, I do think that storing only the average eliminates too much data. I like storing the average of ten runs, but instead of storing just the average, I'd also store the max and min values, so that I can get a feel for the spread of the data in addition to its center.
The max and min information in particular will tell you how often corner cases arise. Is the 1500ms case a one-in-1000 outlier? Or is it something that recurs on a regular basis?

Modeling distribution of performance measurements

How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario?
My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty).
My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized.
A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers.
I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;)
Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately.
Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle.
Why (since it was asked) -
For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc.
This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above.
[edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.
Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.
If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.
Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.
BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.
ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.
ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)
The problem you describe is called "Distribution Fitting" and has nothing to do with performance measurements, i.e. this is generic problem of fitting suitable distribution to any gathered/measured data sample.
The standard process is something like that:
Guess the best distribution.
Run hypothesis tests to check how well it describes gathered data.
Repeat 1-3 if not well enough.
You can find interesting article describing how this can be done with open-source R software system here. I think especially useful to you may be function fitdistr.
In addition to already given answers consider Empirical Distributions. I have successful experience in using empirical distributions for performance analysis of several distributed systems. The idea is very straightforward. You need to build histogram of performance measurements. Measurements should be discretized with given accuracy. When you have histogram you could do several useful things:
calculate the probability of any given value (you are bound by accuracy only);
build PDF and CDF functions for the performance measurements;
generate sequence of response times according to a distribution. This one is very useful for performance modeling.
Try whit gamma distribution http://en.wikipedia.org/wiki/Gamma_distribution
From wikipedia
The gamma distribution is frequently a probability model for waiting times; for instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.
The standard for randomized Arrival times for performance modelling is either Exponential distribution or Poisson distribution (which is just the distribution of multiple Exponential distributions added together).
Not exactly answering your question, but relevant still: Mor Harchol-Balter did a very nice analysis of the size of jobs submitted to a scheduler, The effect of heavy-tailed job size distributions on computer systems design (1999). She found that the size of jobs submitted to her distributed task assignment system took a power-law distribution, which meant that certain pieces of conventional wisdom she had assumed in the construction of her task assignment system, most importantly that the jobs should be well load balanced, had awful consequences for submitters of jobs. She's done good follor-up work on this issue.
The broader point is, you need to ask such questions as:
What happens if reasonable-seeming assumptions about the distribution of performance, such as that they take a normal distribution, break down?
Are the data sets I'm looking at really representative of the problem I'm trying to solve?

How to automatically tune parameters of an algorithm?

Here's the setup:
I have an algorithm that can succeed or fail.
I want it to succeed with highest probability possible.
Probability of success depends on some parameters (and some external circumstances):
struct Parameters {
float param1;
float param2;
float param3;
float param4;
// ...
};
bool RunAlgorithm (const Parameters& parameters) {
// ...
// P(return true) is a function of parameters.
}
How to (automatically) find best parameters with a smallest number of calls to RunAlgorithm ?
I would be especially happy with a readl library.
If you need more info on my particular case:
Probability of success is smooth function of parameters and have single global optimum.
There are around 10 parameters, most of them independently tunable (but some are interdependent)
I will run the tunning overnight, I can handle around 1000 calls to Run algorithm.
Clarification:
Best parameters have to found automatically overnight, and used during the day.
The external circumstances change each day, so computing them once and for all is impossible.
More clarification:
RunAlgorithm is actually game-playing algorithm. It plays a whole game (Go or Chess) against fixed opponent. I can play 1000 games overnight. Every night is other opponent.
I want to see whether different opponents need different parameters.
RunAlgorithm is smooth in the sense that changing parameter a little does change algorithm only a little.
Probability of success could be estimated by large number of samples with the same parameters.
But it is too costly to run so many games without changing parameters.
I could try optimize each parameter independently (which would result in 100 runs per parameter) but I guess there are some dependencies.
The whole problem is about using the scarce data wisely.
Games played are very highly randomized, no problem with that.
Maybe you are looking for genetic algorithms.
Why not allow the program fight with itself? Take some vector v (parameters) and let it fight with v + (0.1,0,0,0,..,0), say 15 times. Then, take the winner and modify another parameter and so on. With enough luck, you'll get a strong player, able to defeat most others.
Previous answer (much of it is irrevelant after the question was edited):
With these assumptions and that level of generality, you will achieve nothing (except maybe an impossiblity result).
Basic question: can you change the algorithm so that it will return probability of success, not the result of a single experiment? Then, use appropriate optimization technique (nobody will tell you which under such general assumptions). In Haskell, you can even change code so that it will find the probability in simple cases (probability monad, instead of giving a single result. As others mentioned, you can use a genetic algorithm using probability as fitness function. If you have a formula, use a computer algebra system to find the maximum value.
Probability of success is smooth function of parameters and have single global optimum.
Smooth or continuous? If smooth, you can use differential calculus (Lagrange multipliers?). You can even, with little changes in code (assuming your programming language is general enough), compute derivatives automatically using automatic differentiation.
I will run the tunning overnight, I can handle around 1000 calls to Run algorithm.
That complex? This will allow you to check two possible values (210=1024), out of many floats. You won't even determine order of magnitude, or even order of order of magnitude.
There are around 10 parameters, most of them independently tunable (but some are interdependent)
If you know what is independent, fix some parameters and change those that are independent of them, like in divide-and-conquer. Obviously it's much better to tune two algorithms with 5 parameters.
I'm downvoting the question unless you give more details. This has too much noise for an academic question and not enough data for a real-world question.
The main problem you have is that, with ten parameters, 1000 runs is next to nothing, given that, for each run, all you have is a true/false result rather than a P(success) associated with the parameters.
Here's an idea that, on the one hand, may make best use of your 1000 runs and, on the other hand, also illustrates the the intractability of your problem. Let's assume the ten parameters really are independent. Pick two values for each parameter (e.g. a "high" value and a "low" value). There are 1024 ways to select unique combinations of those values; run your method for each combination and store the result. When you're done, you'll have 512 test runs for each value of each parameter; with the independence assumption, that might give you a decent estimate on the conditional probability of success for each value. An analysis of that data should give you a little information about how to set your parameters, and may suggest refinements of your "high" and "low" values for future nights. The back of my mind is dredging up ANOVA as a possibly useful statistical tool here.
Very vague advice... but, as has been noted, it's a rather vague problem.
Specifically for tuning parameters for game-playing agents, you may be interested in CLOP
http://remi.coulom.free.fr/CLOP/
Not sure if I understood correctly...
If you can choose the parameters for your algorithm, does it mean that you can choose it once for all?
Then, you could simply:
have the developper run all/many cases only once, find the best case, and replace the parameters with the best value
at runtime for your real user, the algorithm is already parameterized with the best parameters
Or, if the best values change for each run ...
Are you looking for Genetic Algorithms type of approach?
The answer to this question depends on:
Parameter range. Can your parameters have a small or large range of values?
Game grading. Does it have to be a boolean, or can it be a smooth function?
One approach that seems natural to this problem is Hill Climbing.
A possible way to implement would be to start with several points, and calculate their "grade". Then figure out a favorable direction for the next point, and try to "ascend".
The main problems that I see in this question, as you presented it, is the huge range of parameter values, and the fact that the result of the run is boolean (and not a numeric grade). This will require many runs to figure out whether a set of chosen parameters are indeed good, and on the other hand, there is a huge set of parameters values yet to check. Just checking all directions will result in a (too?) large number of runs.

How to quantify the quality of a pseudorandom number generator?

This is based on this question. A number of answers were proposed that generate non-uniform distributions and I started wondering how to quantify the non uniformity of the output. I'm not looking for patterning issues, just single value aspects.
What are the accepted procedures?
My current thinking is to computer the average Shannon entropy per call by computing the entropy of each value and taking a weighted average. This can then be compered to the expected value.
My concerns are
Is this correct?
How to compute these value without loosing precision?
For #1 I'm wondering if I've got it correct.
For #2 the concern is that I would be processing numbers with magnitudes like 1/7 +/- 1e-18 and I'm worried that the floating point errors will kill me for any but the smallest problems. The exact form of the computation could result in some major differences here and I seem to recall that there are some ASM options for some special log cases but I can't seem to find the docs about this.
In this case the use is take a "good" PRNG for the range [1,n] and generate a SRNG for the range [1,m]. The question is how much worse is the results than the input?
What I have is expected occurrence rates for each output value.
NIST has a set of documents and tools for statistically analyzing random number generators cross a variety of metrics.
http://csrc.nist.gov/groups/ST/toolkit/rng/index.html
Many of these tests are also incorporated into the Dieharder PRNG test suite.
http://www.phy.duke.edu/~rgb/General/rand_rate.php
There are a ton of different metrics, because there are many, many different ways to use PRNGs. You can't analyze a PRNG in a vacuum - you have to understand the use case. These tools and documents provide a lot of information to help you in this, but at the end of the day you'll still have to understand what you actually need before you can determine of the algorithm is suitable. The NIST documentation is thorough, if somewhat dense.
-Adam
This page discusses one way of checking if you are getting a bad distribution: plotting the pseudo-random values in a field and then just looking at them.
TestU01 has an even more exacting test set than Dieharder. The largest test set is called "BigCrush", but it takes a long time to execute, so there are also subsets called just "Crush" and "SmallCrush". The idea is to first try SmallCrush, and if the PRNG passes that, try Crush, and if it passes that, BigCrush. If it passes that too, it should be good enough.
You can get TestU01 here.

Resources