Related
I am trying to make sense of the different distribution objects in c++11 and I am finding it overwhelming. I hope some of you can and will help.
This is why I am looking into all this:
I need a random number generator that I can adjust every time it is used so that it is more likely to produce the same number again. The second requirement I need to fill is that I need the random numbers generated to only be these numbers:
{1, 2, 4, 8, 16, ..., 128}
Third and last requirement is that on certain occasions I need to skip one or more numbers from the above set.
My problem is that I don't understand the descriptions of various distribution objects. I, thus, cannot determine what tools I need to use to meet my above needs.
Can somebody tell me what tools I need and how I need to use them? The more clear, concise and detailed the response the better.
Your range can be generated with a random number j in the range [0, 7], then you compute:
1 << j
to get your number. std::uniform_int_distribution<> would be handy for generating the value in [0, 7].
Additionally you could use a std::bernoulli_distribution (which returns a random bool) to decide if the next number is going to be the same as the last one, or if you should generate a new number. The std::bernoulli_distribution defaults to a 50/50 chance of true/false, but you can customize that distribution in the bernoulli_distribution constructor to anything you like (e.g. 80/20 or whatever).
If this isn't clear enough, just jump in with some code. Try coding it up, and if it isn't working, post what you have, and I'm sure somebody will help.
Oh, forgot about your 3rd requirement: For that just put your [0, 7] generation in a loop, and if you come up with a number you're supposed to skip, then iterate the loop, else break out of it.
For skipping numbers I completely agree with Howard that manual checking is probably the way to go, but there might be a better way altering the probability of a given number being generated.
Another way to do this would be to use a discrete_distribution object, which allows you to specify the probability of generating any given value, so for your example it would be something like
std::default_random_engine entropy;
std::array<double, 128> probs;
probs.fill(1.0);
std::discrete_distribution<int> choose(probs.begin(), probs.end());
then when you're in your loop, in addition to deciding whether or not to skip, you can increment one of those values by some amount to increase the odds of it coming up again, making sure to reinitialize the discrete distribution, like this:
int x;
double myValue = 0.2;//or whatever increment you want
for (something; something else; something else else)
{
x = choose(entropy);
if (skip(x))
continue;//alternately you could set probs.at(x) = 0
//only if you never want to generate it again
probs.at(x) += myValue;
choose = std::discrete_distribution<int>(probs.begin(), probs.end());
output(x);
}
where skip and output are your functions to decide if x should be skipped and do whatever you want with the generated value respectively
I've been looking at the ways to check arguments of functions. I noticed that
MatrixQ takes 2 arguments, the second is a test to apply to each element.
But ListQ only takes one argument. (also for some reason, ?ListQ does not have a help page, like ?MatrixQ does).
So, for example, to check that an argument to a function is a matrix of numbers, I write
ClearAll[foo]
foo[a_?(MatrixQ[#, NumberQ] &)] := Module[{}, a + 1]
What would be a good way to do the same for a List? This below only checks that the input is a List
ClearAll[foo]
foo[a_?(ListQ[#] &)] := Module[{}, a + 1]
I could do something like this:
ClearAll[foo]
foo[a_?(ListQ[#] && (And ## Map[NumberQ[#] &, # ]) &)] := Module[{}, a + 1]
so that foo[{1, 2, 3}] will work, but foo[{1, 2, x}] will not (assuming x is a symbol). But it seems to me to be someone complicated way to do this.
Question: Do you know a better way to check that an argument is a list and also check the list content to be Numbers (or of any other Head known to Mathematica?)
And a related question: Any major run-time performance issues with adding such checks to each argument? If so, do you recommend these checks be removed after testing and development is completed so that final program runs faster? (for example, have a version of the code with all the checks in, for the development/testing, and a version without for production).
You might use VectorQ in a way completely analogous to MatrixQ. For example,
f[vector_ /; VectorQ[vector, NumericQ]] := ...
Also note two differences between VectorQ and ListQ:
A plain VectorQ (with no second argument) only gives true if no elements of the list are lists themselves (i.e. only for 1D structures)
VectorQ will handle SparseArrays while ListQ will not
I am not sure about the performance impact of using these in practice, I am very curious about that myself.
Here's a naive benchmark. I am comparing two functions: one that only checks the arguments, but does nothing, and one that adds two vectors (this is a very fast built-in operation, i.e. anything faster than this could be considered negligible). I am using NumericQ which is a more complex (therefore potentially slower) check than NumberQ.
In[2]:= add[a_ /; VectorQ[a, NumericQ], b_ /; VectorQ[b, NumericQ]] :=
a + b
In[3]:= nothing[a_ /; VectorQ[a, NumericQ],
b_ /; VectorQ[b, NumericQ]] := Null
Packed array. It can be verified that the check is constant time (not shown here).
In[4]:= rr = RandomReal[1, 10000000];
In[5]:= Do[add[rr, rr], {10}]; // Timing
Out[5]= {1.906, Null}
In[6]:= Do[nothing[rr, rr], {10}]; // Timing
Out[6]= {0., Null}
Homogeneous non-packed array. The check is linear time, but very fast.
In[7]:= rr2 = Developer`FromPackedArray#RandomInteger[10000, 1000000];
In[8]:= Do[add[rr2, rr2], {10}]; // Timing
Out[8]= {1.75, Null}
In[9]:= Do[nothing[rr2, rr2], {10}]; // Timing
Out[9]= {0.204, Null}
Non-homogeneous non-packed array. The check takes the same time as in the previous example.
In[10]:= rr3 = Join[rr2, {Pi, 1.0}];
In[11]:= Do[add[rr3, rr3], {10}]; // Timing
Out[11]= {5.625, Null}
In[12]:= Do[nothing[rr3, rr3], {10}]; // Timing
Out[12]= {0.282, Null}
Conclusion based on this very simple example:
VectorQ is highly optimized, at least when using common second arguments. It's much faster than e.g. adding two vectors, which itself is a well optimized operation.
For packed arrays VectorQ is constant time.
#Leonid's answer is very relevant too, please see it.
Regarding the performance hit (since your first question has been answered already) - by all means, do the checks, but in your top-level functions (which receive data directly from the user of your functionality. The user can also be another independent module, written by you or someone else). Don't put these checks in all your intermediate functions, since such checks will be duplicate and indeed unjustified.
EDIT
To address the problem of errors in intermediate functions, raised by #Nasser in the comments: there is a very simple technique which allows one to switch pattern-checks on and off in "one click". You can store your patterns in variables inside your package, defined prior to your function definitions.
Here is an example, where f is a top-level function, while g and h are "inner functions". We define two patterns: for the main function and for the inner ones, like so:
Clear[nlPatt,innerNLPatt ];
nlPatt= _?(!VectorQ[#,NumericQ]&);
innerNLPatt = nlPatt;
Now, we define our functions:
ClearAll[f,g,h];
f[vector:nlPatt]:=g[vector]+h[vector];
g[nv:innerNLPatt ]:=nv^2;
h[nv:innerNLPatt ]:=nv^3;
Note that the patterns are substituted inside definitions at definition time, not run-time, so this is exactly equivalent to coding those patterns by hand. Once you test, you just have to change one line: from
innerNLPatt = nlPatt
to
innerNLPatt = _
and reload your package.
A final question is - how do you quickly find errors? I answered that here, in sections "Instead of returning $Failed, one can throw an exception, using Throw.", and "Meta-programming and automation".
END EDIT
I included a brief discussion of this issue in my book here. In that example, the performance hit was on the level of 10% increase of running time, which IMO is borderline acceptable. In the case at hand, the check is simpler and the performance penalty is much less. Generally, for a function which is any computationally-intensive, correctly-written type checks cost only a small fraction of the total run-time.
A few tricks which are good to know:
Pattern-matcher can be very fast, when used syntactically (no Condition or PatternTest present in the pattern).
For example:
randomString[]:=FromCharacterCode#RandomInteger[{97,122},5];
rstest = Table[randomString[],{1000000}];
In[102]:= MatchQ[rstest,{__String}]//Timing
Out[102]= {0.047,True}
In[103]:= MatchQ[rstest,{__?StringQ}]//Timing
Out[103]= {0.234,True}
Just because in the latter case the PatternTest was used, the check is much slower, because evaluator is invoked by the pattern-matcher for every element, while in the first case, everything is purely syntactic and all is done inside the pattern-matcher.
The same is true for unpacked numerical lists (the timing difference is similar). However, for packed numerical lists, MatchQ and other pattern-testing functions don't unpack for certain special patterns, moreover, for some of them the check is instantaneous.
Here is an example:
In[113]:=
test = RandomInteger[100000,1000000];
In[114]:= MatchQ[test,{__?IntegerQ}]//Timing
Out[114]= {0.203,True}
In[115]:= MatchQ[test,{__Integer}]//Timing
Out[115]= {0.,True}
In[116]:= Do[MatchQ[test,{__Integer}],{1000}]//Timing
Out[116]= {0.,Null}
The same, apparently, seems to be true for functions like VectorQ, MatrixQ and ArrayQ with certain predicates (NumericQ) - these tests are extremely efficient.
A lot depends on how you write your test, i.e. to what degree you reuse the efficient Mathematica structures.
For example, we want to test that we have a real numeric matrix:
In[143]:= rm = RandomInteger[10000,{1500,1500}];
Here is the most straight-forward and slow way:
In[144]:= MatrixQ[rm,NumericQ[#]&&Im[#]==0&]//Timing
Out[144]= {4.125,True}
This is better, since we reuse the pattern-matcher better:
In[145]:= MatrixQ[rm,NumericQ]&&FreeQ[rm,Complex]//Timing
Out[145]= {0.204,True}
We did not utilize the packed nature of the matrix however. This is still better:
In[146]:= MatrixQ[rm,NumericQ]&&Total[Abs[Flatten[Im[rm]]]]==0//Timing
Out[146]= {0.047,True}
However, this is not the end. The following one is near instantaneous:
In[147]:= MatrixQ[rm,NumericQ]&&Re[rm]==rm//Timing
Out[147]= {0.,True}
Since ListQ just checks that the head is List, the following is a simple solution:
foo[a:{___?NumberQ}] := Module[{}, a + 1]
In Mathematica as in other systems of computer math the numbers are internally stored in binary form. However when exporting them with such functions as Put and PutAppend they are converted into approximate decimals. When you import them back with such functions as Get they are restored from this approximate decimal representation to binary form.
The question is whether the recovered number is always identical to the original binary number and, if not always, in which cases it is not and how large can be the difference? I am particularly interested in the Put - Get cycle (on the same computer system).
The following two simple experiments show that probably the Put - Get cycle in Mathematica always restores original numbers exactly even for arbitrary precision numbers:
In[1]:= list=RandomReal[{-10^6,10^6},10000];
Put[list,"test.txt"];
list2=Get["test.txt"];
Order[list,list2]===0
Order[Total#Abs[list-list2],0.]===0
Out[4]= True
Out[5]= True
In[6]:= list=SetPrecision[RandomReal[{-10^6,10^6},10000],50];
Put[list,"test.txt"];
list2=Get["test.txt"];
Order[list,list2]===0
Total#Abs[list-list2]//InputForm
Out[9]= True
Out[10]//InputForm=
0``39.999515496936205
But maybe I am missing something?
UPDATE
With more correct test code I have found that in reality these tests show only that restored numbers have identical binary RealDigits but their Precisions may differ even in Equal sense. Here are more correct tests:
test := (Put[list, "test.txt"];
list2 = Get["test.txt"];
{Order[list, list2] === 0,
Order[Total#Abs[list - list2], 0.] === 0,
Total[Order ### RealDigits[Transpose[{list, list2}], 2]],
Total[Order ### Map[Precision, Transpose[{list, list2}], {-1}]],
Total[1 - Boole[Equal ### Map[Precision, Transpose[{list, list2}], {-1}]]]})
In[8]:= list=RandomReal[NormalDistribution[],10000]^1001;
test
Out[9]= {False,True,0,1,3}
In[6]:= list=RandomReal[NormalDistribution[],10000,WorkingPrecision->50]^1001;
test
Out[7]= {False,False,0,-2174,1}
I'm afraid I can't give a definitive answer. If you look into the text file you see it's stored as something like the InputForm of the values, including the precision indication for non-machine precision numbers.
Assuming that Get uses the same conversion routines as ImportString and ExportString your test can be sped up a tiny bit.
Monitor[
Do[
i = RandomReal[{$MinMachineNumber, 10 $MinMachineNumber}, 100000];
If[i =!=
ToExpression[ImportString[ExportString[i, "Text"], "List"]],
Print[i]], {n, 100}
],
n]
I have tested this for several hundreds of millions of numbers in various ranges between $MinMachineNumber and $MaxMachineNumber and I always get back the original numbers. It's no proof, of course, but it seems unlikely that you're going to see numbers for which this is not true if there are any (and in that case the difference would be so tiny as to be negligible).
One important thing to know is that Put[] / Get[] doesn't keep packed arrays packed. You should check out DumpSave[]. It's much faster as it's a binary format and keeps arrays packed.
I have 2 sets of data:
d1= {0.119894,0.430666,0.0831885,0.0319174,0.120422,0.113005,0.396407,0.286316,0.0846212,0.0380193,0.047136,0.0362305,0.0445161,0.142403,0.0540607,0.133119,0.10831,0.173586,0.162465,0.0704632,0.0856676,0.086322,0.31334,0.210488,0.165907,0.119317,0.0995894,0.103821,0.135736,0.245069,0.0814167,0.142331,0.321499,0.0576824,0.0535766,0.0546975,0.121395,0.0608112,0.0606295,0.133289,0.0468469,0.0501325,0.0641351,0.0846396,0.317252,0.0779754,0.105217,0.0749865,0.302625,0.301864,0.0929992,0.12178,0.279253,0.245539,0.198353,0.107202,0.17784,0.145572,0.055006,0.0770127,0.0861758,0.189966,0.21403,0.0834313,0.206845,0.2087,0.263422,0.0767717,0.162445,0.0542824,0.0553086,0.141381,0.052898,0.0945407,0.0776741,0.0367623,0.0565677,0.166219,0.035447,0.120121,0.0418321,0.11264,0.0540176,0.120358,0.074417,0.242225,0.398622,0.308373,0.15192,0.278717};
d2={0.170719,0.099203,0.0539713,0.15749,0.150455,0.142714,0.0705496,0.0690684,0.0630756,0.0372223,0.0885515,0.0305229,0.0869673,0.0426363,0.0504665,0.0371966,0.0766164,0.0402321,0.0334813,0.0489499,0.0753463,0.0942363,0.0786223,0.335095,0.0706324,0.0764047,0.0682716,0.0699429,0.0355438,0.0755698,0.10206,0.199187,0.0560379,0.0342713,0.0500202,0.0558365,0.0624332,0.0418887,0.0531662,0.0499419,0.0273659,0.0228881,0.0893776,0.0643183,0.0171277,0.0373337,0.0457631,0.0764322,0.0963383,0.0633643,0.107952,0.0570244,0.19336,0.0428824,0.0629954,0.120787,0.0924894,0.0562895,0.125588,0.116919,0.196895,0.264337,0.0787541,0.318374,0.193144,0.147134,0.0456675,0.0419496,0.057378,0.0577714,0.0706519,0.0410366,0.0716635,0.0547774,0.0157382,0.030444,0.0769898,0.0121786,0.0586156,0.0314843,0.0942514,0.1627,0.0781299,0.148406,0.423559,0.276206,0.0708934,0.0812794,0.159947};
Now I want to find an Estimated distribution using StableDistribution[]
For the first data set I do the following:
dist1 = EstimatedDistribution[d1, StableDistribution[alpha, beta, mu, sigma]]
I get a message and output
FindMaximum::sdprec: Line search unable to find a sufficient increase in the function value with MachinePrecision digit precision. >>
StableDistribution[1,0.863446,1.,-0.0781627,0.0345779]
The output looks ok (not a great fit for the data, but not too bad) but what does the message imply for the output?
For the second data set, d2
dist2 = EstimatedDistribution[d2, StableDistribution[alpha, beta, mu, sigma]]
I get a different message.
Optimization`ModifiedCholeskyDecomposition::herm: The matrix {{2.76856*10^157,-1.75574*10^159,-1.84519*10^157,-2.26892*10^157},{7.88598*10^159,0.,6.41507*10^159,7.88598*10^159},{1.82386*10^157,6.41507*10^159,1.13495*10^157,1.82386*10^157},{-2.26892*10^157,-1.75574*10^159,-1.84519*10^157,1.68961*10^157}} is not Hermitian or real and symmetric.
and output:
StableDistribution[1,0.834688,1.,-0.0101189,0.0181306]
So, I've got a couple of questions. Can anyone explain these messages and their relevance? It looks to me that Mathematica tries a number of different ways to estimate the distribution and some just don't work very well.
Thx.
J.
In order to make parameter estimation for stable distribution efficient, a multivariate interpolation of the pdf(alpha, beta, x) is constructed, and the resulting interpolation is used for estimation. Polynomial interpolation exhibits small scale oscillations, which can throw off the maximization routines. Thus, in working with stable estimation, it is better to use PrecisionGoal->3, AccuracyGoal->3.
Doing this does not get rid of your messages, though, but will speed-up estimation, which matters for larger size problems.
Since you data-size is small, statistical uncertainties of the estimators are large anyway.
The first message is benign, but the second is probably a bug, since the log-likelihood
of the estimated distribution on data is too low.
As an aside, it seems that StableDistribution is not a very good fit for your data:
In[44]:= LogLikelihood[
EstimatedDistribution[d1, StableDistribution[a, b, c, d]],
d1] // Quiet
Out[44]= 101.926
In[45]:= LogLikelihood[
EstimatedDistribution[d1, HyperbolicDistribution[a, b, c, d]],
d1] // Quiet
Out[45]= 111.847
In[46]:= LogLikelihood[
EstimatedDistribution[d2, StableDistribution[a, b, c, d]],
d2] // Quiet
Out[46]= -10.2194
In[47]:= LogLikelihood[
EstimatedDistribution[d2, HyperbolicDistribution[a, b, c, d]],
d2] // Quiet
Out[47]= 143.04
A general comment about numerical optimizer warnings -- I had a similar issue with using FindMaximum and getting "sufficient decrease" warnings, even though output seemed fine. It had to do with the fact that default AccuracyGoal of 6 could not be guaranteed, but smaller goal could be met without warnings.
You can globally turn the warning off with Off[FindMaximum::sdprec] or suppress it on per-command basis with
Quiet[EstimatedDistribution[d1,StableDistribution[alpha, beta, mu, sigma]], FindMaximum::sdprec]
I'm trying to run the following program, which calculates roots of polynomials of degree up to d with coefficients only +1 or -1, and then store it into files.
d = 20; n = 18000;
f[z_, i_] := Sum[(2 Mod[Floor[(i - 1)/2^k], 2] - 1) z^(d - k), {k, 0, d}];
Here f[z,i] gives a polynomial in z with plus or minus signs counting in binary. Say d=2, we would have
f[z,1] = -z2 - z - 1
f[z,2] = -z2 - z + 1
f[z,3] = -z2 + z - 1
f[z,4] = -z2 + z + 1
DistributeDefinitions[d, n, f]
ParallelDo[
Do[
root = N[Root[f[z, i], j]];
{a, b} = Round[n ({Re[root], Im[root]}/1.5 + 1)/2];
{i, 1, 2^d}],
{j, 1, d}]
I realise reading this probably isn't too enjoyable, but it's relatively short anyway. I would've tried to cut down to the relevant parts, but here I really have no clue what the trouble is. I'm calculating all roots of f[z,i], and then just round them to make them correspond to a point in a n by n grid, and save that data in various files.
For some reason, the memory usage in Mathematica creeps up until it fills all the memory (6 GB on this machine); then the computation continues extremely slowly; why is this?
I am not sure what is using up the memory here - my only guess was the stream of files used up memory, but that's not the case: I tried appending data to 2GB files and there was no noticeable memory usage for that. There seems to be absolutely no reason for Mathematica to be using large amounts of memory here.
For small values of d (15 for example), the behaviour is the following: I have 4 kernels running. As they all run through the ParallelDo loop (each doing a value of j at a time), the memory use increases, until they all finish going through that loop once. Then the next times they go through that loop, the memory use does not increase at all. The calculation eventually finishes and everything is fine.
Also, quite importantly, once the calculation stops, the memory use does not go back down.
If I start another calculation, the following happens:
-If the previous calculation stopped when memory use was still increasing, it continues to increase (it might take a while to start increasing again, basically to get to the same point in the computation).
-If the previous calculation stopped when memory use was not increasing, it does not increase further.
Edit: The issue seems to come from the relative complexity of f - changing it into some easier polynomial seems to fix the issue. I thought the problem might be that Mathematica remembers f[z,i] for specific values of i, but setting f[z,i] :=. just after calculating a root of f[z,i] complains that the assignment did not exist in the first place, and the memory is still used.
It's quite puzzling really, as f is the only remaining thing I can imagine taking up memory, but defining f in the inner Do loop and clearing it each time after a root is calculated does not solve the problem.
Ouch, this is a nasty one.
What's going on is that N will do caching of results in order to speed up future calculations if you need them again. Sometimes this is absolutely what you want, but sometimes it just breaks the world. Fortunately, you do have some options. One is to use the ClearSystemCache command, which does just what it said on the tin. After I ran your un-parallelized loop for a little while (before getting bored and aborting the calculation), MemoryInUse reported ~160 MiB in use. Using ClearSystemCache got that down to about 14 MiB.
One thing you should look at doing, instead of calling ClearSystemCache programmatically, is to use SetSystemOptions to change the caching behavior. You should take a look at SystemOptions["CacheOptions"] to see what the possibilities are.
EDIT: It's not terribly surprising that the caching causes a bigger problem for more complex expressions. It's got to be stashing copies of those expressions somewhere, and more complex expressions require more memory.