Generate Array of Numbers that fit to a Probability Distribution in Ruby? - ruby

Say I have 100 records, and I want to mock out the created_at date so it fits on some curve. Is there a library to do that, or what formula could I use? I think this is along the same track:
Generate Random Numbers with Probabilistic Distribution
I don't know much about how they are classified in mathematics, but I'm looking at things like:
bell curve
logarithmic (typical biology/evolution) curve?
...
Just looking for some formulas in code so I can say this:
Given 100 records, a timespan of 1.week, and an interval of 12.hours
set created_at for each record such that it fits, roughly, to curve
Thanks so much!
Update
I found this forum post about ruby algorithms, which led me to rsruby, an R/Ruby bridge, but that seems like too much.
Update 2
I wrote this little snippet trying out the gsl library, getting there...
Generate test data in Rails where created_at falls along a Statistical Distribution

I recently came across croupier, a ruby gem that aims to generate numbers according to a variety of statistical distributions.
I have yet to try it but it sounds quite promising.

You can generate UNIX timestamps which are really just integers. First figure out when you want to start, for example now:
start = DateTime::now().to_time.to_i
Find out when the end of your interval should be (say 1 week later):
finish = (DateTime::now()+1.week).to_time.to_i
Ruby uses this algorithm to generate random numbers. It is almost uniform. Then generate random numbers between the two:
r = Random.new.rand(start..finish)
Then convert that back to a date:
d = Time.at(r)
This looks promising as well:
http://rb-gsl.rubyforge.org/files/rdoc/randist_rdoc.html
And this too:
http://rb-gsl.rubyforge.org/files/rdoc/rng_rdoc.html

From wiki:
There are a couple of methods to
generate a random number based on a
probability density function. These
methods involve transforming a uniform
random number in some way. Because of
this, these methods work equally well
in generating both pseudo-random and
true random numbers.
One method, called the inversion
method, involves integrating up to
an area greater than or equal to the
random number (which should be
generated between 0 and 1 for proper
distributions).
A second method, called the
acceptance-rejection method,
involves choosing an x and y value and
testing whether the function of x is
greater than the y value. If it is,
the x value is accepted. Otherwise,
the x value is rejected and the
algorithm tries again.
The first method is the one used in the accepted answer in your SO linked question: Generate Random Numbers with Probabilistic Distribution

Another option is the Distribution gem under SciRuby. You can generate normal numbers by:
require 'distribution'
rng = Distribution::Normal.rng
random_numbers = Array.new(100).map { rng.call }
There are RNGs for various other distributions as well.

Related

Why the decision tree algorithm in python change every run?

I am following a course on udemy about data science with python.
The course is focused on the output of the algorithm and less on the algorithm by itself.
In particular I am performing a decision tree. Every doing I run the algorithm on python, also with the same samples, the algorithm gives me a slightly different decision tree. I have asked to the tutors and they told me "The decision trees does not guarantee the same results each run because of its nature." Someone can explain me why more in detail or maybe give me an advice for a good book about it?
I did the decision tree of my data importing:
import numpy as np
import pandas as pd
from sklearn import tree
and doing this command:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,y)
where X are my feature data and y is my target data
Thank you
The DecisionTreeClassifier() function is apparently documented here:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
So this function has many arguments. But in Python, function arguments may have default values. Here, all arguments have default values, so you can even call the function with an empty argument list, like this:
clf = tree.DecisionTreeClassifier()
The parameter of interest, random_state is documented like this:
random_state: int, RandomState instance or None, default=None
So your call is equivalent to, among many other things:
clf = tree.DecisionTreeClassifier(random_state=None)
The None value tells the library that you don't want to bother with providing a seed (that is, an initial state) to the underlying pseudo-random number generator. Hence, the library has to come up with some seed.
Typically, it will take the current time value, with microsecond precision if possible, and apply some hash function. So at every call you will get a different initial state, and so a different sequence of pseudo-random numbers. Hence, a different tree.
You might want to try forcing the seed. For example:
clf = tree.DecisionTreeClassifier(random_state=42)
and see if your problem persists.
Now, regarding why does the decision tree require pseudo-random numbers, this is discussed for example here:
According to scikit-learn’s “best” and “random” implementation [4], both the “best” splitter and the “random” splitter uses Fisher-Yates-based algorithm to compute a permutation of the features array.
The Fisher-Yates algorithm is the most common way to compute a random permutation. Also, if stopped before completion, it can be used to extract a random subset of the data sample, for example if you need a random 10% of the sample to be excluded from the data fitting and set aside for a later cross-validation step.
Side note: in some circumstances, non-reproducibility can become a pain point, for example if you want to study the influence of an external parameter, say some global Y values bias. In that case, you don't want uncontrolled changes in the random numbers to blur the effects of your parameter changes. Hence the need for the API to provide some way to control the seed value.

Random number that prefers a range

I'm trying to write a program that requires me to generator a random number.
I also need to make it so there's a variable chance to pick a set range.
In this case, I would be generating between 1-10 and the range with the percent chance is 7-10.
How would I do this? Could I be supplied with a formula or something like that?
So if I'm understanding your question, you want two number ranges, and a variable-defined probability that the one range will be selected. This can be described mathematically as a probability density function (PDF), which in this case would also take your "chance" variable as an argument. If, for example, your 7-10 range is more likely than the rest of the 1-10 range, your PDF might look something like:
One PDF such as a flat distribution can be transformed into another via a transformation function, which would allow you to generate a uniformly random number and transform it to your own density function. See here for the rigorous mathematics:
http://www.stat.cmu.edu/~shyun/probclass16/transformations.pdf
But since you're writing a program and not a mathematics thesis, I suggest that the easiest way is to just generate two random numbers. The first decides which range to use, and the second generates the number within your chosen range. Keep in mind that if your ranges overlap (1-10 and 7-10 obviously do) then the overlapping region will be even more likely, so you will probably want to make your ranges exclusive so you can more easily control the probabilities. You haven't said what language you're using, but here's a simple example in Python:
import random
range_chance = 30 #30 percent change of the 7-10 range
if random.uniform(0,100) < range_chance:
print(random.uniform(7,10))
else:
print(random.uniform(1,7)) #1-7 so that there is no overlapping region

Test the randomness of a black box that outputs random 64-bit floats

I got this interview question and need to write a function for it. I failed.
Because it is a phone interview question, I don't think what I am supposed to code really need to be perfect random tester.
Any ideas?
How to write some code to be a reasonable randomness tester within like 30 minutes during an interview?
edit
The distribution in this question is uniformly distributed
As this is an interview question, I think the interviewers are looking to assess in two ways:
Ability to understand what the requirements of the problem really are.
Ability to think of some code that would address those requirements.
This could be a really good interview question in certain settings, especially if the interviewer were willing to prompt the candidate with questions as and when necessary.
In terms of understanding the requirements of the question, it helps if you know that this is a really difficult problem, witness the Diehard tests mentioned in pjs's answer. Fundamentally I think a candidate would need to demonstrate appreciation of two things:
(a) The overall distribution of the numbers should match the desired distribution (I'm assuming it is uniform in this case, but as #pjs points out in comments this assumption should be made explicit).
(b) Each number drawn should be independent from the previous numbers drawn.
With half an hour to code something up in a phone interview, you can't go very far. If I were answering this question I would try to suggest something like:
(a) To test the distribution, come up with a set of equal-sized bins for the floating point numbers, and count the numbers that fall into each bin. Plot a histogram and eyeball it (plotting the data is always a good idea). To extend this, you could use a chi-squared test, as described in amit's answer.
However, as discussed in the comments, and here
The main problem with chi squared test is the choice of number and size of the intervals. Although rules of thumb can help produce good results, there is no panacea for all kinds of applications.
To this end, the Kolmogorov-Smirnov test can be used. The idea behind this test is that if you a plot of the ordered data should be a good fit against the perfect ordered data (known as the cumulative distribution). For a uniform distribution the perfect ordered data is a straight line: you expect the 10th percentile of the data to be 10% of the way through the range, the 20th to be 20% of the way through the range and so on. So, programmatically, you could sort the data, plot it against the ideal value and you should get a straight line. There is also a formal, quantitative statistical test you can apply, which is based on the differences between the actual and ideal values.
(b) To test independence, there are multiple approaches. Autocorrelation at various time lags is one fairly obvious one: to what extent is the value at time t similar to the value at time t+1, for example. The runs test is another nice one: you convert all the numbers into 1 or 0 depending on whether they fall above or below the median, and then the distribution of the length of runs can be used to construct a statistical test. The runs test can also be used to test for runs in one direction or another, as described here and here (this might be more useful in your case). Both of these have fairly straightforward implementations so long as you have the formulas to hand!
Apart from the diehard tests, other good sources discussing random number generators include here and here.
The way to check if a random number generator (or any other probability for that matter) is matching a desired model (in your case, uniform distribution) - you should use a statistical test, the Pearson's chi squared test.
The test is based on collecting observations, and matching them to the expected probability in according to the theoretic model you are assuming the numbers come from.
At the end, the test gives you the probability that the collected sample indeed came from the given model.
A simple example:
Given a cube, and the draws: [5,3,5,5,1,1] Is the cube balanced? (p=1/6 for each of {1,...,6})
Given the above observations we create the Expected vector: E = [1,1,1,1,1,1] (each entry is N/6 - 6 because this is the number of outcomes and N is the number of draws, 6 in the above example). And the Observed vector: O=[2,0,1,0,3,0]
From this we compute the statistic:
Xi^2 = sum((O_i - E_i)^2 / E_i) = 1/1 + 1/1 + 0/1 + 1/1 + 4/1 + 1/1 = 8
Now, we need to check what is the probability for P(Xi^2>=8), according to the chi^2 distribution (one degree of freedom). This probability is ~0.005 (a bit less..). So we can reject the hypothesis that the sample comes from unbiased cube with pretty high probability.
You're saying that they wanted you to recreate/reinvent the "diehard" battery of tests that it took Marsaglia many years to develop? I'd call them on unreasonable expectations.
Whatever distribution the random floats are suppposed to have, say uniform distribution over the interval [0,1], you can use the Kolmogorov-Smirnov test http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test to test to see if a sample does not follow the desired distribution. This can have advantages over chi-squared test if you have many possible values (because if you have more possible values than samples, then you have to define buckets for the chi-squared test, which makes the test less powerful compared to general distribution checking like Kolmogorov-Smirnov)

Need an Algorithm to generate Serialnumber

I want to generate 16-digits hexadecimal serial-number like: F204-8BE2-17A2-CFF3.
(This pattern give me 16^16 distinct serial-number But I don't need all of them)
I need you all to suggest me an algorithm to generate these serial-numbers randomly with an special characteristic which is:
each two serial-numbers have (at-least) 6 different digits
(= It means if you are given two most similar serial-number, they should still have difference in 6 indexes)
I know that a good algorithm with this characteristic needs to remember previously generated serial-numbers and I don't want that much.
In fact, I need an algorithm which do this with least probability for a chosen pair to collide (less than 0.001 seems sufficient )
PS:
I've just tried to create 10K string randomly using MD5 hash and It gave similar string( similar=more than 3 same digits) with 0.00018 probability.
It is possible to construct a correct generator without having to remember all previously generated codes. You can generate serial numbers that are spaced 6 characters apart by using Hamming code. A hamming code can be designed to arbitrarily space out two distinct generated values. Obviously, the greater the distance, the higher redundancy you will have to use, resulting in more complex code and longer numbers.
First you design a hamming code to your liking, that encodes a number into a sequence of hexadecimal digits and then you can take any sequence of numbers and use it as a seed, such as prime numbers. You just always need to remember, what number was used last and use the next one.
That being said, if you don't need to properly ensure minimal distance of two serials, and would settle for a small error, I would suggest that any half decent hash function or cypher should produce decently spaced out outputs. Therefore the first thing I would try to do is to take MD5 or SHA hashes and test-drive them on numbers 1 - 1000. My hopes are, the results will be quite satisfactory.
I suggest you look into the ANSI X9.17 pseudorandom bit generator. An algorithmic sketch is given in these slides. ANSI X9.17 generates 64-bit pseudorandom strings which is what you want.
A revised and enhanced version of this generator was approved by NIST. Please have a look at this page.
Now whether you use ANSI X9.17 generator, another generator, or develop your own, it's a good idea to have the generator pass some statistical tests in order to ensure the quality of its pseudorandom bits.
Example tests include the ENT battery, the DIEHARD battery, and the NIST battery.

How do you make an algorithm for a Random Number Generator?

According to a teacher of mine in-order to do this you make two arrays with numbers, possessing several decimals. One positive array and one negative.
Array 1 [0] = e.g 1.5739
Array 2 [0] = e.g -5.31729
Then you find current time
201305220957 or May 22,2013 at 9:57 AM
And use this equation:
(201305211647*1.5739)--5.31729
-Then you use absolute value and round to 1.0 decimal place and you have your number
Is it true that in most generators the value is dependent on time?
Bottom line up front - Generating random numbers is really hard to do right, and has badly burned some seriously smart people (John Von Neumann for one). Normal people shouldn't try to create their own RNG algorithms. It requires expertise in number theory, probability & statistics, and numerical computation. Unless you qualify in all three fields, you're much better off using algorithms developed by people who are. If you want to know how to do it right you can find lots of good info at http://en.wikipedia.org/wiki/Random_number_generation and http://en.wikipedia.org/wiki/Pseudorandom_number_generator.
Speaking bluntly, your teacher is totally clueless about this topic.
If you need to generate random numbers for encryption, or statistical purposes - you need to get a well studied generator. One I like is the Mersenne Twister, that has very nice statistical properties, is fast to run and easy to code.
If you just need a reasonably random generator - for example to make things appear random in a game - you can use the classic "Linear Congruent Generator" which is trivial to write and produces pretty random looking output. ( Not safe for heavy duty computation ).
The LCG generator:
int seed = 0x333; // chose any number.
int random() { seed = ( seed * 69069 ) + 1; return seed; }
There are several numbers you can use instead of 69069. But don't pick your own. Chose one from here if you don't like 69069.
http://en.wikipedia.org/wiki/Linear_congruential_generator
The wikipedia has a lot of great pages devoted to random number generation (RNG). One page is devoted to just listing the various types of random number generators used throughout history. One of the earliest and weakest is known as the Middle Square method - easy to implement in programming and suitable for many low level tasks. Some computers have a linear feedback shift register (LFSR) built into the circuitry for random number generation, but its not very advanced. One of the more modern generators (not the most modern), and which is considered cryptographically secure (in the sense of how unpredictable it is), is known as the Mersenne Twister.
Properly, these are pseudorandom number generators (PRNG), because they arent truly random. They arent truly random because computers are deterministic machines (state machines); no predetermined algorithm can be programmed to generate truly random numbers from a known prior state.
That said, the invention of true random number generator (TRNG) hardware circuitry (typically analog) does exist, and are approached in different ways. From checking ambient conditions such as temperature and pressure, to phenomenon a more nuanced and bit more subject to conditions atomic/quantum, such as what state a multistable circuit with feedback settles in. Most modern personal computers do not use this and, if they do, probably only use it to find the seed value for PRNGs. Then, you also have RNG programs that check through an internet connection, to find random number servers online, most of which use TRNGs. You can even rely on lookup tables from real world phenomenon that have been documented; lookup tables is very old school.
Pseudorandom number generators only have the appearance of randomness, namely, they follow a particular distribution and the ability to predict future values from prior ones is not easy. There are a set of diehard tests which have the sole purpose of testing the quality of a random number generator. In general, though, all a PRNG is, is an algorithm that produces a sequence of integers. Ideally, we are looking for an algorithm that produces a sequence of numbers which are suitably unpredictable (this is the hardest part) from prior terms in the sequence, while also following a particular distribution (uniform, typically), meaning that every value in a range is produced in equal proportion.
In general, its a trivial task to convert one distribution into another, regardless if its TRNG or PRNG. Uniform discrete integer distributions (which is what PRNGs generate) can easily be extended or compressed to span any arbitrary interval of integers using a variety of scaling techniques that preserve uniformity, or converted into uniform floating points distributions by randomly choosing a large integer and scaling it down into a float, etc.
Uniform floating point numbers can easily be converted to any other non-uniform distribution such as the Normal, Chi-Square, exponential, etc., using a variety of methods, such as Inverse Transform sampling, Rejection sampling, or simple algebraic relationships on distributions (e.g. the Chi-Square is simply the sum of the squares of independent normal distributions). Additionally some distributions can be suitably approximated using simple mathematical functions applied to a uniform float.
Ultimately the hardest part and the heart of most studies in the subject is in the generation of those uniformly distributed integers, that form the basis of all other distributions.
There is nothing fundamentally wrong with using clock time to get an initial seed value, for example. Id favor using the lower three or four significant digits of the time expressed in microseconds, however, for something suitably unpredictable. Or using the LFSR in a computer for the same purpose, of getting a more advanced algorithm jump started.

Resources