I am using PCA to visualize clusters and noticed Sklearn added "random_state" parameter to the PCA method (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), my question is what does the random_state parameter do there?
My understanding is that PCA should return the same principle components despite the random state, so, what is the purpose of having it built in?
As the docs state, this param is only relevant for some of the solvers:
random_state : int, RandomState instance or None, optional (default None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or ‘randomized’.
The randomized solver is described in Finding Structure With Randomness:
Stochastic Algorithms For Constructing
Approximate Matrix Decompositions and it's clear that it's an approximation with guaranteed error-bounds.
Now what's with arpack? It's an iterative method and wiki's pseudo-code already shows some random-initialization and the arpack-page also mentions Implicitly Restarted Arnoldi Method, where restart is one more indication of randomness.
So be prepared to see, that not all methods output the same result. arpack and especially randomized are targeting bigger datasets, where approximation is all we can do.
Also observe, that both of these solvers were not available in version 0.17 when there was no random_state parameter.
Related
I have noticed that my gensim Doc2Vec (DBOW) model is sensitive to document tags. My understanding was that these tags are cosmetic and so they should not influence the learned embeddings. Am I misunderstanding something? Here is a minimal example:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import numpy as np
import os
os.environ['PYTHONHASHSEED'] = '0'
reps = []
for a in [0,500]:
documents = [TaggedDocument(doc, [i + a])
for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=100, window=2, min_count=0,
workers=1, epochs=10, dm=0, seed=0)
reps.append(np.array([model.docvecs[k] for k in range(len(common_texts))])
reps[0].sum() == reps[1].sum()
This last line returns False. I am working with gensim 3.8.3 and Python 3.5.2. More generally, is there any role that the values of the tags play (assuming they are unique)? I ask because I have found that using different tags for documents in a classification task leads to widely varying performance.
Thanks in advance.
First & foremost, your test isn't even comparing vectors corresponding to the same texts!
In run #1, the vector for the 1st text in in model.docvecs[0]. In run #2, the vector for the 1st text is in model.docvecs[1].
And, in run #2, the vector at model.docvecs[0] is just a randomly-initialized, but never-trained, vector - because none of the training texts had a document tag of (int) 0. (If using pure ints as the doc-tags, Doc2Vec uses them as literal indexes - potentially leaving any unused slots less than your highest tag allocated-and-initialized, but never-trained.)
Since common_texts only has 11 entries, by the time you reach run #12, all the vectors in your reps array of the first 11 vectors are garbage uncorrelated with any of your texts/
However, even after correcting that:
As explained in the Gensim FAQ answer #11, determinism in this algorithm shouldn't generally be expected, given many sources of potential randomness, and the fuzzy/approximate nature of the whole approach. If you're relying on it, or testing for it, you're probably making some unwarranted assumptions.
In general, tests of these algorithms should be evaluating "roughly equivalent usefulness in comparative uses" rather than "identical (or even similar) specific vectors". For example, a test whether apple and orange are roughly at the same positions in each others' nearest-neighbor rankings makes more sense than checking their (somewhat arbitrary) exact vector positions or even cosine-similarity.
Additionally:
tiny toy datasets like common_texts won't show the algorithm's usual behavior/benefits
PYTHONHASHSEED is only consulted by the Python interpreter at startup; setting it from Python can't have any effect. But also, the kind of indeterminism it introduces only comes up with separate interpreter launches: a tight loop within a single interpreter run like this wouldn't be affected by that in any case.
Have you checked the magnitude of the differences?
Just running:
delta = reps[0].sum() - reps[1].sum()
for the aggregate differences results with -1.2598932e-05 when I run it.
Comparison dimension-wise:
eps = 10**-4
over = (np.abs(diff) <= eps).all()
Returns True on a vast majority of the runs which means that you are getting quite reproducible results given the complexity of the calculations.
I would blame numerical stability of the calculations or uncontrolled randomness. Even though you do try to control the random seed, there is a different random seed in NumPy and different in random standard library so you are not controlling for all of the sources of randomness. This can also have an influence on the results but I did not check the actual implementation in gensim and it's dependencies.
Change
import os
os.environ['PYTHONHASHSEED'] = '0'
to
import os
import sys
hashseed = os.getenv('PYTHONHASHSEED')
if not hashseed:
os.environ['PYTHONHASHSEED'] = '0'
os.execv(sys.executable, [sys.executable] + sys.argv)
I am trying to port a model from Infer.NET, and I am struggling with
how can I make a Deterministic variable observed in pymc3?
M,L ~ Bernoulli
# doesn't work ...
Deterministic("U %i" % i, switch(M[i], ~L[i], L[i]), observed=True)
It's not quite clear what you are trying to model (you are more likely to get replies with a complete description of the problem and attempt at code), but in pymc3 you pass data via the 'observed' argument to specify the likelihood function. For example, if you want to estimate the probability of success for Bernoulli-distributed random variable, the likelihood for the model would be
likelihood = pm.Bernoulli('likelihood', prior_p_success, observed=data)
where prior_p_success is the prior probability of success and data is a vector of your binary data.
I want to generate a sequence of random numbers that will be used to pick tiles for a "maze". Each maze will have an id and I want to use that id as a seed to a pseudo random function. That way I can generate the same maze over and over given it's maze id. Preferably I do not want to use a built in pseudo random function in a language since I do not have control over the algorithm and it could change from platform to platform. As such, I would like to know:
How should I go about implementing my own pseudo random function?
Is it even feasible to generate platform independent pseudo random numbers?
Yes, it is possible.
Here is an example of such an algorithm (and its use) for noise generation.
Those particular random functions (Noise1, Noise2, Noise3, ..) use input parameters and calculate the pseudo random values from there.
Their output range is from 0.0 to 1.0.
And there are many more out there (Like mentioned in the comments).
UPDATE 2019
Looking back at this answer, a better suited choice would be the below-mentioned mersenne twister. Or you could find any implementation of xorshift.
The Mersenne Twister may be a good pick for this. As you can see from the pseudocode on wikipedia, you can seed the RNG with whatever you prefer to produce identical values for any instance with that seed. In your case, the maze ID or the hash of the maze ID.
If you are using Python, you can use the random module by typing at the beginning,
import random. Then, to use it, you type-
var = random.randint(1000, 9999)
This gives the var a 4 digit number that can be used for its id
If you are using another language, there is likely a similar module
I'm looking for a determenistic psuedo random generator that takes two inputs and always returns the same output. I'm looking for things like uniform distribution, unpredictable as possible, and doesn't repeat for a long long time. Ideally the function doesn't rely on previous values. The reason that is a problem is I'm generating terrain data for an extremely large procedurely generated world and can't afford to store previous values.
Any help is appreciated.
i think what you're looking for is perlin noise - it's a way of generating "random" values in 2d (typically) that look like terrain / clouds / etc.
note that this doesn't have much to do with cryptography etc, but a "real" random number source is probably not what you want for synthetic terrain (it looks too noisy/spikey).
there's a good article on perlin noise here.
the implementation of perlin noise does use a source of random numbers, but typically you can use whatever is present on your system (starting with a known seed if you want to reproduce it later).
Is the problem deciding on a PRNG algorithm to use or an algorithm that accepts 2 inputs?
If it's the former, why not use the built in random class - such as Random class in .NET - since it strives for uniform distribution and long cycles. Also, given the same seed it will generate the same sequence of numbers.
If it's the latter, what you can do is map the 2 inputs to a single ouput and use that as a seed to your random algorithm. You can define a simple hash function that takes a string and calculates an integer from it:
s[0] + s[1]^1 + s[2]^2 + ... s[n]^n = seed
Combination of two inputs (by concatenating each other, provided the inputs are binary integers) into one seed will do, for a PRNG, such as Mersenne Twister.
I came across an article about Car remote entry system at http://auto.howstuffworks.com/remote-entry2.htm In the third bullet, author says,
Both the transmitter and the receiver use the same pseudo-random number generator. When the transmitter sends a 40-bit code, it uses the pseudo-random number generator to pick a new code, which it stores in memory. On the other end, when the receiver receives a valid code, it uses the same pseudo-random number generator to pick a new one. In this way, the transmitter and the receiver are synchronized. The receiver only opens the door if it receives the code it expects.
Is it possible to have two PRNG functions producing same random numbers at the same time?
In PRNG functions, the output of the function is dependent on a 'seed' value, such that the same output will be provided from successive calls given the same seed value. So, yes.
An example (using C#) would be something like:
// Provide the same seed value for both generators:
System.Random r1 = new System.Random(1);
System.Random r2 = new System.Random(1);
// Will output 'True'
Console.WriteLine(r1.Next() == r2.Next());
This is all of course dependent on the random number generator using some sort of deterministic formula to generate its values. If you use a so-called 'true random' number generator that uses properties of entropy or noise in its generation, then it would be very difficult to produce the same values given some input, unless you're able to duplicate the entropic state for both calls into the function - which, of course, would defeat the purpose of using such a generator...
In the case of remote keyless entry systems, they very likely use a PRNG function that is deterministic in order to take advantage of this feature. There are many ICs that provide this sort of functionality to produce random numbers for electronic circuits.
Edit: upon request, here is an example of a non-deterministic random number generator which doesn't rely upon a specified seed value: Quantum Random Number Generator. Of course, as freespace points out in the comments, this is not a pseudorandom number generator, since it generates truly random numbers.
Most PRNGs have an internal state in the form of a seed, which they use to generate their next values. The internal logic goes something like this:
nextNumber = function(seed);
seed = nextNumber;
So every time you generate a new number, the seed is updated. If you give two PRNGs that use the same algorithm the same seed, function(seed) is going to evaluate to the same number (given that they are deterministic, which most are).
Applied to your question directly: the transmitter picks a code, and uses it as a seed. The receiver, after receiving it, uses this to seed its generator. Now the two are aligned, and they will generate the same values.
As Erik and Claudiu have said, ad long as you seed your PRNG with the same value you'll end up with the same output.
An example can be seen when using AES (or any other encryption algorithm) as the basis of your PRNG. As long as you keep using an inputs that match on both device (transmitter and receiver) then the outputs will also match.