Why is my Doc2Vec model in gensim not reproducible? - gensim

I have noticed that my gensim Doc2Vec (DBOW) model is sensitive to document tags. My understanding was that these tags are cosmetic and so they should not influence the learned embeddings. Am I misunderstanding something? Here is a minimal example:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import numpy as np
import os
os.environ['PYTHONHASHSEED'] = '0'
reps = []
for a in [0,500]:
documents = [TaggedDocument(doc, [i + a])
for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=100, window=2, min_count=0,
workers=1, epochs=10, dm=0, seed=0)
reps.append(np.array([model.docvecs[k] for k in range(len(common_texts))])
reps[0].sum() == reps[1].sum()
This last line returns False. I am working with gensim 3.8.3 and Python 3.5.2. More generally, is there any role that the values of the tags play (assuming they are unique)? I ask because I have found that using different tags for documents in a classification task leads to widely varying performance.
Thanks in advance.

First & foremost, your test isn't even comparing vectors corresponding to the same texts!
In run #1, the vector for the 1st text in in model.docvecs[0]. In run #2, the vector for the 1st text is in model.docvecs[1].
And, in run #2, the vector at model.docvecs[0] is just a randomly-initialized, but never-trained, vector - because none of the training texts had a document tag of (int) 0. (If using pure ints as the doc-tags, Doc2Vec uses them as literal indexes - potentially leaving any unused slots less than your highest tag allocated-and-initialized, but never-trained.)
Since common_texts only has 11 entries, by the time you reach run #12, all the vectors in your reps array of the first 11 vectors are garbage uncorrelated with any of your texts/
However, even after correcting that:
As explained in the Gensim FAQ answer #11, determinism in this algorithm shouldn't generally be expected, given many sources of potential randomness, and the fuzzy/approximate nature of the whole approach. If you're relying on it, or testing for it, you're probably making some unwarranted assumptions.
In general, tests of these algorithms should be evaluating "roughly equivalent usefulness in comparative uses" rather than "identical (or even similar) specific vectors". For example, a test whether apple and orange are roughly at the same positions in each others' nearest-neighbor rankings makes more sense than checking their (somewhat arbitrary) exact vector positions or even cosine-similarity.
Additionally:
tiny toy datasets like common_texts won't show the algorithm's usual behavior/benefits
PYTHONHASHSEED is only consulted by the Python interpreter at startup; setting it from Python can't have any effect. But also, the kind of indeterminism it introduces only comes up with separate interpreter launches: a tight loop within a single interpreter run like this wouldn't be affected by that in any case.

Have you checked the magnitude of the differences?
Just running:
delta = reps[0].sum() - reps[1].sum()
for the aggregate differences results with -1.2598932e-05 when I run it.
Comparison dimension-wise:
eps = 10**-4
over = (np.abs(diff) <= eps).all()
Returns True on a vast majority of the runs which means that you are getting quite reproducible results given the complexity of the calculations.
I would blame numerical stability of the calculations or uncontrolled randomness. Even though you do try to control the random seed, there is a different random seed in NumPy and different in random standard library so you are not controlling for all of the sources of randomness. This can also have an influence on the results but I did not check the actual implementation in gensim and it's dependencies.

Change
import os
os.environ['PYTHONHASHSEED'] = '0'
to
import os
import sys
hashseed = os.getenv('PYTHONHASHSEED')
if not hashseed:
os.environ['PYTHONHASHSEED'] = '0'
os.execv(sys.executable, [sys.executable] + sys.argv)

Related

Bug ommits data interval - possible causes?

I have encountered a strange bug and wanted to ask if someone has any idea what might be the cause.
The bug:
When I correlate the facial width-to-height ratio (FWHR) of NHL players with their penalty minutes per games played (PIM/GP), a section of the FWHR distribution is blank (between 1.98-2 and 2-2.022; see Figure 1). The FWHR is an int/int ratio where each int has two digits. It is extremely unlikely this reflects a true signal and is therefore most likely a bug in the code I am using.
Context:
I know my PIM/P data is correct (retrieved from NHL's website) but the FWHR was calculated using an algorithm. The problem most likely lies within this facial measuring algorithm. I have not been able to locate the bug and therefore turn to you for advice.
Question:
While the code for the facial measuring algorithm is far too long to be presented here, I wanted to ask if someone might have any ideas on what might have caused it/ what I could check for?
The Nature of Ratio Distributions
Idea: It should be impossible for a ratio of two 2-digit integers to fill all 2-decimal values between two integers. Could such impossible values be especially pronounced around 2.0? For example, maybe 1.99 can not be represented?
Method: Loop through 2-digit ints and append the ratio to a list.
Then check if the list lacks values around 2.0 (e.g., 1.99).
import numpy as np
from matplotlib import pyplot as plt
def int_ratio_generator():
ratio_list = []
for i in range(1,100):
for j in range(1,100):
ratio = i/j
ratio_list.append(ratio)
return ratio_list
ratio_list = int_ratio_generator()
key = 1.99 in ratio_list
print('\nis 1.99 a possible ratio from 2-digit ints?', key)
fig, ax = plt.subplots()
X = ratio_list
Y = np.random.rand(len(ratio_list),1)
plt.scatter(X, Y, color='C0')
plt.xlim(1.8, 2.2)
plt.show()
Conclusion:
Ratios from positive 2-digit integers do not fill all possible 2-decimal values between integers, and impossible values include 1.99.
It follows that previously impossible values can be filled by including a larger range of ints, or by introducing decimal numbers within the same range.
Furthermore, as shown by the simulation above, ratio distributions with 2-digit integers will have relatively large ranges of impossible values on either side of each integer.

In Tensorflow, what is the difference between Session.partial_run and Session.run?

I always thought that Session.run required all placeholders in the graph to be fed, while Session.partial_run only the ones specified through Session.partial_run_setup, but looking further that is not the case.
So how exactly do the two methods differentiate? What are the advantages/disadvantages of using one over the other?
With tf.Session.run, you usually give some inputs and expected outputs, and TensorFlow runs the operations in the graph to compute and return those outputs. If you later want to get some other output, even if it is with the same input, you have to run again all the necessary operations in the graph, even if some intermediate results will be the same as in the previous call. For example, consider something like this:
import tensorflow as tf
input_ = tf.placeholder(tf.float32)
result1 = some_expensive_operation(input_)
result2 = another_expensive_operation(result1)
with tf.Session() as sess:
x = ...
sess.run(result1, feed_dict={input_: x})
sess.run(result2, feed_dict={input_: x})
Computing result2 will require to run both the operations from some_expensive_operation and another_expensive_operation, but actually most of the computation is repeated from when result1 was calculated. tf.Session.partial_run allows you to evaluate part of a graph, leave that evaluation "on hold" and complete it later. For example:
import tensorflow as tf
input_ = tf.placeholder(tf.float32)
result1 = some_expensive_operation(input_)
result2 = another_expensive_operation(result1)
with tf.Session() as sess:
x = ...
h = sess.partial_run_setup([result1, result2], [input_ ])
sess.partial_run(h, result1, feed_dict={input_: x})
sess.partial_run(h, result2)
Unlike before, here the operations from some_expensive_operation will only we run once in total, because the computation of result2 is just a continuation from the computation of result1.
This can be useful in several contexts, for example if you want to split the computational cost of a run into several steps, but also if you need to do some mid-evaluation checks out of TensorFlow, such as computing an input to the second half of the graph that depends on an output of the first half, or deciding whether or not to complete an evaluation depending on an intermediate result (these may also be implemented within TensorFlow, but there may be cases where you do not want that).
Note too that it is not only a matter of avoiding repeating computation. Many operations have a state that changes on each evaluation, so the result of two separate evaluations and one evaluation divided into two partial ones may actually be different. This is the case with random operations, where you get a new different value per run, and other stateful object like iterators. Variables are also obviously stateful, so operations that change variables (like tf.Session.assign or optimizers) will not produce the same results when they are run once and when they are run twice.
In any case, note that, as of v1.12.0, partial_run is still an experimental feature and is subject to change.

Speed up numpy matrix inverse

I am using Numpy/Scipy to invert a 20k matrix, it's slow.
I tried:
(1) M_inv = M.I
(2) Ident = np.Identity(len(M))
M_inv = scipy.linalg.solve(M, Ident)
(3) M_inv = scipy.linglg.inv(M)
but didn't see any speedup.
Is there any other way to speed this up?
This is a big matrix, and inverting it is going to be slow. Some options:
Use a numpy linked against Intel MKL (e.g. the Enthought distribution, or you can compile it yourself), which should be faster than one linked against standard BLAS/ATLAS.
If your matrix is sufficiently sparse, use scipy.linalg.sparse. (This will probably be slower if there are only a few zeros, though.)
Figure out if you really need an explicit representation of the inverted matrix to do whatever it is you're trying to do with it – often you can get away without explicitly inverting it, but it's hard to tell without knowing what it is you're doing with this matrix.

How can I prove the experiment data follows heavy-tail distribution?

I have several test results of server response delay. According to our theory analysis, the delay distribution should have heavy-tail behavior. But how could I prove that the test result does follow heavy-tail distribution?
I think the most straightforward way is to make a best fit of half normal distribution to and look how well it describe the tail.
In Python you can make this with the help of scipy.stats.halfnorm.fit() or use a longtail module that is dedicated for plotting and analysing heavy tails (https://github.com/Mottl/longtail):
import numpy as np
import longtail
# generate random values from heavy tailed distribution (let's take Laplace)
X = np.random.laplace(size=10000)
X = X[X>0] # take only right half of the distribution
# get best fit of half normal distribution to our data:
params = longtail.fit_distributions(X, distributions=['halfnorm'])
# visualize X and best fit:
longtail.plot(X, params=params)
Since points at the tail are above the half normal approximation, the given distribution can be considered as having heavier tail than half normal:
I am not an expert, but I think that estimating the kurtosis of your delay distribution would be a good start.
If you know the theoretical delay distribution, you can also do a goodness of fit test.

Time delay estimation between two audio signals

I have two audio recordings of a same signal by 2 different microphones (for example, in a WAV format), but one of them is recorded with delay, for example, several seconds.
It's easy to identify such a delay visually when viewing these signals in some kind of waveform viewer - i.e. just spotting first visible peak in every signal and ensuring that they're the same shape:
(source: greycat.ru)
But how do I do it programmatically - find out what this delay (t) is? Two digitized signals are slightly different (because microphones are different, were at different positions, due to ADC setups, etc).
I've digged around a bit and found out that this problem is usually called "time-delay estimation" and it has myriads of approaches to it - for example, one of them.
But are there any simple and ready-made solutions, such as command-line utility, library or straight-forward algorithm available?
Conclusion: I've found no simple implementation and done a simple command-line utility myself - available at https://bitbucket.org/GreyCat/calc-sound-delay (GPLv3-licensed). It implements a very simple search-for-maximum algorithm described at Wikipedia.
The technique you're looking for is called cross correlation. It's a very simple, if somewhat compute intensive technique which can be used for solving various problems, including measuring the time difference (aka lag) between two similar signals (the signals do not need to be identical).
If you have a reasonable idea of your lag value (or at least the range of lag values that are expected) then you can reduce the total amount of computation considerably. Ditto if you can put a definite limit on how much accuracy you need.
Having had the same problem and without success to find a tool to sync the start of video/audio recordings automatically,
I decided to make syncstart (github).
It is a command line tool. The basic code behind it is this:
import numpy as np
from scipy import fft
from scipy.io import wavfile
r1,s1 = wavfile.read(in1)
r2,s2 = wavfile.read(in2)
assert r1==r2, "syncstart normalizes using ffmpeg"
fs = r1
ls1 = len(s1)
ls2 = len(s2)
padsize = ls1+ls2+1
padsize = 2**(int(np.log(padsize)/np.log(2))+1)
s1pad = np.zeros(padsize)
s1pad[:ls1] = s1
s2pad = np.zeros(padsize)
s2pad[:ls2] = s2
corr = fft.ifft(fft.fft(s1pad)*np.conj(fft.fft(s2pad)))
ca = np.absolute(corr)
xmax = np.argmax(ca)
if xmax > padsize // 2:
file,offset = in2,(padsize-xmax)/fs
else:
file,offset = in1,xmax/fs
A very straight forward thing todo is just to check if the peaks exceed some threshold, the time between high-peak on line A and high-peak on line B is probably your delay. Just try tinkering a bit with the thresholds and if the graphs are usually as clear as the picture you posted, then you should be fine.

Resources