word2vec's probalistic output - gensim

I'm new to the world of word2vec and I just start to use gensim's implementation for word2vec.
I use two naive sentences as my first document set,
[['first', 'sentence'], ['second', 'sentence']]
The vectors I get are like this:
'first', -0.07386458, -0.17405555
'second', 0.0761444 , -0.21217766
'sentence', 0.0545655 , -0.07535963
However, when I type in another toy document sets:
[['a', 'c'], ['b', 'c']]
I get the following result:
'a', 0.02936198, -0.05837455
'b', -0.05362414, -0.06813956
'c', 0.11918657, -0.10411404
Again, I'm new to word2vec but according to my understanding,
my two document sets are structurally identical, so the results of the corresponding word should be the same.
But why I'm getting different results?
Is the algorithm always giving probalistic output or the document sets too small?
The function I used is as the following:
model = word2vec.Word2Vec(sentences, size=2, min_count=1, window=2)

Prime reason you are getting different vectors is random initialisation of vectors in word2vec (there are other reasons like negative sampling, threading which can lead to difference in vector values).
The philosophy behind word2vec being, if the number of documents (training data) >> number of unique words (vocabulary size), the vectors for the words will stabilise after few iterations.

Related

How to interpret doc2vec results on previously seen data?

I use gensim 4.0.1 and train doc2vec:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
sentences = [['hello', 'world'], ['james', 'bond'], ['adam', 'smith']]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]
model = Doc2Vec(documents, vector_size=5, window=5, min_count=0, workers=4)
documents
[TaggedDocument(words=['hello', 'world'], tags=[0]),
TaggedDocument(words=['james', 'bond'], tags=[1]),
TaggedDocument(words=['adam', 'smith'], tags=[2])]
model.dv[0],model.dv[1],model.dv[2]
(array([-0.10461631, -0.11958256, -0.1976151 , 0.1710569 , 0.0713223 ],
dtype=float32),
array([ 0.00526548, -0.19761242, -0.10334401, -0.19437183, 0.04021204],
dtype=float32),
array([ 0.05662392, 0.09290017, -0.08597242, -0.06293383, -0.06159503],
dtype=float32))
I expect to get a match on TaggedDocument #1
seen = ['james','bond']
Surprisingly, that known text (james bond) produces a completely "unseen" vector:
new_vector = model.infer_vector(seen)
new_vector
array([-0.07762126, 0.03976333, -0.02985927, 0.07899596, -0.03556045],
dtype=float32)
The most_similar() does not point to the expected Tag=1. Moreover, all 3 scores are quite weak implying completely unseen data.
model.dv.most_similar_cosmul(positive=[new_vector])
[(0, 0.5322251915931702), (2, 0.4972134530544281), (1, 0.46321794390678406)]
What is wrong here, any ideas?
Five dimensions is still too many for a toy-sized dataset of just 6 words, 6 unique words, and 3 2-word texts.
None of the Word2Vec/Doc2Vec/FastText-type algorithms works well on tiny amounts of contrived data. They only learn their patterns from many, subtly-contrasting usages of words in varied contexts.
Their real strengths only emerge with vectors that are 50, 100, or hundreds-of-dimensions wide - and training that many dimensions requires a unique vocabulary of (at least) many thousands of words – ideally tens or hundreds of thousands of words – with many usage examples of each. (For a variant like Doc2Vec, you'd similarly want many thousands of varied documents.)
You'll see improved correlations with expected results when using sufficient training data.

does doc2vec(gensim) infer_vector needs window-size padded sentence?

According to the original paper Distributed Representations of Sentences and Documents, the inference on unseen paragraph can be done by
training “the inference stage” to get paragraph vectors D for new
paragraphs (never seen before) by adding more columns
in D and gradient descending on D while holding W, U, b
fixed
This inference stage can be done in gensim by infer_vector().
If I have window = 5 for doc2vec model, and attempts to infer paragraph with whose some sentences are len(sentence) < 5.
such as :
model = Doc2Vec(window=5)
paragraph = [['I', 'am', 'groot'], ['I', 'am', 'groot', 'I', 'am', 'groot']]
model.infer_vector(paragraph)
In this case, should I pre-pad my inferring vector with special NULL word symbol so that all length of sentences in the paragraph should be bigger than window size ?
such as :
paragraph = [['I', 'am', 'groot', NULL, NULL], ['I', 'am', 'groot', 'I', 'am', 'groot']]
You never need to do any explicit padding.
In the default and common Doc2Vec modes, if there's not enough context on either side of a focal word, the effective window simply shrinks on that side to match what is available.
(In the non-default dm=1, dm_concat=1 mode, there's automatic padding when necessary. But this mode results in larger, slower models requiring a lot more data to train, and whose value isn't very clear in any proven settings. That mode is unlikely to get good results except for advanced users with a lot of data and ability to tinker with non-default parameters.)
I found that gensim automatically pre-pads documents at both training and inferring stage.
gensim.models.doc2vec.train_document_dm_concat
null_word = model.vocab['\0']
pre_pad_count = model.window
post_pad_count = model.window
padded_document_indexes = (
(pre_pad_count * [null_word.index]) # pre-padding
+ [word.index for word in word_vocabs if word is not None] # elide out-of-Vocabulary words
+ (post_pad_count * [null_word.index]) # post-padding
)

Selecting only a small amount of trials in a possibly huge condition file in a pseudo-randomized way

I am using the PsychoPy Builder and have used the code only rudimentary.
Now I'm having a problem for which I think coding is inevitable, but I have no idea how to do it and so far, I didn't find helpful answers in the net.
I have an experiment with pictures of 3 valences (negative, neutral, positive).
In one of the corners of the pictures, additional pictures (letters and numbers) can appear (randomly in one of the 4 positions) in random latencies.
All in all, with all combinations (taken the identity of the letters/numbers into account), I have more than 2000 trial possibilities.
But I only need 72 trials, with the condition that each valence appears 24 times (or: each of the 36 pictures 2 times) and each latency 36 times. Thus, the valence and latency should be counterbalanced, but the positions and the identities of the letters and numbers can be random. However, in a specific rate, (in 25% of the trials) no letters/ numbers should apear in the corners.
Is there a way to do it?
Adding a pretty simple code component in builder will do this for you. I'm a bit confused about the conditions, but you'll probably get the general idea. Let's assume that you have your 72 "fixed" conditions in a conditions file and a loop with a routine that runs for each of these conditions.
I assume that you have a TextStim in your stimulus routine. Let's say that you called it 'letternumber'. Then the general strategy is to pre-compute a list of randomized characters and positions for each of the 72 trials and then just display them as we move through the experiment. To do this, add a code component to the top of your stimulus routine and add under "begin experiment":
import random # we'll use this module to pick random elements from below
# Indicator sequence, specifying whether letter/number should be shown. False= do not show. True = do show.
show_letternumber = [False] * 18 + [True] * 54 # 18/72=25%, 54/72=75%.
random.shuffle(show_letternumber)
# Sets of letters and numbers to present
char_set = ['1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g'] # ... and so on.
char_trial = [random.choice(char_set) if show_char else '' for show_char in char_set] # list with characters
# List of positions
pos_set = [(0.5, 0.5),(-0.5, 0.5),(-0.5,-0.5),(0.5, -0.5)] # coordinates of your four corners
pos_trial = [random.choice(pos_set) for char in char_trial]
Then under "begin routine" in the code component, set the lettersnumbers to show the value of character_trial for that trial and at the position in pos_trial.
letternumbers.pos = pos_trial[trials.thisN] # set position. trials.thisN is the current trial number
letternumbers.text = char_trial[trials.thisN] # set text
# Save to data/log
trials.addData('pos', pos_trial[trials.thisN])
trials.addData('char', char_trial[trials.thisN])
You may need to tick "set every repeat" for the lettersnumbers component in Builder for the text to actually show.
Here is a strategy you could try, but as I don't use builder I can't integrate it into that work flow.
Prepare a list that has the types of trials you want in the write numbers. You could type this by hand if needed. For example mytrials = ['a','a',...'d','d'] where those letters represent some label for the combination of trial types you want.
Then open up the console and permute that list (i.e. shuffle it).
import random
random.shuffle(mytrials)
That will shift the mytrials around. You can see that by just printing that. When you are happy with that paste that into your code with some sort of loop like
t in mytrials:
if t == 'a':
<grab a picture of type 'a'>
elseif t == 'b':
<grab a picture of type 'b'>
else:
<grab a picture of type 'c'>
<then show the picture you grabbed>
There are programmatic ways to build the list with the right number of repeats, but for what you are doing it may be easier to just get going with a hand written list, and then worry about making it fancier once that works.

Generating Boolean Searches Against an Array of Sentences to Group Results into n Results or Fewer

I feel this is a strange one. It comes from nowhere specific but it's a problem I've started trying to solve and now just want to know the answer or at least a starting place.
I have an array of x number of sentences,
I have a count of how many sentences each word appears in,
I have a count of how many sentences each word appears in with every other word,
I can search for a sentence using typical case insensitive boolean search clauses (AND +/- Word)
My data structure looks like this:
{ words: [{ word: '', count: x, concurrentWords: [{ word: '', count: x }] }] }
I need to generate an array of searches which will group the sentences into arrays of n size or less.
I don't know if it's even possible to do this in a predictable way so approximations are cool. The solution doesn't have to use the fact that I have my array of words and their counts. I'm doing this in JavaScript, not that that should matter.
Thanks in advance

Create an array of strings of all possible combination of the elements of a series of arrays

I would like to do reverse DNA translation using BioRuby which offers a nice CODON table for bacterias.
Here a code snippet describing the series of arrays I have (they are a lot more!).
# Arrays Sample
a = table.revtrans("A") # ["gct", "gcc"]
b = table.revtrans("M") # ["atg"]
c = table.revtrans("L") # ["tta", "ttg", "ctt", "ctc", "cta", "ctg"]
d = ...
I would like to create an array or hash with all possible combinations of the above strings.
["gctatgtta", "gccatgtta", "gtcatgttg", "gctatgctt", etc]
Any idea how can I achieve this using Ruby? I tried using the combination method, but failed to produce any sensible result. Also, I'd like to be able to predetermine the number of computations if possible! So please offer some mathematical explanation if you can!
Some Explanation
These 3-letter strings are DNA codons. Each triplet can be translated into an amino acid from a pre-determined table. What I'm doing is essentially creating a (huge) series of potential DNA sequences from which a protein could be produced theoretically.
Thanks!
What you want to use is product.
Returns an array of all combinations of elements from all arrays.
The length of the returned array is the product of the length of self
and the argument arrays.
%w(gct gcc).product(%w(atg), %w(tta ttg)).map(&:join)
# => ["gctatgtta", "gctatgttg", "gccatgtta", "gccatgttg"]
[*a, *b, *c].combination(3).map &:join
#=> ["gctgccatg", "gctgcctta", #...

Resources