Why the two embedding vectors for a same key from two Word2Vec models so similar? - gensim

I am using two toy word sets to train my Word2Vec model with Gensim. The vocabulary in set 1 is 'x','y','c' and in set 2 is 'a','b','c'. After I trained the two sets separately with two different models, I found that the embedding vectors for the word 'c' are very similar. My understanding is that the embedding is randomly initialized, so you probably even need to align the vectors for the same words trained with separate models in order to put them in the same space. Then why are my two vectors so similar? Here is my code.
common_texts_1 = [['y', 'x', 'c', 'x', 'y', 'y']] +\
[['y', 'c', 'c', 'c', 'x', 'y']] +\
[['c', 'x', 'c', 'y', 'x', 'y']] +\
[['y', 'c', 'x', 'c', 'c', 'y']] +\
[['c', 'x', 'x', 'y', 'y', 'y']] +\
[['c', 'x', 'x', 'y', 'y', 'y']] +\
[['x', 'x', 'x', 'c', 'y', 'y']] +\
[['y', 'x', 'c', 'y', 'y', 'y']] +\
[['c', 'x', 'x', 'y', 'c', 'y']] +\
[['c', 'y', 'y', 'y', 'y', 'y']] +\
[['c', 'x', 'x', 'y', 'c', 'y']] +\
[['c', 'x', 'x', 'y', 'y', 'y']] +\
[['x', 'x', 'x', 'c', 'y', 'y']] +\
[['x', 'x', 'x', 'y', 'y', 'c']] +\
[['c', 'x', 'c', 'y', 'y', 'c']] +\
[['x', 'x', 'c', 'y', 'y', 'y']] +\
[['x', 'x', 'x', 'y', 'y', 'c']]
common_texts_2 = [['a', 'a', 'b', 'b', 'c', 'c']] +\
[['a', 'c', 'b', 'b', 'c', 'c']] +\
[['c', 'a', 'b', 'b', 'a', 'c']] +\
[['a', 'c', 'b', 'b', 'b', 'c']] +\
[['c', 'a', 'b', 'b', 'c', 'b']] +\
[['b', 'a', 'b', 'c', 'c', 'a']] +\
[['c', 'b', 'b', 'b', 'b', 'c']] +\
[['c', 'a', 'b', 'b', 'c', 'c']] +\
[['a', 'c', 'b', 'b', 'c', 'c']] +\
[['a', 'c', 'b', 'b', 'c', 'c']] +\
[['a', 'a', 'b', 'b', 'a', 'c']]
base_embed = gensim.models.Word2Vec(common_texts_1,
min_count=1,
size=2,
window=2,
workers=4)
other_embed = gensim.models.Word2Vec(common_texts_2,
min_count=1,
size=2,
window=2,
workers=4)
print(base_embed.wv.word_vec('c'))
print(other_embed.wv.word_vec('c'))

You shouldn't expect toy-sized tests like this to show the qualities that make the word2vec algorithm useful, nor to teach much about its operation – other than the limits of small, unrepresentative corner-cases.
The useful characteristics of word2vec word-vectors arise from large, varied datasets, with many subtly-contrasting word-usages, in natural contexts. You're unlikely to see that with a 3-word language, and it's even possible your synthetic 'texts' have a distribution of neighboring-words that largely cancel-out.
In particular, even if trying ot make the tiniest workable training data, you'd want:
A vocabulary that's significantly larger than the vector-dimensionality, so that the model won't 'overfit' on a representation that's clsoer to one-hot than the 'dense-embedding' word2vec tries to create. (It's actually the challenge of fitting many words into a smaller space that helps push-and-pull words into interesting relative-configurations.)
Training data whose word frequencies, and co-occurrences, resemble natural-language patterns. (Word2vec can often be useful on other data, too, but the reliable territory for exploring its characteristics will look like the richness of language.)
Also, note that in real language data, you essentially never want to run word2vec with min_count=1 - those rare words don't have enough varied usage examples to get good generalizable vectors, but since (in usual Zipfian language distributions) there can nonetheless be a lot of them, they serve as 'noise' making the urrounding words worse. The default is for min_count=5 because on adequately-sized corpora, that value (or even higher!) usually gives better results.
Finally, while the word-vectors are randomly-initialized at the start, Gensim does choose to use the string tokens as initialization seeds (combined with the optional seed parameter). So, the string 'c' will in fact be initialized the same way, in any model that (a) uses the same seed; & (b) has the same dimensionality. In a real-sized dataset, training will tend to move the final word-vector arbitrarily far from the low-magnitude initialization - but in this sort of tiny dataset, where it's neighbors are almost always the exact same 2 other words, it's not going to be getting a lot of meaningful nudges to new positions. I suspect that's why your wv['c'] is so similar in your two models.
I'd suggest running experiments with a dimensionality at least 100, a unique vocabulary (after enforcing min_count=5) of at least 10,000 tokens, and enough raw text so that all those 10,000 tokens have, on average, many dozens of subtly-varying, realistically-contrasting usages examples. Only then will the results start to reflect why people use word2vec.

Related

Permutation of arrays

There are multiple sets of elements like:
[['a', 'b', 'c'], ['e', 'f'], ['g', 'h', 'i']]
I need an algorithm to get all possible combinations of each element from each set.
E.g.
['a', 'e', 'g']
['a', 'f', 'g']
['a', 'f', 'h']
['a', 'f', 'i']
['b', 'e', 'g']
...etc
You can use backtracking to solve this:
def get_perms(arr):
def backtrack(idx, partial_res, res):
if idx == len(arr):
res.append(partial_res[:])
return
for i in range(0, len(arr[idx])):
partial_res.append(arr[idx][i])
backtrack(idx+1, partial_res, res)
partial_res.pop()
res = []
backtrack(0, [], res)
return res
A quick test:
arr = [['a', 'b', 'c'], ['e', 'f'], ['g', 'h', 'i']]
get_perms(arr)
[['a', 'e', 'g'],
['a', 'e', 'h'],
['a', 'e', 'i'],
['a', 'f', 'g'],
['a', 'f', 'h'],
['a', 'f', 'i'],
['b', 'e', 'g'],
['b', 'e', 'h'],
['b', 'e', 'i'],
['b', 'f', 'g'],
['b', 'f', 'h'],
['b', 'f', 'i'],
['c', 'e', 'g'],
['c', 'e', 'h'],
['c', 'e', 'i'],
['c', 'f', 'g'],
['c', 'f', 'h'],
['c', 'f', 'i']]
The algorithm just goes over each inner list and adds each element to partial_res and calls itself recursively while incrementing the index to go to the next inner list.
What you want is a Cartesion product. Use itertools:
from itertools import product
matrix = [['a', 'b', 'c'], ['e', 'f'], ['g', 'h', 'i']]
for res in product(*matrix):
print(list(res))

Algorithm to translate list operations to be index based

Assume you have an unsorted list of distinct items. for example:
['a', 'z', 'g', 'i', 'w', 'p', 't']
You also get a list of Insert and remove operations. Insert operations are composed of the index to insert to, and the item to insert. For example: Insert(5, 's')
Remove operations are expressed using the element to remove. For example: Remove('s')
So a list of operations may look like this:
Insert ('s', 5)
Remove ('p')
Insert ('j', 0)
Remove ('a')
I am looking for the most efficient algorithm that can translate the list of operations so that they are index based. That means that there is no need to modify the insert operations, but the remove operations should be replaced with a remove operation stating the current index of the item to be removed (not the original one).
So the output of the example should look like this:
Starting set: ['a', 'z', 'g', 'i', 'w', 'p', 't']
Insert('s', 5) ( list is now: ['a', 'z', 'g', 'i', 'w', 's', 'p', 't']
Remove (6) (list is now: ['a', 'z', 'g', 'i', 'w', 's', 't']
Insert('j', 0) (list is now: ['j', 'a', 'z', 'g', 'i', 'w', 's', 't']
Remove(1) (list is now: ['j', 'z', 'g', 'i', 'w', 's', 't']
Obviously, we can scan for the next item to remove in the set after each operation, and that would mean the entire algorithm would take O(n*m) where n is the size of the list, and m is the number of operations.
The question is - is there a more efficient algorithm?
You can make this more efficient if you have access to all of the remove operations ahead of time, and they are significantly (context-defined) shorter than the object list.
You can maintain a list of items of interest: those to be removed. Look up their initial positions -- either in the original list, or upon insertion. Whenever an insertion is made at position n, each element of this list past that position gets its index increased by one; for each such deletion, decrease by one.
This is little different from methods already obvious; it's merely quantitatively faster, a potentially smaller m on the O(n*m) complexity.

Use dynamic programming to merge two arrays such that the number of repetitions of the same element is minimised

Let's say we have two arrays m and n containing the characters from the set a, b, c , d, e. Assume each character in the set has a cost associated with it, consider the costs to be a=1, b=3, c=4, d=5, e=7.
for example
m = ['a', 'b', 'c', 'd', 'd', 'e', 'a']
n = ['b', 'b', 'b', 'a', 'c', 'e', 'd']
Suppose we would like to merge m and n to form a larger array s.
An example of s array could be
s = ['a', 'b', 'c', 'd', 'd', 'e', 'a', 'b', 'b', 'b', 'a', 'c', 'e', 'd']
or
s = ['b', 'a', 'd', 'd', 'd', 'b', 'e', 'c', 'b', 'a', 'b', 'a', 'c', 'e']
If there are two or more identical characters adjacent to eachother a penalty is applied which is equal to: number of adjacent characters of the same type * the cost for that character. Consider the second example for s above which contains a sub-array ['d', 'd', 'd']. In this case a penalty of 3*5 will be applied because the cost associated with d is 5 and the number of repetitions of d is 3.
Design a dynamic programming algorithm which minimises the cost associated with s.
Does anyone have any resources, papers, or algorithms they could share to help point me in the right direction?

Efficiently generate permutations with ordering restrictions(possibly without backtracking)?

I need to generate permutations of with ordering restrictions on ordering
for example, in the list [A,B,C,D]
A must always come before B, and C must always come before D. There also may or may not be E,F,G... that has no restrictions.
The input would look like this: [[A,B],[C,D],[E],[F]]
Is there a way to do this without computing unnecessary permutations or backtracking?
Normally, a permutations algorithm might look somewhat like this (Python):
def permutations(elements):
if elements:
for i, current in enumerate(elements):
front, back = elements[:i], elements[i+1:]
for perm in permutations(front + back):
yield [current] + perm
else:
yield []
You iterate the list, taking each of the elements as the first element, and combining them with all the permutations of the remaining elements. You can easily modify this so that the elements are actually lists of elements, and instead of just using the current element, you pop the first element off that list and insert the rest back into the recursive call:
def ordered_permutations(elements):
if elements:
for i, current in enumerate(elements):
front, back = elements[:i], elements[i+1:]
first, rest = current[0], current[1:]
for perm in ordered_permutations(front + ([rest] if rest else []) + back):
yield [first] + perm
else:
yield []
Results for ordered_permutations([['A', 'B'], ['C', 'D'], ['E'], ['F']]):
['A', 'B', 'C', 'D', 'E', 'F']
['A', 'B', 'C', 'D', 'F', 'E']
['A', 'B', 'C', 'E', 'D', 'F']
[ ... some 173 more ... ]
['F', 'E', 'A', 'C', 'D', 'B']
['F', 'E', 'C', 'A', 'B', 'D']
['F', 'E', 'C', 'A', 'D', 'B']
['F', 'E', 'C', 'D', 'A', 'B']
Note, though, that this will create a lot of intermediate lists in each recursive call. Instead, you could use stacks, popping the first element off the stack and putting it back on after the recursive calls.
def ordered_permutations_stack(elements):
if any(elements):
for current in elements:
if current:
first = current.pop()
for perm in ordered_permutations_stack(elements):
yield [first] + perm
current.append(first)
else:
yield []
The code might be a bit easier to grasp, too. In this case, you have to reserve the sublists, i.e. call it as ordered_permutations_stack([['B', 'A'], ['D', 'C'], ['E'], ['F']])

Finding number of occurrences of char from a short List/Array in a infinitely large List/Array

I have been working on a practical situation wherein I require an algorithm, have made a generic problem out of that. Considering there are are Two Arrays :-
Source[10] = {'a', 'v', 'l', 'r', 'p', 's', 'x', 'd', 'q', 'o' , 'g', 'm'}
Target[N] = {'a', 'v', 'l', 'r', 'p', 's', 'x', 'd', 'q', 'o' , 'g', 'm',a', 'v', 'l', 'r', 'p',a', 'v', 'l', 'r', 'p',a',
'v', 'l', 'r', 'p',a', 'v', 'l', 'r', 'p',a', 'v', 'l', 'r', 'p',a', 'v', 'l', 'r', 'p',a', 'v', 'l', 'r', 'p',a', 'v',
'l', 'r', 'p',a', 'v', 'l', 'r', 'p', .... }
We need to have an efficient algorithm to find the frequency of occurrences of characters from Source in Target.
I have thought of hashing the complete Target list and then iterate through the Source and do the lookup in the hashed list. Can people comment/validate the approach.
If your character set is reasonably limited, you can use character codes as indexes into an array of counts. Let's say you have 16-bit characters. You can do this:
int[] counts = new int[65536];
foreach (char c in Target)
counts[c]++;
With the array of counts in hand, you can easily find the frequency by looking up a code from the Source in the counts array.
This solution is asymptotically as fast as it could possibly get, but it may not be the most memory-efficient one.
I don't know what a hashed list is, so I can't comment on that. For efficiency, I would suggest turning the target array into a multiset. Guava has a nice implementation of such a thing (although the Java Collections Framework does not). So does Apache Commons (where it's called a Bag). You can then simply iterate through the source and look up the frequency of each element in the multiset. As described in this thread, using a multiset is easier than using a HashMap from elements to frequencies, although it does require using a third-party library.

Resources