Why the two embedding vectors for a same key from two Word2Vec models so similar? - gensim

I am using two toy word sets to train my Word2Vec model with Gensim. The vocabulary in set 1 is 'x','y','c' and in set 2 is 'a','b','c'. After I trained the two sets separately with two different models, I found that the embedding vectors for the word 'c' are very similar. My understanding is that the embedding is randomly initialized, so you probably even need to align the vectors for the same words trained with separate models in order to put them in the same space. Then why are my two vectors so similar? Here is my code.
common_texts_1 = [['y', 'x', 'c', 'x', 'y', 'y']] +\
[['y', 'c', 'c', 'c', 'x', 'y']] +\
[['c', 'x', 'c', 'y', 'x', 'y']] +\
[['y', 'c', 'x', 'c', 'c', 'y']] +\
[['c', 'x', 'x', 'y', 'y', 'y']] +\
[['c', 'x', 'x', 'y', 'y', 'y']] +\
[['x', 'x', 'x', 'c', 'y', 'y']] +\
[['y', 'x', 'c', 'y', 'y', 'y']] +\
[['c', 'x', 'x', 'y', 'c', 'y']] +\
[['c', 'y', 'y', 'y', 'y', 'y']] +\
[['c', 'x', 'x', 'y', 'c', 'y']] +\
[['c', 'x', 'x', 'y', 'y', 'y']] +\
[['x', 'x', 'x', 'c', 'y', 'y']] +\
[['x', 'x', 'x', 'y', 'y', 'c']] +\
[['c', 'x', 'c', 'y', 'y', 'c']] +\
[['x', 'x', 'c', 'y', 'y', 'y']] +\
[['x', 'x', 'x', 'y', 'y', 'c']]
common_texts_2 = [['a', 'a', 'b', 'b', 'c', 'c']] +\
[['a', 'c', 'b', 'b', 'c', 'c']] +\
[['c', 'a', 'b', 'b', 'a', 'c']] +\
[['a', 'c', 'b', 'b', 'b', 'c']] +\
[['c', 'a', 'b', 'b', 'c', 'b']] +\
[['b', 'a', 'b', 'c', 'c', 'a']] +\
[['c', 'b', 'b', 'b', 'b', 'c']] +\
[['c', 'a', 'b', 'b', 'c', 'c']] +\
[['a', 'c', 'b', 'b', 'c', 'c']] +\
[['a', 'c', 'b', 'b', 'c', 'c']] +\
[['a', 'a', 'b', 'b', 'a', 'c']]
base_embed = gensim.models.Word2Vec(common_texts_1,
other_embed = gensim.models.Word2Vec(common_texts_2,

You shouldn't expect toy-sized tests like this to show the qualities that make the word2vec algorithm useful, nor to teach much about its operation – other than the limits of small, unrepresentative corner-cases.
The useful characteristics of word2vec word-vectors arise from large, varied datasets, with many subtly-contrasting word-usages, in natural contexts. You're unlikely to see that with a 3-word language, and it's even possible your synthetic 'texts' have a distribution of neighboring-words that largely cancel-out.
In particular, even if trying ot make the tiniest workable training data, you'd want:
A vocabulary that's significantly larger than the vector-dimensionality, so that the model won't 'overfit' on a representation that's clsoer to one-hot than the 'dense-embedding' word2vec tries to create. (It's actually the challenge of fitting many words into a smaller space that helps push-and-pull words into interesting relative-configurations.)
Training data whose word frequencies, and co-occurrences, resemble natural-language patterns. (Word2vec can often be useful on other data, too, but the reliable territory for exploring its characteristics will look like the richness of language.)
Also, note that in real language data, you essentially never want to run word2vec with min_count=1 - those rare words don't have enough varied usage examples to get good generalizable vectors, but since (in usual Zipfian language distributions) there can nonetheless be a lot of them, they serve as 'noise' making the urrounding words worse. The default is for min_count=5 because on adequately-sized corpora, that value (or even higher!) usually gives better results.
Finally, while the word-vectors are randomly-initialized at the start, Gensim does choose to use the string tokens as initialization seeds (combined with the optional seed parameter). So, the string 'c' will in fact be initialized the same way, in any model that (a) uses the same seed; & (b) has the same dimensionality. In a real-sized dataset, training will tend to move the final word-vector arbitrarily far from the low-magnitude initialization - but in this sort of tiny dataset, where it's neighbors are almost always the exact same 2 other words, it's not going to be getting a lot of meaningful nudges to new positions. I suspect that's why your wv['c'] is so similar in your two models.
I'd suggest running experiments with a dimensionality at least 100, a unique vocabulary (after enforcing min_count=5) of at least 10,000 tokens, and enough raw text so that all those 10,000 tokens have, on average, many dozens of subtly-varying, realistically-contrasting usages examples. Only then will the results start to reflect why people use word2vec.


