word2vec recommendation system KeyError: "word '21883' not in vocabulary" - gensim

The code works absolutely fine for the data set containing 500000+ instances but whenever I reduce the data set to 5000/10000/15000 it throws a key error : word "***" not in vocabulary.Not for every data point but for most them it throws the error.The data set is in excel format. [1]: https://i.stack.imgur.com/YCBiQ.png
I don't know how to fix this problem since i have very little knowledge about it,,I am still learning.Please help me fix this problem!
purchases_train = []
for i in tqdm(customers_train):
temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
purchases_train.append(temp)
purchases_val = []
for i in tqdm(validation_df['CustomerID'].unique()):
temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
purchases_val.append(temp)
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
model.save("word2vec_2.model")
model.init_sims(replace=True)
# extract all vectors
X = model[model.wv.vocab]
X.shape
products = train_df[["StockCode", "Description"]]
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")
products_dict=products.groupby('StockCode'['Description'].apply(list).to_dict()
def similar_products(v, n = 6):
ms = model.similar_by_vector(v, topn= n+1)[1:]
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
similar_products(model['21883'])

If you get a KeyError saying a word is not in the vocabulary, that's a reliable indicator that the word you're looking-up was not in the training data fed to Word2Vec, or did not appear enough (default min_count=5) times.
So, your error indicates the word-token '21883' did not appear at least 5 times in the texts (purchases_train) supplied to Word2Vec. You should do either or both of:
Ensure all words you're going to look-up appear enough times, either with more training data or a lower min_count. (However, words with only one or a few occurrences tend not to get good vectors & instead just drag the quaality of surrounding-words' vectors down - so keeping this value above 1, or even raising it above the default of 5 to discard more rare words, is a better path whenever you have sufficient data.)
If your later code will be looking up words that might not be present, either check for their presence first (word in model.wv.vocab) or set up a try: ... except: ... to catch & handle the case where they're not present.

Related

Copying embeddings for gensim word2vec

I wanted to see if I can simply set new weights for gensim's Word2Vec without training. I get the 20 News Group data set from scikit-learn (from sklearn.datasets import fetch_20newsgroups) and trained an instance of Word2Vec on it:
model_w2v = models.Word2Vec(sg = 1, size=300)
model_w2v.build_vocab(all_tokens)
model_w2v.train(all_tokens, total_examples=model_w2v.corpus_count, epochs = 30)
Here all_tokens is the tokenized data set.
Then I created a new instance of Word2Vec without training
model_w2v_new = models.Word2Vec(sg = 1, size=300)
model_w2v_new.build_vocab(all_tokens)
and set the embeddings of the new Word2Vec equal to the first one
model_w2v_new.wv.vectors = model_w2v.wv.vectors
Most of the functions work as expected, e.g.
model_w2v.wv.similarity( w1='religion', w2 = 'religions')
> 0.4796233
model_w2v_new.wv.similarity( w1='religion', w2 = 'religions')
> 0.4796233
and
model_w2v.wv.words_closer_than(w1='religion', w2 = 'judaism')
> ['religions']
model_w2v_new.wv.words_closer_than(w1='religion', w2 = 'judaism')
> ['religions']
and
entities_list = list(model_w2v.wv.vocab.keys()).remove('religion')
model_w2v.wv.most_similar_to_given(entity1='religion',entities_list = entities_list)
> 'religions'
model_w2v_new.wv.most_similar_to_given(entity1='religion',entities_list = entities_list)
> 'religions'
However, most_similar doesn't work:
model_w2v.wv.most_similar(positive=['religion'], topn=3)
[('religions', 0.4796232581138611),
('judaism', 0.4426296651363373),
('theists', 0.43141329288482666)]
model_w2v_new.wv.most_similar(positive=['religion'], topn=3)
>[('roderick', 0.22643062472343445),
> ('nci', 0.21744996309280396),
> ('soviet', 0.20012077689170837)]
What am I missing?
Disclaimer. I posted this question on datascience.stackexchange but got no response, hoping to have a better luck here.
Generally, your approach should work.
It's likely the specific problem you're encountering was caused by an extra probing step you took and is not shown in your code, because you had no reason to think it significant: some sort of most_similar()-like operation on model_w2v_new after its build_vocab() call but before the later, malfunctioning operations.
Traditionally, most_similar() calculations operate on a version of the vectors that has been normalized to unit-length. The 1st time these unit-normed vectors are needed, they're calculated – and then cached inside the model. So, if you then replace the raw vectors with other values, but don't discard those cached values, you'll see results like you're reporting – essentially random, reflecting the randomly-initialized-but-never-trained starting vector values.
If this is what happened, just discarding the cached values should cause the next most_similar() to refresh them properly, and then you should get the results you expect:
model_w2v_new.wv.vectors_norm = None

get_document_topics() returns empty list of topics

I am having a 1-word document that I want to transform to its bag-of-words representation:
so doc is ['party'] and id2word.doc2bow(doc) is [(229, 1)] which means the word is known.
However, if I call get_document_topics() with doc_bow, the result is an empty list:
id2word = lda.id2word
# ..
doc_bow = id2word.doc2bow(doc)
t = lda.get_document_topics(doc_bow)
try:
label, prob = sorted(t, key=lambda x: -x[1])[0]
except Exception as e:
print('Error!')
raise e
The only possible explanation I'd have here is that this document (the single word) cannot be assigned to any topic. Is this the reason why I am seeing this?
Here is how I solved this issue:
in the file gensim/models/ldamodel.py you need to edit value of epsilon to a larger value.
DTYPE_TO_EPS = {
np.float16: 1e-5,
np.float32: 1e-35, #line to change
np.float64: 1e-100,
}
also, make sure to set the minimum_probability parameter of get_document_topics to a very small value.

Boost Confidence of Overlapping Observations In Apache Spark

I'm fairly new to scala/spark, so forgive me if my question is elementary but I've searched everywhere and can't find the answer.
Problem
I'm trying to boost the confidence scores a bunch of network router observations (observations of probable router types at different network junctions).
I have a type NetblockObservation combines device types seen on a network with an associated netblock and a confidence. The confidence is the confidence that we accurately identified which device the device we saw.
case class NetblockObservation(
device_type: String
ip_start: Long,
ip_end: Long,
confidence: Double
)
If the confidence is above some threshold thresh, then I want that observation to be in the returned dataset. If it's below thresh, it should not be.
In addition if I have two observations with the same device_type and that one contains the other, the containee should have its confidence increased by by the confidence of the container.
Example
Let's say I have 3 Netblock Observations
// 0.0.0.0/28
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 15, confidence_score: .4)
// 0.0.0.0/29
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 7, confidence_score: .4)
// 0.0.0.0/30
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 3, confidence_score: .4)
With a confidence threshold of 1, I would expect to have a single output of NetblockObservation(device_type: "x", ip_start: 0, ip_end: 4, confidence_score: 1.2)
Explanation: I am allowed to add the confidence scores of NetblockObservation's together if it's contained and has the same device_type
I was allowed to add the confidence score of the 0.0.0.0/29 to the confidence of the 0.0.0.0/30 because it's contained within it.
I was not allowed to add the confidence score of 0.0.0.0/30 to the 0.0.0.0/29 because 0.0.0.0/29 is not contained within 0.0.0.0/30.
My (pitiful) Attempt
Failure reason: Too slow / never completed
I attempted to implement this while simultaneously learning scala/spark so I'm not sure if it's the idea or the implementation which is wrong. I think it would eventually work but after an hour, it hadn't completed on a dataset of size 300,000 (small compared to production scale) so I gave up on it.
The idea is to find the largest netblock and separate the data into netblocks which are contained and netblocks which are not contained. The netblocks which are not contained are recursively passed back into the same function. If the largest netblock has a confidence_score of 1, the entire contained dataset is disregarded and the largest is added to return dataset. If the confidence_score is less then 1, then its confidence_score is added to everything in the contained dataset and that group is recursively passed back to the same function. Eventually, you should only be left with the data which has a confidence_score greater then 1. This algorithm also has the issue of not taking device_type into account.
def handleDataset(largestInNetData: Option[NetblockObservation], netData: RDD[NetblockObservation]): RDD[NetblockObservation] = {
if (netData.isEmpty) spark.sparkContext.emptyRDD else largestInNetData match {
case Some(largest) =>
val grouped = netData.groupBy(item =>
if (item.ip_start >= largest.ip_start && item.ip_end <= largest.ip_end) largestInNetData
else None)
def lookup(k: Option[NetblockObservation]) = grouped.filter(_._1 == k).flatMap(_._2)
val nos = handleDataset(None, lookup(None))
// Threshold is assumed to be 1
val next = if (largest.confidence_score >= 1) spark.sparkContext.parallelize(Seq(largest)) else
handleDataset(None, lookup(largestInNetData)
.filter(x => x != largest)
.map(x => x.copy(confidence_score = x.confidence_score + largest.confidence_score)))
nos ++ next
case None =>
val largest = netData.reduce((a: NetblockObservation, b: NetblockObservation) => if ((a.ip_end - a.ip_start) > (b.ip_end - b.ip_start)) a else b)
handleDataset(Option(largest), netData)
}
}
It is a fairly involved bit of code, so here is a general algorithm that I hope will help:
Forget about Spark for a moment and write a Scala function, probably in the companion object for NetblockObservation, that takes a collection of them and returns a subset of that collection that is contained. You should unit test the heck out of this function, and again this is pure Scala.
Moving now to Spark. Do a groupBy on your RDD[NetblockObservation] with device_type as the key producing essentially a map of String to Iterable[NetblockObservation].
Filter out all the entries in the map that have a value of size 1 and have a confidence below thresh.
For the entries that remain, apply your function from the first step to the collections of NetblockObservations with a mapValues.
Do a reduceByKey or similar to simply add up the confidence_scores of the contained values.
Enjoy a refreshing beverage.

How would I find an unknown pattern in an array of bytes?

I am building a tool to help me reverse engineer database files. I am targeting my tool towards fixed record length flat files.
What I know:
1) Each record has an index(ID).
2) Each record is separated by a delimiter.
3) Each record is fixed width.
4) Each column in each record is separated by at least one x00 byte.
5) The file header is at the beginning (I say this because the header does not contain the delimiter..)
Delimiters I have found in other files are: ( xFAxFA, xFExFE, xFDxFD ) But this is kind of irrelevant considering that I may use the tool on a different database in the future. So I will need something that will be able to pick out a 'pattern' despite how many bytes it is made of. Probably no more than 6 bytes? It would probably eat up too much data if it was more. But, my experience doing this is limited.
So I guess my question is, how would I find UNKNOWN delimiters in a large file? I feel that given, 'what I know' I should be able to program something, I just dont know where to begin...
# Really loose pseudo code
def begin_some_how
# THIS IS THE PART I NEED HELP WITH...
# find all non-zero non-ascii sets of 2 or more bytes that repeat more than twice.
end
def check_possible_record_lengths
possible_delimiter = begin_some_how
# test if any of the above are always the same number of bytes apart from each other(except one instance, the header...)
possible_records = file.split(possible_delimiter)
rec_length_count = possible_records.map{ |record| record.length}.uniq.count
if rec_length_count == 2 # The header will most likely not be the same size.
puts "Success! We found the fixed record delimiter: #{possible_delimiter}
else
puts "Wrong delimiter found"
end
end
possible = [",", "."]
result = [0, ""]
possible.each do |delimiter|
sizes = file.split( delimiter ).map{ |record| record.size }
next if sizes.size < 2
average = 0.0 + sizes.inject{|sum,x| sum + x }
average /= sizes.size #This should be the record length if this is the right delimiter
deviation = 0.0 + sizes.inject{|sum,x| sum + (x-average)**2 }
matching_value = average / (deviation**2)
if matching_value > result[0] then
result[0] = matching_value
result[1] = delimiter
end
end
Take advantage of the fact that the records have constant size. Take every possible delimiter and check how much each record deviates from the usual record length. If the header is small enough compared rest of the file this should work.

Random iteration to fill a table in Lua

I'm attempting to fill a table of 26 values randomly. That is, I have a table called rndmalpha, and I want to randomly insert the values throughout the table. This is the code I have:
rndmalpha = {}
for i= 1, 26 do
rndmalpha[i] = 0
end
valueadded = 0
while valueadded = 0 do
a = math.random(1,26)
if rndmalpha[a] == 0 then
rndmalpha[a] = "a"
valueadded = 1
end
end
while valueadded = 0 do
a = math.random(1,26)
if rndmalpha[a] == 0 then
rndmalpha[a] = "b"
valueadded = 1
end
end
...
The code repeats itself until "z", so this is just a general idea. The problem I'm running into, however, is as the table gets filled, the random hits less. This has potential to freeze up the program, especially in the final letters because there are only 2-3 numbers that have 0 as a value. So, what happens if the while loop goes through a million calls before it finally hits that last number? Is there an efficient way to say, "Hey, disregard positions 6, 13, 17, 24, and 25, and focus on filling the others."? For that matter, is there a much more efficient way to do what I'm doing overall?
The algorithm you are using seems pretty non-efficient, it seems to me that all you need is to initialize a table with all alphabet:
math.randomseed(os.time())
local t = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"}
and Then shuffle the elements:
for i = 1, #t*2 do
local a = math.random(#t)
local b = math.random(#t)
t[a],t[b] = t[b],t[a]
end
Swapping the elements for #t*2 times gives randomness pretty well. If you need more randomness, increase the number of shuffling, and use a better random number generator. The random() function provided by the C library is usually not that good.
Instead of randoming for each letter, go through the table once and get something random per position. The method you're using could take forever because you might never hit it.
Never repeat yourself. Never repeat yourself! If you're copy and pasting too often, it's a sure sign something has gone wrong. Use a second table to contain all the possible letters you can choose, and then randomly pick from that.
letters = {"a","b","c","d","e"}
numberOfLetters = 5
rndmalpha = {}
for i in 1,26 do
rndmalpha[i] = letters[math.random(1,numberOfLetters)]
end

Resources