Should I split sentences in a document for Doc2Vec? - gensim

I am building a Doc2Vec model with 1000 documents using Gensim.
Each document has consisted of several sentences which include multiple words.
Example)
Doc1: [[word1, word2, word3], [word4, word5, word6, word7],[word8, word9, word10]]
Doc2: [[word7, word3, word1, word2], [word1, word5, word6, word10]]
Initially, to train the Doc2Vec, I first split sentences and tag each sentence with the same document tag using "TaggedDocument". As a result, I got the final training input for Doc2Vec as follows:
TaggedDocument(words=[word1, word2, word3], tags=['Doc1'])
TaggedDocument(words=[word4, word5, word6, word7], tags=['Doc1'])
TaggedDocument(words=[word8, word9, word10], tags=['Doc1'])
TaggedDocument(words=[word7, word3, word1, word2], tags=['Doc2'])
TaggedDocument(words=[word1, word5, word6, word10], tags=['Doc2'])
However, would it be okay to train the model with the document as a whole without splitting sentences?
TaggedDocument(words=[word1, word2, word3,word4, word5, word6, word7,word8, word9, word10], tags=['Doc1'])
TaggedDocument(words=[word4, word5, word6, word7,word1, word5, word6, word10], tags=['Doc2'])
Thank you in advance :)

Both approaches are going to be very similar in their effect.
The slight difference is that in PV-DM modes (dm=1), or PV-DBOW with added skip-gram training (dm=0, dbow_words=1), if you split by sentence, words in different sentences will never be within the same context-window.
For example, your 'Doc1' words 'word3' and 'word4' would never be averaged-together in the same PV-DM context-window-average, nor be used to PV-DBOW skip-gram predict-each-other, if you split by sentences. If you just run the whole doc's words together into a single TaggedDocument example, they would interact more, via appearing in shared context-windows.
Whether one or the other is better for your purposes is something you'd have to evaluate in your own analysis - it could depend a lot on the nature of the data & desired similarity results.
But, I can say that your second option, all the words in one TaggedDocument, is the more common/traditional approach.
(That is, as long as the document is still no more than 10,000 tokens long. If longer, splitting the doc's words into multiple TaggedDocument instances, each with the same tags, is a common workaround for an internal 10,000-token implementation limit.)

Related

Best Data Structure For Text Processing

Given a sentence like this, and i have the data structure (dictionary of lists, UNIQUE keys in dictionary):
{'cat': ['feline', 'kitten'], 'brave': ['courageous', 'fearless'], 'little': ['mini']}
A courageous feline was recently spotted in the neighborhood protecting her mini kitten
How would I efficiently process these set of text to convert the word synonyms of the word cat to the word CAT such that the output is like this:
A fearless cat was recently spotted in the neighborhood protecting her little cat
The algorithm I want is something that can process the initial text to convert the synonyms into its ROOT word (key inside dictionary), the keywords and synonyms would get longer as well.
Hence, first, I want to inquire if the data structure I am using is able to perform efficiently and whether there are more efficient structure.
For now, I am only able to think of looping through each list inside the dictionary, searching for the synonym's then mapping it back to its keyword
edit: Refined the question
Your dictionary is organised in the wrong way. It will allow you to quickly find a target word, but that is not helpful when you have an input that does not have the target word, but some synonyms of it.
So organise your dictionary in the opposite sense:
d = {
'feline': 'cat',
'kitten': 'cat'
}
To make the replacements, you could create a regular expression and call re.sub with a callback function that will look up the translation of the found word:
import re
regex = re.compile(rf"\b(?:{ '|'.join(map(re.escape, d)) })\b")
s = "A feline was recently spotted in the neighborhood protecting her little kitten"
print(regex.sub(lambda match: d[match[0]], s))
The regular expression makes sure that the match is with a complete word, and not with a substring -- "cafeline" as input will not give a match for "feline".

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

Extracting food items from sentences

Given a sentence:
I had peanut butter and jelly sandwich and a cup of coffee for
breakfast
I want to be able to extract the following food items from it:
peanut butter and jelly sandwich
coffee
Till now, using POS tagging, I have been able to extract the individual food items, i.e.
peanut, butter, jelly, sandwich, coffee
But like I said, what I need is peanut butter and jelly sandwich instead of the individual items.
Is there some way of doing this without having a corpus or database of food items in the backend?
You can attempt it without using a trained set which contains a corpus of food items, but the approach shall work without it too.
Instead of doing simple POS tagging, do a dependency parsing combined with POS tagging.
That way would be able to find relations between multiple tokens of the phrase, and parsing the dependency tree with restricted conditions like noun-noun dependencies you shall be able to find relevant chunk.
You can use spacy for dep parsing. Here is output from displacy :
https://demos.explosion.ai/displacy/?text=peanut%20butter%20and%20jelly%20sandwich%20is%20delicious&model=en&cpu=1&cph=1
You can use freely available data here, or something better:
https://en.wikipedia.org/wiki/Lists_of_foods as a training set to
create a base set of food items (the hyperlinks in the crawled tree)
Based on the dependency parsing on your new data, you can keep
enriching the base data. For example: if 'butter' exists in your
corpus, and 'peanut butter' is a frequently encountered pair of
tokens, then 'peanut' and 'peanut butter' also get added to the
corpus.
The corpus can be maintained in a file which can be loaded in memory
while processing, or database like redis,aerospike etc.
Make sure you work with normalized i.e. small cased, special
characters cleaned, words lemmatized/stemmed, both in corpus and the
processing data. That would increase your coverage and accuracy.
First extract all Noun phrases using NLTK's Chunking (code copied from here):
import nltk
import re
import pprint
from nltk import Tree
import pdb
patterns="""
NP: {<JJ>*<NN*>+}
{<JJ>*<NN*><CC>*<NN*>+}
{<NP><CC><NP>}
{<RB><JJ>*<NN*>+}
"""
NPChunker = nltk.RegexpParser(patterns)
def prepare_text(input):
sentences = nltk.sent_tokenize(input)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
sentences = [NPChunker.parse(sent) for sent in sentences]
return sentences
def parsed_text_to_NP(sentences):
nps = []
for sent in sentences:
tree = NPChunker.parse(sent)
print(tree)
for subtree in tree.subtrees():
if subtree.label() == 'NP':
t = subtree
t = ' '.join(word for word, tag in t.leaves())
nps.append(t)
return nps
def sent_parse(input):
sentences = prepare_text(input)
nps = parsed_text_to_NP(sentences)
return nps
if __name__ == '__main__':
print(sent_parse('I ate peanut butter and beef burger and a cup of coffee for breakfast.'))
This will POS tag your sentences and uses a regex parser to extract Noun Phrases.
1.Define and Refine your noun phrase regex
You'll need to change the patterns regex to define and refine your Noun phrases.
For example is telling the parser than an NP followed by a coordinator (CC) like ''and'' and another NP is itself an NP.
2.Change from NLTK POS tagger to Stanford POS tagger
Also I noted that NLTK's POS tagger is not performing very well (e.g. It considers had peanut as a verb phrase. You can change the POS tagger to Stanford Parser if you want.
3.Remove smaller noun phrases:
After you have extracted all the Noun phrases for a sentence, you can remove the ones that are part of a bigger noun phrase. For example in the following example beef burger and peanut butter should be removed because
they're a part of a bigger noun phrase peanut butter and beef burger.
4.Remove noun phrases which none of the words are in a food lexicon
you will get noun phrases like school bus. if none of school and bus is in a food lexicon that you can compile from Wikipedia or WordNet then you remove the noun phrase. In this case remove cup and breakfast because they're not hopefully in your food lexicon.
The current code returns
['peanut butter and beef burger', 'peanut butter', 'beef burger', 'cup', 'coffee', 'breakfast']
for input
print(sent_parse('I ate peanut butter and beef burger and a cup of coffee for breakfast.'))
Too much for a comment, but not really an answer:
I think you would at least get closer if when you got two foods without a proper separator and combined them into one food. That would give peanut butter, jelly sandwich, coffee.
If you have correct English you could detect this case by count/non-count. Correcting the original to "I had a peanut butter and jelly sandwich and a cup of coffee for breakfast". Butter is non-count, you can't have "a butter", but you can have "a sandwich". Thus the a must apply to sandwich and despite the and "peanut butter" and "jelly sandwich" must be the same item--"peanut butter and jelly sandwich". Your mistaken sentence would parse the other way, though!
I would be very surprised if you could come up with general rules that cover every case, though. I would come at this sort of thing figuring that a few would leak and need a database to catch.
You could search for n-grams in your text where you vary the value of n. For example, if n=5 then you would extract "peanut butter and jelly sandwich" and "cup of coffee for breakfast", depending on where you start your search in the text for groups of five words. You won't need a corpus of text or a database to make the algorithm work.
A rule based approach with a lexicon of all food items would work here.
You can use GATE for the same and use JAPE rules with it.
In the above example your jape rule would have a condition to find all (np cc np) && np in "FOOD LEIXCON"
Can share a detailed jape code in an event you plan to go this route.

Wiktionary/MediaWiki Search & Suffix Filtering

I'm building an application that will hopefully use Wiktionary words and definitions as a data source. In my queries, I'd like to be able to search for all Wiktionary entries that are similar to user provided terms in either the title or definition, but also have titles ending with a specified suffix (or one of a set of suffixes).
For example, I want to find all Wiktionary entries that contain the words "large dog", like this:
https://en.wiktionary.org/w/api.php?action=query&list=search&srsearch=large%20dog
But further filter the results to only contain entries with titles ending with "d". So in that example, "boarhound", "Saint Bernard", and "unleashed" would be returned.
Is this possible with the MediaWiki search API? Do you have any recommendations?
This is mostly possible with ElasticSearch/CirrusSearch, but disabled for performance reasons. You can still use it on your wiki, or attempt smart search queries.
Usually for Wiktionary I use yanker, which can access the page table of the database. Your example (one-letter suffix) would be huge, but for instance .*hound$ finds:
Afghan_hound
Bavarian_mountain_hound
Foxhound
Irish_Wolfhound
Mahound
Otterhound
Russian_Wolfhound
Scottish_Deerhound
Tripehound
basset_hound
bearhound
black_horehound
bloodhound
boarhound
bookhound
boozehound
buckhound
chowhound
coon_hound
coonhound
covert-hound
covert_hound
coverthound
deerhound
double-nosed_andean_tiger_hound
elkhound
foxhound
gazehound
gorehound
grayhound
greyhound
harehound
heckhound
hell-hound
hell_hound
hellhound
hoarhound
horehound
hound
limehound
lyam-hound
minkhound
newshound
nursehound
otterhound
powder_hound
powderhound
publicity-hound
publicity_hound
rock_hound
rockhound
scent_hound
scenthound
shag-hound
sighthound
sleuth-hound
sleuthhound
slot-hound
slowhound
sluthhound
smooth_hound
smoothhound
smuthound
staghound
war_hound
whorehound
wolfhound

Fuzzy matching of words with phrases with ruby

I want to match a bunch of data with a short number of services
My data would look something like this
{"title" : "blorb",
"category" : "zurb"
"description" : "Massage is the manipulation of superficial and deeper layers of muscle and connective tissue using various techniques, to enhance function, aid in the healing process, decrease muscle reflex activity..."
}
and I have to match it with
["Swedish Massage", "Haircut"]
Clearly the "Swedish Massage" would be the winner, but running a benchmark shows that "Haircut" is:
require 'amatch'
arr = [:levenshtein_similar, :hamming_similar, :pair_distance_similar, :longest_subsequence_similar, :longest_substring_similar, :jaro_similar, :jarowinkler_similar]
arr.each do |method|
["Swedish Massage", "Haircut"].each do |sh|
pp ">>> #{sh} matched with #{method.to_s}"
pp sh.send(method, description)
end
end and nil
result:
">>> Swedish Massage matched with jaro_similar"
# 0.5246896118183247
">>> Haircut matched with jaro_similar"
# 0.5353606789250354
">>> Swedish Massage matched with jarowinkler_similar"
# 0.5246896118183247
">>> Haircut matched with jarowinkler_similar"
# 0.5353606789250354
The rest of the indices are well below 0.1
What would be a better approach to solving this problem?
Search is a constant battle between precision and recall. One thing you could try is splitting your input by words - this will result in a much stronger match on Massage but with the consequence of broadening out the result set. You will now find sentences returned with only words close to Swedish. You could then try to control that broadening by averaging the results for multiple words, using stop lists to avoid common words like and, boosts for finding tokens close to each other etc, but you will never see truly perfect results. If you're really interested in fine tuning this I recommend ElasticSearch - relatively easy to learn and powerful.

Resources