Best Data Structure For Text Processing - algorithm

Given a sentence like this, and i have the data structure (dictionary of lists, UNIQUE keys in dictionary):
{'cat': ['feline', 'kitten'], 'brave': ['courageous', 'fearless'], 'little': ['mini']}
A courageous feline was recently spotted in the neighborhood protecting her mini kitten
How would I efficiently process these set of text to convert the word synonyms of the word cat to the word CAT such that the output is like this:
A fearless cat was recently spotted in the neighborhood protecting her little cat
The algorithm I want is something that can process the initial text to convert the synonyms into its ROOT word (key inside dictionary), the keywords and synonyms would get longer as well.
Hence, first, I want to inquire if the data structure I am using is able to perform efficiently and whether there are more efficient structure.
For now, I am only able to think of looping through each list inside the dictionary, searching for the synonym's then mapping it back to its keyword
edit: Refined the question

Your dictionary is organised in the wrong way. It will allow you to quickly find a target word, but that is not helpful when you have an input that does not have the target word, but some synonyms of it.
So organise your dictionary in the opposite sense:
d = {
'feline': 'cat',
'kitten': 'cat'
}
To make the replacements, you could create a regular expression and call re.sub with a callback function that will look up the translation of the found word:
import re
regex = re.compile(rf"\b(?:{ '|'.join(map(re.escape, d)) })\b")
s = "A feline was recently spotted in the neighborhood protecting her little kitten"
print(regex.sub(lambda match: d[match[0]], s))
The regular expression makes sure that the match is with a complete word, and not with a substring -- "cafeline" as input will not give a match for "feline".

Related

IFTTT JavaScript filter - How to make case insensitive searches + How to search Include and Exclude sets of terms

First off I'm a total novice for Javascript, so please go gently. I'm aware of how people feel about having to now pay for IFTTT, but it's perfect for what I need.
I am using a more expansive version of this code below to capture certain keywords from Tweets to then generate emails if the search returns a positive result. This search works very nicely, except it is case sensitive which is a problem.
Yes, I know you can manipulate the twitter search to pick up specific words or phrases. I am very proficient in achieving searches this way. I am casting a wide net to pick up approx 120 search words or phrases which is too long to achieve through "OR" Twitter search parameters alone which is why I'm using this.
Q1 - I have tried adding item.toLowerCase() and just .toLowerCase() in various parts of the code so it wouldn't matter if the sentence case of the search term is different to that of the original tweet text case. I just can't get it to work though. I've seen various posts on here but I can't get any of them to work in IFTTT. I believe IFTTT doesn't accept REGEX either, which is annoying.
Any advice of how to get this code running so it's case-insensitive for text within IFTTT?
Q2 - I have approx 120 search terms for the tweet text to return positive results. There is a lot of junk that comes through with that. Does anyone know how to add a second layer of 'and exclude' search terms?
I have something like 300-400 words and specific phrases which would be used to stop the email from being triggered - so it'd be something like "IF tweet text contains a, b, c BUT text ALSO contains x, y, z... do not send the email"
let str=Twitter.newTweetFromSearch.Text;
let searchTerms=[
"Northbound",
"Westbound",
"Southbound",
"Eastbound"
]
let foundOne=0;
if(searchTerms.some(function(v){return str.indexOf(v)>=0;})){
foundOne=1;
}
if(foundOne==0){
Email.sendMeEmail.skip();
}
I have looked at the Twitter API, but that is a step too far for my coding ability which is why I'm using IFTTT.
Any help is very much appreciated
Thank you.
I'm playing with IFTTT Filter myself at the moment, so here are some thoughts about solving your solution.
If you want to do a case insensitive seatch on the original text, convert the original text to lowercase, then have all your search terms in lowercase.
Plus I think you want to iterate over the searchTerms array, and use the includes() method. Ok, just realised that .some() does the iteration for you, but I prefer includes() over indexof().
let str=Twitter.newTweetFromSearch.Text.toLowerCase();
let searchTerms=[
"northbound",
"westbound",
"southbound",
"eastbound"
]
let foundOne=0;
if(searchTerms.some(function(term){return str.includes(term);})){
foundOne=1;
}
if(foundOne==0){
Email.sendMeEmail.skip();
}
Or you could just skip having the foundOne variable, and do the search in the if() statement.
let str=Twitter.newTweetFromSearch.Text.toLowerCase();
let searchTerms=[
"northbound",
"westbound",
"southbound",
"eastbound"
]
if(!searchTerms.some(function(term){return str.includes(term);})){
Email.sendMeEmail.skip();
}

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

Condense nested for loop to improve processing time with text analysis python

I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:
features = [0 for i in xrange(len(dictionary))]
for bgrm in new_scored:
for i in xrange(len(dictionary)):
if bgrm[0] == dictionary[i]:
features[i] = int(bgrm[1])
break
I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.
The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.
I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.
Thank you in advance!
Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.
If you refactor dictionary that way, then the loop can be rewritten as:
features = [0 for key in dictionary]
for bgram in new_scored:
try:
features[dictionary[bgram[0]]] = int(bgrm[1])
except KeyError:
# do something if the bigram is not in the dictionary for some reason
This should convert what was an O(n) traversal through dictionary into a hash lookup.
Hope this helps.

Parsing one large array into several sub-arrays

I have a list of adjectives (found here), that I would like to be the basis for a "random_adjective(category)" method.
I'm really just taking a stab at this, as my first real attempt at a useful program.
Step 1: Open file, remove formatting. No problem.
list=File.read('adjectivelist')
list.gsub(/\n/, " ")
The next step is to break the string up by category..
list.split(" ")
Now I have an array of every word in the file. Neat. The ones with a tilde before them represent the category names.
Now I would like to break up this LARGE array into several smaller ones, based on category.
I need help with the syntax here, although the pseudocode for this would be something like
Scan the array for an element which begins with a tilde.
Now create a new array based on the name of that element sans the tilde, and ALSO place this "category name" into the "categories" array. Now pull all the elements from the main array, and pop them into the sub-array, until you meet another tilde. Then repeat the process until there are no more elements in the array.
Finally I would pull a random word from the category named in the parameter. If there was no category name matching the parameter, it would return false and exit (this is simply in case I want to add more categories later.)
Tips would be appreciated
You may want to go back and split first time around like this:
categories = list.split(" ~")
Then each list item will start with the category name. This will save you having to go back through your data structure as you suggest. Consider that a tip: sometimes it's better to re-think the start of a coding problem than to head inexorably forwards
The structure you are reaching towards is probably a Hash, where the keys are category names, and the values are arrays of all the matching adjectives. It might look like this:
{
'category' => [ 'word1', 'word2', 'word3' ]
}
So you might do this:
words_in_category = Hash.new
categories.each do |category_string|
cat_name, *words = category_string.split(" ")
words_in_category[cat_name] = words
end
Finally, to pick a random element from an array, Ruby provides a very useful method sample, so you can just do this
words_in_category[ chosen_category ].sample
. . . assuming chosen_category contains the string name of an actual category. I'll leave it to you to figure out how to put this all together and handle errors, bad input etc
Use slice_before:
categories = list.split(" ").slice_before(/~\w+/)
This will create an sub array for each word starting with ~, containing all words before the next matching word.
If this file format is your original and you have freedom to change it, then I recommend you save the data as yaml or json format and read it when needed. There are libraries to do this. That is all. No worry about the mess. Don't spend time reinventing the wheel.

Selecting random phrase from a list

I've been playing around with a .lua file which passes a random phrase using the following line:
SendChatMessage(GetRandomArgument("text1", "text2", "text3", "text4"), "RAID")
My problem is that I have a lot of phrases and the one line of code is very long indeed.
Is there a way to hold
text1
text2
text3
text3
in a list somewhere else in the code (or externally) and call a random value from the main code. Would make maintaining the list of text options easier.
For lists up to a few hundred elements, then the following will work:
messages = {
"text1",
"text2",
"text3",
"text4",
-- ...
}
SendChatMessage(GetRandomArgument(unpack(messages)), "RAID")
For longer lists, you would be well served to replace GetRandomArgument with GetRandomElement that would take a single table as its argument and return a random entry from the table.
Edit: Olle's answer shows one way that something like GetRandomElement might be implemented. But it used table.getn on every call which is deprecated in Lua 5.1, and its replacement (table.maxn) has a runtime cost proportional to the number of elements in the table.
The function table.maxn is only required if the table in use might have missing elements in its array part. However, in this case of a list of items to choose among, there is likely to be no reason to need to allow holes in the list. If you need to edit the list at run time, you can always use table.remove to remove an item since it will also close the gap.
With a guarantee of no gaps in the array of text, then you can implement GetRandomElement like this:
function GetRandomElement(a)
return a[math.random(#a)]
end
So that you send the message like this:
SendChatMessage(GetRandomElement(messages), "RAID")
You want a table to contain your phrases like
phrases = { "tex1", "text2", "text3" }
table.insert(phrases ,"text4") -- alternative syntax
SendChatMessage(phrases[math.random(table.getn(phrases))], "RAID")
Note: getn gets the size of the table; math.random gets a random number (with a max of the size of the phrases table) and the phrases[] syntax returns the table element at the index inside [].

Resources