Matching keywords with sentence database, how to avoid duplicated keywords in results? - ruby

I'm very new to programming and am a beginner in Ruby. I've done a lot of searching to try to find the answers I need, but nothing seems to match what I'm looking for.
I need to make a program for work that will:
Get keywords from the user
Match those keywords with the same keywords in a database of sentences, and then
Spit out randomized sentences that:
contain all the keywords 1 time
do NOT contain keywords not listed
do NOT duplicate keywords
Important to know: Sentences all have a mix of several keywords, NOT one per sentence
1 & 2 are OK, I've been able to do those. My problem is with part 3. I've tried long lists of "if include?" parameters, but it never ends up working and I know there must be a better way to do this.
My grasp of Ruby (and programming generally) is basic and I don't really know what it can and can't do, so any tips or hints in what functions would be useful would be very very much appreciated.

If the match is found, why don't you consecutively pop it out of your array/db? It will ensure no duplication, since that record would not be present to be matched later. No?
Consider this snippet:
db=%q(It is hot today), %q(It is going to rain), %q(Where are you, sonny?), %q(sentence contains is and are)
keyw=%w(is am are)
de=[]
keyw.each do |word|
for index in 0...db.length
if db[index].include?(word)
puts "Matched #{word} with #{db[index]}"
de<<index
end
end
until de.empty?
db.delete_at(de.pop)
end
end
db is database example and keyw contains keywords.
Corresponding output:
Matched is with It is hot today
Matched is with It is going to rain
Matched is with sentence contains is and are
Matched are with Where are you, sonny?
No duplication. :)

Related

Is there an algorithm to filter out sentence fragments?

I have in a database thousands of sentences (highlights from kindle books) and some of them are sentence fragments (e.g. "You can have the nicest, most") which I am trying to filter out.
As per some definition I found, a sentence fragment is missing either its subject or its main verb.
I tried to find some kind of sentence fragment algorithm but without success.
But anyway in the above example, I can see the subject (You) and the verb (have) but it still doesn't look like a full sentence to me.
I thought about restricting on the length (like excluding string whose length is < than 30) but I don't think it's a good idea.
Any suggestion on how you would do it?

Rasa RegexFeaturizer is it based on token or whole sentence?

- regex: regex features for intent classification
examples: |
- \bon road pric/i
- \bonroad pric/i
I have tested above regex and they are working fine. Hence I am sure there is no issue with regex expression
Example:
training-row-1] Please tell me on road price now.
training-row-2] Please tell me price now.
Based on above regex pattern, regex features which should get added are:
training-row-1] Please tell me on road price now. ==> TRUE (because regex match)
training-row-2] Please tell me price now. ==> FALSE (regex don't match)
My question is, In RegexFeaturizer, does regex match happens on whole sentence or on each token?
It make sense to have it on whole sentence.
Is above featurization which I have assumed is correct or no?
I've found the following docstring in the code for the RegexFeaturizer.
"""
Given a sentence, returns a vector of {1,0} values indicating which
regexes did match. Furthermore, if the message is tokenized, the
function will mark all tokens with a dict relating the name of the
regex to whether it was matched.
"""
So I think it's taking the entire sentence as input. It's hard to see inside of the feature space in Rasa but I've confirmed that the correct entity is picked up across tokens when using the RegexEntityExtractor. This is easily verified by temporarily adding entity examples in your NLU data (make sure it appears at least twice in intents) and running rasa interactive.

Is Regex faster than array comparison in this case?

Say I have an incoming string that I want scan to see if it contains any of the words I have chosen to be "bad." :)
Is it faster to split the string into an array, as well as keep the bad words in an array, and then iterate through each bad word as well as each incoming word and see if there's a match, kind of like:
badwords.each do |badword|
incoming.each do |word|
trigger = true if badword == word
end
end
OR is it faster to do this:
incoming.each do |word|
trigger = true if badwords.include? word
end
OR is it faster to leave the string as it is and run a .match() with a regex that looks something like:
/\bbadword1\b|\bbadword2\b|\bbadword3\b/
Or is the performance difference almost completely negligible? Been wondering this for a while.
You're giving the regex an advantage by not stopping your loop when it finds a match. Try:
incoming.find{|word| badwords.include? word}
My money is still on the regex though which should be simplified to:
/\b(badword1|badword2|badword3)\b/
or to make it a fair fight:
/\a(badword1|badword2|badword3)\z/
Once it is compiled, the Regex is the fastest in real live (i.e. really long incoming string, many similar bad words, etc.) since it can run on incoming in situ and will handle overlapping parts of your "bad words" really well.
The answer probably depends on the number of bad words to check: if there is only one bad word it probably doesn't make a huge difference, if there are 50 then checking an array would probably get slow. On the other hand, with tens or hundreds of thousands of words the regexp probably won't be too fast either
If you need to handle large numbers of bad words, you might want to consider splitting into individual words and then using a bloomfilter to test whether the word is likely to be bad or not.
This does not excatly answer your question but this will definitely help solve it.
Take some examples what your are tring to acheive and put them to bench marks.
you can find how to do benchmarking in ruby here
Just put the varoius forms between report block and get the benchmarks and decide yourself what suits you the best.
http://ruby.about.com/od/tasks/f/benchmark.htm
http://ruby-doc.org/stdlib-1.9.3/libdoc/benchmark/rdoc/Benchmark.html
For better solutions use the real data to test.
Benchmarks are always better than discussions :)
If you want to scan a string for occurrences of words, use scan to find them.
Use Regexp.union to build a pattern that will find the strings in your black-list. You will want to wrap the result with \b to force matching word-boundaries, and use a case-insensitive search.
To give you an idea of how Regexp.union can help:
words = %w[foo bar]
Regexp.union(words)
=> /foo|bar/
'Daniel Foo killed him a bar'.scan(/\b#{Regexp.union(words)}\b/i)
=> ["foo", "bar"]
You could also build the pattern using Regexp.new or /.../ if you want a bit more control:
Regexp.new('\b(?:' + words.join('|') + ')\b', Regexp::IGNORECASE)
=> /\b(?:foo|bar)\b/i
/\b(?:#{words.join('|')})\b/i
=> /\b(?:foo|bar)\b/i
'Daniel Foo killed him a bar'.scan(/\b(?:#{words.join('|')})\b/i)
=> ["Foo", "bar"]
As a word of advice, black-listing words you find offensive is easily tricked by a user, and often gives results that are wrong because many "offensive" words are only offensive in a certain context. A user can deliberately misspell them or use "l33t" speak and have an almost inexhaustible supply of alternate spellings that will make you constantly update your list. It's a source of enjoyment to some people to fool a system.
I was once given a similar task and wrote a translator to supply alternate spellings for "offensive" words. I started with a list of words and terms I'd gleaned from the Internet and started my code running. After several million alternates were added to the database I pulled the plug and showed management it was a fools-errand because it was trivial to fool it.

What Ruby Regex code can I use for obtaining "out of sight" from the input "outofsight"?

I'm building an application that returns results based on a movie input from a user. If the user messes up and forgets to space out the title of the movie is there a way I can still take the input and return the correct data? For example "outofsight" will still be interpreted as "out of sight".
There is no regex that can do this in a good and reliable way. You could try a search server like Solr.
Alternatively, you could do auto-complete in the GUI (if you have one) on the input of the user, and this way mitigate some of the common errors users can end up doing.
Example:
User wants to search for "outofsight"
Starts typing "out"
Sees "out of sight" as suggestion
Selects "out of sight" from suggestions
????
PROFIT!!!
There's no regex that can tell you where the word breaks were supposed to be. For example, if the input is "offlight", is it supposed to return "Off Light" or "Of Flight"?
This is impossible without a dictionary and some kind of fuzzy-search algorithm. For the latter see How can I do fuzzy substring matching in Ruby?.
You could take a string and put \s* in between each character.
So outofsight would be converted to:
o\s*u\s*t\s*o\s*f\s*s\s*i\s*g\s*h\s*t
... and match out of sight.
You can't do this with regular expressions, unless you want to store one or more patterns to match for each movie record. That would be silly.
A better approach for catching minor misspellings would be to calculate Levenshtein distances between what the user is typing and your movie titles. However, when your list of movies is large, this will become a rather slow operation, so you're better off using a dedicated search engine like Lucene/Solr that excels at this sort of thing.

ALL CAPS to Normal case

I'm trying to find an elegant solution on how to convert something like this
ALL CAPS TEXT. "WHY ANYONE WOULD USE IT?" THIS IS RIDICULOUS! HELP.
...to regular-case. I could more or less find all sentence-starting characters with:
(?<=^|(\. \"?)|(! ))[A-Z] #this regex sure should be more complex
but (standard) Ruby neither allows lookbehinds, nor it is possible to apply .capitalize to, say, gsub replacements. I wish I could do this:
"mytext".gsub(/my(regex)/, '\1'.capitalize)
but the current working solution would be to
"mytext".split(/\. /).each {|x| p x.capitalize } #but this solution sucks
First of all, notice that what you are trying to do will only be an approximation.
You cannot correctly tell where the sentence boundaries are. You can approximate it as The beginning of the entire string or right after a period, question mark, or exclamation mark followed by spaces. But then, you will incorrectly capitalize "economy" in "U.S. economy".
You cannot correctly tell which words should be capitalized. For example, "John" will be "john".
You may want to do some natural language processing to give you a close-to-correct result in many cases, but these methods are only probablistically correct. You will never get a perfect result.
Understanding these limitations, you might want to do:
mytext.gsub(/.*?(?:[.?!]\s+|\z)/, &:capitalize)

Resources