Fuzzy matching of words with phrases with ruby - ruby

I want to match a bunch of data with a short number of services
My data would look something like this
{"title" : "blorb",
"category" : "zurb"
"description" : "Massage is the manipulation of superficial and deeper layers of muscle and connective tissue using various techniques, to enhance function, aid in the healing process, decrease muscle reflex activity..."
}
and I have to match it with
["Swedish Massage", "Haircut"]
Clearly the "Swedish Massage" would be the winner, but running a benchmark shows that "Haircut" is:
require 'amatch'
arr = [:levenshtein_similar, :hamming_similar, :pair_distance_similar, :longest_subsequence_similar, :longest_substring_similar, :jaro_similar, :jarowinkler_similar]
arr.each do |method|
["Swedish Massage", "Haircut"].each do |sh|
pp ">>> #{sh} matched with #{method.to_s}"
pp sh.send(method, description)
end
end and nil
result:
">>> Swedish Massage matched with jaro_similar"
# 0.5246896118183247
">>> Haircut matched with jaro_similar"
# 0.5353606789250354
">>> Swedish Massage matched with jarowinkler_similar"
# 0.5246896118183247
">>> Haircut matched with jarowinkler_similar"
# 0.5353606789250354
The rest of the indices are well below 0.1
What would be a better approach to solving this problem?

Search is a constant battle between precision and recall. One thing you could try is splitting your input by words - this will result in a much stronger match on Massage but with the consequence of broadening out the result set. You will now find sentences returned with only words close to Swedish. You could then try to control that broadening by averaging the results for multiple words, using stop lists to avoid common words like and, boosts for finding tokens close to each other etc, but you will never see truly perfect results. If you're really interested in fine tuning this I recommend ElasticSearch - relatively easy to learn and powerful.

Related

Best Data Structure For Text Processing

Given a sentence like this, and i have the data structure (dictionary of lists, UNIQUE keys in dictionary):
{'cat': ['feline', 'kitten'], 'brave': ['courageous', 'fearless'], 'little': ['mini']}
A courageous feline was recently spotted in the neighborhood protecting her mini kitten
How would I efficiently process these set of text to convert the word synonyms of the word cat to the word CAT such that the output is like this:
A fearless cat was recently spotted in the neighborhood protecting her little cat
The algorithm I want is something that can process the initial text to convert the synonyms into its ROOT word (key inside dictionary), the keywords and synonyms would get longer as well.
Hence, first, I want to inquire if the data structure I am using is able to perform efficiently and whether there are more efficient structure.
For now, I am only able to think of looping through each list inside the dictionary, searching for the synonym's then mapping it back to its keyword
edit: Refined the question
Your dictionary is organised in the wrong way. It will allow you to quickly find a target word, but that is not helpful when you have an input that does not have the target word, but some synonyms of it.
So organise your dictionary in the opposite sense:
d = {
'feline': 'cat',
'kitten': 'cat'
}
To make the replacements, you could create a regular expression and call re.sub with a callback function that will look up the translation of the found word:
import re
regex = re.compile(rf"\b(?:{ '|'.join(map(re.escape, d)) })\b")
s = "A feline was recently spotted in the neighborhood protecting her little kitten"
print(regex.sub(lambda match: d[match[0]], s))
The regular expression makes sure that the match is with a complete word, and not with a substring -- "cafeline" as input will not give a match for "feline".

Crate fulltext query syntax

I'm thinking about migration from Sphinx to Crate, but I can't find any documentation for fulltext query syntax. In Sphinx I can search:
("black cat" -catalog) | (awesome creature)
this stands for EITHER exact phrase "black cat" and no term "catalog" in document OR both "awesome" and "creature" at any position in document
black << big << cat
this requires document to contain all "black", "big" and "cat" terms and also requires match position of "black" be less than match position of "big" and so on.
And I need to search at specific place in the document. In sphinx I was able to use proximity operator as follows
hello NEAR/10 (mother|father -dear)
this requires document to contain "hello" term and "mother" or "father" term at most 10 terms away from "hello" and also term "dear" must not be closer than 10 terms to "hello"
The last construction with NEAR is heavily used in my application. Is it all possible in Crate?
Unfortunately I cannot comment on how it compares to Sphinx, but I will stick to your questions :)
Crate's fulltext search comes with SQL and Lucene's matching power and therefore should be able to handle complex queries. I'll just provide the queries matching your output I think it should be quite readable.
("black cat" -catalog) | (awesome creature)
select *
from mytable
where
(match(indexed_column, 'black cat') using phrase
and not match(indexed_column, 'catalog'))
or match(indexed_column, 'awesome creature') using best_fields with (operator='and');
black << big << cat
select *
from mytable
where
match(indexed_column, 'black big cat') using phrase with (slob=100000);
This one is tricky, there doesn't seem to be an operator that does exactly the same as in Sphinx, but it could be adjusted with a "slop" value. Depending on the use case there might be another (better) solution as well...
hello NEAR/10 (mother|father -dear)
select *
from mytable
where
(match(indexed_column, 'hello mother') using phrase with (slop=10)
or match(indexed_column, 'hello father') using phrase with (slop = 10))
and not match(indexed_column, 'hello dear') using phrase with (slop = 10)
They might look a bit clunky compared to Sphinx's language, but they work fine :)
Performance wise, they should still be super fast, thanks to Lucene..
Cheers, Claus

Algorithm in Ruby to trigger a method based on presence of certain texts

I don't know if it can be called an algorithm but i think its close.
I will be pulling data from an API that will have certain words in the title, eg:
Great Software 2.0 Download Now
Buy Great Software for just $10
Great Software Torrent Download
So, i want to do different things based on the presence of certain words such as Download, Buy etc. For eg, if it has the word 'buy' in it, i would like to extract the word buy and the amount value that is present in the title and show it in another div, so in this case it would be "Buy for $10" or "Buy $10" etc. I can do if/else as well but I don't want to use if else because there could be more such conditions in the future. So what i am thinking about is using the send method. eg:
def buy(string)
'Buy for just' + string.scan(/\$\d+/).first
end
def whichkeyword(title)
send (title.scan(/(download|buy)/i)[0][0]).downcase.to_sym, title
end
whichkeyword('Buy this software for $10 now')
is there a better way to do this? Or is this even a good way to do it? Any help would be appreciated
First of all, use send if and only you are to call private method, use public_send otherwise.
In this particular case metaprogramming is an overkill. It requires too much redundant code, plus it requires the code to be changed for new items. I would go with building a hash like:
#hash = { 'buy' => { text: 'Buy for just %{placeholder}', re: /\$\d+/ } }
This hash might be places somewhere outside of the code, e. g. it might be stored in yml file near the code and loaded in advance. That way you might be able to change a behaviour without modifying the code, that is handy for instance in gem.
As we have a hash defined/loaded, I would call the method:
def format string
key = string[/#{Regexp.union(#hash.keys).source}/i].downcase
puts #hash[key][:text] % { placeholder: string[#hash[key][:re]] }
end
Yielding:
▶ format("Buy this software for $10 now")
#⇒ Buy for just $10
There are many advantages over declaring methods, e. g. now matches might contain spaces, you might easily add/remove matchers etc.
First of all, your algorithm can work, but has some troubles in it, like what if no keyword is applied.
I have two solutions for you:
NLP
If you want to do it much more dynamic, you can use NLP - Natural language Processing. NLP will find main words in you sentence and then you can find the good solution for each.
A good gem for that is Treat that you can use with stanford-core-nlp. After processing the data you can find the verbs and even synonyms in the sentence and figure out what to do.
sentence('Buy this software for $10 now').verbs # ['buy']
Simple Hash
This solution is less dynamic, but much more simple. Like you did with the scan, just use Constant to manage your keywords, and the output from them(I would do it with lambdas). you can also add default to the hash
KEYWORDS = Hash.new('Default Title').merge(
buy: -> { },
download: -> { }
)
KEYWORDS[sentence[/(#{KEYWORDS.keys.join('|')})/i].downcase]
I think this solution is good enough.
The only thing that looks strange is scan(/(download|buy)/i)[0][0].
As for me I don't very much like using [] syntax in Ruby.
I think using scan here is not necessary.
What about
def whichkeyword(title)
title =~ /(download|buy)/i
send $1.downcase.to_sym, title unless $1.nil?
end
UPDATE
def whichkeyword(title)
action = title[/(download|buy)/i]
public_send action.downcase.to_sym, title if action
end

Regex causing high CPU load, causing Rails to not respond

I have a Ruby 1.8.7 script to parse iOS localization files:
singleline_comment = /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m
string_line = /\s*"(.*?)"\s*=\s*"(.*?)"\s*\;\s*/xm
out = decoded_src.scan(/(?:#{singleline_comment}|#{multiline_comment})?\s*?#{string_line}/)
It used to work fine, but today we tested it with a file that is 800Kb, and that doesn't have ; at the end of each line. The result was a high CPU load and no response from the Rails server. My assumption is that it took the whole file as a single string in the capturing group and that blocked the server.
The solution was to add ? (regex quantificator, 0 or 1 time) to the ; literal character:
/\s*"(.*?)"\s*=\s*"(.*?)"\s*\;?\s*/xm
Now it works fine again even with those files in the old iOS format, but my fear now is, what if a user submits a malformed file, like one with no ending ". Will my server get blocked again?
And how do I prevent this? Is there any way to try to run this only for five seconds? What I can I do to avoid halting my whole Rails application?
It looks like you're trying to parse an entire configuration as if it was a string. While that is doable, it's error-prone. Regular expression engines have to do a lot of looking forward and backward, and poorly written patterns can end up wasting a huge amount of CPU time. Sometimes a minor tweak will fix the problem, but the more text being processed, and the more complex the expression, the higher the chance of something happening that will mess you up.
From benchmarking different ways of getting at data for my own work, I've learned that anchoring regexp patterns can make a huge difference in speed. If you can't anchor a pattern somehow, then you are going to suffer from the backtracking and greediness of patterns unless you can limit what the engine wants to do by default.
I have to parse a lot of device configurations, but instead of trying to treat them as a single string, I break them down into logical blocks consisting of arrays of lines, and then I can provide logic to extract data from those blocks based on knowledge that blocks contain certain types of information. Small blocks are faster to search, and it's a lot easier to write patterns that can be anchored, providing huge speedups.
Also, don't hesitate to use Ruby's String methods, like split to tear apart lines, and sub-string matching to find lines containing what you want. They're very fast and less likely to induce slowdowns.
If I had a string like:
config = "name:\n foo\ntype:\n thingie\nlast update:\n tomorrow\n"
chunks = config.split("\n").slice_before(/^\w/).to_a
# => [["name:", " foo"], ["type:", " thingie"], ["last update:", " tomorrow"]]
command_blocks = chunks.map{ |k, v| [k[0..-2], v.strip] }.to_h
command_blocks['name'] # => "foo"
command_blocks['last update'] # => "tomorrow"
slice_before is a very useful method for this sort of task as it lets us define a pattern that is then used to test for breaks in the master array, and group by those. The Enumerable module has lots of useful methods in it, so be sure to look through it.
The same data could be parsed.
Of course, without sample data for what you're trying to do it's difficult to suggest something that works better, but the idea is, break down your input into small manageable chunks and go from there.
As a comment on how you're defining your patterns.
Instead of using /\/.../ (which is known as "leaning-toothpicks syndrome") use %r which allows you to define a different delimiter:
singleline_comment = /\/\/(.*)$/ # => /\/\/(.*)$/
singleline_comment = %r#//(.*)$# # => /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m # => /\/\*(.*?)\*\//m
multiline_comment = %r#/\*(.*?)\*/#m # => /\/\*(.*?)\*\//m
The first line in each sample above is how you're doing it, and the second is how I'd do it. They result in identical regexp objects, but the second ones are easier to understand.
You can even have Regexp help you by escaping things for you:
NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*?)'
GREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*)'
EOL = '$'
Regexp.new(Regexp.escape('//') + GREEDY_CAPTURE_NONE_TO_ALL_CHARS + EOL) # => /\/\/(.*)$/
Regexp.new(Regexp.escape('/*') + NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS + Regexp.escape('*/'), Regexp::MULTILINE) # => /\/\*(.*?)\*\//m
Doing this you can iteratively build up extremely complex expressions while keeping them relatively easy to maintain.
As far as halting your Rails app, don't try to process the files in the same Ruby process. Run a separate job that watches for the files and process them and store whatever you're looking for to be accessed as needed later. That way your server will continue to respond rather than lock up. I wouldn't do it in a thread, but would write a separate Ruby script that looks for incoming data, and if nothing is found, sleeps for some interval of time then looks again. Ruby's sleep method will help with that, or you could use the cron capability of your OS.

How to scan a string for an exact keyword match?

I'm scanning names and descriptions of different items in order to see if there are any keyword matches.
In the code below it will return things like 'googler' or 'applecobbler', when what I'm trying to do is get exact matches only:
[name, description].join(" ").downcase.scan(/apple|microsoft|google/)
How should I do this?
My regex skills are pretty weak, but I think you need to use a word boundary:
[name, description].join(" ").downcase.scan(/\b(apple|microsoft|google)\b/)
Rubular example
Depends on what information you want, but if you just want exact match, you do not need regex for the comparing part. Just compare the relevant strings.
splitted_strings = [name, description].join(" ").downcase.split(/\b/)
splitted_strings & %w[apple microsoft google]
# => the words that match given in the order of appearance
Add proper boundaries entities in your regexp (\b). You can also use #grep method. instead of joining:
array.grep(your_regexp)
Looking at the question, and the situation I'd want to do those things, here's what I'd do for an actual program, where I had lists of sources, and their associated texts, and wanted to know the hits, I'd probably write something like this:
require 'pp'
names = ['From: Apple', 'From: Microsoft', 'From: Google.com']
descriptions = [
'"an apple a day..."',
'Microsoft Excel flight simulator... according to Microsoft',
'Searches of Google revealed multiple hits for "google"'
]
targets = %w[apple microsoft google]
regex = /\b(?:#{ Regexp.union(targets).source })\b/i
names.zip(descriptions) do |n,d|
name_hits, description_hits = [n, d].map{ |s| s.scan(regex) }
pp [name_hits, description_hits]
end
Which outputs:
[["Apple"], ["apple"]]
[["Microsoft"], ["Microsoft", "Microsoft"]]
[["Google"], ["Google", "google"]]
This would let me know the letter-case of the words, so I could try to differentiate the apple fruit from Apple the company, and get word counts, helping to show relevance of the text.
The regex looks like:
/\b(?:apple|microsoft|google)\b/i
It's case insensitive but scan will returns words in their original case.
names, descriptions and targets could all come from a database or separate files, helping to separate the data from the code and the need to modify the code as the targets change.
I'd use a list of target words and use Regexp.union to quickly build the pattern.

Resources