How to scan a string for an exact keyword match? - ruby

I'm scanning names and descriptions of different items in order to see if there are any keyword matches.
In the code below it will return things like 'googler' or 'applecobbler', when what I'm trying to do is get exact matches only:
[name, description].join(" ").downcase.scan(/apple|microsoft|google/)
How should I do this?

My regex skills are pretty weak, but I think you need to use a word boundary:
[name, description].join(" ").downcase.scan(/\b(apple|microsoft|google)\b/)
Rubular example

Depends on what information you want, but if you just want exact match, you do not need regex for the comparing part. Just compare the relevant strings.
splitted_strings = [name, description].join(" ").downcase.split(/\b/)
splitted_strings & %w[apple microsoft google]
# => the words that match given in the order of appearance

Add proper boundaries entities in your regexp (\b). You can also use #grep method. instead of joining:
array.grep(your_regexp)

Looking at the question, and the situation I'd want to do those things, here's what I'd do for an actual program, where I had lists of sources, and their associated texts, and wanted to know the hits, I'd probably write something like this:
require 'pp'
names = ['From: Apple', 'From: Microsoft', 'From: Google.com']
descriptions = [
'"an apple a day..."',
'Microsoft Excel flight simulator... according to Microsoft',
'Searches of Google revealed multiple hits for "google"'
]
targets = %w[apple microsoft google]
regex = /\b(?:#{ Regexp.union(targets).source })\b/i
names.zip(descriptions) do |n,d|
name_hits, description_hits = [n, d].map{ |s| s.scan(regex) }
pp [name_hits, description_hits]
end
Which outputs:
[["Apple"], ["apple"]]
[["Microsoft"], ["Microsoft", "Microsoft"]]
[["Google"], ["Google", "google"]]
This would let me know the letter-case of the words, so I could try to differentiate the apple fruit from Apple the company, and get word counts, helping to show relevance of the text.
The regex looks like:
/\b(?:apple|microsoft|google)\b/i
It's case insensitive but scan will returns words in their original case.
names, descriptions and targets could all come from a database or separate files, helping to separate the data from the code and the need to modify the code as the targets change.
I'd use a list of target words and use Regexp.union to quickly build the pattern.

Related

IFTTT JavaScript filter - How to make case insensitive searches + How to search Include and Exclude sets of terms

First off I'm a total novice for Javascript, so please go gently. I'm aware of how people feel about having to now pay for IFTTT, but it's perfect for what I need.
I am using a more expansive version of this code below to capture certain keywords from Tweets to then generate emails if the search returns a positive result. This search works very nicely, except it is case sensitive which is a problem.
Yes, I know you can manipulate the twitter search to pick up specific words or phrases. I am very proficient in achieving searches this way. I am casting a wide net to pick up approx 120 search words or phrases which is too long to achieve through "OR" Twitter search parameters alone which is why I'm using this.
Q1 - I have tried adding item.toLowerCase() and just .toLowerCase() in various parts of the code so it wouldn't matter if the sentence case of the search term is different to that of the original tweet text case. I just can't get it to work though. I've seen various posts on here but I can't get any of them to work in IFTTT. I believe IFTTT doesn't accept REGEX either, which is annoying.
Any advice of how to get this code running so it's case-insensitive for text within IFTTT?
Q2 - I have approx 120 search terms for the tweet text to return positive results. There is a lot of junk that comes through with that. Does anyone know how to add a second layer of 'and exclude' search terms?
I have something like 300-400 words and specific phrases which would be used to stop the email from being triggered - so it'd be something like "IF tweet text contains a, b, c BUT text ALSO contains x, y, z... do not send the email"
let str=Twitter.newTweetFromSearch.Text;
let searchTerms=[
"Northbound",
"Westbound",
"Southbound",
"Eastbound"
]
let foundOne=0;
if(searchTerms.some(function(v){return str.indexOf(v)>=0;})){
foundOne=1;
}
if(foundOne==0){
Email.sendMeEmail.skip();
}
I have looked at the Twitter API, but that is a step too far for my coding ability which is why I'm using IFTTT.
Any help is very much appreciated
Thank you.
I'm playing with IFTTT Filter myself at the moment, so here are some thoughts about solving your solution.
If you want to do a case insensitive seatch on the original text, convert the original text to lowercase, then have all your search terms in lowercase.
Plus I think you want to iterate over the searchTerms array, and use the includes() method. Ok, just realised that .some() does the iteration for you, but I prefer includes() over indexof().
let str=Twitter.newTweetFromSearch.Text.toLowerCase();
let searchTerms=[
"northbound",
"westbound",
"southbound",
"eastbound"
]
let foundOne=0;
if(searchTerms.some(function(term){return str.includes(term);})){
foundOne=1;
}
if(foundOne==0){
Email.sendMeEmail.skip();
}
Or you could just skip having the foundOne variable, and do the search in the if() statement.
let str=Twitter.newTweetFromSearch.Text.toLowerCase();
let searchTerms=[
"northbound",
"westbound",
"southbound",
"eastbound"
]
if(!searchTerms.some(function(term){return str.includes(term);})){
Email.sendMeEmail.skip();
}

How to check for multiple words inside a folder

I have a words in a text file called words.txt, and I need to check if any of those words are in my Source folder, which also contains sub-folders and files.
I was able to get all of the words into an array using this code:
array_of_words = []
File.readlines('words.txt').map do |word|
array_of_words << word
end
And I also have (kinda) figured out how to search through the whole Source folder including the sub-folders and sub-files for a specific word using:
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath).any?{ |l| l['api'] }
end
Instead of searching for one word like api, I want to search the Source folder for the whole array of words (if that is possible).
Consider this:
File.readlines('words.txt').map do |word|
array_of_words << word
end
will read the entire file into memory, then convert it into individual elements in an array. You could accomplish the same thing using:
array_of_words = File.readlines('words.txt')
A potential problem is its not scalable. If "words.txt" is larger than the available memory your code will have problems so be careful.
Searching a file for an array of words can be done a number of ways, but I've always found it easiest to use a regular expression. Perl has a great module called Regexp::Assemble that makes it easy to convert a list of words into a very efficient pattern, but Ruby is missing that sort of functionality. See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for one solution I put together in the past to help with that.
Ruby does have Regexp.union however it's only a partial help.
words = %w(foo bar)
re = Regexp.union(words) # => /foo|bar/
The pattern generated has flags for the expression so you have to be careful with interpolating it into another pattern:
/#{re}/ # => /(?-mix:foo|bar)/
(?-mix: will cause you problems so don't do that. Instead use:
/#{re.source}/ # => /foo|bar/
which will generate the pattern and behave like we expect.
Unfortunately, that's not a complete solution either, because the words could be found as sub-strings in other words:
'foolish'[/#{re.source}/] # => "foo"
The way to work around that is to set word-boundaries around the pattern:
/\b(?:#{re.source})\b/ # => /\b(?:foo|bar)\b/
which then look for whole words:
'foolish'[/\b(?:#{re.source})\b/] # => nil
More information is available in Ruby's Regexp documentation.
Once you have a pattern you want to use then it becomes a simpler matter to search. Ruby has the Find class, which makes it easy to recursively search directories for files. The documentation covers how to use it.
Alternately, you can cobble your own method using the Dir class. Again, it has examples in the documentation to use it, but I usually go with Find.
When reading the files you're scanning I'd recommend using foreach to read the files line-by-line. File.read and File.readlines are not scalable and can make your program behave erratically as Ruby tries to read a big file into memory. Instead, foreach will result in very scalable code that runs more quickly. See "Why is "slurping" a file not a good practice?" for more information.
Using the links above you should be able to put something together quickly that'll run efficiently and be flexible.
This untested code should get you started:
WORD_ARRAY = File.readlines('words.txt').map(&:chomp)
WORD_RE = /\b(?:#{Regexp.union(WORD_ARRAY).source}\b)/
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts "#{filepath}: #{!!File.read(filepath)[WORD_RE]}"
end
It will output the file it's reading, and "true" or "false" whether there is a hit finding one of the words in the list.
It's not scalable because of readlines and read and could suffer serious slowdown if any of the files are huge. Again, see the caveats in the "slurp" link above.
Recursively searches directory for any of the words contained in words.txt
re = /#{File.readlines('words.txt').map { |word| Regexp.quote(word.strip) }.join('|')}/
Dir['Source/**/*.{cpp,txt,html}'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath, "r:ascii").grep(re).any?
end

Algorithm in Ruby to trigger a method based on presence of certain texts

I don't know if it can be called an algorithm but i think its close.
I will be pulling data from an API that will have certain words in the title, eg:
Great Software 2.0 Download Now
Buy Great Software for just $10
Great Software Torrent Download
So, i want to do different things based on the presence of certain words such as Download, Buy etc. For eg, if it has the word 'buy' in it, i would like to extract the word buy and the amount value that is present in the title and show it in another div, so in this case it would be "Buy for $10" or "Buy $10" etc. I can do if/else as well but I don't want to use if else because there could be more such conditions in the future. So what i am thinking about is using the send method. eg:
def buy(string)
'Buy for just' + string.scan(/\$\d+/).first
end
def whichkeyword(title)
send (title.scan(/(download|buy)/i)[0][0]).downcase.to_sym, title
end
whichkeyword('Buy this software for $10 now')
is there a better way to do this? Or is this even a good way to do it? Any help would be appreciated
First of all, use send if and only you are to call private method, use public_send otherwise.
In this particular case metaprogramming is an overkill. It requires too much redundant code, plus it requires the code to be changed for new items. I would go with building a hash like:
#hash = { 'buy' => { text: 'Buy for just %{placeholder}', re: /\$\d+/ } }
This hash might be places somewhere outside of the code, e. g. it might be stored in yml file near the code and loaded in advance. That way you might be able to change a behaviour without modifying the code, that is handy for instance in gem.
As we have a hash defined/loaded, I would call the method:
def format string
key = string[/#{Regexp.union(#hash.keys).source}/i].downcase
puts #hash[key][:text] % { placeholder: string[#hash[key][:re]] }
end
Yielding:
▶ format("Buy this software for $10 now")
#⇒ Buy for just $10
There are many advantages over declaring methods, e. g. now matches might contain spaces, you might easily add/remove matchers etc.
First of all, your algorithm can work, but has some troubles in it, like what if no keyword is applied.
I have two solutions for you:
NLP
If you want to do it much more dynamic, you can use NLP - Natural language Processing. NLP will find main words in you sentence and then you can find the good solution for each.
A good gem for that is Treat that you can use with stanford-core-nlp. After processing the data you can find the verbs and even synonyms in the sentence and figure out what to do.
sentence('Buy this software for $10 now').verbs # ['buy']
Simple Hash
This solution is less dynamic, but much more simple. Like you did with the scan, just use Constant to manage your keywords, and the output from them(I would do it with lambdas). you can also add default to the hash
KEYWORDS = Hash.new('Default Title').merge(
buy: -> { },
download: -> { }
)
KEYWORDS[sentence[/(#{KEYWORDS.keys.join('|')})/i].downcase]
I think this solution is good enough.
The only thing that looks strange is scan(/(download|buy)/i)[0][0].
As for me I don't very much like using [] syntax in Ruby.
I think using scan here is not necessary.
What about
def whichkeyword(title)
title =~ /(download|buy)/i
send $1.downcase.to_sym, title unless $1.nil?
end
UPDATE
def whichkeyword(title)
action = title[/(download|buy)/i]
public_send action.downcase.to_sym, title if action
end

Ruby, regex, sentences

I'm currently building a code generator, which aims to generate boiler plate for me once I write the templates and/or translations, in whatever language I have to work with.
I have a problem with a regex in Ruby. The regex aims to select whatever is between {{{ and }}}, so I can generate functions according to my needs.
My regex is currently :
/\{\{\{(([a-zA-Z]|\s)+)\}\}\}/m
My test data set is:
{{{Demande aaa}}} => {{{tagadatsouintsouin tutu}}}
The results are:
[["Demande aaa", "a"], ["tagadatsouintsouin tutu", "u"]]
Each time the regex picks the last character twice. That's not exactly what I want, I need something more like this:
/\{\{\{((\w|\W)+)\}\}\}/m
But this has a flaw too, the results are:
[["Demande aaa}}} => {{{tagadatsouintsouin tutu", "u"]]
Whereas, I wish to get:
[["Demande aaa"],["tagadatsouintsouin tutu"]]
How do I correct these regexes? I could use two sets of delimiters, but it won't teach me anything.
Edit :
All your regex run against my data sample, so you all got a point.
Regex may be overkill, and probably are overkill for my purpose. So i have two questions.
First, do the regex keep the same exact indentation ? This should be able to handle whole functions.
Second, is there something fitter for that task ?
Detailled explanation of the purpose of this tool. I'm bored to write boiler plate code in php - symfony. So i wish to generate this according to templates.
My intent is to build some views, some controllers, and even parts of model this way.
Pratical example : In my model, i wish to generate some functions according to the type of an object's attribute. For examples, i have functions displaying correctly money. So i need to build the corect function, according to my attribute, and then put in , inside m output file.
So there is some translations which themselves need translations.
So a fictive example :
{{{euro}}} => {{{ function getMyAttributeEuro()
{
return formating($this->get[[MyAttribute]]);
} }}}
In order to stock my translations, should i use regex, like
I wish to build something a bit clever, so it can build most of the basic code with no bug. So i can work on interesting code.
You have one set of capturing parentheses too many.
/\{\{\{([a-zA-Z\s]+)\}\}\}/
Also, you don't need the /m modifier because there is no dot (.) in your regex whose behaviour would be affected by it.
I'm partial to:
data = '{{{Demande aaa}}} => {{{tagadatsouintsouin tutu}}}'
data.scan(/\{{3}(.+?)}{3}/).flatten.map{ |r| r.squeeze(' ') }
=> ["Demande aaa", "tagadatsouintsouin tutu"]
or:
data.scan(/\{{3}(.+?)}{3}/).flatten.map{ |r| [ r.squeeze(' ') ] }
=> [["Demande aaa"], ["tagadatsouintsouin tutu"]]
or:
data.scan(/\{{3}(.+?)}{3}/).map{ |r| [ r[0].squeeze(' ') ] }
=> [["Demande aaa"], ["tagadatsouintsouin tutu"]]
if you need the sub-arrays.
I'm not big on trying to everything possible inside the regex. I prefer to keep it short and sweet, then polish the output once I've found what I was looking for. It's a maintenance issue, because regex make my head hurt, and I stopped thinking of them as a macho thing years ago. Regex are a very useful tool, but too often they are seen as the answer to every problem, which they're not.
Some people, when confronted with a problem, think “I know,
I'll use regular expressions.” Now they have two problems.
-- Jamie Zawinski
You want non capturing groups (?:...), but here is another way.
/\{\{\{(.*?)\}\}\}/m
Just a shot
/\{\{\{([\w\W]+?)\}\}\}/
Added non-greedyness to your regex
Here this seems to work

Find email addresses in large data stream

STILL NOT RESOLVED :( [Feb 11th]
I have a large text file full of random data and want to pull out all the email addresses from it.
I would like to do this in Ruby, with pseudo code like this:
monster_data_string = "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
monster_data_string.match(EMAIL_REGEX)
Does anyone know what Ruby email regular expression I would use to accomplish this?
Please keep in mind that I'm looking for a Ruby answer to this. I have already tried numerous regex found by googling but most of them cause Ruby runtime errors stating that characters like "+" and "" are invalid/unrecognized.*
What I have already tried is:
monster_data_string.match(/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
but I receive Ruby errors stating that "+" is an invalid character
Thanks in advance
Watch this...
f = File.open("content.txt")
content = f.read
r = Regexp.new(/\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b/)
emails = content.scan(r).uniq
puts YAML.dump(emails)
If you're getting an error message about + or * being invalid in regexes, you're doing something very wrong. This is a valid regex in Ruby, although it's not the one you want:
/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i
For one thing, you don't want to anchor the regex to the start and end of lines (^ and $) if you're trying to pluck the addresses from "random" text. But once you've gotten rid of the anchors, your regex will match **joe#example.com in your test string, which I presume you don't want. This regex from Regular-Expressions.info does a better job, but read that page for tips on tweaking it to meet your particular needs.
/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
Finally (and you may already know this), you won't want to use the match() method because that will only find the first match. Try scan() instead.
Given that it is not possible to parse every valid email address using a regexp you are left with two choices:
Make a regexp that matches as many valid email addresses as possible and live with the the fact that some valid but rarely used forms of email address might get overlooked.
or
Make a regexp that Matches anything that "might be" an email address and then live with the false positives
I use the second approach to weed out obviously wrong email addresses when validating user sign up email addresses on a web page
Gleaned from Ruby Cookbook which has a very good section on email address validation:
valid = '[^ #]+'
/^#{valid}##{valid}\.#{valid}/
Apparently there is a 6343 character Perl regexp written by Paul Warren that does a very good job and also works in Ruby, but even that is not foolproof (I think it might also have some performance implications).
What kind of runtime error messages are you gettting? Is it regarding the regexps as invalid, or is it breaking due to the target string being too large?
To try and help you get there (though not very elegantly, I admit):
I think the start and end anchors (^ and $) aren't helping. You may also want to filter the asterisks?:
irb(main):001:0> mds = "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
=> "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
irb(main):003:0> mds.match(/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
=> nil
irb(main):004:0> mds.match(/([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
=> #<MatchData "**joe#example.com" 1:"**joe" 2:"example.com">
irb(main):005:0> mds.match(/([^#\s*]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
=> #<MatchData "joe#example.com" 1:"joe" 2:"example.com">
Even better,
require 'yaml'
content = "asfsfsdfsdfsf sfda **joe#example.com.au** sdfdsf cool_me#example.com.fr"
r = Regexp.new(/\b([a-zA-Z0-9._%+-]+)#([a-zA-Z0-9.-]+?)(\.[a-zA-Z.]*)\b/)
emails = content.scan(r).uniq
puts YAML.dump(emails)
will give you
---
- - joe
- example
- .com.au
- - cool_me
- example
- .com.au

Resources