Create regular expression from Array of search terms ruby - ruby

Is there a way / gem to create regular expressions with some basic search parameters.
e.g.
Search = ["\"German Shepherd\"","Collie","poodle", "Miniature Schnauzer"]
Such that the regexp will search (case insensitively) for:
"German Shepherd" - exactly
OR
"Collie"
OR
"poodle"
OR
"Miniature" AND "Schnauzer"
So in this case something like:
/German\ Shepherd|Collie|poodle|(?=.*Miniature)(?=.*Schnauzer).+/i
(Open to suggestions of better ways of doing the last bit...)

If I understood the question properly, here you go:
regexps = ["\"German Shepherd\"","Collie","poodle", "Miniature Schnauzer"]
# those in quotes
greedy = regexps.select { |re| re =~ /\A['"].*['"]\z/ } # c'"mon, parser
# the rest unquoted
non_greedy = (regexps - greedy).map(&:split).flatten
# concatenating... ⇓⇓⇓ get rid of quotes
all = Regexp.union(non_greedy + greedy.map { |re| re[1...-1] })
#⇒ /Collie|poodle|Miniature|Schnauzer|German\ Shepherd/
UPD
I finally got what is to be done with Miniature Schnauzer (please see a comment below for further explanation.) That said, these words are to be permuted and joined with non-greedy .*?:
non_greedy = (regexps - greedy).map(&:split).map do |re|
# single word? YES : NO, permute and join
re.length < 2 ? re : re.permutation.map { |p| Regexp.new p.join('.*?') }
end.flatten
all = Regexp.union(non_greedy + greedy.map { |re| re[1...-1] })
#=> /Collie|poodle|(?-mix:Miniature.*?Schnauzer)|(?-mix:Schnauzer.*?Miniature)|German\ Shepherd/

Related

Insert multiple characters in string at once

Where as str[] will replace a character, str.insert will insert a character at a position. But it requires two lines of code:
str = "COSO17123456"
str.insert 4, "-"
str.insert 7, "-"
=> "COSO-17-123456"
I was thinking how to do this in one line of code. I came up with the following solution:
str = "COSO17123456"
str.each_char.with_index.reduce("") { |acc,(c,i)| acc += c + ( (i == 3 || i == 5) ? "-" : "" ) }
=> "COSO-17-123456
Is there a built-in Ruby helper for this task? If not, should I stick with the insert option rather than combining several iterators?
Use each to iterate over an array of indices:
str = "COSO17123456"
[4, 7].each { |i| str.insert i, '-' }
str #=> "COSO-17-123456"
You can uses slices and .join:
> [str[0..3], str[4..5],str[6..-1]].join("-")
=> "COSO-17-123456"
Note that the index after the first one (between 3 and 4) will be different since you are not inserting earlier insertion first. ie, more natural (to me anyway...)
You will insert at the absolute index of the original string -- not the moving relative index as insertions are made.
If you want to insert at specific absolute index values, you can also use ..each_with_index and control the behavior character by character:
str2 = ""
tgts=[3,5]
str.split("").each_with_index { |c,idx| str2+=c; str2+='-' if tgts.include? idx }
Both of the above create a new string.
String#insert returns the string itself.
This means you can chain the method calls, which can be a prettier and more efficient if you only have to do it a couple of times like in your example:
str = "COSO17123456".insert(4, "-").insert(7, "-")
puts str
COSO-17-123456
Your reduce version can be therefore more concisely written as:
[4,7].reduce(str) { |str, idx| str.insert(idx, '-') }
I'll bring one more variation to the table, String#unpack:
new_str = str.unpack("A4A2A*").join('-')
# or with String#%
new_str = "%s-%s-%s" % str.unpack("A4A2A*")

Regex to match a specific sequence of strings

Assuming I have 2 array of strings
position1 = ['word1', 'word2', 'word3']
position2 = ['word4', 'word1']
and I want inside a text/string to check if the substring #{target} which exists in text is followed by either one of the words of position1 or following one of the words of the position2 or even both at the same time. Similarly as if I am looking left and right of #{target}.
For example in the sentence "Writing reports and inputting data onto internal systems, with regards to enforcement and immigration papers" if the target word is data I would like to check if the word left (inputting) and right (onto) are included in the arrays or if one of the words in the arrays return true for the regex match. Any suggestions? I am using Ruby and I have tried some regex but I can't make it work yet. I also have to ignore any potential special characters in between.
One of them:
/^.*\b(#{joined_position1})\b.*$[\s,.:-_]*\b#{target}\b[\s,.:-_\\\/]*^.*\b(#{joined_position2})\b.*$/i
Edit:
I figured out this way with regex to capture the word left and right:
(\S+)\s*#{target}\s*(\S+)
However what could I change if I would like to capture more than one words left and right?
If you have two arrays of strings, what you can do is something like this:
matches = /^.+ (\S+) #{target} (\S+) .+$/.match(text)
if matches and (position1.include?(matches[1]) or position2.include?(matches[2]))
do_something()
end
What this regex does is match the target word in your text and extract the words next to it using capture groups. The code then compares those words against your arrays, and does something if they're in the right places. A more general version of this might look like:
def checkWords(target, text, leftArray, rightArray, numLeft = 1, numRight = 1)
# Build the regex
regex = "^.+"
regex += " (\S+)" * numLeft
regex += " #{target}"
regex += " (\S+)" * numRight
regex += " .+$"
pattern = Regexp.new(regex)
matches = pattern.match(text)
return false if !matches
for i in 1..numLeft
return false if (!leftArray.include?(matches[i]))
end
for i in 1..numRight
return false if (!rightArray.include?(matches[numLeft + i]))
end
return true
end
Which can then be invoked like this:
do_something() if checkWords("data", text, position1, position2, 2, 2)
I'm pretty sure it's not terribly idiomatic, but it gives you a general sense of how you would do what you in a more general way.

Need better regex solution in Ruby

I have following code:
date_time = Time.now.strftime('%Y%m%d%H%M%S')
name = "builder-#{date_time}" # builder-20150923125450
if some_condition
name.sub!("#{date_time}", "one-#{date_time}") # builder-one-20150923125450
end
Above code is working fine.
But I think it could be better as I feel like I am repeating #{date_time} twice here.
I have heard of regex capture and replace. Can we use it here? If yes, how?
To utilize capturing mechanism, you need to use round brackets round a subpattern that you would like to refer to using a back-reference in the replacement string.
Here is an example:
date_time = Time.now.strftime('%Y%m%d%H%M%S')
name = "builder-#{date_time}"
puts name.sub(/^([^-]*-)/, "\\1one-")
See IDEONE demo
The ^([^-]*-) matches and captures all characters other than - from the beginning of the string (^) and a hyphen, and then we refer to the text with \\1 in the replacement string.
Refer to Use Parentheses for Grouping and Capturing at Regular-Expressions.info for more details.
A more optimal way is using a ternary operator when initializing name variable:
a = 1
date_time = Time.now.strftime('%Y%m%d%H%M%S')
name = "builder-" + (some_condition ? "one-" : "") + "#{date_time}"
IDEONE demo
Strategy one - precalculate the prefix:
date_time = Time.now.strftime('%Y%m%d%H%M%S')
prefix = some_condition ? 'builder-one-' : 'builder-'
name = "#{prefix}#{date_time}"
The string 'builder-' is repeated twice here. Obviously, you can DRY it even more, but it's an overkill IMHO.
Strategy two - use a lookahead:
date_time = Time.now.strftime('%Y%m%d%H%M%S')
name = "builder-#{date_time}"
name.sub!(/(?=#{date_time})/, "one-") if some_condition
Now date_time appears only twice. I wouldn't say it's a great improvement. I wouldn't say there is much of a problem to begin with.
"builder-" + ("one-" if some_condition).to_s + date_time
date_time = "right now"
some_condition = true
"builder-" + ("one-" if some_condition).to_s + date_time
#=> "builder-one-right now"
some_condition = false
"builder-" + ("one-" if some_condition).to_s + date_time
#=> "builder-right now"
Note that:
("one-" if false).to_s #=> nil.to_s => ""

Grouping regex based on the previous grouping result

I have some parameters that I have to sort into different lists. The prefix determines which list should it belong to.
I use prefixes like: c, a, n, o and an additional hyphen (-) to determine whether to put it in include l it or exclude list.
I use the regex grouped as:
/^(-?)([o|a|c|n])(\w+)/
But here the third group (\w+) is not generic, and it should actually be dependent on the second group's result. I.e, if the prefix is:
'c' or 'a' -> /\w{3}/
'o' -> /\w{2}/
else -> /\w+/
Can I do this with a single regex? Currently I am using an if condition to do so.
Example input:
Valid:
"-cABS", "-aXYZ", "-oWE", "-oqr", "-ncanbeanyting", "nstillanything", "a123", "-conT" (will go to c_exclude_list)
Invalid:
"cmorethan3chars", "c1", "-a1234", "prefizisnotvalid", "somethingelse", "oABC"
Output: for each arg push to the correct list, ignore the invalid.
c_include_list, c_exclude_list, a_include_list, a_exclude_list etc.
You can use this pattern:
/(-?)\b([aocn])((?:(?<=[ac])\w{3}|(?<=o)\w{2}|(?<=n)\w+))\b/
The idea consists to use lookbehinds to check the previous character without including it in the capture group.
Since version 2.0, Ruby has switched from Oniguruma to Onigmo (a fork of Oniguruma), which adds support for conditional regex, among other features.
So you can use the following regex to customize the pattern based on the prefix:
^-(?:([ca])|(o)|(n))?(?(1)\w{3}|(?(2)\w{2}|(?(3)\w+)))$
Demo at rubular
Is a single, mind-bending regex the best way to deal with this problem?
Here's a simpler approach that does not employ a regex at all. I suspect that it would be at least as efficient as a single regex, considering that with the latter you must still assign matching strings to their respective arrays. I think it also reads better and would be easier to maintain. The code below should be easy to modify if I have misunderstood some fine points of the question.
Code
def devide_em_up(str)
h = { a_exclude: [], a_include: [], c_exclude: [], c_include: [],
o_exclude: [], o_include: [], other_exclude: [], other_include: [] }
str.split.each do |s|
exclude = (s[0] == ?-)
s = s[1..-1] if exclude
first = s[0]
s = s[1..-1] if 'cao'.include?(first)
len = s.size
case first
when 'a'
(exclude ? h[:a_exclude] : h[:a_include]) << s if len == 3
when 'c'
(exclude ? h[:c_exclude] : h[:c_include]) << s if len == 3
when 'o'
(exclude ? h[:o_exclude] : h[:o_include]) << s if len == 2
else
(exclude ? h[:other_exclude] : h[:other_include]) << s
end
end
h
end
Example
Let's try it:
str = "-cABS cABT -cDEF -aXYZ -oWE -oQR oQT -ncanbeany nstillany a123 " +
"-conT cmorethan3chars c1 -a1234 prefizisnotvalid somethingelse oABC"
devide_em_up(str)
#=> {:a_exclude=>["XYZ"], :a_include=>["123"],
# :c_exclude=>["ABS", "DEF"], :c_include=>["ABT"],
# :o_exclude=>["WE", "QR"], :o_include=>["QT"],
# :other_exclude=>["ncanbeany"], :other_include=>["nstillany"]}

regex replace [ with \[

I want to write a regex in Ruby that will add a backslash prior to any open square brackets.
str = "my.name[0].hello.line[2]"
out = str.gsub(/\[/,"\\[")
# desired out = "my.name\[0].hello.line\[2]"
I've tried multiple combinations of backslashes in the substitution string and can't get it to leave a single backslash.
You don't need a regular expression here.
str = "my.name[0].hello.line[2]"
puts str.gsub('[', '\[')
# my.name\[0].hello.line\[2]
I tried your code and it worked correct:
str = "my.name[0].hello.line[2]"
out = str.gsub(/\[/,"\\[")
puts out #my.name\[0].hello.line\[2]
If you replace putswith p you get the inspect-version of the string:
p out #"my.name\\[0].hello.line\\[2]"
Please see the " and the masked \. Maybe you saw this result.
As Daniel already answered: You can also define the string with ' and don't need to mask the values.

Resources