Overlapping string matching using regular expressions - ruby

Imagine we have some sequence of letters in the form of a string, call it
str = "gcggcataa"
The regular expression
r = /(...)/
matches any three characters, and when I execute the code
str.scan(r)
I get the following output:
["gcg", "gca", "taa"]
However, what if I wanted to scan through and instead of the distinct, non-overlapping strings as above but instead wanted to get this output:
["gcg", "cgg", "ggc", "gca", "cat", "ata", "taa"]
What regular expression would allow this?
I know I could do this with a loop but I don't want to do that

str = "gcggcataa"
str.chars.each_cons(3).map(&:join) # => ["gcg", "cgg", "ggc", "gca", "cat", "ata", "taa"]

Related

Ruby regex: union 2K values in one regex,

I code a process to process bunch of text files and capture its name if any of 2000 literals exists in it (1 or many). So I'm thinking to combine that many values into one regex, do you think it's doable, I did test for 100 and looks like it's OK. Tx all
Code below depics my flow and sample code, just without looping.
# 1. read regex value list as file [alpha,fox, delta] # 2000 values
# 2. read file into s #5000 files
# 3. find if any of #1 values exists in each #2 file. *with regex tweaks to match format dbname.dob.table
s = '1 dbName.dbo.ALPHA 2 DBNAME.bcd.ALPHA 3 dbName..ALPHA 4 ALPHA 5x dbName.alphA 6x alpha.XX 7x ###dbName.###a.alpha --alpha
dbName..FOX dbName.dbo.DELTA clarity.aba..fox '
value1 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:alpha)(?=\s|$)'
value2 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:fox)(?=\s|$)'
##...
value2000 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:delta)(?=\s|$)'
regex = /#{value1}|#{value2}|#{value2000}/i ## can I union 2000 regex's ???
puts 'reg1: ' + regex.to_s
puts 'result: ' + s.scan(regex).to_s
if s.scan(regex) then puts '...Match!!!d' end
Declaring 2000 variables is highly unnecessary; you should define all values in a single array, then somehow loop through them.
Also, the regular expression is highly repetitive - e.g. the use of (?:dbName\.[a-z]*\.) 2000 times. This can be simplified by grouping all of your values within the non-capture group as follows:
values = %w(alpha fox delta)
regex = /(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:#{Regexp.union(values)})(?=\s|$)/
This is the result:
/(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:(?-mix:alpha|fox|delta))(?=\s|$)/
If you extend that values array to contain 2000 strings, the other code does not need to change.
Provided two conditions are met, I would do it as follows, which I think would be far more efficient than using a gigantic regular expression, which, by its nature, requires that a linear search of the "bad words" be performed for each word in the string, until a match is found or it is determined that there are no matches.
We are given a file whose path is contained in a variable fname and an array of bad words:
arr = ["alpha", "fox", "delta", "charlie", "mabel"]
The first condition that I spoke of above is that, by way of example, "ALPHA" and "Alpha" match "alpha", but "aLPha" does not (or some variant of that).
The second condition is that there is a regular expression with a capture group that would capture a bad word if a bad word were present at the given location in a match. For example:
regex = (?<=^|\s)(?:dbName\.[a-z]*\.)?(\p{Alpha}+)(?=\s|$)
Wherever there is a match, the capture group (\p{Alpha}+) would capture a string of one or more alphanumeric characters whose value is assigned to the global variable $1. We will then check to see if the value of $1 is a bad word. (The regular expression might have other capture groups as well, in which case we might be looking for $2 or $3, say, or a named capture group.)
If there were more than one such regular expression to check for, the code below could be executed for each of them until a match is found or it is determined that there are no more matches.
The first step is to convert the array of bad words to a set:
require 'set'
bad_words = arr.flat_map { |w| [w, w.capitalize, w.upcase] }.to_set
#=> #<Set: {"alpha", "Alpha", "ALPHA", "fox", "Fox", "FOX",
# "delta", "Delta", "DELTA", "charlie", "Charlie", "CHARLIE",
# "mabel", "Mabel", "MABEL"}>
This allows very fast word lookups--much faster than stepping through an array. We may then search the file as follows.
rv = IO.foreach(fname).any? do |line|
line.gsub(regex).any? { bad_words.include?($1) }
end
IO::foreach without a block is seen to return an enumerator. We can then chain that to any? to determine if there is a line that contains a match of the regular expression and the value of its capture group is contained in the set bad_words. If such a line is found the search terminates and true is returned; else, false is returned.
It is seen that String#gsub without a block returns an enumerator, which here I've chained to any?. This form of gsub has nothing to do with string replacements; it just generates matches. Those matches are passed to the block, but we are only interested in the contents of the capture group, which are held by $1. Hence the expression bad_words.include?($1).

Regex to obfuscate substring of a repeating substring

Given a string like:
abc_1234 xyz def_123aa4a56
I want to replace parts of it so the output is:
abc_*******z def_*******56
The rules are:
abc_ and def_ are kind of delimiters, so anything between the two are part of the previous delimiter string.
The string between the abc_ and def_, and the next delimited string should be replaced by *, except for the last 2 characters of that substring. In the above example, abc_1234 xyz (note trailing space), got turned into abc_*******z
prefixes = %w|abc_ def_|
input = "Hello abc_111def_frg def_333World abc_444"
input.gsub(/(#{Regexp.union(prefixes)})../, "\\1**")
#⇒ "Hello abc_**1def_**g def_**3World abc_**4"
Is this what you are looking for?
str = "Hello abc_111def_frg def_333World abc_444"
str.scan(/(?<=abc_|def_)(?:[[:alpha:]]+|[[:digit:]]+)/)
# => ["111", "frg", "333", "444"]
I've assumed the string following "abc_" or "def_" is either all digits or all letters. It won't work if, for example, you wished to extract "a1b" from "abc_a1b cat". You need to better define the rules for what terminates the strings you want.
The regular expression reads, "Following the string "abc_" or "def_" (a positive lookbehind that is not part of the match), match a string of digits or a string of letters".
Given:
> s
=> "abc_1234 xyz def_123aa4a56"
You can do:
> s.gsub(/(?<=abc_|def_)(.*?)(..)(?=(?:abc_|def_|$))/) { |m| "*" * $1.length<<$2 }
=> "abc_*******z def_*******56"

Substring extraction issues with regex in Ruby

I'm attempting to do some substring extraction in Ruby through the use of regular expressions, and running into some issues with the regexp being "overly selective".
Here's the target string I'm attempting to match:
"Exam­ple strin­g with 3 numbe­rs, 2 comma­s, and 6,388­ other­ value­s that are not inclu­ded."
What I'm attempting to extract is the numerical values in the statement provided. In order to account for the comma, I came up with the expression /(\d{1,3}(,\d{1,3})*)/.
Testing the following in IRB, this is the code and result:
string = "Exam­ple strin­g with 3 numbe­rs, 2 comma­s, and 6,388­ other­ value­s that are not inclu­ded."
puts strin­g.scan(/(\­d{1,3}(,\d­{1,3})*)/)­
=> "[[\"3\", nil], [\"2\", nil], [\"6,388\", \",388\"]]"
What I'm looking for is something along the lines of ["3", "2", "6,388"]. Here's the issues I need help correcting:
Why does Ruby include nil for each match group that is not comma-delimited, and how do I adjust the regular expression/match strategy to remove that and get a "flat" array?
How do I prevent the regular expression from matching a sub-expression of the substring I'm attempting to match (that is, ",388" in "6,388")?
I did attempt to use .match(), but ran into the issue that it simply returned "3" (presumably, the first value matched) with no other information apparent. Attempting to index that with [1] or [2] resulted in nil.
If there's a capturing group in pattern, String#scan returns array of arrays to express all groups.
For each match, a result is generated and either added to the result
array or passed to the block. If the pattern contains no groups, each
individual result consists of the matched string, $&. If the pattern
contains groups, each individual result is itself an array containing
one entry per group.
By removing capturing group or by replacing (...) with non-capturing group (?:...), you will get a different result:
string = "Example string with 3 numbers, 2 commas, and 6,388 other values ..."
string.scan(/\d{1,3}(?:,\d{1,3})*/) # no capturing group
# => ["3", "2", "6,388"]

best way to find substring in ruby using regular expression

I have a string https://stackverflow.com. I want a new string that contains the domain from the given string using regular expressions.
Example:
x = "https://stackverflow.com"
newstring = "stackoverflow.com"
Example 2:
x = "https://www.stackverflow.com"
newstring = "www.stackoverflow.com"
"https://stackverflow.com"[/(?<=:\/\/).*/]
#⇒ "stackverflow.com"
(?<=..) is a positive lookbehind.
If string = "http://stackoverflow.com",
a really easy way is string.split("http://")[1]. But this isn't regex.
A regex solution would be as follows:
string.scan(/^http:\/\/(.+)$/).flatten.first
To explain:
String#scan returns the first match of the regex.
The regex:
^ matches beginning of line
http: matches those characters
\/\/ matches //
(.+) sets a "match group" containing any number of any characters. This is the value returned by the scan.
$ matches end of line
.flatten.first extracts the results from String#scan, which in this case returns a nested array.
You might want to try this:
#!/usr/bin/env ruby
str = "https://stackoverflow.com"
if mtch = str.match(/(?::\/\/)(/S)/)
f1 = mtch.captures
end
There are two capturing groups in the match method: the first one is a non-capturing group referring to your search pattern and the second one referring to everything else afterwards. After that, the captures method will assign the desired result to f1.
I hope this solves your problem.

How can I extract a variable number of sub-matches from a Ruby regex?

I have some strings that I would like to pattern match and then extract out the matches as variables $1, $2, etc.
The pattern matching code I have is
a = /^([\+|\-]?[1-9]?)([C|P])(?:([\+|\-][1-9]?)([C|P]))*$/i.match(field)
puts result = #{a.to_a.inspect}
With the above I am able to easily match the following sample strings:
"C", "+2C", "2c-P", "2C-3P", "P+C"
And I have confirmed all of these work on the Rubular website.
However, when I try to match "+2P-c-3p", it matches however, the MatchData "array-like object" looks like this:
result = ["+2P-C-3P", "+2", "P", "-3", "P"]
The problem is that I am unable to extract into the array, the middle pattern "-C".
What I would expect to see is:
result = ["+2P-C-3P", "+2", "P", "-", "C", "-3", "P"]
It seems to extract only the end part "-3P" as "-3" and "P"
Does anyone know how I can modify my pattern to capture the middle matches ?
So as an other example, +3c+2p-c-4p, I would expect should create:
["+3c+2p-c-4p", "+3", "C", "+2", "P", "-", "C", "-4", "P"]
but what I get is
["+3c+2p-c-4p", "+3", "C", "-4", "P"]
which completely misses the middle part.
You have a profound (but common) misunderstanding how character classes work. This:
[C|P]
is wrong. Unless you want to match pipe | characters. There is no alternation in character classes - they are not like groups. This would be correct:
[CP]
Also, there are no meta-characters in a character class, so you only need to escape very few characters (namely, the closing square bracket ] and the dash -, unless you put it at the end of the group). So your regex reduces to:
^([+-]?\d?)([CP])(?:([+-]?\d?)([CP]))*$
Your second misunderstanding is that group count is dynamic - that you somehow have more groups in the result because more matches occurred in the string. This is not the case.
You have exactly as many groups in your result as you have parentheses pairs in your regex (less the number of non-capturing groups of course). In this case, that number is 4. No more, no less.
If a group matches multiple times, only the contents of the last match occurrence will be retained. There is no way (in Ruby) to get the contents of previous match occurrences for that group.
As an alternative, you could regex-split the string into its meaningful parts and then parse them in a loop to extract all info.
This is what I managed to do :
([+-]?\d?)(C|P)(?=(?:[+-]?\d?[CP])*$)
This way you capture multiple elements.
The only problem is the validity of the string. As ruby doesn't have look-behind I can't check the start of the string, so zerhyju+2P-C-3P is valid (but will only capture +2P-C-3P) whereas +2P-C-3Pzertyuio isn't valid.
If you want to both capture and check if your string is valid, the best way (IMO) is to use two regexes, one to check the value ^(?:[+-]?\d?[CP])*$ and a second one to capture ([+-]?\d?)(C|P) (You could also use ([CP]) for the last part).

Resources