Regex: text before multiple matches - ruby

Idea. Given the string, return all the matches (with overlaps) and the text before these matches.
Example. For the text atatgcgcatatat and the query atat there are three matches, and the desired output is atat, atatgcgcatat and atatgcgcatatat.
Problem. I use Ruby 2.2 and String#scan method to get multiple matches. I've tried to use lookahead, but the regex /(?=(.*?atat))/ returns every substring that ends with atat. There must be some regex magic to solve this problem, but I can't figure out the right spell.

I believe this is at least better than the OP's answer:
text = "atatgcgcatatat"
query = "atat"
res = []
text.scan(/(?=#{query})/){res.push($` + query)} #`
res # => ["atat", "atatgcgcatat", "atatgcgcatatat"]

Given the nature and purpose of regex, there is no way to do that. When a regex matches text, there is no way to include the same text in another match. Therefore, the best option that I can think of is to use a look-behind to find the ending position of each match:
(?<=atat)
With your example input of atatgcgcatatat, that would return the following three matches:
Position 4, Length 0
Position 12, Length 0
Position 14, Length 0
You could then loop through those results, get the position for each one, and then get the sub-string that starts at the beginning of the input string and ends at that position. If you don't know how to get the positions of each match, you may find the answers to this question helpful.

You could do this:
str = 'atatgcgcatatat'
target = 'atat'
[].tap do |a|
str.gsub(/(?=#{target})/) { a << str[0, $~.end(0)+target.size] }
end
#=> ["atat", "atatgcgcatat", "atatgcgcatatat"]
Notice that the string returned by gsub is discarded.

It seems, there's no way to solve the problem in just one go.
One possible solution is to use this knowledge to get indices of matches when using String#scan, and then return the array of sliced strings:
def find_by_end text, query
res = []
n = query.length
text.scan( /(?=(#{query}))/ ) do |m|
res << text.slice(0, $~.offset(0).first + n)
end
res
end
find_by_end "atatgcgcatatat", "atat" #=> ["atat", "atatgcgcatat", "atatgcgcatatat"]
A slightly different solution was proposed by #StevenDoggart. Here's a nice and short code which uses this hack to solve the problem:
"atatgcatatat".to_enum(:scan, /(?<=atat)/).map { $` } #`
#=> ["atat", "atatgcatat", "atatgcatatat"]
As #CasimiretHippolyte notes, reversing the string might help to solve the problem. It actually does, but it's hardly the prettiest solution:
"atatgcatatat".reverse.scan(/(?=(tata.*))/).flatten.map(&:reverse).reverse
#=> ["atat", "atatgcatat", "atatgcatatat"]

Related

How to use gsubstitution with more letters

I've printed the code, wit ruby
string = "hahahah"
pring string.gsub("a","b")
How do I add more letter replacements into gsub?
string.gsub("a","b")("h","l") and string.gsub("a","b";"h","l")
didnt work...
*update I have tried this too but without any success .
letters = {
"a" => "l"
"b" => "n"
...
"z" => "f"
}
string = "hahahah"
print string.gsub(\/w\,letters)
You're overcomplicating. As with most method calls in Ruby, you can simply chain #gsub calls together, one after the other:
str = 'adfh'
print str.gsub("a","b").gsub("h","l") #=> 'bdfl'
What you're doing here is applying the second #gsub to the result of the first one.
Of course, that gets a bit long-winded if you do too many of them. So, when you find yourself stringing too many together, you'll want to look for a regex solution. Rubular is a great place to tinker with them.
The way to use your hash trick with #gsub and a regex expression is to provide a hash for all possible matches. This has the same result as the two #gsub calls:
print str.gsub(/[ah]/, {'a'=>'b', 'h'=>'l'}) #=> 'bdfl'
The regex matches either a or h (/[ah]/), and the hash is saying what to substitute for each of them.
All that said, str.tr('ah', 'bl') is the simplest way to solve your problem as specified, as some commenters have mentioned, so long as you are working with single letters. If you need to work with two or more characters per substitution, you'll need to use #gsub.

How to verify that the last character in a string is a number

I need to check if the last character in a string is a digit, and if so, increment it.
I have a directory structure of /u01/app/oracle/... and that's where it goes off the rails. Sometimes it ends with the version number, sometimes it ends with dbhome_1 (or 2, or 3), and sometimes, I have to assume, it will take some other form. If it ends with dbhome_X, I need to parse that and bump that final digit, if it is a digit.
I use split to split the directory structure on '/', and use include? to check if the final element is something like "dbhome". As long as my directory structure ends with dbhome_X it seems to work. As I was testing, though, I tried a path that ended with dbhome, and found that my check for the last character being a digit didn't work.
db_home = '/u01/app/oracle/product/11.2.0/dbhome'
if db_home.split('/')[-1].include?('dbhome')
homedir=db_home.split('/')[-1]
if homedir[-1].to_i.is_a? Numeric
homedir=homedir[0...-1]+(homedir[-1].to_i+1).to_s
new_path="/"+db_home.split('/')[1...-1].join("/")+"/"+homedir.to_s
end
else
new_path=db_home+"/dbhome_1"
end
puts new_path
I did not expect the output to be /u01/app/oracle/11.2.0/product/dbhom1 - it seems to have fallen into the if block that added 1 to the final character.
If I set the initial path to /u01/app/.../dbhome_1, I get the expected /u01/app/.../dbhome_2 as the output.
You could use a regular expression to make matching a tad bit easier
if !!(db_home[/.*dbhome.*\z]) ..
You could use regex's
/[0-9]$/.match("How3").nil?
I need to check if the last character in a string is a digit, and if
so, increment it.
This is one option:
s = 'string9'
s[-1].then { |last| last.to_i.to_s == last ? [s[0..-2], last.to_i+1].join : s }
#=> "string10"
'/u01/app/11.2.0/dbhome'.sub(/\d\z/) { |s| s.succ }
#=> "/u01/app/11.2.0/dbhome"
'/u01/app/11.2.0/dbhome9'.sub(/\d\z/) { |s| s.succ }
#=> "/u01/app/11.2.0/dbhome10"
This is a starting point if you're running Ruby v2.6+:
fname = 'filename1'
fname[/\d+$/].then { |digits|
fname[/\d+$/] = digits.to_i.next.to_s if digits
}
fname # => "filename2"
And it's safe if the filename doesn't end with a digit:
fname = 'filename'
fname[/\d+$/].then { |digits|
fname[/\d+$/] = digits.to_i.next.to_s if digits
}
fname # => "filename"
I'm not sure if I like doing it that way better than the more traditional way which works with much older Rubies:
digits = fname[/\d+$/]
fname[/\d+$/] = digits.to_i.next.to_s if digits
except for the fact that digits gets stuck into the variable space after only being used once. There's probably worse things that happen in my code though.
This is taking advantage of String's [] and []= methods.

Regex returning weird arrays

I want to make an array of results from a string like this one, using a regular expression:
results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday
Here’s my regex as it stands. It works in Sublime Text’s regex search but not in Ruby:
(results)\|.*?\\n(?=((results\|)|(timestamps\|\|)))
and this would be the desired result:
1. results|foofoofoo
2. results|barbarbar
3. results|googoogoo
Instead I’m getting these weird returns, and I can’t understand it. Why does this not select the result lines?
Match 1
1. results
2. results|
3. results|
4.
Match 2
1. results
2. results|
3. results|
4.
Match 3
1. results
2. timestamps||
3.
4. timestamps||
Here’s the actual code using the regex:
#create new lines for each regex'd line body with that body set as the raw attribute
host_scan.raw.scan(/(?:results)\|.*?\\n(?=((?:results\|)|(?:timestamps\|\|)))/).each do |body|
#lines << Line.new({:raw => body})
end
As Kendall Frey already stated, you are creating too many capture groups. No need to group the first literal “results|”, and no need to group the elements of your alternate group in individual non backreferencing groups. What you are intending to do is this regex:
/results\|.*?(?=\\n(?:results\||timestamps\|\|))/
or, if you don’t mind repeating the \\n part, you can do away with the non-capturing subgroup:
/results\|.*?(?=\\nresults\||\\ntimestamps\|\|)/
– both will return an array of matched values as specified in your question.
I'm guessing it has something to do with capturing groups. If you change all your (...) to (?:...) it will eliminate capturing groups.
Rather than jump to a regex, which is a much more complicated way to get at the data, use split("\n").
text = "results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday"
ary = text.split("\n")
ary is:
[
"results|foofoofoo",
"results|barbarbarbar",
"results|googoogoo",
"timestamps||friday"
]
Slice that and you can get:
ary[0..2]
=> ["results|foofoofoo", "results|barbarbarbar", "results|googoogoo"]
EDIT:
Based on the comment that there are more carriage returns and complex characters in the strings:
require 'awesome_print'
text = "results|foofoofoo\nmorefoo\nandevenmorefoo\nresults|barbarbarbar\nandmorebar\nandyetagainmorebar\nresults|googoogoo\ntimestamps||friday"
ap text.sub(/\|\|friday$/, '').split('results')[1..-1].map{ |l| 'results' << l }
Which outputs:
[
[0] "results|foofoofoo\nmorefoo\nandevenmorefoo\n",
[1] "results|barbarbarbar\nandmorebar\nandyetagainmorebar\n",
[2] "results|googoogoo\ntimestamps"
]
The answer turned out to lie in the parentheses. Wrapping in parentheses caused it to return the entire match instead of just the tail delimiter.
host_scan.raw.scan(/((?:results\|.*?\\n)(?=(?:results\|)|(?:timestamps\|\|)))/).each do |body|
#lines << Line.new({:raw => body})
end

How can I do fuzzy substring matching in Ruby?

I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.
I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.
If the string was identical because no random error was introduced, I would use document.index(substring), however this fails if there is even one character difference.
I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.
The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.
You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: https://github.com/flori/amatch.
Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:
include 'amatch'
module FuzzyFinder
def scanner( input )
out = [] unless block_given?
pos = 0
input.scan(/(\w+)(\W*)/) do |word, white|
startpos = pos
pos = word.length + white.length
if block_given?
yield startpos, word
else
out << [startpos, word]
end
end
end
def find( text, doc )
index = scanner(doc)
sstr = text.gsub(/\W/,'')
levenshtein = Amatch::Levensthtein.new(sstr)
minlen = sstr.length
maxndx = index.length
possibles = []
minscore = minlen*2
index.each_with_index do |x, i|
spos = x[0]
str = x[1]
si = i
while (str.length < minlen)
i += 1
break unless i < maxndx
str += index[i][1]
end
str = str.slice(0,minlen) if (str.length > minlen)
score = levenshtein.search(str)
if score < minscore
possibles = [spos]
minscore = score
elsif score == minscore
possibles << spos
end
end
[minscore, possibles]
end
end
Obviously there are numerous improvements possible and probably necessary! A few off the top:
Process the document once and store
the results, possibly in a database.
Determine a usable length of string
for an initial check, process
against that initial substring first
before trying to match the entire
fragment.
Following up on the previous,
precalculate starting fragments of
that length.
A simple one is fuzzy_match
require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus
A more elaborated one (you wouldn't say it from this example though) is levenshein, which computes the number of differences.
require 'levenshtein'
Levenshtein.distance('test', 'test') # => 0
Levenshtein.distance('test', 'tent') # => 1
You should look at the StrikeAMatch implementation detailed here:
A better similarity ranking algorithm for variable length strings
Instead of relying on some kind of string distance (i.e. number of changes between two strings), this one looks at the character pairs patterns. The more character pairs occur in each string, the better the match. It has worked wonderfully for our application, where we search for mistyped/variable length headings in a plain text file.
There's also a gem which combines StrikeAMatch (an implementation of Dice's coefficient on character-level bigrams) and Levenshtein distance to find matches: https://github.com/seamusabshere/fuzzy_match
It depends on the artifacts that can end up in the substring. In the simpler case where they are not part of [a-z] you can use parse the substring and then use Regexp#match on the document:
document = 'Ulputat non nullandigna tortor dolessi illam sectem laor acipsus.'
substr = "tortor - dolessi _%&# +illam"
re = Regexp.new(substr.split(/[^a-z]/i).select{|e| !e.empty?}.join(".*"))
md = document.match re
puts document[md.begin(0) ... md.end(0)]
# => tortor dolessi illam
(Here, as we do not set any parenthesis in the Regexp, we use begin and end on the first (full match) element 0 of MatchData.
If you are only interested in the start position, you can use =~ operator:
start_pos = document =~ re
I have used none of them, but I found some libraries just by doing a search for 'diff' in rubygems.org. All of them can be installed by gem. You might want to try them. I myself is interested, so if you already know these or if you try them out, it would be helpful if you leave your comment.
diff
diff-lcs
differ
difflcs
pretty_diff
diffy
kronk
khtmldiff
gdiff
ruby_diff
tdiff
diffrenderer
diffplex
dbdiff
diff_dirs
rsyncdiff
wdiff
diff4all
davidtrogers-htmldiff
edouard-htmldiff
diff2xml
dirdiff
rrdiff
nokogiri-diff
pretty-diff
easy_diff
smartdiff

how to remove all [d+] except the last [d+]?

i have a string like
/root/children[2]/header[1]/something/some[4]/table/tr[1]/links/a/b
and
/root/children[2]/header[1]/something/some[4]/table/tr[2]
how can i reproduce the string so that all the /\[\d+\]/ are removed except for the last /\[\d+\]/ ?
so i should end up with .
/root/children/header/something/some/table/tr[1]/links/a/b
and
/root/children/header/something/some/table/tr[2]
No loops for you. Use a lookahead assertion (?= ... ):
s.gsub(/\[\d+\](?=.*\[)/, "")
There's a reasonable explanation of the very useful lookaround operators here
We will have to use while loop, I guess. And here comes good ol' C-style-loop solution:
while s.gsub!(/(\[\d+\])(.*?)(\[\d+\])/, '\2\3'); end
It's a bit hard to read, so I'll explain. The idea is that we match the string with a pattern that requires two [\d+] blocks to persist in a string. In the replacement, we just delete the first one. We repeat it until string doesn't match (so it contains only one such block) and utilize the fact that gsub! doesn't perform substitution when string is unmatched.
I'm absolutely certain there's a more elegant solution, but this ought to get you going:
string = "/root/children[2]/header[1]/something/some[4]/table/tr[1]/links/a/b"
count = string.scan(/\[\d+\]/).size
index = 0
string.gsub(/\[\d+\]/) do |capture|
index += 1
index == count ? capture : ""
end
Try this:
str.scan(/\[\d+\]/)[0..-2].each {|match| str.sub!(match, '')}

Resources