String diff to show positions of changed characters? - ruby

I'm looking for a way to diff two strings and return the index value of where the changes start and finish.
I'm already using diff-lcs to find out which lines have changed, but I need to figure out the positions of which characters have changed. I need the positions of the new characters so I can handle them with JavaScript, not the actual text, which is what most diff tools seem to give.
So, for example if I have this string:
The brown fox jumps over the lazy dog
and compare to this string:
The red fox jumps over the crazy dog
I would like to see something like:
[[5,8],[28,33]]
Those numbers being the position where the new characters are found.
Does anyone have any idea how I might get this done?

How about the Google diff-match-patch code? https://github.com/elliotlaster/Ruby-Diff-Match-Patch
I've used it in the past and been happy with the results.
Taken from the documentation linked above:
# Diff-ing
dmp.diff_main("Apples are a fruit.", "Bananas are also fruit.", false)
=> [[-1, "Apple"], [1, "Banana"], [0, "s are a"], [1, "lso"], [0, " fruit."]]
You would just need to iterate through the non-matches and find the character position in the appropriate string.
pos_ary = s.enum_for(:scan, /search_string/).map { regexp.last_match.begin(0) }

Related

How use match in ruby?

Im trying to get the uppercase words from a text. How i can use .match() for this?
Example
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
and I need something like:
r = /[A-Z]/
puts r.match(text)
I never used match and i need a method that gets all uppercase words (Acronym).
If you only want acronyms, you can use something like:
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
text.scan(/\b[A-Z]+\b/)
# => ["PS"]
It's important to match entire words, which is where \b helps, as it marks word boundaries.
The problem is when your text contains single, stand-alone capital letters:
text = "Pediatric stroke (PS) I U.S.A"
text.scan(/\b[A-Z]+\b/)
# => ["PS", "I", "U", "S", "A"]
At that point we need a bit more intelligence and foreknowledge of the text content being searched. The question is, are single-letter acronyms valid? If not, then a minor modification will help:
text.scan(/\b[A-Z]{2,}\b/)
# => ["PS"]
{2,} is explained in the Regexp documentation, so read that for more information.
i only want acronym type " (ACRONYM) ", in this case PS
It's not easy to tell what you want by your description. An acronym is defined as:
An acronym is an abbreviation used as a word which is formed from the initial components in a phrase or a word. Usually these components are individual letters (as in NATO or laser) or parts of words or names (as in Benelux).
according to Wikipedia. By that definition, lowercase, all caps and mixed case can be valid.
If, you mean you only want all-caps within parenthesis, then you can easily modify the regex to honor that, but you'll fail on other acronyms you could encounter, by either missing ones you should want, or by capturing others you should want to ignore.
text = "(PS) (CT/CAT scan)"
text.scan(/\([A-Z]+\)/) # => ["(PS)"]
text.scan(/\([A-Z]+\)/).map{ |s| s[1..-2] } # => ["PS"]
text.scan(/\(([A-Z]+)\)/) # => [["PS"]]
text.scan(/\(([A-Z]+)\)/).flatten # => ["PS"]
are varying ways grab the text but this only opens a new can of worms when you look at "List of medical abbreviations" and "Medical Acronyms / Abbreviations".
Typically I'd have a table of the ones I'll accept, use a simple pattern to capture anything that looks like something I'd want, check to see if it's in the table then keep it or reject it. How to do that is for you to figure out as it's a completely different question and doesn't belong in this one.
Wrong function for the job. Use String#scan.
To get all words that start with uppercase, use String#scan with \b\p{Lu}\w*\b:
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
puts text.scan(/\b\p{Lu}\w*\b/).flatten
See demo
The String.match() will only get you the first match, while scan will return all matches.
The regex \b\p{Lu}\w*\b matches:
\b - word boundary
\p{Lu} - an uppercase Unicode letter
\w* - 0 or more alphanumeric characters
\b - a trailing word boundary
To only match linguistic words (made of letters) you can use
puts text.scan(/\b\p{Lu}\p{M}*+(?>\p{L}\p{M}*+)*\b/).flatten
See another demo
Here, \p{Lu}\p{M}*+ matches any Unicode uppercase letter (even a precomposed one as \p{M} matches diacritics) and (?>\p{L}\p{M}*+)* matches 0 or more letters.
To only get words in ALLCAPS, use
puts text.scan(/\b(?>\p{Lu}\p{M}*+)+\b/).flatten
See the 3rd demo
Yes, you can use String#match for this. It may not be the best way, but you didn't ask if it was. You'd have to do something like this:
text.split.map { |s| s.match(/[A-Z]\w*/) }.compact.map { |md| md[0] }
#=> ["Pediatric", "PS"]
If you knew in advance that text contained two words beginning with a capital letter, you could write:
text.match(/([A-Z]\w*).*([A-Z]\w*)/)
[$1,$2]
#=> ["Pediatric", "PS"]
Note that using a regex is not your only option:
text.delete('.,!?()[]{}').split.select { |str| ('A'..'Z').cover?(str[0]) }
#=> ["Pediatric", "PS"]

Regex: text before multiple matches

Idea. Given the string, return all the matches (with overlaps) and the text before these matches.
Example. For the text atatgcgcatatat and the query atat there are three matches, and the desired output is atat, atatgcgcatat and atatgcgcatatat.
Problem. I use Ruby 2.2 and String#scan method to get multiple matches. I've tried to use lookahead, but the regex /(?=(.*?atat))/ returns every substring that ends with atat. There must be some regex magic to solve this problem, but I can't figure out the right spell.
I believe this is at least better than the OP's answer:
text = "atatgcgcatatat"
query = "atat"
res = []
text.scan(/(?=#{query})/){res.push($` + query)} #`
res # => ["atat", "atatgcgcatat", "atatgcgcatatat"]
Given the nature and purpose of regex, there is no way to do that. When a regex matches text, there is no way to include the same text in another match. Therefore, the best option that I can think of is to use a look-behind to find the ending position of each match:
(?<=atat)
With your example input of atatgcgcatatat, that would return the following three matches:
Position 4, Length 0
Position 12, Length 0
Position 14, Length 0
You could then loop through those results, get the position for each one, and then get the sub-string that starts at the beginning of the input string and ends at that position. If you don't know how to get the positions of each match, you may find the answers to this question helpful.
You could do this:
str = 'atatgcgcatatat'
target = 'atat'
[].tap do |a|
str.gsub(/(?=#{target})/) { a << str[0, $~.end(0)+target.size] }
end
#=> ["atat", "atatgcgcatat", "atatgcgcatatat"]
Notice that the string returned by gsub is discarded.
It seems, there's no way to solve the problem in just one go.
One possible solution is to use this knowledge to get indices of matches when using String#scan, and then return the array of sliced strings:
def find_by_end text, query
res = []
n = query.length
text.scan( /(?=(#{query}))/ ) do |m|
res << text.slice(0, $~.offset(0).first + n)
end
res
end
find_by_end "atatgcgcatatat", "atat" #=> ["atat", "atatgcgcatat", "atatgcgcatatat"]
A slightly different solution was proposed by #StevenDoggart. Here's a nice and short code which uses this hack to solve the problem:
"atatgcatatat".to_enum(:scan, /(?<=atat)/).map { $` } #`
#=> ["atat", "atatgcatat", "atatgcatatat"]
As #CasimiretHippolyte notes, reversing the string might help to solve the problem. It actually does, but it's hardly the prettiest solution:
"atatgcatatat".reverse.scan(/(?=(tata.*))/).flatten.map(&:reverse).reverse
#=> ["atat", "atatgcatat", "atatgcatatat"]

In ruby, how do I use string.scan(/regex/) method for numbers from 1 to 12?

That's what I am doing:
c.scan(/[1-9]|1[0-2]/)
For some reason, it returns only numbers from 1 to 9, ignoring the second part. I tried experimenting a little bit, it seems that the method will search for 10-12 only if 1 is excluded from [1-9] part, e.g., c.scan(/[2-9]|1[0-2]/) will do. What is the reason?
P.S. I know that this method lacks lookbehinds and will search for numbers and "part of numbers" as well
Change the order of your patterns and add word boundaries if necessary.
c.scan(/\b(?:1[0-2]|[1-9])\b/)
The pattern before | is used first. So in our case, it matches all the numbers from 10 to 12. After that the next pattern, that is the one after | is used and now it matches all the remaining numbers ranges from 1 to 9. Note that this would match 9 in 59 also. So i suggest you to put your pattern inside a capturing or non-capturing group and add word boundary \b (matches between a word character and a non-word character) before and after to that group .
DEMO
| matches left to right, and the first part of the right side (1) is always matched by the left side. Reverse them:
c.scan(/1[0-2]|[1-9]/)
Here's another way you might consider extracting numbers between 1 and 12 (assuming that's what you want to do):
c = '14 0 11x 15 003 y12'
c.scan(/\d+/).map(&:to_i).select { |n| (1..12).cover?(n) }
#=> [11, 3, 12]
I've returned an array of integers, rather than strings, thinking that probably would be more useful, but if you want strings:
c.scan(/\d+/).map { |s| s.to_i.to_s }
.select { |s| ['10', '11', '12', *'1'..'9'].include?(s) }
#=> ["11", "3", "12"]
I see several advantages to this approach, versus using a single regex:
it's easy to understand;
the regex is simple;
it's easy to modify if the permissible values change; and
it can be broken into three pieces to facilitate testing.

How to know if a match is adjacent to the previous match

In a construction like
string.scan(regex){...}
or
string.gsub(regex){...}
how can check if the match for a loop cycle is adjacent to the previous one in the original string? For example, in
"abaabcaaab".scan(/a+b/){|match|
...
continued = ...
...
}
there will be three matches "ab", "aab", and "aaab". During each cycle, I want them to have the variable continued to be false, true, and false respectively because "ab" is the first match cycle, "aab" is adjacent to it, and "c" interrupts before the next match "aaab".
"ab" #=> continued = false
"aab" #=> continued = true
"aaab" #=> continued = false
Is there an anchor in origuruma that refers to the end of the previous matching position? If so, that may be used in the regex. If not, I probably need to use things like MatchData#offset. and do some calculation in the loop.
By the way, what is \G in origuruma regex? I had the impression that it might be the anchor that I want, but I am not sure what it is.
I don't believe the offset data is available using those methods. You'll probably have to use Regexp#match, passing along the location each time. The returned MatchData object contains all the info you need to do any substitutions etc too.
Of course, you'll have to be careful if you are incrementing offsets in combination with doing string substitutions, if the length of the replacement is not the same as the length of the match. A common pattern here is to walk the string backwards, but I don't think you'll be able to follow that pattern with these methods, so you'll need to adjust the offsets.
EDIT | Actually, you would be able to walk the string backwards, if you do the replacement in a completely separate step. First find everything you need to replace, along with the offsets. Next, iterate that list in reverse order, doing your substitutions.
StringScanner would be well suited to this task: http://corelib.rubyonrails.org/classes/StringScanner.html
require 'strscan'
s = StringScanner.new('abaabcaaab')
begin
puts s.pos
s.scan_until(/a+b/)
puts s.matched
end while !s.matched.nil?
outputs
0
ab
2
aab
5
aaab
10
nil
So you could then just keep track of the length of the last match and the position and do the math to see if they are adjacent.

Ruby Regex, Only One Capture (Very Simple!)

I guess this will be a silly mistake but for me, the following returns an array containing only "M". See this:
/(.)+?/.match("Many many characters!").captures
=> ["M"]
Why doesn't it return an array of every character? I must have missed something blatantly obvious because I can't see whats wrong with this?
Edit: Just realised, I don't need the +? but it still doesn't work without it.
Edit: Apologies! I will clarify: my goal is to allow users to enter a regular expression and styling and an input text file, wherever there is a match, the text will be surrounded with a html element and styling will be applied, I am not just splitting the string into characters, I only used the given regex because it was the simplest although that was stupid on my part. How do I get capture groups from scan() or is that not possible? I see that $1 contains "!" (last match?) and not any others.
Edit: Gosh, it really isn't my day. As injekt has informed me, the captures are stored in separate arrays. How do I get the offset of these captures from the original string? I would like to be able to get the offset of a captures then surround it with another string. Or is that what gsub is for? (I thought that only replaced the match, not a capture group)
Hopefully final edit: Right, let me just start this again :P
So, I have a string. The user will use a configuration file to enter a regular expression, then a style associated with each capture group. I need to be able to scan the entire string and get the start and finish or offset and size of each group match.
So if a user had configured ([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4}) (email address) then I should be able to get:
[ ["elliotpotts", 0, 11],
["sample.", 12, 7],
["com", 19, 3] ]
from the string:
"elliotpotts#sample.com"
If that is not clear, there is simply something wrong with me :P. Thanks a lot so far guys, and thank you for being so patient!
Because your capture is only matching one single character. (.)+ is not the same as (.+)
>> /(.)+?/.match("Many many characters!").captures
=> ["M"]
>> /(.+)?/.match("Many many characters!").captures
=> ["Many many characters!"]
>> /(.+?)/.match("Many many characters!").captures
=> ["M"]
If you want to match every character recursively use String#scan or String#split if you don't care about capture groups
Using scan:
"Many many characters!".scan(/./)
#=> ["M", "a", "n", "y", " ", "m", "a", "n", "y", " ", "c", "h", "a", "r", "a", "c", "t", "e", "r", "s", "!"]
Note that other answer are using (.) whilst that's fine if you care about the capture group, it's a little pointless if you don't, otherwise it'll return EVERY CHARACTER in it's own separate Array, like this:
[["M"], ["a"], ["n"], ["y"], [" "], ["m"], ["a"], ["n"], ["y"], [" "], ["c"], ["h"], ["a"], ["r"], ["a"], ["c"], ["t"], ["e"], ["r"], ["s"], ["!"]]
Otherwise, just use split: "Many many characters!".split(' ')"
EDIT In reply to your edit:
reg = /([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4})/
str = "elliotpotts#sample.com"
str.scan(reg).flatten.map { |capture| [capture, str.index(capture), capture.size] }
#=> [["elliotpotts", 0, 11], ["sample.", 12, 7], ["com", 19, 3]]`
Oh, and you don't need scan, you're not really scanning so you dont need to traverse, at least not with the example you provided:
str.match(reg).captures.map { |capture| [capture, str.index(capture), capture.size] }
Will also work
Yes, something important was missed ;-)
(...) only introduces ONE capture group: the number of times the group matches is irrelevant as the index is determined only by the regular expression itself and not the input.
The key is a "global regular expression", which will apply the regular expression multiple times in order. In Ruby this is done with inverting from Regex#match to String#scan (many other languages have a "/g" regular expression modifier):
"Many many chara­cters!".sc­an(/(.)+?/­)
# but more simply (or see answers using String#split)
"Many many chara­cters!".sc­an(/(.)/­)
Happy coding
It's only returning one character because that's all you've asked it to match. You probably want to use scan instead:
str = "Many many characters!"
matches = str.scan(/(.)/)
The following code is from Get index of string scan results in ruby and modified for my liking.
[].tap {|results|
"abab".scan(/a/) {|capture|
results.push(([capture, Regexp::last_match.offset(0)]).flatten)
}
}
=> [["a", 0], ["a", 2]]

Resources