Ruby Regex, Only One Capture (Very Simple!) - ruby

I guess this will be a silly mistake but for me, the following returns an array containing only "M". See this:
/(.)+?/.match("Many many characters!").captures
=> ["M"]
Why doesn't it return an array of every character? I must have missed something blatantly obvious because I can't see whats wrong with this?
Edit: Just realised, I don't need the +? but it still doesn't work without it.
Edit: Apologies! I will clarify: my goal is to allow users to enter a regular expression and styling and an input text file, wherever there is a match, the text will be surrounded with a html element and styling will be applied, I am not just splitting the string into characters, I only used the given regex because it was the simplest although that was stupid on my part. How do I get capture groups from scan() or is that not possible? I see that $1 contains "!" (last match?) and not any others.
Edit: Gosh, it really isn't my day. As injekt has informed me, the captures are stored in separate arrays. How do I get the offset of these captures from the original string? I would like to be able to get the offset of a captures then surround it with another string. Or is that what gsub is for? (I thought that only replaced the match, not a capture group)
Hopefully final edit: Right, let me just start this again :P
So, I have a string. The user will use a configuration file to enter a regular expression, then a style associated with each capture group. I need to be able to scan the entire string and get the start and finish or offset and size of each group match.
So if a user had configured ([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4}) (email address) then I should be able to get:
[ ["elliotpotts", 0, 11],
["sample.", 12, 7],
["com", 19, 3] ]
from the string:
"elliotpotts#sample.com"
If that is not clear, there is simply something wrong with me :P. Thanks a lot so far guys, and thank you for being so patient!

Because your capture is only matching one single character. (.)+ is not the same as (.+)
>> /(.)+?/.match("Many many characters!").captures
=> ["M"]
>> /(.+)?/.match("Many many characters!").captures
=> ["Many many characters!"]
>> /(.+?)/.match("Many many characters!").captures
=> ["M"]
If you want to match every character recursively use String#scan or String#split if you don't care about capture groups
Using scan:
"Many many characters!".scan(/./)
#=> ["M", "a", "n", "y", " ", "m", "a", "n", "y", " ", "c", "h", "a", "r", "a", "c", "t", "e", "r", "s", "!"]
Note that other answer are using (.) whilst that's fine if you care about the capture group, it's a little pointless if you don't, otherwise it'll return EVERY CHARACTER in it's own separate Array, like this:
[["M"], ["a"], ["n"], ["y"], [" "], ["m"], ["a"], ["n"], ["y"], [" "], ["c"], ["h"], ["a"], ["r"], ["a"], ["c"], ["t"], ["e"], ["r"], ["s"], ["!"]]
Otherwise, just use split: "Many many characters!".split(' ')"
EDIT In reply to your edit:
reg = /([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4})/
str = "elliotpotts#sample.com"
str.scan(reg).flatten.map { |capture| [capture, str.index(capture), capture.size] }
#=> [["elliotpotts", 0, 11], ["sample.", 12, 7], ["com", 19, 3]]`
Oh, and you don't need scan, you're not really scanning so you dont need to traverse, at least not with the example you provided:
str.match(reg).captures.map { |capture| [capture, str.index(capture), capture.size] }
Will also work

Yes, something important was missed ;-)
(...) only introduces ONE capture group: the number of times the group matches is irrelevant as the index is determined only by the regular expression itself and not the input.
The key is a "global regular expression", which will apply the regular expression multiple times in order. In Ruby this is done with inverting from Regex#match to String#scan (many other languages have a "/g" regular expression modifier):
"Many many chara­cters!".sc­an(/(.)+?/­)
# but more simply (or see answers using String#split)
"Many many chara­cters!".sc­an(/(.)/­)
Happy coding

It's only returning one character because that's all you've asked it to match. You probably want to use scan instead:
str = "Many many characters!"
matches = str.scan(/(.)/)

The following code is from Get index of string scan results in ruby and modified for my liking.
[].tap {|results|
"abab".scan(/a/) {|capture|
results.push(([capture, Regexp::last_match.offset(0)]).flatten)
}
}
=> [["a", 0], ["a", 2]]

Related

Better way to tokenize string in Ruby?

I have a string of arbitrary characters. I would like to turn it into an array, where each character is in a single array element, EXCEPT of successive word-characters (\w+), which should end up together in one array element. Example:
ab.:u/87z
should become
['ab','.',':','u','/','87z']
My first approach went like this:
mystring.split(/\b/)
Of course this groups together non-word characters:
['ab','.:','u','/','87','z']
I can take them apart in a subsequent step, but I'm looking for a more elegant way. Next I tried these:
mystring.split(/(\w+|\W)/)
mystring.split(/(\b|\W)/)
Both return nearly the desired result, only that they also return array elements containing empty strings, so I have to write something like
mystring.split(/(\b|\W)/).reject(&:empty?)
Now my question: Is there a simpler way to do this?
UPDATE: I made a silly mistake when I explained my example. Of course '87' and 'z' should be together, i.e. '87z'. I fixed my example.
'ab.:u/87z'.scan(/\w+|./) #=>["ab", ".", ":", "u", "/", "87z"]
I'm not exactly sure what you want because you said word-characters (\w+) but split the 87 and z. If I'm correct, \w should match letters, digits and underscores. Hence the "87z".
'ab.:u/87z'.scan(/[A-Za-z]+|\d+|./) #=>["ab", ".", ":", "u", "/", "87", "z"]
You could always do this to achieve what you showed there though
Don't use split, use the scan method:
> "ab.:u/87z".scan(/\w+|\W/)
=> ["ab", ".", ":", "u", "/", "87z"]

String diff to show positions of changed characters?

I'm looking for a way to diff two strings and return the index value of where the changes start and finish.
I'm already using diff-lcs to find out which lines have changed, but I need to figure out the positions of which characters have changed. I need the positions of the new characters so I can handle them with JavaScript, not the actual text, which is what most diff tools seem to give.
So, for example if I have this string:
The brown fox jumps over the lazy dog
and compare to this string:
The red fox jumps over the crazy dog
I would like to see something like:
[[5,8],[28,33]]
Those numbers being the position where the new characters are found.
Does anyone have any idea how I might get this done?
How about the Google diff-match-patch code? https://github.com/elliotlaster/Ruby-Diff-Match-Patch
I've used it in the past and been happy with the results.
Taken from the documentation linked above:
# Diff-ing
dmp.diff_main("Apples are a fruit.", "Bananas are also fruit.", false)
=> [[-1, "Apple"], [1, "Banana"], [0, "s are a"], [1, "lso"], [0, " fruit."]]
You would just need to iterate through the non-matches and find the character position in the appropriate string.
pos_ary = s.enum_for(:scan, /search_string/).map { regexp.last_match.begin(0) }

ruby - %w(%q(words in a row)) returns wrong elements

How can I get this expression to return an array of words?
%w(%q(words in a row))
I thought that %q would give me a string and then %w would give me an array of the words.
Bt instead I get
["%q(words", "in", "a", "row")"]
This is part of some larger code so just using %w on its own will not help.
I want to be able to interpolate the %q expression first.
Lower case %w and %q does not interpolate variables inside. You will need to use upper case %W and %Q, and also you need to wrap the variable with #{} to interpolate.
Your version of working code:
%W{#{%q{words in a row}}}
However just as Justin said, I do not understand the point of this. You can directly put raw string without quotes in %w{}.
Maybe I am misunderstanding the problem. But if you want to interpolate the string and then convert to a word array, I believe the %W does this, noting the capitalization (see here).
%W(words in a row)
#=> ["words", "in", "a", "row"]
I doubt the usefulness of doing such thing, and there is definitely a smell in your code, but here is a way:
eval("%w(#{%q(words in a row)})")
# => ["words", "in", "a", "row"]

Ruby: How can I process a CSV file with "bad commas"?

I need to process a CSV file from FedEx.com containing shipping history. Unfortunately FedEx doesn't seem to actually test its CSV files as it doesn't quote strings that have commas in them.
For instance, a company name might be "Dog Widgets, Inc." but the CSV doesn't quote that string, so any CSV parser thinks that comma before "Inc." is the start of a new field.
Is there any way I can reliably parse those rows using Ruby?
The only differentiating characteristic that I can find is that the commas that are part of a string have a space after then. Commas that separate fields have no spaces. No clue how that helps me parse this, but it is something I noticed.
you can use a negative lookahead
>> "foo,bar,baz,pop, blah,foobar".split(/,(?![ \t])/)
=> ["foo", "bar", "baz", "pop, blah", "foobar"]
Well, here's an idea: You could replace each instance of comma-followed-by-a-space with a unique character, then parse the CSV as usual, then go through the resulting rows and reverse the replace.
Perhaps something along these lines..
using gsub to change the ', ' to something else
ruby-1.9.2-p0 > "foo,bar,baz,pop, blah,foobar".gsub(/,\ /,'| ').split(',')
[
[0] "foo",
[1] "bar",
[2] "baz",
[3] "pop| blah",
[4] "foobar"
]
and then remove the | after words.
If you are so lucky as to only have one field like that, you can parse the leading fields off the start, the trailing fields off than end and assume whatever is left is the offending field. In python (no habla ruby) this would look something like:
fields = line.split(',') # doesn't work if some fields are quoted
fields = fields[:5] + [','.join(fields[5:-3])] + fields[-3:]
Whatever you do, you should be able at a minimum determine the number of offending commas and that should give you something (a sanity check if nothing else).

How can I extract a variable number of sub-matches from a Ruby regex?

I have some strings that I would like to pattern match and then extract out the matches as variables $1, $2, etc.
The pattern matching code I have is
a = /^([\+|\-]?[1-9]?)([C|P])(?:([\+|\-][1-9]?)([C|P]))*$/i.match(field)
puts result = #{a.to_a.inspect}
With the above I am able to easily match the following sample strings:
"C", "+2C", "2c-P", "2C-3P", "P+C"
And I have confirmed all of these work on the Rubular website.
However, when I try to match "+2P-c-3p", it matches however, the MatchData "array-like object" looks like this:
result = ["+2P-C-3P", "+2", "P", "-3", "P"]
The problem is that I am unable to extract into the array, the middle pattern "-C".
What I would expect to see is:
result = ["+2P-C-3P", "+2", "P", "-", "C", "-3", "P"]
It seems to extract only the end part "-3P" as "-3" and "P"
Does anyone know how I can modify my pattern to capture the middle matches ?
So as an other example, +3c+2p-c-4p, I would expect should create:
["+3c+2p-c-4p", "+3", "C", "+2", "P", "-", "C", "-4", "P"]
but what I get is
["+3c+2p-c-4p", "+3", "C", "-4", "P"]
which completely misses the middle part.
You have a profound (but common) misunderstanding how character classes work. This:
[C|P]
is wrong. Unless you want to match pipe | characters. There is no alternation in character classes - they are not like groups. This would be correct:
[CP]
Also, there are no meta-characters in a character class, so you only need to escape very few characters (namely, the closing square bracket ] and the dash -, unless you put it at the end of the group). So your regex reduces to:
^([+-]?\d?)([CP])(?:([+-]?\d?)([CP]))*$
Your second misunderstanding is that group count is dynamic - that you somehow have more groups in the result because more matches occurred in the string. This is not the case.
You have exactly as many groups in your result as you have parentheses pairs in your regex (less the number of non-capturing groups of course). In this case, that number is 4. No more, no less.
If a group matches multiple times, only the contents of the last match occurrence will be retained. There is no way (in Ruby) to get the contents of previous match occurrences for that group.
As an alternative, you could regex-split the string into its meaningful parts and then parse them in a loop to extract all info.
This is what I managed to do :
([+-]?\d?)(C|P)(?=(?:[+-]?\d?[CP])*$)
This way you capture multiple elements.
The only problem is the validity of the string. As ruby doesn't have look-behind I can't check the start of the string, so zerhyju+2P-C-3P is valid (but will only capture +2P-C-3P) whereas +2P-C-3Pzertyuio isn't valid.
If you want to both capture and check if your string is valid, the best way (IMO) is to use two regexes, one to check the value ^(?:[+-]?\d?[CP])*$ and a second one to capture ([+-]?\d?)(C|P) (You could also use ([CP]) for the last part).

Resources