What's wrong with this RegEx? - ruby

I'm trying to implement this in a small ruby script, and tested it on http://www.rubular.com/, where it worked perfectly. Not sure why its not performing in the actual script.
The RegEx: /(motion|links|sound|button|symbol)|(0.\d{8})|(\s\d{1}\s)|(\d{10}\s)/
The Text it's Against:
Trial ID: 1 | Trial Type: motion | Trick? 1
Click Time: 0.87913100 1302969732
Trial ID: 7 | Trial Type: button | Trick? 0
Click Time: 0.19817800 1302987043
etc. etc.
What I am trying to grab: Only the numbers, and the single word after "Trial Type". So for the first line of the example, I would only want " 1 motion 1 0.87913100 1302969732" to be returned. I also want to keep the space before the first number in each trial.
My short ruby script:
File.open('log.txt', 'r') do |file|
contents = file.readlines.to_s
regex = Regexp.new(/(motion|links|sound|button|symbol)|(0\.\d{8})|(\s\d{1}\s)|(\d{10}\s)/)
matchdata = regex.match(contents).to_a
matchdata.each do |match|
if match != nil
puts match
end
end
end
It only outputs two "1"s though. Hmm... I know its reading the file contents right, and when I tried an alternate simplet regex it worked fine.
Thanks for any help I get here!! : )

You want to use String#scan
matchdata = contents.scan(regex)
Also #Mike Penington is correct, you shouldn't have to do the if match != nil if you do it right. You have to clean up your regex as well. The pipe character in regex is a special character to denote match the left side OR the right side, and you have the litteral pipe character that you must escape.

You need to escape the literal pipes inside the regex, fill in other missing literals (like Trick, \?, Click\sTime:, remove some of the spaces, etc...), and insert regex spaces where appropriate... i.e.
regex = Regexp.new(/(motion|links|sound|button|symbol)\s\|\sTrick\?\s*\d\s*Click\s+Time:\s+(0\.\d{,8})\s(\d{10}))/)
EDIT: fixed parenthesis nesting in the original

If you know that the data follows a particular pattern, you can just follow that pattern in the regex, and pick up the portions you want with ( ).
/Trial ID: (\d+) \| Trial Type: (\w+) \| Trick\? (\d+) Click Time: ([\.\d]+) ([\.\d]+)/
The more you know previously about the data, the more specifically you can make the regex.
If you see some variations in the data, and the regex fails to match, then just relax the pattern:
If the Trail ID, Trail ID may include a decimal point, use [\.\d]+ instead of \d+.
If the space can be more than one, then replace it with []+
If the space can be a tab, or can be absent, use \s* or [ \t]*.
If the Trial ID: part may appear as a different phrase, replace it with .*?,
and so on.
If you are not sure how many spaces/tabs appear, use this:
/Trial\s*ID:\s*(\d+)\s*\|\s*Trial\s*Type:\s*(\w+)\s*\|\s*Trick\?\s*(\d+)\s*Click\s*Time:\s*([\.\d]+)\s+([\.\d]+)/

This is one of those times that trying to everything in a big regex makes you work too hard. Simplify things:
ary = [
'Trial ID: 1 | Trial Type: motion | Trick? 1 Click Time: 0.87913100 1302969732',
'Trial ID: 7 | Trial Type: button | Trick? 0 Click Time: 0.19817800 1302987043'
]
ary.each do |li|
numbers = li.scan(/[\d.]+/)
trial_type = li[/Trial Type: (\w+)/, 1]
puts "%d %s %d %f %d\n" % [numbers.first, trial_type, *numbers[1 .. -1]]
end
# >> 1 motion 1 0.879131 1302969732
# >> 7 button 0 0.198178 1302987043
Regex patterns are powerful, but people think it's macho to do everything in one big line. You have to weigh doing that with the increased work necessary to put together the regex in the first place, plus maintain it if something changes in the text being parsed later.

Related

Ruby Regex Match Between "foo" and "bar"

I have unfortunately wandered into a situation where I need regex using Ruby. Basically I want to match this string after the underscore and before the first parentheses. So the end result would be 'table salt'.
_____ table salt (1) [F]
As usual I tried to fight this battle on my own and with rubular.com. I got the first part
^_____ (Match the beginning of the string with underscores ).
Then I got bolder,
^_____(.*?) ( Do the first part of the match, then give me any amount of words and letters after it )
Regex had had enough and put an end to that nonsense and crapped out. So I was wondering if anyone on stackoverflow knew or would have any hints on how to say my goal to the Ruby Regex parser.
EDIT: Thanks everyone, this is the pattern I ended up using after creating it with rubular.
ingredientNameRegex = /^_+([^(]*)/;
Everything got better once I took a deep breath, and thought about what I was trying to say.
str = "_____ table salt (1) [F]"
p str[ /_{3}\s(.+?)\s+\(/, 1 ]
#=> "table salt"
That says:
Find at least three underscores
and a whitespace character (\s)
and then one or more (+) of any character (.), but as little as possible (?), up until you find
one or more whitespace characters,
and then a literal (
The parens in the middle save that bit, and the 1 pulls it out.
Try this: ^[_]+([^(]*)\(
It will match lines starting with one or more underscores followed by anything not equal to an opening bracket: http://rubular.com/r/vthpGpVr4y
Here's working regex:
str = "_____ table salt (1) [F]"
match = str.match(/_([^_]+?)\(/)
p match[1].strip # => "table salt"
You could use
^_____\s*([^(]+?)\s*\(
^_____ match the underscore from the beginning of string
\s* matches any whitespace character
( grouping start
[^(]+ matches all non ( character at least once
? matches the shortest possible string (non greedy)
) grouping end
\s* matches any whitespace character
\( find the (
"_____ table salt (1) [F]".gsub(/[_]\s(.+)\s\(/, ' >>>\1<<< ')
# => "____ >>>table salt<<< 1) [F]"
It seems to me the simplest regex to do what you want is:
/^_____ ([\w\s]+) /
That says:
leading underscores, space, then capture any combination of word chars or spaces, then another space.

Ruby regex: split string with match beginning with either a newline or the start of the string?

Here's my regular expression that I have for this. I'm in Ruby, which — if I'm not mistaken — uses POSIX regular expressions.
regex = /(?:\n^)(\*[\w+ ?]+\*)\n/
Here's my goal: I want to split a string with a regex that is *delimited by asterisks*, including those asterisks. However: I only want to split by the match if it is prefaced with a newline character (\n), or it's the start of the whole string. This is the string I'm working with.
"*Friday*\nDo not *break here*\n*But break here*\nBut again, not this"
My regular expression is not splitting properly at the *Friday* match, but it is splitting at the *But break here* match (it's also throwing in a here split). My issue is somewhere in the first group, I think: (?:\n^) — I know it's wrong, and I'm not entirely sure of the correct way to write it. Can someone shed some light? Here's my complete code.
regex = /(?:\n^)(\*[\w+ ?]+\*)\n/
str = "*Friday*\nDo not *break here*\n*But break here*\nBut again, not this"
str.split(regex)
Which results in this:
>>> ["*Friday*\nDo not *break here*", "*But break here*", "But again, not this"]
I want it to be this:
>>> ["*Friday*", "Do not *break here*", "*But break here*", "But again, not this"]
Edit #1: I've updated my regex and result. (2011/10/18 16:26 CST)
Edit #2: I've updated both again. (16:32 CST)
What if you just add a '\n' to the front of each string. That simplifies the processing quite a bit:
regex = /(?:\n)(\*[\w+ ?]+\*)\n/
str = "*Friday*\nDo not *break here*\n*But break here*\nBut again, not this"
res = ("\n"+str).split(regex)
res.shift if res[0] == ""
res
=> [ "*Friday*", "Do not *break here*",
"*But break here*", "But again, not this"]
We have to watch for the initial extra match but it's not too bad. I suspect someone can shorten this a bit.
Groups 1 & 2 of the regex below :
(?:\A|\\n)(\*.*?\*)|(?:\A|\\n)(.*?)(?=\\n|\Z)
Will give you your desired output. I am no ruby expert so you will have to create the list yourself :)
Why not just split at newlines? From your example, it looks that's what you're really trying to do.
str.split("\n")

ruby regex matching cent ¢

I am having difficulty matching string "79¢ /lb" with this regex: (\$|¢)\d+(.\d{1,2})?
It works fine when the cent symbol appears in the beginning, but I don't know what needs to be added near the end of the string.
Basically I'm planning to extract a float value from this price tag, that is, 0.79, thanks in advance, I'm using ruby.
Well, that regex requires the $ or ¢ to be at the start of the string. To match 79¢ /lb, you'll need something like:
(\d+)¢
where the ¢ comes after the digits.
A single regex to match the many varied formats that you're likely to see will be a little more complex. I would suggest either doing it as multiple regexes (for simplicity), or asking another question here specifying the full range of strings you want to capture the prices from.
It's easiest to figure out the right regex when you consider each case separately. If I understand your question correctly, there are 4 cases:
cents, with the ¢ symbol before the price
cents, with the ¢ symbol after the price
dollars (and optional cents), with the $ symbol before the price
dollars (and optional cents), with the $ symbol after the price
First, write a regex for each case separately:
¢(\d{1,2})\b
\b(\d{1,2})¢
\$(\d+(?:\.\d{2})?)\b
\b(\d+(?:\.\d{2})?)\$
Then, combine them into a single regex:
regex = %r{
¢(\d{1,2})\b | # case 1
\b(\d{1,2})¢ | # case 2
\$(\d+(?:\.\d{2})?)\b | # case 3
\b(\d+(?:\.\d{2})?)\$ # case 4
}x
Then, match to your heart's content:
string_with_prices.scan(regex) do |match|
# If there was a match in the first two groups, it's for cents
cents = $1 || $2
# ...and the last two groups are dollars.
dollars = $3 || $4
if cents
puts "found price (cents): #{cents}"
elsif dollars
puts "found price (dollars): #{dollars}"
else
puts 'unknown match!'
end
end
Note: To test this code, I had to use 'c' instead of '¢' because Ruby was telling me invalid multibyte char (US-ASCII). To avoid this issue, use a different character encoding, or else figure out the encoded value of the '¢' character and embed it directly in the regex, e.g. %r{\x42} instead of %r{¢}.
Maybe you don't need to do everything in your reg exp;
#price is the string that contains the price
if price =~ /\$|¢/
value = string.match(/\d+/)
end
Or something along those lines.

Ruby MatchData class is repeating captures, instead of including additional captures as it "should"

Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').

Ruby Regex match unless escaped with \

Using Ruby I'm trying to split the following text with a Regex
~foo\~\=bar =cheese~monkey
Where ~ or = denotes the beginning of match unless it is escaped with \
So it should match
~foo\~\=bar
then
=cheese
then
~monkey
I thought the following would work, but it doesn't.
([~=]([^~=]|\\=|\\~)+)(.*)
What is a better regex expression to use?
edit To be more specific, the above regex matches all occurrences of = and ~
edit Working solution. Here is what I came up with to solve the issue. I found that Ruby 1.8 has look ahead, but doesn't have lookbehind functionality. So after looking around a bit, I came across this post in comp.lang.ruby and completed it with the following:
# Iterates through the answer clauses
def split_apart clauses
reg = Regexp.new('.*?(?:[~=])(?!\\\\)', Regexp::MULTILINE)
# need to use reverse since Ruby 1.8 has look ahead, but not look behind
matches = clauses.reverse.scan(reg).reverse.map {|clause| clause.strip.reverse}
matches.each do |match|
yield match
end
end
What does "remove the head" mean in this context?
If you want to remove everything before a certain char, this will do:
.*?(?<!\\)= // anything up to the first "=" that is not preceded by "\"
.*?(?<!\\)~ // same, but for the squiggly "~"
.*?(?<!\\)(?=~) // same, but excluding the separator itself (if you need that)
Replace by "", repeat, done.
If your string has exactly three elements ("1=2~3") and you want to match all of them at once, you can use:
^(.*?(?<!\\)(?:=))(.*?(?<!\\)(?:~))(.*)$
matches: \~foo\~\=bar =cheese~monkey
| 1 | 2 | 3 |
Alternatively, you split the string using this regex:
(?<!\\)[=~]
returns: ['\~foo\~\=bar ', 'cheese', 'monkey'] for "\~foo\~\=bar =cheese~monkey"
returns: ['', 'foo\~\=bar ', 'cheese', 'monkey'] for "~foo\~\=bar =cheese~monkey"

Resources