Problem capturing data inside of a capture that is optional - ruby

It's best to start with an example and what I've gotten so far.
Sample Data:
FOO foo#acme.com 5545
<Data><Name>tester</Name><Foo>bar</Foo></Data>
Current regex:
/FOO\s(.{1,20}#[^\s]+)\s.{0,20}\s{1,2}(<Data>.{0,100}<Name>(.{0,20})<\/Name>.{0,100}<\/Data>)?/m
Matches from regex:
foo#acme.com
testerbar
tester
I've wrapped the <Data> section in parenthesis followed-by a ? because the entire data section may or may not exist. However, the <Name> section is also optional, it may or may not exist. So I tried putting parenthesis around <Name> with a question mark as well but then I don't get the matches:
/FOO\s(.{1,20}#[^\s]+)\s.{0,20}\s{1,2}(<Data>.{0,100}(<Name>(.{0,20})<\/Name>)?.{0,100}<\/Data>)?/m
I've posted my regex and sample data on a regex site to make it easier to test/validate what I'm trying to do: http://www.rubular.com/r/ZhQzlNp1vv
In the <Data> section there is <Name> and even <Foo>. The point is, there may be many different elements in <Data> and I only care about extracting data from some of them. I need to use regex for my particular situation so please don't suggest using some XML parsing library (thanks!).
Thanks in advance.

/FOO\s(\S+#\S+).*?\n(?:.{0,100}(.{0,20})</Name>.{0,100}</Data>)?/m
http://www.rubular.com/r/IhisH7HYJR

To capture an optional group, use a non-capturing group to indicate the optionality inside a capturing group.
i.e.
((?:content)?)
The outer parentheses form the capturing group - if the optional group doesn't match you get an empty string. The (?:...) is the non-capturing group, which allows you to group the content (so it can all be made optional) without capturing it.
Update:
Whenever you have a complex regex, use free-spacing comment mode (flag=x) to make it readable (and thus far easier to figure out what's going on), like this:
FOO\s(.{1,20}#[^\s]+)\s.{0,20}\s{1,2}
((?:<Data>
# upto 200 chars, excluding captured tags or end tag (repeated below)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 3:
((?:<Name>.{0,20}<\/Name>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 4:
((?:<Foo>.{0,20}<\/Foo>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 5:
((?:<Bob>.{0,20}<\/Bob>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
<\/Data>)?)
Which at rubular results in:
1. foo#acme.com
2. <Data><Name>tester</Name><Foo>bar</Foo></Data>
3. <Name>tester</Name>
4. <Foo>bar</Foo>
5.
Annoyingly rubular doesn't seem to provide a multi-line editor when x is turned on, which sucks, and it also doesn't support standard comment syntax, so I had to change those #... to (?#...) which is less readable. Oh well.
If you need the values without the tags, you'll need a separate expression to strip those.
( Or, y'know, use a tool actually designed for the job. ;) )

Related

Multi-Line Regex: Find A where B is absent

I have been looking through a lot on Regex lately and have seen a lot of answers involving the matching of one word, where a second word is absent. I have seen a lot of Regex Examples where I can have a Regex search for a given word (or any more complex regex in its place) and find where a word is missing.
It seems like the works very well on a line by line basis, but after including the multi-line mode it still doesn't seem to match properly.
Example: Match an entire file string where the word foo is included, but the word bar is absent from the file. What I have so far is (?m)^(?=.*?(foo))((?!bar).)*$ which is based off the example link. I have been testing with a Ruby Regex tester, but I think it is a open ended regex problem/question. It seems to match smaller pieces, I would like to have it either match/not match on the entire string as one big chunk.
In the provided example above, matches are found on a line by line basis it seems. What changes need to be made to the regex so it applies over the ENTIRE string?
EDIT: I know there are other more efficient ways to solve this problem that doesn't involve using a regex. I am not looking for a solution to the problem using other means, I am asking from a theoretical regex point of view. It has a multi-line mode (which looks to "work"), it has negative/positive searching which can be combined on a line by line basis, how come combining these two principals doesn't yield the expected result?
Sawa's answer can be simplified, all that's needed is a positive lookahead, a negative lookahead, and since you're in multiline mode, .* takes care of the rest:
/(?=.*foo)(?!.*bar).*/m
Multiline means that . matches \n also, and matches are greedy. So the whole string will match without the need for anchors.
Update
#Sawa makes a good point for the \A being necessary but not the \Z.
Actually, looking at it again, the positive lookahead seems unnecessary:
/\A(?!.*bar).*foo.*/m
A regex that matches an entire string that does not include foo is:
/\A(?!.*foo.*).*\z/m
and a regex that matches from the beginning of an entire string that includes bar is:
/\A.*bar/m
Since you want to satisfy both of these, take a conjunction of these by putting one of them in a lookahead:
/\A(?=.*bar)(?!.*foo.*).*\z/m

Block Indent Regex

I'm having problems about a regexp.
I'm trying to implement a regex to select just the tab indent blocks, but i cant find a way of make it work:
Example:
INDENT(1)
INDENT(2)
CONTENT(a)
CONTENT(b)
INDENT(3)
CONTENT(c)
So I need blocks like:
INDENT(2)
CONTENT(a)
CONTENT(b)
AND
INDENT(3)
CONTENT(c)
How I can do this?
really tks, its almost that, here is my original need:
table
tr
td
"joao"
"joao"
td
"marcos"
I need separated "td" blocks, could i adapt your example to that?
It depends on exactly what you are trying to do, but maybe something like this:
^(\t+)(\S.*)\n(?:\1\t.*\n)*
Working example: http://www.rubular.com/r/qj3WSWK9JR
The pattern searches for:
^(\t+)(\S.*)\n - a line that begins with a tab (I've also captured the first line in a group, just to see the effect), followed by
(?:\1\t.*\n)* - lines with more tabs.
Similarly, you can use ^( +)(\S.*)\n(?:\1 .*\n)* for spaces (example). Mixing spaces and tabs may be a little problematic though.
For the updated question, consider using ^(\t{2,})(\S.*)\n(?:\1\t.*\n)*, for at least 2 tabs at the beginning of the line.
You could use the following regex to get the groups...
[^\s]*.*\r\n(?:\s+.*\r*\n*)*
this requires that your lines not begin with white space for the beginning of the blocks.

Ruby RegEx issue

I'm having a problem getting my RegEx to work with my Ruby script.
Here is what I'm trying to match:
http://my.test.website.com/{GUID}/{GUID}/
Here is the RegEx that I've tested and should be matching the string as shown above:
/([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)/
3 capturing groups:
group 1: ([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)
group 2: (\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)
group 3: ([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])
Ruby is giving me an error when trying to validate a match against this regex:
empty range in char class: (My RegEx goes here) (SyntaxError)
I appreciate any thoughts or suggestions on this.
You could simplify things a bit by using URI to deal parsing the URL, \h in the regex, and scan to pull out the GUIDs:
uri = URI.parse(your_url)
path = uri.path
guids = path.scan(/\h{8}-\h{4}-\h{4}-\h{4}-\h{12}/)
If you need any of the non-path components of the URL the you can easily pull them out of uri.
You might need to tighten things up a bit depending on your data or it might be sufficient to check that guids has two elements.
You have several errors in your RegEx. I am very sleepy now, so I'll just give you a hint instead of a solution:
...[\/\/[0-9a-fA-F]....
the first [ does not belong there. Also, having \/\/ inside [] is unnecessary - you only need each character once inside []. Also,
...[-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}...
is greedy, and includes a period - indeed, includes all chars (AFAICS) that can come after it, effectively swallowing the whole string (when you get rid of other bugs). Consider {2,256}? instead.

Regular expression syntax

I have a similar problem, to a previously asked question. But similar practices apparently do not produce similar results.
Previous Question
New question - I want to match the lines beginning in T as the first match, and the following lines beginning with X as the second match (as a whole string, to be later matched by another regex)
What I have so far is (^T(\d+)\n(.*?)(?:the_problem)/m) I don't know what to replace "the_problem" with, or even if that is the issue. I assumed some rendition (?:\n|\z), but apparently not. Everything I tried, would not count the next occurrence of ^T(\d+) as the start of a new group, and continue to capture all of the lines between each occurrence, at the same time.
Sample text;
T01C0.025
T02C0.035
T03C0.055
T04C0.150
T05C0.065
T06C0.075
%
G05
G90
T01
X011200Y004700
X011200Y009700
X018500Y011200
X013500Y-011200
X023800Y019500
T02
X034800Y017800
X-033800Y-017800
X032800Y017800
T03
X036730Y003000
X038700Y003000
X040668Y-003000
X059230Y003000
T04
X110580Y017800
X023800Y027300
X095500Y028500
X005500Y-006500
X021500Y-006500
T05
X003950Y002000
X003950Y004500
X003950Y007000
T06
X026300Y027300
M30
I only want to capture the shorter version of T01, T02,...T0n, not the longer version at the top, then the entire collection of ^X(-?\d+)Y(-?\d+) that follows it, as another match.
Result 1.
Match 1. T01
Match 2. X011200Y004700
X011200Y009700
X018500Y011200
X013500Y-011200
X023800Y019500
Result 2.
Match 1. T02
Match 2. X034800Y017800
X-033800Y-017800
X032800Y017800
Result 3.
Match 1. T03
Match 2. X036730Y003000
X038700Y003000
....etc....
Thanks in advance for any help ;-) Note: I prefer to use raw Ruby, without extensions or plugins. My version of ruby is 1.8.6.
Try this instead:
^(T[^\s]+)[\n\r\s]((?:(?:X\S+)[\n\r\s])+)
It makes the groups for the X lines into non-capturing groups, then puts all the repetitions of the final pattern into a single group. All the X lines will be in a single capture.
You can test this using Rubular (an indispensable tool for developing regular expressions) http://rubular.com/r/PRnurKy64Q
this seems to work...
^(T[^\s]+)[\n\r\s]((X[^\s]+)[\n\r\s]){1,}
I'm not totally sure I understand your problem, but I'll give this a shot. It looks like you want:
/(^T\d+$(^X[-A-Z\d]+$)+)*/g
This will have to be run under multiline mode so that ^ and $ match after and before newlines. Word of caution: I don't have much practice with mulitline regex, so you might want to do a sanity check on the use of ^ and $.
Also, I notice you didn't include the lines similar to T01C0.025 in your sample results, so I made the T\d+ assumption based on that.

Very odd issue with Ruby and regex

I am getting completely different reults from string.scan and several regex testers...
I am just trying to grab the domain from the string, it is the last word.
The regex in question:
/([a-zA-Z0-9\-]*\.)*\w{1,4}$/
The string (1 single line, verified in Ruby's runtime btw)
str = 'Show more results from software.informer.com'
Work fine, but in ruby....
irb(main):050:0> str.scan /([a-zA-Z0-9\-]*\.)*\w{1,4}$/
=> [["informer."]]
I would think that I would get a match on software.informer.com ,which is my goal.
Your regex is correct, the result has to do with the way String#scan behaves. From the official documentation:
"If the pattern contains groups, each individual result is itself an array containing one entry per group."
Basically, if you put parentheses around the whole regex, the first element of each array in your results will be what you expect.
It does not look as if you expect more than one result (especially as the regex is anchored). In that case there is no reason to use scan.
'Show more results from software.informer.com'[ /([a-zA-Z0-9\-]*\.)*\w{1,4}$/ ]
#=> "software.informer.com"
If you do need to use scan (in which case you obviously need to remove the anchor), you can use (?:) to create non-capturing groups.
'foo.bar.baz lala software.informer.com'.scan( /(?:[a-zA-Z0-9\-]*\.)*\w{1,4}/ )
#=> ["foo.bar.baz", "lala", "software.informer.com"]
You are getting a match on software.informer.com. Check the value of $&. The return of scan is an array of the captured groups. Add capturing parentheses around the suffix, and you'll get the .com as part of the return value from scan as well.
The regex testers and Ruby are not disagreeing about the fundamental issue (the regex itself). Rather, their interfaces are differing in what they are emphasizing. When you run scan in irb, the first thing you'll see is the return value from scan (an Array of the captured subpatterns), which is not the same thing as the matched text. Regex testers are most likely oriented toward displaying the matched text.
How about doing this :
/([a-zA-Z0-9\-]*\.*\w{1,4})$/
This returns
informer.com
On your test string.
http://rubular.com/regexes/13670

Resources