Regular expression syntax - ruby

I have a similar problem, to a previously asked question. But similar practices apparently do not produce similar results.
Previous Question
New question - I want to match the lines beginning in T as the first match, and the following lines beginning with X as the second match (as a whole string, to be later matched by another regex)
What I have so far is (^T(\d+)\n(.*?)(?:the_problem)/m) I don't know what to replace "the_problem" with, or even if that is the issue. I assumed some rendition (?:\n|\z), but apparently not. Everything I tried, would not count the next occurrence of ^T(\d+) as the start of a new group, and continue to capture all of the lines between each occurrence, at the same time.
Sample text;
T01C0.025
T02C0.035
T03C0.055
T04C0.150
T05C0.065
T06C0.075
%
G05
G90
T01
X011200Y004700
X011200Y009700
X018500Y011200
X013500Y-011200
X023800Y019500
T02
X034800Y017800
X-033800Y-017800
X032800Y017800
T03
X036730Y003000
X038700Y003000
X040668Y-003000
X059230Y003000
T04
X110580Y017800
X023800Y027300
X095500Y028500
X005500Y-006500
X021500Y-006500
T05
X003950Y002000
X003950Y004500
X003950Y007000
T06
X026300Y027300
M30
I only want to capture the shorter version of T01, T02,...T0n, not the longer version at the top, then the entire collection of ^X(-?\d+)Y(-?\d+) that follows it, as another match.
Result 1.
Match 1. T01
Match 2. X011200Y004700
X011200Y009700
X018500Y011200
X013500Y-011200
X023800Y019500
Result 2.
Match 1. T02
Match 2. X034800Y017800
X-033800Y-017800
X032800Y017800
Result 3.
Match 1. T03
Match 2. X036730Y003000
X038700Y003000
....etc....
Thanks in advance for any help ;-) Note: I prefer to use raw Ruby, without extensions or plugins. My version of ruby is 1.8.6.

Try this instead:
^(T[^\s]+)[\n\r\s]((?:(?:X\S+)[\n\r\s])+)
It makes the groups for the X lines into non-capturing groups, then puts all the repetitions of the final pattern into a single group. All the X lines will be in a single capture.
You can test this using Rubular (an indispensable tool for developing regular expressions) http://rubular.com/r/PRnurKy64Q

this seems to work...
^(T[^\s]+)[\n\r\s]((X[^\s]+)[\n\r\s]){1,}

I'm not totally sure I understand your problem, but I'll give this a shot. It looks like you want:
/(^T\d+$(^X[-A-Z\d]+$)+)*/g
This will have to be run under multiline mode so that ^ and $ match after and before newlines. Word of caution: I don't have much practice with mulitline regex, so you might want to do a sanity check on the use of ^ and $.
Also, I notice you didn't include the lines similar to T01C0.025 in your sample results, so I made the T\d+ assumption based on that.

Related

Ruby Group regular expression to match only first occurence [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

ruby regex: match URL recurring pattern

I want to be able to match all the following cases below using Ruby 1.8.7.
/pages/multiedit/16801,16809,16817,16825,16833
/pages/multiedit/16801,16809,16817
/pages/multiedit/16801
/pages/multiedit/1,3,5,7,8,9,10,46
I currently have:
\/pages\/multiedit\/\d*
This matches upto the first set of numbers. So for example:
"/pages/multiedit/16801,16809,16817,16825,16833"[/\/pages\/multiedit\/\d*/]
# => "/pages/multiedit/16801"
See http://rubular.com/r/ruFPx5yIAF for example.
Thanks for the help, regex gods.
\/pages\/multiedit\/\d+(?:,\d+)*
Example: http://rubular.com/r/0nhpgki6Gy
Edit: Updated to not capture anything... Although the performance hit would be negligible. (Thanks Tin Man)
The currently accepted answer of
\/pages\/multiedit\/[\d,]+
may not be a good idea because that will also match the following strings
.../pages/multiedit/,,,
.../pages/multiedit/,1,
My answer requires there be at least one digit before the first comma, and at least one digit between commas, and it must end with a digit.
I'd use:
/\/pages\/multiedit\/[\d,]+/
Here's a demonstration of the pattern at http://rubular.com/r/h7VLZS1W1q
[\d,]+ means "find one or more numbers or commas"
The reason \d* doesn't work is it means "find zero or more numbers". As soon as the pattern search runs into a comma it stops. You have to tell the engine that it's OK to find numbers and commas.

Multi-Line Regex: Find A where B is absent

I have been looking through a lot on Regex lately and have seen a lot of answers involving the matching of one word, where a second word is absent. I have seen a lot of Regex Examples where I can have a Regex search for a given word (or any more complex regex in its place) and find where a word is missing.
It seems like the works very well on a line by line basis, but after including the multi-line mode it still doesn't seem to match properly.
Example: Match an entire file string where the word foo is included, but the word bar is absent from the file. What I have so far is (?m)^(?=.*?(foo))((?!bar).)*$ which is based off the example link. I have been testing with a Ruby Regex tester, but I think it is a open ended regex problem/question. It seems to match smaller pieces, I would like to have it either match/not match on the entire string as one big chunk.
In the provided example above, matches are found on a line by line basis it seems. What changes need to be made to the regex so it applies over the ENTIRE string?
EDIT: I know there are other more efficient ways to solve this problem that doesn't involve using a regex. I am not looking for a solution to the problem using other means, I am asking from a theoretical regex point of view. It has a multi-line mode (which looks to "work"), it has negative/positive searching which can be combined on a line by line basis, how come combining these two principals doesn't yield the expected result?
Sawa's answer can be simplified, all that's needed is a positive lookahead, a negative lookahead, and since you're in multiline mode, .* takes care of the rest:
/(?=.*foo)(?!.*bar).*/m
Multiline means that . matches \n also, and matches are greedy. So the whole string will match without the need for anchors.
Update
#Sawa makes a good point for the \A being necessary but not the \Z.
Actually, looking at it again, the positive lookahead seems unnecessary:
/\A(?!.*bar).*foo.*/m
A regex that matches an entire string that does not include foo is:
/\A(?!.*foo.*).*\z/m
and a regex that matches from the beginning of an entire string that includes bar is:
/\A.*bar/m
Since you want to satisfy both of these, take a conjunction of these by putting one of them in a lookahead:
/\A(?=.*bar)(?!.*foo.*).*\z/m

Problem capturing data inside of a capture that is optional

It's best to start with an example and what I've gotten so far.
Sample Data:
FOO foo#acme.com 5545
<Data><Name>tester</Name><Foo>bar</Foo></Data>
Current regex:
/FOO\s(.{1,20}#[^\s]+)\s.{0,20}\s{1,2}(<Data>.{0,100}<Name>(.{0,20})<\/Name>.{0,100}<\/Data>)?/m
Matches from regex:
foo#acme.com
testerbar
tester
I've wrapped the <Data> section in parenthesis followed-by a ? because the entire data section may or may not exist. However, the <Name> section is also optional, it may or may not exist. So I tried putting parenthesis around <Name> with a question mark as well but then I don't get the matches:
/FOO\s(.{1,20}#[^\s]+)\s.{0,20}\s{1,2}(<Data>.{0,100}(<Name>(.{0,20})<\/Name>)?.{0,100}<\/Data>)?/m
I've posted my regex and sample data on a regex site to make it easier to test/validate what I'm trying to do: http://www.rubular.com/r/ZhQzlNp1vv
In the <Data> section there is <Name> and even <Foo>. The point is, there may be many different elements in <Data> and I only care about extracting data from some of them. I need to use regex for my particular situation so please don't suggest using some XML parsing library (thanks!).
Thanks in advance.
/FOO\s(\S+#\S+).*?\n(?:.{0,100}(.{0,20})</Name>.{0,100}</Data>)?/m
http://www.rubular.com/r/IhisH7HYJR
To capture an optional group, use a non-capturing group to indicate the optionality inside a capturing group.
i.e.
((?:content)?)
The outer parentheses form the capturing group - if the optional group doesn't match you get an empty string. The (?:...) is the non-capturing group, which allows you to group the content (so it can all be made optional) without capturing it.
Update:
Whenever you have a complex regex, use free-spacing comment mode (flag=x) to make it readable (and thus far easier to figure out what's going on), like this:
FOO\s(.{1,20}#[^\s]+)\s.{0,20}\s{1,2}
((?:<Data>
# upto 200 chars, excluding captured tags or end tag (repeated below)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 3:
((?:<Name>.{0,20}<\/Name>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 4:
((?:<Foo>.{0,20}<\/Foo>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 5:
((?:<Bob>.{0,20}<\/Bob>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
<\/Data>)?)
Which at rubular results in:
1. foo#acme.com
2. <Data><Name>tester</Name><Foo>bar</Foo></Data>
3. <Name>tester</Name>
4. <Foo>bar</Foo>
5.
Annoyingly rubular doesn't seem to provide a multi-line editor when x is turned on, which sucks, and it also doesn't support standard comment syntax, so I had to change those #... to (?#...) which is less readable. Oh well.
If you need the values without the tags, you'll need a separate expression to strip those.
( Or, y'know, use a tool actually designed for the job. ;) )

Optimal Regular Expression: match sets of lines starting with

Alright, this one's interesting. I have a solution, but I don't like it.
The goal is to be able to find a set of lines that start with 3 periods - not an individual line, mind you, but a collection of all the lines in a row that match. For example, here's some matches (each match is separated by a blank line):
...
...hello
...
...hello
...world
...
...wazzup?
...
My solution is as follows:
^\.\.\..*(\n\.\.\..*)*$
It matches all those, so it's what I'm using for now - however, it looks kinda silly to repeat the \.\.\..* pattern. Is there a simpler way?
Please test your regex before submitting it, rather than submit what "should work." For example, I tried the following first:
(^\.\.\..*$)+
which only returned individual lines, even though in my mind it looks like it would do the trick - I guess I just don't understand regex internals. (And no, I didn't need to set any flags to get ^ and $ to match line boundaries, since I'm implementing this in Ruby.)
So I'm not totally sure there's a good answer, but one would be much appreciated - thanks in advance!
In most regex implementations you can shorten \.\.\. using \.{3} so your solution would turn into \.{3}.*(\n\.{3}.*)*.
What you already have is already simple and understandable. Keep in mind that more "clever" RegExps may very well be slower and undoubtedly less readable.
Assuming lines are terminated by a \n:
((^|\n)\.{3}[^\n]*)+
I am not familiar with Ruby, so depending on how it returns matches you might need to "nonmatch" groups:
((?:(?:^|\n)\.{3}[^\n]*)+)
^([.]{3}.*$\n?)+
This doesn't really need $ in there.
You are pretty close to a solution with (^\.\.\..*$)+, but because the + modifier is on the outside of the group, it is getting overwritten each time and you are only left with the last line. Try wrapping it in an outer group: ((^\.\.\..*$)+) and looking at the first submatch and ignoring the inner one.
Combined with the other suggestion: ((^\.{3}.*$)+)

Resources