(Ruby) parsing a string with RegEx - ruby

This is the string that I want to parse: 2 Sep 27 Sep 28 SOME TEXT HERE 35.00
I want to parse it into a list so that the values look like:
list[0] = 'Sep 28'
list[1] = 'SOME TEXT HERE'
list[2] = '35.00'
The RegEx that I've been working on:
^\d{1}\s{1}[a-zA-Z]{3}\s{1}\d{2}\s{1}([a-zA-Z]{3}\s{1}\d{2})\s{1}([a-zA-Z0-9]*\s{1})+(\d+.\d+)
My values are:
list[0] = 'Sep 28'
list[1] = 'HERE'
list[2] = '35.00'
The list[1] value is off. I'm also probably not parsing the spaces right, but I couldn't find any guidance in the "Pickaxe" book or online.

Your problem is in your second capture group:
([a-zA-Z0-9]*\s{1})+
The parenthesized group is repeated, matching each of the words 'SOME', 'TEXT', and 'HERE' individually, leaving your second capture group with only the final match, 'HERE'.
You need to put the + inside the capturing parenthesized groups, and use non-capturing parentheses (?:...) to enclose your existing group. Non-capturing parentheses, which use (?: to start the group and ) to end the group, are a way in a regular expression to group parts of your match together without capturing the group. You can use repetition operators (+, *, {n}, or {n,m}) on a non-capturing group and then capture the entire expression:
((?:[a-zA-Z0-9]*\s{1})+)
In total:
/^\d{1}\s{1}[a-zA-Z]{3}\s{1}\d{2}\s{1}([a-zA-Z]{3}\s{1}\d{2})\s{1}((?:[a-zA-Z0-9]*\s{1})+)(\d+.\d+)/
As a side note, this is a pretty clunky regex. You never really need to specify {1} in a regex as a single match is the default. Similarly, \d\d is one character less typing than \d{2}. Also, you probably just want \w instead of [a-zA-Z0-9]. Since you don't seem to care about case, you probably just want to use the /i option and simplify the letter character classes. Something like this is a more idiomatic regular expression:
/^\d [a-z]{3} \d\d ([a-z]{3} \d\d) ((?:\w* )+)(\d+.\d+)/i
Finally, though the Ruby documentation for regular expressions is a little thin, Ruby uses somewhat standard Perl-compatible regular expressions, and you can find more information about regular expressions generally at regular-expressions.info

You may have also been here and tried this tool, but I would highly recommend Rubular. It offers very quick string parsing.
It looks like you already got the specific answer to your question, so I just wanted to drop this in for other people coming by so they can know where to go test their regex or just practice.

Related

Regular expression to get value in between parentheses

I am trying to write a regular expression to get the value in between parentheses. I expect a value without parentheses. For example, given:
value = "John sinu.s(14)"
I expected to get 14.
I tried the following:
value[/\(.*?\)/]
but it gives the result (14). Please help me.
You may do that using
value[/\((.*?)\)/, 1]
or
value[/\(([^()]*)\)/, 1]
Use a capturing group and a second argument to extract just the group value.
Note that \((.*?)\) will also match a substring that contains ( char in it, and the second option will only match a substring between parentheses that does not contain ( nor ) since [^()] is a negated character class that matches any char but ( and ).
See the Ruby demo online.
From the Ruby docs:
str[regexp, capture] → new_str or nil
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
In case you need to extract multiple occurrences, use String#scan:
value = "John sinu.s(14) and Jack(156)"
puts value.scan(/\(([^()]*)\)/)
# => [ 14, 156 ]
See another Ruby demo.
Another option is to use non-capturing look arounds like this
value[/(?<=\().*(?=\))/]
(?<=\() - positive look behind make sure there is ( but don't capture it
(?=\)) - positive look ahead make sure the regex ends with ) but don't capture it
You can use
/(?<=\\()[^\\)]+/g
which selects string inside brackets without brackets
Only thing you need is "positive lookahead" feature
Follow this link for more info about positive lookahead in special groups.
I don't know if it is supported in ruby
Try using this regular expression
/\((.*?)\)/
\( will match your opening parenthesis in the string
(.*?) creates a capturing group
\) will match your closing parenthesis
Do you wish to extract the string between the parentheses or do that using a regular expression? You specify the latter in the question but it's conceivable your question is really the former and you are assuming that a regular expression must be used.
If you just want the value, without any restriction on the method used to obtain it, you could do that quite simply using String#index and String#rindex.
s = "John sinu.s(14)"
s[s.index('(')+1 .. s.rindex(')')-1]
#=> "14"

Ruby regex - gsub only captured group

I'm not quite sure I understand how non-capturing groups work. I am looking for a regex to produce this result: 5.214. I thought the regex below would work, but it is replacing everything including the non-capture groups. How can I write a regex to only replace the capture groups?
"5,214".gsub(/(?:\d)(,)(?:\d)/, '.')
# => ".14"
My desired result:
"5,214".gsub(some_regex)
#=> "5.214
non capturing groups still consumes the match
use
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
or
"5,214".gsub(/(?<=\d+)(,)(?=\d+)/, '.')
You can't. gsub replaces the entire match; it does not do anything with the captured groups. It will not make any difference whether the groups are captured or not.
In order to achieve the result, you need to use lookbehind and lookahead.
"5,214".gsub(/(?<=\d),(?=\d)/, '.')
It is also possible to use Regexp.last_match (also available via $~) in the block version to get access to the full MatchData:
"5,214".gsub(/(\d),(\d)/) { |_|
match = Regexp.last_match
"#{match[1]}.#{match[2]}"
}
This scales better to more involved use-cases.
Nota bene, from the Ruby docs:
the ::last_match is local to the thread and method scope of the method that did the pattern match.
gsub replaces the entire match the regular expression engine produces. Both capturing/non-capturing group constructs are not retained. However, you could use lookaround assertions which do not "consume" any characters on the string.
"5,214".gsub(/\d\K,(?=\d)/, '.')
Explanation: The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. That being said, we then look for and match the comma, and the Positive Lookahead asserts that a digit follows.
I know nothing about ruby.
But from what i see in the tutorial
gsub mean replace,
the pattern should be /(?<=\d+),(?=\d+)/ just replace the comma with dot
or, use capture /(\d+),(\d+)/ replace the string with "\1.\2"?
You can easily reference capture groups in the replacement string (second argument) like so:
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
#=> "5.214"
\0 will return the whole matched string.
\1 will be replaced by the first capturing group.
\2 will be replaced by the second capturing group etc.
You could rewrite the example above using a non-capturing group for the , char.
"5,214".gsub(/(\d+)(?:,)(\d+)/, '\1.\2')
#=> "5.214"
As you can see, the part after the comma is now the second capturing group, since we defined the middle group as non-capturing.
Although it's kind of pointless in this case. You can just omit the capturing group for , altogether
"5,214".gsub(/(\d+),(\d+)/, '\1.\2')
#=> "5.214"
You don't need regexp to achieve what you need:
'1,200.00'.tr('.','!').tr(',','.').tr('!', ',')
Periods become bangs (1,200!00)
Commas become periods (1.200!00)
Bangs become commas (1.200,00)

Regex - Matching text AFTER certain characters

I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).

Replacing partial regex matches in place with Ruby

I want to transform the following text
This is a ![foto](foto.jpeg), here is another ![foto](foto.png)
into
This is a ![foto](/folder1/foto.jpeg), here is another ![foto](/folder2/foto.png)
In other words I want to find all the image paths that are enclosed between brackets (the text is in Markdown syntax) and replace them with other paths. The string containing the new path is returned by a separate real_path function.
I would like to do this using String#gsub in its block version. Currently my code looks like this:
re = /!\[.*?\]\((.*?)\)/
rel_content = content.gsub(re) do |path|
real_path(path)
end
The problem with this regex is that it will match ![foto](foto.jpeg) instead of just foto.jpeg. I also tried other regexen like (?>\!\[.*?\]\()(.*?)(?>\)) but to no avail.
My current workaround is to split the path and reassemble it later.
Is there a Ruby regex that matches only the path inside the brackets and not all the contextual required characters?
Post-answers update: The main problem here is that Ruby's regexen have no way to specify zero-width lookbehinds. The most generic solution is to group what the part of regexp before and the one after the real matching part, i.e. /(pre)(matching-part)(post)/, and reconstruct the full string afterwards.
In this case the solution would be
re = /(!\[.*?\]\()(.*?)(\))/
rel_content = content.gsub(re) do
$1 + real_path($2) + $3
end
A quick solution (adjust as necessary):
s = 'This is a ![foto](foto.jpeg)'
s.sub!(/!(\[.*?\])\((.*?)\)/, '\1(/folder1/\2)' )
p s # This is a [foto](/folder1/foto.jpeg)
You can always do it in two steps - first extract the whole image expression out and then second replace the link:
str = "This is a ![foto](foto.jpeg), here is another ![foto](foto.png)"
str.gsub(/\!\[[^\]]*\]\(([^)]*)\)/) do |image|
image.gsub(/(?<=\()(.*)(?=\))/) do |link|
"/a/new/path/" + link
end
end
#=> "This is a ![foto](/a/new/path/foto.jpeg), here is another ![foto](/a/new/path/foto.png)"
I changed the first regex a bit, but you can use the same one you had before in its place. image is the image expression like ![foto](foto.jpeg), and link is just the path like foto.jpeg.
[EDIT] Clarification: Ruby does have lookbehinds (and they are used in my answer):
You can create lookbehinds with (?<=regex) for positive and (?<!regex) for negative, where regex is an arbitrary regex expression subject to the following condition. Regexp expressions in lookbehinds they have to be fixed width due to limitations on the regex implementation, which means that they can't include expressions with an unknown number of repetitions or alternations with different-width choices. If you try to do that, you'll get an error. (The restriction doesn't apply to lookaheads though).
In your case, the [foto] part has a variable width (foto can be any string) so it can't go into a lookbehind due to the above. However, lookbehind is exactly what we need since it's a zero-width match, and we take advantage of that in the second regex which only needs to worry about (fixed-length) compulsory open parentheses.
Obviously you can put real_path in from here, but I just wanted a test-able example.
I think that this approach is more flexible and more readable than reconstructing the string through the match group variables
In your block, use $1 to access the first capture group ($2 for the second and so on).
From the documentation:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
As a side note, some people think '\1' inappropriate for situations where an unconfirmed number of characters are matched. For example, if you want to match and modify the middle content, how can you protect the characters on both sides?
It's easy. Put a bracket around something else.
For example, I hope replace a-ruby-porgramming-book-531070.png to a-ruby-porgramming-book.png. Remove context between last "-" and last ".".
I can use /.*(-.*?)\./ match -531070. Now how should I replace it? Notice
everything else does not have a definite format.
The answer is to put brackets around something else, then protect them:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1.')
# => "a-ruby-porgramming-book.png"
If you want add something before matched content, you can use:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1-2019\2.')
# => "a-ruby-porgramming-book-2019-531070.png"

Working with Regular Expressions - Repeating Patterns

I am trying to use regular expressions to match some text.
The following pattern is what I am trying to gather.
#Identifier('VariableA', 'VariableB', 'VariableX', ..., 'VariableZ')
I would like to grab a dynamic number of variables rather than a fixed set of two or three.
Is there any way to do this? I have an existing Regular Expression:
\#(\w+)\W+(\w+)\W+(\w+)\W+(\w+)
This captures the Identifier and up to three variables.
Edit: Is it just me, or are regular expressions not as powerful as I'm making them out to be?
You want to use scan for this sort of thing. The basic pattern would be this:
s.scan(/\w+/)
That would give you an array of all the contiguous sequences for word characters:
>> "#Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ')".scan(/\w+/)
=> ["Identifier", "VariableA", "VariableB", "VariableX", "VariableZ"]
You say you might have multiple instances of your pattern with arbitrary stuff surrounding them. You can deal with that with nested scans:
s.scan(/#(\w+)\(([^)]+?)\)/).map { |m| [ m.first, m.last.scan(/\w+/) ] }
That will give you an array of arrays, each inner array will have the "Identifier" part as the first element and that "Variable" parts as an array in the second element. For example:
>> s = "pancakes #Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ') pancakes #Pancakes('one','two','three') eggs"
>> s.scan(/#(\w+)\(([^)]+?)\)/).map { |m| [ m.first, m.last.scan(/\w+/) ] }
=> [["Identifier", ["VariableA", "VariableB", "VariableX", "VariableZ"]], ["Pancakes", ["one", "two", "three"]]]
If you might be facing escaped quotes inside your "Variable" bits then you'll need something more complex.
Some notes on the expression:
# # A literal "#".
( # Open a group
\w+ # One more more ("+") word characters ("\w").
) # Close the group.
\( # A literal "(", parentheses are used for group so we escape it.
( # Open a group.
[ # Open a character class.
^) # The "^" at the beginning of a [] means "not", the ")" isn't escaped because it doesn't have any special meaning inside a character class.
] # Close a character class.
+? # One more of the preceding pattern but don't be greedy.
) # Close the group.
\) # A literal ")".
You don't really need [^)]+? here, just [^)]+ would do but I use the non-greedy forms by habit because that's usually what I mean. The grouping is used to separate the #Identifier and Variable parts so that we can easily get the desired nested array output.
But alex thinks that you meant you wanted to capture the same thing four times. If you want to capture the same pattern, but different things, then you may want to consider two things:
Iteration. In perl, you can say
while ($variable =~ /regex/g) {
the 'g' stands for 'global', and means that each time the regex is called, it matches the /next/ instance.
The other option is recursion. Write your regex like this:
/(what you want)(.*)/
Then, you have backreference 1 containing the first thing, which you can push to an array, and backreference 2 which you'll then recurse over until it no longer matches.
You may use simply (\w+).
Given the input string
#Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ')
The results would be:
Identifier
VariableA
VariableB
VariableX
VariableZ
This would work for an arbitrary number of variables.
For future reference, it's easy and fun to play around with regexp ideas on Rubular.
So you are asking if there is a way to capture both the identifier and an arbitrary number of variables. I am afraid that you can only do this with regex engines that support captures. Note here that captures and capturing groups are not the one and the same thing. You want to remember all the "variables". This can't be done with simple capturing groups.
I am unaware whether Ruby supports this or not, but I am sure that .NET and the new PERL 6 support it.
In your case you could use two regexes. One to capture the identifier e.g. ^\s*#(\w+)
and another one to capture all variables e.g. result = subject.scan(/'[^']+'/)

Resources