Ruby regex to extract match_group value? - ruby

I have two questions about regex.
The match string is:
"FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA, user_email=admin#example.com"
When extracting the user_email value, my regexp is:
\s+(?<email_from_header>\S+)
and the match group value is:
(space)user_email=admin#example.com"
What do I use to omit the first (space) char and the last " quote?
When extracting the token, my regex is:
AUTH-TOKEN\s+(?<auth_token>\S+)
and the match group value is:
FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA,
What do I use to delete that last trailing comma ,?

Your regex would be,
\s+\K(?<email_from_header>[^"]*)
Use \K switch to discard the previously matched characters. And use not character class to match any character not of " zero or more times.
Your regex would be,
AUTH-TOKEN\s+(?<auth_token>[^,]*)
[^,]* it would match any character not of , zero or more times.

If your string has embedded double-quotes:
str[/^"(.+),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str[/^"(.+?),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str[/^"([^,]+),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str = '"FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA, user_email=admin#example.com"'
str # => "\"FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA, user_email=admin#example.com\""
str[/(user_email=.+)"/, 1] # => "user_email=admin#example.com"
str[/(user_email=[^"]+)"/, 1] # => "user_email=admin#example.com"
str[/user_email=([^"]+)"/, 1] # => "admin#example.com"
match = str.match(/(?<user_email>user_email=(?<addr>.+))"/)
match # => #<MatchData "user_email=admin#example.com\"" user_email:"user_email=admin#example.com" addr:"admin#example.com">
match['user_email'] # => "user_email=admin#example.com"
match['addr'] # => "admin#example.com"
If it doesn't:
str = 'FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA, user_email=admin#example.com'
str # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA, user_email=admin#example.com"
str[/^(.+),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str[/^(.+?),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str[/^([^,]+),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str[/(user_email=.+)/, 1] # => "user_email=admin#example.com"
str[/(user_email=(.+))/, 2] # => "admin#example.com"
str[/user_email=(.+)/, 1] # => "admin#example.com"
Or, having more regex fun:
match = str.match(/(?<user_email>user_email=(?<addr>.+))/)
match # => #<MatchData "user_email=admin#example.com" user_email:"user_email=admin#example.com" addr:"admin#example.com">
match['user_email'] # => "user_email=admin#example.com"
match['addr'] # => "admin#example.com"
Regular expressions are a very rich language, and you can write something in many ways usually. The problem then becomes maintaining the pattern as the program "matures". I recommend starting simply, and expanding the pattern as the needs dictate. Don't start complex hoping to find a working solution, because that usually doesn't work; Getting a complex pattern to work immediately often fails.

Related

How to embed regular expressions in other regular expressions in Ruby

I have a string:
'A Foo'
and want to find "Foo" in it.
I have a regular expression:
/foo/
that I'm embedding into another case-insensitive regular expression, so I can build the pattern in steps:
foo_regex = /foo/
pattern = /A #{ foo_regex }/i
But it won't match correctly:
'A Foo' =~ pattern # => nil
If I embed the text directly into the pattern it works:
'A Foo' =~ /A foo/i # => 0
What's wrong?
On the surface it seems that embedding a pattern inside another pattern would simply work, but that's based on a bad assumption of how patterns work in Ruby, that they're simply strings. Using:
foo_regex = /foo/
creates a Regexp object:
/foo/.class # => Regexp
As such it has knowledge of the optional flags used to create it:
( /foo/ ).options # => 0
( /foo/i ).options # => 1
( /foo/x ).options # => 2
( /foo/ix ).options # => 3
( /foo/m ).options # => 4
( /foo/im ).options # => 5
( /foo/mx ).options # => 6
( /foo/imx ).options # => 7
or, if you like binary:
'%04b' % ( /foo/ ).options # => "0000"
'%04b' % ( /foo/i ).options # => "0001"
'%04b' % ( /foo/x ).options # => "0010"
'%04b' % ( /foo/xi ).options # => "0011"
'%04b' % ( /foo/m ).options # => "0100"
'%04b' % ( /foo/mi ).options # => "0101"
'%04b' % ( /foo/mx ).options # => "0110"
'%04b' % ( /foo/mxi ).options # => "0111"
and remembers those whenever the Regexp is used, whether as a standalone pattern or if embedded in another.
You can see this in action if we look to see what the pattern looks like after embedding:
/#{ /foo/ }/ # => /(?-mix:foo)/
/#{ /foo/i }/ # => /(?i-mx:foo)/
?-mix: and ?i-mx: are how those options are represented in an embedded-pattern.
According to the Regexp documentation for Options:
i, m, and x can also be applied on the subexpression level with the (?on-off) construct, which enables options on, and disables options off for the expression enclosed by the parentheses.
So, Regexp is remembering those options, even inside the outer pattern, causing the overall pattern to fail the match:
pattern = /A #{ foo_regex }/i # => /A (?-mix:foo)/i
'A Foo' =~ pattern # => nil
It's possible to make sure that all sub-expressions match their surrounding patterns, however that can quickly become too convoluted or messy:
foo_regex = /foo/i
pattern = /A #{ foo_regex }/i # => /A (?i-mx:foo)/i
'A Foo' =~ pattern # => 0
Instead we have the source method which returns the text of a pattern:
/#{ /foo/.source }/ # => /foo/
/#{ /foo/i.source }/ # => /foo/
The problem with the embedded pattern remembering the options also appears when using other Regexp methods, such as union:
/#{ Regexp.union(%w[a b]) }/ # => /(?-mix:a|b)/
and again, source can help:
/#{ Regexp.union(%w[a b]).source }/ # => /a|b/
Knowing all that:
foo_regex = /foo/
pattern = /#{ foo_regex.source }/i # => /foo/i
'A Foo' =~ pattern # => 2
"what's wrong?"
Your assumption on how a Regexp is interpolated is wrong.
Interpolation via #{...} is done by calling to_s on the interpolated object:
d = Date.new(2017, 9, 8)
#=> #<Date: 2017-09-08 ((2458005j,0s,0n),+0s,2299161j)>
d.to_s
#=> "2017-09-08"
"today is #{d}!"
#=> "today is 2017-09-08!"
and not just for string literals, but also for regular expression literals:
/today is #{d}!/
#=> /today is 2017-09-08!/
In your example, the object-to-be-interpolated is a Regexp:
foo_regex = /foo/
And Regexp#to_s returns:
[...] the regular expression and its options using the (?opts:source) notation.
foo_regex.to_s
#=> "(?-mix:foo)"
Therefore:
/A #{foo_regex}/i
#=> /A (?-mix:foo)/i
Just like:
"A #{foo_regex}"
#=> "A (?-mix:foo)"
In other words: because of the way Regexp#to_s is implemented, you can interpolate patterns without loosing their flags. It's a feature, not a bug.
If Regexp#to_s would return just the source (without options), it would work the way you expect:
def foo_regex.to_s
source
end
/A #{foo_regex}/i
#=> /A foo/i
The above code is just for demonstration purposes, don't do that.

ruby and regex grouping

Here is the code
string = "Looking for the ^[cows]"
footnote = string[/\^\[(.*?)\]/]
I was hoping that footnote would equal cows
What I get is footnote equals ^[cows]
Any help?
Thanks!
You can specify which capture group you want with a second argument to []:
string = "Looking for the ^[cows]"
footnote = string[/\^\[(.*?)\]/, 1]
# footnote == "cows"
According to the String documentation, the #[] method takes a second parameter, an integer, which determines the matching group returned:
a = "hello there"
a[/[aeiou](.)\1/] #=> "ell"
a[/[aeiou](.)\1/, 0] #=> "ell"
a[/[aeiou](.)\1/, 1] #=> "l"
a[/[aeiou](.)\1/, 2] #=> nil
You should use footnote = string[/\^\[(.*?)\]/, 1]
If you want to capture subgroups, you can use Regexp#match:
r = /\^\[(.*?)\]/
r.match(string) # => #<MatchData "^[cows]" 1:"cows">
r.match(string)[0] # => "^[cows]"
r.match(string)[1] # => "cows"
An alternative to using a capture group, and then retrieving it's contents, is to match only what you want. Here are three ways of doing that.
#1 Use a positive lookbehind and a positive lookahead
string[/(?<=\[).*?(?=\])/]
#=> "cows"
#2 Use match but forget (\K) and a positive lookahead
string[/\[\K.*?(?=\])/]
#=> "cows"
#3 Use String#gsub
string.gsub(/.*?\[|\].*/,'')
#=> "cows"

Trim a trailing .0

I have an Excel column containing part numbers. Here is a sample
As you can see, it can be many different datatypes: Float, Int, and String. I am using roo gem to read the file. The problem is that roo interprets integer cells as Float, adding a trailing zero to them (16431 => 16431.0). I want to trim this trailing zero. I cannot use to_i because it will trim all the trailing numbers of the cells that require a decimal in them (the first row in the above example) and will cut everything after a string char in the String rows (the last row in the above example).
Currently, I have a a method that checks the last two characters of the cell and trims them if they are ".0"
def trim(row)
if row[0].to_s[-2..-1] == ".0"
row[0] = row[0].to_s[0..-3]
end
end
This works, but it feels terrible and hacky. What is the proper way of getting my Excel file contents into a Ruby data structure?
def trim num
i, f = num.to_i, num.to_f
i == f ? i : f
end
trim(2.5) # => 2.5
trim(23) # => 23
or, from string:
def convert x
Float(x)
i, f = x.to_i, x.to_f
i == f ? i : f
rescue ArgumentError
x
end
convert("fjf") # => "fjf"
convert("2.5") # => 2.5
convert("23") # => 23
convert("2.0") # => 2
convert("1.00") # => 1
convert("1.10") # => 1.1
For those using Rails, ActionView has the number_with_precision method that takes a strip_insignificant_zeros: true argument to handle this.
number_with_precision(13.00, precision: 2, strip_insignificant_zeros: true)
# => 13
number_with_precision(13.25, precision: 2, strip_insignificant_zeros: true)
# => 13.25
See the number_with_precision documentation for more information.
This should cover your needs in most cases: some_value.gsub(/(\.)0+$/, '').
It trims all trailing zeroes and a decimal point followed only by zeroes. Otherwise, it leaves the string alone.
It's also very performant, as it is entirely string-based, requiring no floating point or integer conversions, assuming your input value is already a string:
Loading development environment (Rails 3.2.19)
irb(main):001:0> '123.0'.gsub(/(\.)0+$/, '')
=> "123"
irb(main):002:0> '123.000'.gsub(/(\.)0+$/, '')
=> "123"
irb(main):003:0> '123.560'.gsub(/(\.)0+$/, '')
=> "123.560"
irb(main):004:0> '123.'.gsub(/(\.)0+$/, '')
=> "123."
irb(main):005:0> '123'.gsub(/(\.)0+$/, '')
=> "123"
irb(main):006:0> '100'.gsub(/(\.)0+$/, '')
=> "100"
irb(main):007:0> '127.0.0.1'.gsub(/(\.)0+$/, '')
=> "127.0.0.1"
irb(main):008:0> '123xzy45'.gsub(/(\.)0+$/, '')
=> "123xzy45"
irb(main):009:0> '123xzy45.0'.gsub(/(\.)0+$/, '')
=> "123xzy45"
irb(main):010:0> 'Bobby McGee'.gsub(/(\.)0+$/, '')
=> "Bobby McGee"
irb(main):011:0>
Numeric values are returned as type :float
def convert_cell(cell)
if cell.is_a?(Float)
i = cell.to_i
cell == i.to_f ? i : cell
else
cell
end
end
convert_cell("foobar") # => "foobar"
convert_cell(123) # => 123
convert_cell(123.4) # => 123.4

Regexp for finding href in <a> open-uri ruby

I need to find distance between two websites useing ruby open-uri. Using
def check(url)
site = open(url.base_url)
link = %r{^<([a])([^"]+)*([^>]+)*(?:>(.*)<\/\1>|\s+\/>)$}
site.each_line {|line| puts $&,$1,$2,$3,$4 if (line=~link)}
p url.links
end
Finding links not working properly. Any ideas why ?
If you want to find the a tags' href parameters, use the right tool, which isn't often a regex. More likely you should use a HTML/XML parser.
Nokogiri is the parser of choice with Ruby:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.example.org/index.html'))
doc.search('a').map{ |a| a['href'] }
pp doc.search('a').map{ |a| a['href'] }
# => [
# => "/",
# => "/domains/",
# => "/numbers/",
# => "/protocols/",
# => "/about/",
# => "/go/rfc2606",
# => "/about/",
# => "/about/presentations/",
# => "/about/performance/",
# => "/reports/",
# => "/domains/",
# => "/domains/root/",
# => "/domains/int/",
# => "/domains/arpa/",
# => "/domains/idn-tables/",
# => "/protocols/",
# => "/numbers/",
# => "/abuse/",
# => "http://www.icann.org/",
# => "mailto:iana#iana.org?subject=General%20website%20feedback"
# => ]
I see several issues with this regular expression:
It is not necessarily the case that a space must come before the trailing slash in an empty tag, yet your regexp requires it
Your regexp is very verbose and redundant
Try the following instead, it will extract you the URL out of <a> tags:
link = /<a \s # Start of tag
[^>]* # Some whitespace, other attributes, ...
href=" # Start of URL
([^"]*) # The URL, everything up to the closing quote
" # The closing quotes
/x # We stop here, as regular expressions wouldn't be able to
# correctly match nested tags anyway

How can I check a word is already all uppercase?

I want to be able to check if a word is already all uppercase. And it might also include numbers.
Example:
GO234 => yes
Go234 => no
You can compare the string with the same string but in uppercase:
'go234' == 'go234'.upcase #=> false
'GO234' == 'GO234'.upcase #=> true
a = "Go234"
a.match(/\p{Lower}/) # => #<MatchData "o">
b = "GO234"
b.match(/\p{Lower}/) # => nil
c = "123"
c.match(/\p{Lower}/) # => nil
d = "µ"
d.match(/\p{Lower}/) # => #<MatchData "µ">
So when the match result is nil, it is in uppercase already, else something is in lowercase.
Thank you #mu is too short mentioned that we should use /\p{Lower}/ instead to match non-English lower case letters.
I am using the solution by #PeterWong and it works great as long as the string you're checking against doesn't contain any special characters (as pointed out in the comments).
However if you want to use it for strings like "Überall", just add this slight modification:
utf_pattern = Regexp.new("\\p{Lower}".force_encoding("UTF-8"))
a = "Go234"
a.match(utf_pattern) # => #<MatchData "o">
b = "GO234"
b.match(utf_pattern) # => nil
b = "ÜÖ234"
b.match(utf_pattern) # => nil
b = "Über234"
b.match(utf_pattern) # => #<MatchData "b">
Have fun!
You could either compare the string and string.upcase for equality (as shown by JCorc..)
irb(main):007:0> str = "Go234"
=> "Go234"
irb(main):008:0> str == str.upcase
=> false
OR
you could call arg.upcase! and check for nil. (But this will modify the original argument, so you may have to create a copy)
irb(main):001:0> "GO234".upcase!
=> nil
irb(main):002:0> "Go234".upcase!
=> "GO234"
Update: If you want this to work for unicode.. (multi-byte), then string#upcase won't work, you'd need the unicode-util gem mentioned in this SO question

Resources