ruby regex string $ - ruby

I have to cut the price from strings like that:
s1 = "somefing $ 100"
s2 = "$ 19081 words $"
s3 = "30$"
s4 = "hi $90"
s5 = "wow 150"
Output should be:
s1 = "100"
s2 = "19081"
s3 = "30"
s4 = "90"
s5 = nil
I use the following regex:
price = str[/\$\s*(\d+)|(\d+)\s*\$/, 1]
But it doesn't work for all types of strings.

Your code always returns the result of the first capture group group whereas in the failing case it is the second capture group that you are interested in. I don't think the [] method has a good way of dealing with this (when using numbered capture groups). You could write this like so
price = str =~ /\$\s*(\d+)|(\d+)\s*\$/ && ($1 || $2)
Although this isn't very legible. If instead you use a named capture group, then you can do
price = str[/\$\s*(?<amount>\d+)|(?<amount>\d+)\s*\$/, 'amount']
Duplicate named capture groups won't always do what you want but when they are in separate alternation branches (as they are here) then it should work.

The problem is that you're always getting value from the first regex group and you don't check the second. So, you're not looking the case after | - the one when digit is before $ sign.
If you look at the graphical representation of your regex, by typing 1 as a second parameter in square brackets, you are covering only the upper path (first case), and you never check lower one (second case).
Basically, try:
price = str[/\$\s*(\d+)|(\d+)\s*\$/, 1] or str[/\$\s*(\d+)|(\d+)\s*\$/, 2]
P.S. I'm not that experienced in Ruby, there might be some more optimal way to type this, but this should do the trick

try this, its much simpler but it may not be the most efficient.
p1 = s1.gsub(' ','')[/\$(\d+)|(\d+)\$/,1]

Related

What is the best way to delimit a csv files thats contain commas and double quotes?

Lets say I have the following string and I want the below output without requiring csv.
this, "what I need", to, do, "i, want, this", to, work
this
what i need
to
do
i, want, this
to
work
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
We can solve it with a beautifully-simple regex:
"([^"]+)"|[^, ]+
The left side of the alternation | matches complete "quotes" and captures the contents to Group1. The right side matches characters that are neither commas nor spaces, and we know they are the right ones because they were not matched by the expression on the left.
Option 2: Allowing Multiple Words
In your input, all tokens are single words, but if you also want the regex to work for my cat scratches, "what I need", your dog barks, use this:
"([^"]+)"|[^, ]+(?:[ ]*[^, ]+)*
The only difference is the addition of (?:[ ]*[^, ]+)* which optionally adds spaces + characters, zero or more times.
This program shows how to use the regex (see the results at the bottom of the online demo):
subject = 'this, "what I need", to, do, "i, want, this", to, work'
regex = /"([^"]+)"|[^, ]+/
# put Group 1 captures in an array
mymatches = []
subject.scan(regex) {|m|
$1.nil? ? mymatches << $& : mymatches << $1
}
mymatches.each { |x| puts x }
Output
this
what I need
to
do
i, want, this
to
work
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...

RegEx to remove new line characters and replace with comma

I scraped a website using Nokogiri and after using xpath I was left with the following string (which is a few td's pushed into one string).
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t"
My goal is to make this into an array that looks like the following(it will be a nested array):
["Total First Downs", "359", "274"]
The issue is creating a regex equation that removes the escaped characters, subs in one "," but does not sub in a "," after the last set of integers. If the comma after the last set of integers is necessary, I could use #compact to get rid of the nil that occurs in the array. If you need the code on how I scraped the website here it is: (please note i saved the webpage for testing in order for my ip address to not get burned during the trial phase)
f = File.open('page')
doc = Nokogiri::HTML:(f)
f.close
number = doc.xpath('//tr[#class="tbdy1"]').count
stats = Array.new(number) {Array.new}
i = 0
doc.xpath('//tr[#class="tbdy1"]').each do |tr|
stats[i] << tr.text
i += 1
end
Thanks for your help
I don't fully understand your problem, but the result can be easily achieved with this:
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t"
.split(/[\n\t]+/)
# => ["Total First Downs", "359", "274"]
Try with gsub
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t".gsub("/[\n\t]+/",",")

Ruby Regexp group matching, assign variables on 1 line

I'm currently trying to rexp a string into multiple variables. Example string:
ryan_string = "RyanOnRails: This is a test"
I've matched it with this regexp, with 3 groups:
ryan_group = ryan_string.scan(/(^.*)(:)(.*)/i)
Now to access each group I have to do something like this:
ryan_group[0][0] (first group) RyanOnRails
ryan_group[0][1] (second group) :
ryan_group[0][2] (third group) This is a test
This seems pretty ridiculous and it feels like I'm doing something wrong. I would be expect to be able to do something like this:
g1, g2, g3 = ryan_string.scan(/(^.*)(:)(.*)/i)
Is this possible? Or is there a better way than how I'm doing it?
You don't want scan for this, as it makes little sense. You can use String#match which will return a MatchData object, you can then call #captures to return an Array of captures. Something like this:
#!/usr/bin/env ruby
string = "RyanOnRails: This is a test"
one, two, three = string.match(/(^.*)(:)(.*)/i).captures
p one #=> "RyanOnRails"
p two #=> ":"
p three #=> " This is a test"
Be aware that if no match is found, String#match will return nil, so something like this might work better:
if match = string.match(/(^.*)(:)(.*)/i)
one, two, three = match.captures
end
Although scan does make little sense for this. It does still do the job, you just need to flatten the returned Array first. one, two, three = string.scan(/(^.*)(:)(.*)/i).flatten
You could use Match or =~ instead which would give you a single match and you could either access the match data the same way or just use the special match variables $1, $2, $3
Something like:
if ryan_string =~ /(^.*)(:)(.*)/i
first = $1
third = $3
end
You can name your captured matches
string = "RyanOnRails: This is a test"
/(?<one>^.*)(?<two>:)(?<three>.*)/i =~ string
puts one, two, three
It doesn't work if you reverse the order of string and the regex.
You have to decide whether it is a good idea, but ruby regexp can (automagically) define local variables for you!
I am not yet sure whether this feature is awesome or just totally crazy, but your regex can define local variables.
ryan_string = "RyanOnRails: This is a test"
/^(?<webframework>.*)(?<colon>:)(?<rest>)/ =~ ryan_string
# This defined three variables for you. Crazy, but true.
webframework # => "RyanOnRails"
puts "W: #{webframework} , C: #{colon}, R: #{rest}"
(Take a look at http://ruby-doc.org/core-2.1.1/Regexp.html , search for "local variable").
Note:
As pointed out in a comment, I see that there is a similar and earlier answer to this question by #toonsend (https://stackoverflow.com/a/21412455). I do not think I was "stealing", but if you want to be fair with praises and honor the first answer, feel free :) I hope no animals were harmed.
scan() will find all non-overlapping matches of the regex in your string, so instead of returning an array of your groups like you seem to be expecting, it is returning an array of arrays.
You are probably better off using match(), and then getting the array of captures using MatchData#captures:
g1, g2, g3 = ryan_string.match(/(^.*)(:)(.*)/i).captures
However you could also do this with scan() if you wanted to:
g1, g2, g3 = ryan_string.scan(/(^.*)(:)(.*)/i)[0]

Ruby String pad zero OPE ID

I'm working with OPE IDs. One file has them with two trailing zeros, eg, [998700, 1001900]. The other file has them with one or two leading zeros for a total length of six, eg, [009987, 010019]. I want to convert every OPE ID (in both files) to an eight-digit string with exactly two leading zeros and however many zeros at the end to get it to be eight digits long.
Try this:
a = [ "00123123", "077934", "93422", "1231234", "12333" ]
a.map { |n| n.gsub(/^0*/, '00').ljust(8, '0') }
=> ["00123123", "00779340", "00934220", "001231234", "00123330"]
If you have your data parsed and stored as strings, it could be done like this, for example.
n = ["998700", "1001900", "009987", "0010019"]
puts n.map { |i|
i =~ /^0*([0-9]+?)0*$/
"00" + $1 + "0" * [0, 6 - $1.length].max
}
Output:
00998700
00100190
00998700
00100190
This example on codepad.
I'm note very sure though, that I got the description exactly right. Please check the comments and I correct in case it's not exactly what you were looking for.
With the help of the answers given by #detunized & #nimblegorilla, I came up with:
"998700"[0..-3].rjust(6, '0').to_sym
to make the first format I described (always with two trailing zeros) equal to the second.

format string (postcode) in ruby

I need to re-format a list of UK postcodes and have started with the following to strip whitespace and capitalize:
postcode.upcase.gsub(/\s/,'')
I now need to change the postcode so the new postcode will be in a format that will match the following regexp:
^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$
I would be grateful of any assistance.
If this standards doc is to be believed (and Wikipedia concurs), formatting a valid post code for output is straightforward: the last three characters are the second part, everything before is the first part!
So assuming you have a valid postcode, without any pre-embedded space, you just need
def format_post_code(pc)
pc.strip.sub(/([A-Z0-9]+)([A-Z0-9]{3})/, '\1 \2')
end
If you want to validate an input post code first, then the regex you gave looks like a good starting point. Perhaps something like this?
NORMAL_POSTCODE_RE = /^([A-PR-UWYZ][A-HK-Y0-9][A-HJKS-UW0-9]?[A-HJKS-UW0-9]?)\s*([0-9][ABD-HJLN-UW-Z]{2})$/i
GIROBANK_POSTCODE_RE = /^GIR\s*0AA$/i
def format_post_code(pc)
return pc.strip.upcase.sub(NORMAL_POSTCODE_RE, '\1 \2') if pc =~ NORMAL_POSTCODE_RE
return 'GIR 0AA' if pc =~ GIROBANK_POSTCODE_RE
end
Note that I removed the '0-9' part of the first character, which appears unnecessary according to the sources I quoted. I also changed the alpha sets to match the first-cited document. It's still not perfect: a code of the format 'AAA ANN' validates, for example, and I think a more complex RE is probably required.
I think this might cover it (constructed in stages for easier fixing!)
A1 = "[A-PR-UWYZ]"
A2 = "[A-HK-Y]"
A34 = "[A-HJKS-UW]" # assume rule for alpha in fourth char is same as for third
A5 = "[ABD-HJLN-UW-Z]"
N = "[0-9]"
AANN = A1 + A2 + N + N # the six possible first-part combos
AANA = A1 + A2 + N + A34
ANA = A1 + N + A34
ANN = A1 + N + N
AAN = A1 + A2 + N
AN = A1 + N
PART_ONE = [AANN, AANA, ANA, ANN, AAN, AN].join('|')
PART_TWO = N + A5 + A5
NORMAL_POSTCODE_RE = Regexp.new("^(#{PART_ONE})[ ]*(#{PART_TWO})$", Regexp::IGNORECASE)
UK Postcodes aren't consistent, but they are finite - you might be better with a look-up table.
Reformat or pattern match? I suspect the latter, although upcasing it first is a good idea.
Before we proceed though I would point out that you are stripping spaces but your regex contains " {1,2}" which is "one or two space characters". As you have already stripped whitespace you've already caused all to fail the match.
Given a post code as input we can check whether it matches the regex using =~
Here we create some example post codes (taken from the wikipedia page), and test each one against the regex:
post_codes = ["M1 1AA", "M60 1NW", "CR2 6XH", "DN55 1PT", "W1A 1HQ", "EC1A 1BB", "bad one", "cc93h29r2"]
r = /^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$/
post_codes.each do |pc|
# pc =~ r will return something true if we have a match (specifically the integer of first match position)
# We use !! to display it as true|false
puts "#{pc}: #{!!(pc =~ r)}"
end
M1 1AA: true
M60 1NW: true
CR2 6XH: true
DN55 1PT: true
W1A 1HQ: true
EC1A 1BB: true
bad one: false
cc93h29r2: false

Resources