Getting some elements in a string using a regex - ruby

Context
Using Ruby I am parsing strings looking like this:
A type with an ID...
[Image=4b5da003ee133e8368000002]
[Video=679hfpam9v56dh800khfdd32]
...with between 0 and n additional options separated with #...
[Image=4b5da003ee133e8368000002#size:small]
[Image=4b5da003ee133e8368000002#size:small#media:true]
In this example:
[Image=4b5da003ee133e8368000002#size:small#media:true]
I want to retrieve:
[Image=4b5da003ee133e8368000002#size:small#media:true]
Image
4b5da003ee133e8368000002
size:small
media:true
Problem
Right now using this regex:
(\[([a-zA-Z]+)=([a-zA-Z0-9]+)(#[a-zA-Z]+:[a-zA-Z]+)*\])
I get...
[Image=4b5da003ee133e8368000002#size:small#media:true]
Image
4b5da003ee133e8368000002
#media:true
What am I doing wrong? How can I get what I want?
PS: All the results are copied from http://rubular.com/ which is nice to debug regex. Please use it if it can help you help me :)
Edit : if it's impossible to get all options separated, how could I get this:
[Image=4b5da003ee133e8368000002#size:small#media:true]
Image
4b5da003ee133e8368000002
#size:small#media:true

Edit:
Ruby's Regex implementation seems not to support multiple captures on one group, as most other regex engines do. Therefore, you'll have to do two steps; first getting all the #*:* in one string and then split those.
To get all of them, this should work:
(\[([a-zA-Z]+)=([a-zA-Z0-9]+)((?:#[a-zA-Z]+:[a-zA-Z]+)*)\])

To get the "tail" of options, you could fetch it from $4 with
/(\[([a-zA-Z]+)=([a-zA-Z0-9]+)((#[a-zA-Z]+:[a-zA-Z]+)*)\])/
and then split on at-signs.
For example:
#! /usr/bin/ruby
str = "[Image=4b5da003ee133e8368000002#size:small#media:true]"
if /(\[([a-zA-Z]+)=([a-zA-Z0-9]+)((#[a-zA-Z]+:[a-zA-Z]+)*)\])/.match(str)
print $1, "\n",
$2, "\n",
$3, "\n",
$4, "\n";
$4[1..-1].split(/#/).each do |s|
print s, "\n";
end
end
Output:
[Image=4b5da003ee133e8368000002#size:small#media:true]
Image
4b5da003ee133e8368000002
#size:small#media:true
size:small
media:true

(\[([a-zA-Z]+)=([a-zA-Z0-9]+)(?:#([a-zA-Z]+:[a-zA-Z]+))*\])
will give you media:true. Note that media:true is overwriting the previous size:small match. I don't think there's a way to get exactly what you want in a single match call.

It looks like the regex only keeps the last match. I think to get the list of matches will require a different approach.
"a=b#c:d#e:f".split(/=|#/)
which creates a list:
["a", "b", "c:d", "e:f"]
which is close to what you want...

Although it can be tricky to do it purely within a regexp, it's not too hard to split it out as a two-step operation:
while (line = DATA.gets)
line.chomp!
if (m = line.match(/\[([a-zA-Z]+)=([a-zA-Z0-9]+)((?:#[a-zA-Z]+:[a-zA-Z]+)*)\]/))
(type, hash, options) = m.to_a[1, 3]
options = options.split(/#/).reject { |s| s.empty? }
puts [ type, hash, options.join(',') ].join(' / ')
end
end
__END__
[Image=4b5da003ee133e8368000002]
[Video=679hfpam9v56dh800khfdd32]
[Image=4b5da003ee133e8368000002#size:small]
[Image=4b5da003ee133e8368000002#size:small#media:true]
[Image=4b5da003ee133e8368000002#size:small#media:true#foo:bar]
This produces the output:
Image / 4b5da003ee133e8368000002 /
Video / 679hfpam9v56dh800khfdd32 /
Image / 4b5da003ee133e8368000002 / size:small
Image / 4b5da003ee133e8368000002 / size:small,media:true
Image / 4b5da003ee133e8368000002 / size:small,media:true,foo:bar

Related

How do I regex-match an unknown number of repeating elements?

I'm trying to write a Ruby script that replaces all rem values in a CSS file with their px equivalents. This would be an example CSS file:
body{font-size:1.6rem;margin:4rem 7rem;}
The MatchData I'd like to get would be:
# Match 1 Match 2
# 1. font-size 1. margin
# 2. 1.6 2. 4
# 3. 7
However I'm entirely clueless as to how to get multiple and different MatchData results. The RegEx that got me closest is this (you can also take a look at it at Rubular):
/([^}{;]+):\s*([0-9.]+?)rem(?=\s*;|\s*})/i
This will match single instances of value declarations (so it will properly return the desired Match 1 result), but entirely disregards multiples.
I also tried something along the lines of ([0-9.]+?rem\s*)+, but that didn't return the desired result either, and doesn't feel like I'm on the right track, as it won't return multiple result data sets.
EDIT After the suggestions in the answers, I ended up solving the problem like this:
# search for any declarations that contain rem unit values and modify blockwise
#output.gsub!(/([^ }{;]+):\s*([^}{;]*[0-9.]rem+[^;]*)(?=\s*;|\s*})/i) do |match|
# search for any single rem value
string = match.gsub(/([0-9.]+)rem/i) do |value|
# convert the rem value to px by multiplying by 10 (this is not universal!)
value = sprintf('%g', Regexp.last_match[1].to_f * 10).to_s + 'px'
end
string += ';' + match # append the original match result to the replacement
match = string # overwrite the matched result
end
You can't capture a dynamic number of match groups (at least not in ruby).
Instead you could do either one of the following:
Capture the whole value and split on space
Use multilevel matching to capture first the whole key/value pair and secondly match the value. You can use blocks on the match method in ruby.
This regex will do the job for your example :
([^}{;]+):(?:([0-9\.]+?)rem\s?)?(?:([0-9\.]+?)rem\s?)
But whith this you can't match something like : margin:4rem 7rem 9rem
This is what I've been able to do: DEMO
Regex: (?<={|;)([^:}]+)(?::)([^A-Za-z]+)
And this is what my result looks like:
# Match 1 Match 2
# 1. font-size 1. margin
# 2. 1.6 2. 4
As #koffeinfrei says, dynamic capture isn't possible in Ruby. Would be smarter to capture the whole string and remove spaces.
str = 'body{font-size:1.6rem;margin:4rem 7rem;}'
str.scan(/(?<=[{; ]).+?(?=[;}])/)
.map { |e| e.match /(?<prop>.+):(?<value>.+)/ }
#⇒ [
# [0] #<MatchData "font-size:1.6rem" prop:"font-size" value:"1.6rem">,
# [1] #<MatchData "margin:4rem 7rem" prop:"margin" value:"4rem 7rem">
# ]
The latter match might be easily adapted to return whatever you want, value.split(/\s+/) will return all the values, \d+ instead of .+ will match digits only etc.

RegEx to remove new line characters and replace with comma

I scraped a website using Nokogiri and after using xpath I was left with the following string (which is a few td's pushed into one string).
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t"
My goal is to make this into an array that looks like the following(it will be a nested array):
["Total First Downs", "359", "274"]
The issue is creating a regex equation that removes the escaped characters, subs in one "," but does not sub in a "," after the last set of integers. If the comma after the last set of integers is necessary, I could use #compact to get rid of the nil that occurs in the array. If you need the code on how I scraped the website here it is: (please note i saved the webpage for testing in order for my ip address to not get burned during the trial phase)
f = File.open('page')
doc = Nokogiri::HTML:(f)
f.close
number = doc.xpath('//tr[#class="tbdy1"]').count
stats = Array.new(number) {Array.new}
i = 0
doc.xpath('//tr[#class="tbdy1"]').each do |tr|
stats[i] << tr.text
i += 1
end
Thanks for your help
I don't fully understand your problem, but the result can be easily achieved with this:
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t"
.split(/[\n\t]+/)
# => ["Total First Downs", "359", "274"]
Try with gsub
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t".gsub("/[\n\t]+/",",")

Regex returning weird arrays

I want to make an array of results from a string like this one, using a regular expression:
results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday
Here’s my regex as it stands. It works in Sublime Text’s regex search but not in Ruby:
(results)\|.*?\\n(?=((results\|)|(timestamps\|\|)))
and this would be the desired result:
1. results|foofoofoo
2. results|barbarbar
3. results|googoogoo
Instead I’m getting these weird returns, and I can’t understand it. Why does this not select the result lines?
Match 1
1. results
2. results|
3. results|
4.
Match 2
1. results
2. results|
3. results|
4.
Match 3
1. results
2. timestamps||
3.
4. timestamps||
Here’s the actual code using the regex:
#create new lines for each regex'd line body with that body set as the raw attribute
host_scan.raw.scan(/(?:results)\|.*?\\n(?=((?:results\|)|(?:timestamps\|\|)))/).each do |body|
#lines << Line.new({:raw => body})
end
As Kendall Frey already stated, you are creating too many capture groups. No need to group the first literal “results|”, and no need to group the elements of your alternate group in individual non backreferencing groups. What you are intending to do is this regex:
/results\|.*?(?=\\n(?:results\||timestamps\|\|))/
or, if you don’t mind repeating the \\n part, you can do away with the non-capturing subgroup:
/results\|.*?(?=\\nresults\||\\ntimestamps\|\|)/
– both will return an array of matched values as specified in your question.
I'm guessing it has something to do with capturing groups. If you change all your (...) to (?:...) it will eliminate capturing groups.
Rather than jump to a regex, which is a much more complicated way to get at the data, use split("\n").
text = "results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday"
ary = text.split("\n")
ary is:
[
"results|foofoofoo",
"results|barbarbarbar",
"results|googoogoo",
"timestamps||friday"
]
Slice that and you can get:
ary[0..2]
=> ["results|foofoofoo", "results|barbarbarbar", "results|googoogoo"]
EDIT:
Based on the comment that there are more carriage returns and complex characters in the strings:
require 'awesome_print'
text = "results|foofoofoo\nmorefoo\nandevenmorefoo\nresults|barbarbarbar\nandmorebar\nandyetagainmorebar\nresults|googoogoo\ntimestamps||friday"
ap text.sub(/\|\|friday$/, '').split('results')[1..-1].map{ |l| 'results' << l }
Which outputs:
[
[0] "results|foofoofoo\nmorefoo\nandevenmorefoo\n",
[1] "results|barbarbarbar\nandmorebar\nandyetagainmorebar\n",
[2] "results|googoogoo\ntimestamps"
]
The answer turned out to lie in the parentheses. Wrapping in parentheses caused it to return the entire match instead of just the tail delimiter.
host_scan.raw.scan(/((?:results\|.*?\\n)(?=(?:results\|)|(?:timestamps\|\|)))/).each do |body|
#lines << Line.new({:raw => body})
end

replacing lines in ruby string

i'm trying to loop through a Ruby string containing many lines using the each_line method, but I also want to change them. I'm using the following code, but it doesn't seem to work:
string.each_line{|line| line=change_line(line)}
I suppose, that Ruby is sending a copy of my line and not the line itself, but unfortunatelly there is no method each_line!. I also tried with the gsub! method, using /^.*$/ to detect each line, but it seems that it calls the change_line method only ones and replaces all lines with it. Any ideas how to do that?
Thanks in advance :)
#azlisum: You are not storing the result of your concatenation. Use:
output = string.lines.map{|line|change_line(line)}.join
Comparing four ways to process by line in a string:
# Inject method (proposed by #steenslang)
output = string.each_line.inject(""){|s, line| s << change_line(line)}
# Join method (proposed by #Lars Haugseth)
output = string.lines.map{|line|change_line(line)}.join
# REGEX method (proposed by #olistik)
output = string.gsub!(/^(.*)$/) {|line| change_line(line)}
# String concatenation += method (proposed by #Erik Hinton)
output = ""
string.each_line{|line| output += change_line(line)}
The timing with Benchmark:
user system total real
Inject Time: 7.920000 0.010000 7.930000 ( 7.920128)
Join Time: 7.150000 0.010000 7.160000 ( 7.155957)
REGEX Time: 11.660000 0.010000 11.670000 ( 11.661059)
+= Time: 7.080000 0.010000 7.090000 ( 7.076423)
As #steenslag pointed out, 's += a' will generate a new string for each concatenation and is therefor not usually the best choice.
So given that, and given the times, your best bet is:
output = string.lines.map{|line|change_line(line)}.join
Also, this is the cleaner looking choice IMHO.
Notes:
Using Benchmark
Ruby-Doc: Benchmark
You should try starting out with a blank string too, each_lining through the string and then pushing the results onto the blank string.
output = ""
string.each_line{|line| output += change_line(line)}
In your original example, you are correct. Your changes are occuring but they are not being ssved anywhere. Each in Ruby does not alter anything by default.
You could use gsub! passing a block to it:
string.gsub!(/^(.*)$/) {|line| change_line(line)}
source: String#gsub!
String#each_line is meant for reading lines in a string, not writing them. You can use this to get the result you want like so:
changed_string = ""
string.each_line{ |line| changed_string += change_line(line) }
If you don't give each_line a block, you'll get an enumerator, which has the inject method.
str = <<HERE
smestring dsfg
line 2
HERE
res = str.each_line.inject(""){|m,line|m << line.upcase}

Ruby: Escaping special characters in a string

I am trying to write a method that is the same as mysqli_real_escape_string in PHP. It takes a string and escapes any 'dangerous' characters. I have looked for a method that will do this for me but I cannot find one. So I am trying to write one on my own.
This is what I have so far (I tested the pattern at Rubular.com and it worked):
# Finds the following characters and escapes them by preceding them with a backslash. Characters: ' " . * / \ -
def escape_characters_in_string(string)
pattern = %r{ (\'|\"|\.|\*|\/|\-|\\) }
string.gsub(pattern, '\\\0') # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
And I am using start_string as the string I want to change, and correct_string as what I want start_string to turn into:
start_string = %("My" 'name' *is* -john- .doe. /ok?/ C:\\Drive)
correct_string = %(\"My\" \'name\' \*is\* \-john\- \.doe\. \/ok?\/ C:\\\\Drive)
Can somebody try and help me determine why I am not getting my desired output (correct_string) or tell me where I can find a method that does this, or even better tell me both? Thanks a lot!
Your pattern isn't defined correctly in your example. This is as close as I can get to your desired output.
Output
"\\\"My\\\" \\'name\\' \\*is\\* \\-john\\- \\.doe\\. \\/ok?\\/ C:\\\\Drive"
It's going to take some tweaking on your part to get it 100% but at least you can see your pattern in action now.
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\)/
string.gsub(pattern){|match|"\\" + match} # <-- Trying to take the currently found match and add a \ before it I have no idea how to do that).
end
I have changed above function like this:
def self.escape_characters_in_string(string)
pattern = /(\'|\"|\.|\*|\/|\-|\\|\)|\$|\+|\(|\^|\?|\!|\~|\`)/
string.gsub(pattern){|match|"\\" + match}
end
This is working great for regex
This should get you started:
print %("'*-.).gsub(/["'*.-]/){ |s| '\\' + s }
\"\'\*\-\.
Take a look at the ActiveRecord sanitization methods: http://api.rubyonrails.org/classes/ActiveRecord/Base.html#method-c-sanitize_sql_array
Take a look at escape_string / quote method in Mysql class here

Resources