RegEx to remove new line characters and replace with comma - ruby

I scraped a website using Nokogiri and after using xpath I was left with the following string (which is a few td's pushed into one string).
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t"
My goal is to make this into an array that looks like the following(it will be a nested array):
["Total First Downs", "359", "274"]
The issue is creating a regex equation that removes the escaped characters, subs in one "," but does not sub in a "," after the last set of integers. If the comma after the last set of integers is necessary, I could use #compact to get rid of the nil that occurs in the array. If you need the code on how I scraped the website here it is: (please note i saved the webpage for testing in order for my ip address to not get burned during the trial phase)
f = File.open('page')
doc = Nokogiri::HTML:(f)
f.close
number = doc.xpath('//tr[#class="tbdy1"]').count
stats = Array.new(number) {Array.new}
i = 0
doc.xpath('//tr[#class="tbdy1"]').each do |tr|
stats[i] << tr.text
i += 1
end
Thanks for your help

I don't fully understand your problem, but the result can be easily achieved with this:
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t"
.split(/[\n\t]+/)
# => ["Total First Downs", "359", "274"]

Try with gsub
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t".gsub("/[\n\t]+/",",")

Related

Regex to extract last number portion of varying URL

I'm creating a URL parser and have three kind of URLs from which I would like to extract the number portion from the end of the URL and increment the extracted number by 10 and update the URL. I'm trying to use regex to extract but I'm new to regex and having trouble.
These are three URL structures of which I'd like to increment the last number portion of:
Increment last number 20 by 10:
http://forums.scamadviser.com/site-feedback-issues-feature-requests/20/
Increment last number 50 by 10:
https://forums.questionablecontent.net/index.php/board,1.50.html
Increment last number 30 by 10:
https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.30/
With \d+(?!.*\d) regex, you will get the last digit chunk in the string. Then, use s.gsub with a block to modify the number and put back to the result.
See this Ruby demo:
strs = ['http://forums.scamadviser.com/site-feedback-issues-feature-requests/20/', 'https://forums.questionablecontent.net/index.php/board,1.50.html', 'https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.30/']
arr = strs.map {|item| item.gsub(/\d+(?!.*\d)/) {$~[0].to_i+10}}
Note: $~ is a MatchData object, and using the [0] index we can access the whole match value.
Results:
http://forums.scamadviser.com/site-feedback-issues-feature-requests/30/
https://forums.questionablecontent.net/index.php/board,1.60.html
https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.40/
Try this regex:
\d+(?=(\/)|(.html))
It will extract the last number.
Demo: https://regex101.com/r/zqUQlF/1
Substitute back with this regex:
(.*?)(\d+)((\/)|(.html))
Demo: https://regex101.com/r/zqUQlF/2
this regex matches only the last whole number in each URL by using a lookahead (which 'sees' patterns but doesn't eat any characters):
\d+(?=\D*$)
online demo here.
Like this:
urls = ['http://forums.scamadviser.com/site-feedback-issues-feature-requests/20/', 'https://forums.questionablecontent.net/index.php/board,1.50.html', 'https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.30/']
pattern = /(\d+)(?=[^\d]+$)/
urls.each do |url|
url.gsub!(pattern) {|m| m.to_i + 10}
end
puts urls
You can also test it online here: https://ideone.com/smBJCQ

How do I regex-match an unknown number of repeating elements?

I'm trying to write a Ruby script that replaces all rem values in a CSS file with their px equivalents. This would be an example CSS file:
body{font-size:1.6rem;margin:4rem 7rem;}
The MatchData I'd like to get would be:
# Match 1 Match 2
# 1. font-size 1. margin
# 2. 1.6 2. 4
# 3. 7
However I'm entirely clueless as to how to get multiple and different MatchData results. The RegEx that got me closest is this (you can also take a look at it at Rubular):
/([^}{;]+):\s*([0-9.]+?)rem(?=\s*;|\s*})/i
This will match single instances of value declarations (so it will properly return the desired Match 1 result), but entirely disregards multiples.
I also tried something along the lines of ([0-9.]+?rem\s*)+, but that didn't return the desired result either, and doesn't feel like I'm on the right track, as it won't return multiple result data sets.
EDIT After the suggestions in the answers, I ended up solving the problem like this:
# search for any declarations that contain rem unit values and modify blockwise
#output.gsub!(/([^ }{;]+):\s*([^}{;]*[0-9.]rem+[^;]*)(?=\s*;|\s*})/i) do |match|
# search for any single rem value
string = match.gsub(/([0-9.]+)rem/i) do |value|
# convert the rem value to px by multiplying by 10 (this is not universal!)
value = sprintf('%g', Regexp.last_match[1].to_f * 10).to_s + 'px'
end
string += ';' + match # append the original match result to the replacement
match = string # overwrite the matched result
end
You can't capture a dynamic number of match groups (at least not in ruby).
Instead you could do either one of the following:
Capture the whole value and split on space
Use multilevel matching to capture first the whole key/value pair and secondly match the value. You can use blocks on the match method in ruby.
This regex will do the job for your example :
([^}{;]+):(?:([0-9\.]+?)rem\s?)?(?:([0-9\.]+?)rem\s?)
But whith this you can't match something like : margin:4rem 7rem 9rem
This is what I've been able to do: DEMO
Regex: (?<={|;)([^:}]+)(?::)([^A-Za-z]+)
And this is what my result looks like:
# Match 1 Match 2
# 1. font-size 1. margin
# 2. 1.6 2. 4
As #koffeinfrei says, dynamic capture isn't possible in Ruby. Would be smarter to capture the whole string and remove spaces.
str = 'body{font-size:1.6rem;margin:4rem 7rem;}'
str.scan(/(?<=[{; ]).+?(?=[;}])/)
.map { |e| e.match /(?<prop>.+):(?<value>.+)/ }
#⇒ [
# [0] #<MatchData "font-size:1.6rem" prop:"font-size" value:"1.6rem">,
# [1] #<MatchData "margin:4rem 7rem" prop:"margin" value:"4rem 7rem">
# ]
The latter match might be easily adapted to return whatever you want, value.split(/\s+/) will return all the values, \d+ instead of .+ will match digits only etc.

Incrementing numeric parameter in a URL parameter string?

I've had a look round and can't find what I need on Stack Overflow, and was wondering if someone had a simple solution.
I want to find a parameter within a URL and increment its value, so, as an example:
?kws=&pstc=&cty=&prvnm=1
I want to be able to locate the prvnm parameter no matter where it is in the string and increment its value by 1.
I know I could split the parameters into an array, find the key, increment it and write it back but that seems rather long winded and wondered if someone else had any ideas!
require "uri"
url = "http://example.com/?kws=&pstc=&cty=&prvnm=1"
def new_url(url)
uri = URI.parse(url)
hsh = Hash[URI.decode_www_form(uri.query)]
hsh['prvnm'] = hsh['prvnm'].next
uri.query = URI.encode_www_form(hsh).to_s
uri.to_s
end
new_url(url) # => "http://example.com/?kws=&pstc=&cty=&prvnm=2"
There are already four answers, so I had to come up with something a little different:
s = "?kws=&pstc=&cty=&prvnm=1"
head, sep, tail = s.partition(/(?<=[?&]prvnm=)\d+/)
head + (sep.to_i + 1).to_s + tail # => "?kws=&pstc=&cty=&prvnm=2"
'String#partition' returns an array of three strings [head, sep, tail], such that head + sep + tail => s, where separator is partition's argument, which can be a string or a regex.
We want the separator to be the digits following &prvnm=. We therefore use a regex with \d+ preceeded by the aforementioned string which we want to treat as having zero length, so it will not be included in the separator. That calls for a "positive look-behind": (?<=&prvnm=). \d+ is "greedy", so it take all consequetive digits.
For the given value of s, head, sep, tail = s.partition(/(?<=&prvnm=)(\d+)/)
=> ["?kws=&pstc=&cty=&prvnm=", "1", ""].
Edit: my thanks to #quetzalcoatl for pointing out that I needed to change (?<=&prvnm=) in my regex to what I have now, as what I had would fail when ?prvnm= was at the beginning of the string.
split the string by `&`
then iterate over the parts
then split each part by `=` and inspect the results
when found `prvnm`, parse the integer and increment it
then join the bits by '='
then join the parts by '&'
Or, use regex like:
/[?&]prvnm=\d+/
and parse the result and then do a replacement.
Or, get some URL-parsing library..
Try something like this:
params = "?kws=&pstc=&cty=&prvnm=1"
num = params.scan(/prvnm=(\d)/)[0].join.to_i
puts num + 1
Use:
require 'uri'
Then:
parsed-url= URI.parse( ur full url)
r = CGI.parse(parsed_url.query)
r is now a hash of all your query parameters.
You can easily access it by using:
r["prsvn"].to_i + 1

Read from a file into an array and stop if a ":" is found in ruby

How can I in Ruby read a string from a file into an array and only read and save in the array until I get a certain marker such as ":" and stop reading?
Any help would be much appreciated =)
For example:
10.199.198.10:111 test/testing/testing (EST-08532522)
10.199.198.12:111 test/testing/testing (EST-08532522)
10.199.198.13:111 test/testing/testing (EST-08532522)
Should only read the following and be contained in the array:
10.199.198.10
10.199.198.12
10.199.198.13
This is a rather trivial problem, using String#split:
results = open('a.txt').map { |line| line.split(':')[0] }
p results
Output:
["10.199.198.10", "10.199.198.12", "10.199.198.13"]
String#split breaks a string at the specified delimiter and returns an array; so line.split(':')[0] takes the first element of that generated array.
In the event that there is a line without a : in it, String#split will return an array with a single element that is the whole line. So if you need to do a little more error checking, you could write something like this:
results = []
open('a.txt').each do |line|
results << line.split(':')[0] if line.include? ':'
end
p results
which will only add split lines to the results array if the line has a : character in it.

Ruby MatchData class is repeating captures, instead of including additional captures as it "should"

Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').

Resources