Parsing & replacing multiple links but not when one contains an other - ruby

I can't figure out how to (easily) avoid link (2) to replace the beginning of link (1). I'd appreciate an answer in Ruby but if you figure out the logic it's good too.
The output should be:
message = "For Last Minute rentals, please go to:
<span class='external_link' href-web='http://www.mydomain.com/thepage'>http://www.mydomain.com/thepage</span> (1)
For more information about our events, please visit our website:
<span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span> (2)"
But it is:
message = "For Last Minute rentals, please go to:
<span class='external_link' href-web='<span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span>/thepage'><span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span>/thepage</span> (1)
For more information about our events, please visit our website:
<span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span> (2)"
Here's the code (edited: took out the spans):
message = "For Last Minute rentals, please go to:
http://www.mydomain.com/thepage
For more information about our events, please visit our website:
http://www.mydomain.com"
links_found = URI.extract(message, ['http', 'https'])
for link_found in links_found
message.gsub!(link_found,"<span class='external_link' href-web='#{web_link}'>#{link_found}</span>")
end
Thoughts?

I would guess that your problem is related to URI.extract. When it goes through message it's pulling all the instances of "http", which, for the first line, would be both "http" inside and outside the <span>.
To further clarify, links_found would be an array with both <span...href-web:... and http...</span>. Since you're only passing link_found to gsub as the pattern to match, it will replace every object in the links_found[] array

First, rule one, don't bother with string manipulation or regular expressions for anything but the most trivial things when dealing with HTML or XML. Doing otherwise is a sure recipe for madness.
Instead, save your sanity and go for a real parser. For Ruby I strongly suggest you look at Nokogiri only - it just works.
Consider this code:
require 'nokogiri'
message = "For Last Minute rentals, please go to:
<span class='external_link' href-web='http://www.mydomain.com/thepage'>http://www.mydomain.com/thepage</span> (1)
For more information about our events, please visit our website:
<span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span> (2)"
doc = Nokogiri::HTML(message)
external_spans = doc.search('span.external_link')
url1 = external_spans[0]['href-web'] # => "http://www.mydomain.com/thepage"
text1 = external_spans[0].text # => "http://www.mydomain.com/thepage"
url2 = external_spans[1]['href-web'] # => "http://www.mydomain.com"
text2 = external_spans[1].text # => "http://www.mydomain.com"
url and text1 are the URLs from span 1 and url2 and text2 are from span 2 respectively.
I'm not sure what you want to do with them, because, after a more-than-cursory glance I couldn't see a difference in your source and desired output, but, once you have them you're pretty much free to do anything. A parser, like Nokogiri, lets you retrieve information from the HTML or XML DOM, replace it, move things around, or even splice in new stuff.

Related

Having trouble parsing these data in watir-webdriver

See hierarchy below:
All I need here is "Company Title", "Company Owner", "Company Owner Title", "Street Number Street Name", and "City, State Zipcode".
I tried b.div.span.bs, but that didn't work (bs because there are multiple blocks I'm gathering data from). I also thought I'd just try something like b.tds.split('<br>') and then replace all instances of tags and somehow delete empty array cells, but I found that each block is different, so the data don't align, i.e., Company Title might be in cell 1 for the first array, but then if Company Title isn't present (for the second block) then cell 1 would be Company Owner, which is conflicting... Anyway, just trying to find a clever way to get these data. Thank you.
Here is the actual HTML; however you must first click "View All".
You can split out everything inside the <div> and then split that by <br>. The first part is Company Title (if exists) and then Company Owner is last/second.
The rest is ... trickier. Some are pretty straighforward in that Fax and Member Since have labels so those are easy. The <a> is easy.
You could probably test the phone number with a regex and then back up from there. If the one before the phone number isn't <a> then it's city, state zip and the one before that is the address. If one exists before that, it's the Company Owner Title.
Everything after the phone number in your examples have labels so those are easy.
I'm not sure all of your use cases, but often for pages where the DOM is not very helpful I just get the text and parse with Ruby:
browser.td.text.split("\n").reject(&:empty?)
This doesn't directly answer the question, but it shows how I'd go about doing this using Nokogiri, which is the standard HTML/XML parser for Ruby:
require 'nokogiri'
doc = Nokogiri::HTML('<td><div></div><br>a<br>b<br>c</td>')
doc is Nokogiri's internal representation of the document.
We use landmarks in the markup to navigate and find things we want. In this case <div> is a good starting point:
doc.at('div').next_sibling.next_sibling.text # => "a"
next_sibling is how we tell Nokogiri to look at the next node. In this case it's stepping past the first <br> and looking at the a TextNode.
That'd result in unworkable code though, so there's a better way to go:
doc.search('td br').to_html # => "<br><br><br>"
That shows we can find all the <br> tags inside the <td>, so we just have to iterate over them and use them as our landmarks:
doc.search('td br').map{ |br| br.next_sibling.text } # => ["a", "b", "c"]

Nokogiri How can I extract text from HTML with correct spacing?

I'm trying to extract the text for a document to index it for search. The below mostly works except various words and punctuation run together. When it removes tags, I need to replace them with spaces so I do not get this issue. I have been trying to figure out the most efficient way to do this but I'm coming up empty so far.
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
doc.xpath("//style").remove
doc.xpath("//a").remove
text = doc.text.gsub(/\s+/,' ')
Here is some sample text I extracted from http://www.washingtontimes.com/blog/redskins-watch/2012/oct/18/redskins-linemen-respond-jason-pierre-paul-rg3-com/
Before the season it was New York Giants defensive end Osi Umenyiora
who made waves by saying he wouldn't call Robert Griffin III by “RG3”
until he did something. Until then, it was “Bob Griffin.”After
Griffin's 76-yard touchdown run in the Washington Redskins' victory
over the Minnesota Vikings, fellow Giants defensive end Jason
Pierre-Paul was the one who had some comments about Griffin.“Don’t
bring it to my side," Pierre-Paul told New York media. “Go the other
way. …“Yes, it'll be a very good matchup. Not on my side, though. Not
on my side. Or the other side.”Griffin, asked jokingly Wednesday about
running for office, said: “I’ve got a lot other guys to be running
away from right now, Pierre-Paul, Osi, all those guys.”But according
to a couple of Redskins linemen, Griffin shouldn't have much to worry
about Sunday if he gets into the open field.“If Robert gets into that
situation, I don't think there's many people that can run him down,”
right guard Chris Chester said. “I'm still going to go out there and
try to block and make sure no one touches Robert at all. But he's a
plenty good athlete to be able to outrun a lot of people in this
league.”Prompted with Pierre-Paul's comments, left tackle Trent
Williams responded: “What do you want me to say about that?”“Robert's
my guy. I don't know Pierre-Paul. I don't know why he would say
something like that,” he said. “Maybe he knows something I don't.”
You could try inserting a space before each p tag:
doc.search('p').each{|el| el.before ' '}
but a better approach probably is something like:
text = doc.search('div.story p').map{|p| p.text}.join(" ")
Other answers are discussing inserting whitespace into the document, but if (as the question asks) your requirement is to replace those nodes with whitespace, Nokogiri has a replace method. So to replace script tags with spaces do:
doc.xpath('//script').each do |node|
node.replace(' ')
end
The question also asks about 'correct' spacing. Most browsers will not insert a space when they render around a <script> tag, so while useful for text extraction, this is not necessarily the 'correct' thing to do.

Using Ruby and Mechanize to make new Olympic medals count

I want to remake the Olympic medals count on London2012 to better reflect the value of the medals. Currently it is only sorted by gold medals. I'd like to relist it by points, so gold=4, silver=2 and bronze=1 to make a new more rational list. I probably want to remember the previous rank then add a new rank column as well.
I'd like to try mechanize to get raw data from site, then parse the data into rows and cols, apply the new counts, then remake the list.
From source at http://www.london2012.com/medals/medal-count/ each country has a block with medals like so:
<span class="countryName">Canada</span></a></div></div></td><td class="gold c">0</td><td class="silver c">2</td><td class="bronze c">5</td>
If I use agent.get('http://www.london2012.com/medals/medal-count') It shows the whole list. How to parse specific spans and table data?
I also need to remember the rank, then when I make the new page put the new rank beside it.
Any tips on mechanize parsing and remembering data would be really helpful. More importantly your thinking process in doing something like this, I'd appreciate the help to get me started. This doesn't have to be a code answer
Thanks
First to identify the table. In chrome load the page and right click anywhere on the table. Go to inspect element. Go up the heirarchy until you're on the table. Now select it and you'll see it looks like this:
<table class="or-tbl overall_medals sortable" summary="Schedule">
The overall_medals class looks like it will be unique so that's a good one to use. Now start irb and do:
require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://www.london2012.com/medals/medal-count/'
double check that the table is unique:
page.search('table.overall_medals').size
#=> 1 (good, it is)
You can get all the data from the table into an array with:
page.search('table.overall_medals tr').map{|tr| tr.search('td').map(&:text)}
Notice that the first 2 rows are empty, let's get rid of them by using a range:
data = page.search('table.overall_medals tr')[2..-1].map{|tr| tr.search('td').map(&:text)}
The second row isn't really empty, it has the column names (in th's instead of td's). You can get those with:
columns = page.search('table.overall_medals tr[2] th').map{|th| th.text.strip}
You can get these into hashes with:
rows = data.map{|row| Hash[columns.zip row]}
Now you can do
rows[0]['Country']
#=> "United States of America"
Or even one big hash:
countries = rows.map{|row| {row['Country'] => row}}.reduce &:merge
now:
countries['France']['Gold']
#=> "8"
You might find this Medals API useful (Assuming your question is not specifically about Mechanize)
http://apify.heroku.com/resources/5014626da8cdbb0002000006
It uses Nokogiri to parse the site and the output is available as JSON:
http://apify.heroku.com/api/olympics2012_medals.json

Scraping a website with Nokogiri

I am using Nokogiri to scrape a website and am running into an issue when I try to grab a field from a table. I am using selector gadget to find the CSS selector of the table. I am grabbing data from a government website that details information on motor carriers.
The method that I am using looks like:
def scrape_database
url = "http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=#{self.dot}#Inspections"
doc = Nokogiri::HTML(open(url))
self.name = doc.at_css("tr:nth-child(4) .queryfield").text
self.address = doc.at_css("tr:nth-child(6) .queryfield").text
end
I grab all of the fields in the upper table using that syntax and the method operates fine, however I am having issues with the crash rate/inspections table below it.
Here is what I am using to grab that info:
self.vehicle_inspections = doc.at_css("center:nth-child(13) tr:nth-child(2) :nth-child(2)").text
undefined method `text' for nil:NilClass
If I remove text from the end of this, the method runs but doesn't grab any relevant information (obviously). I am assuming this is due to the complicated selector that I am using to grab the field, but am not quite sure.
Has anyone run into a similar problem and can you give me some advice?
Yes, that error means that your CSS selector is not finding the information; at_css is returning nil, and nil.text is not valid. You can guard against it like so:
insp = doc.at_css("long example css selector")
self.vehicle_inspections = insp && insp.text
However, it sounds to me like you "need" this data. Since you have not provided with the HTML page nor the CSS selectors, I can't help you craft a working CSS or XPath selector.
For future questions, or an edit to this one, note that actual (pared-down) code is strongly preferred over hand waving and loose descriptions of what your code looks like. If you show us the HTML page, or a relevant snippet, and describe which element/text/attribute you want, we can tell you how to select it.
I see six tables on that page. Which is the "crash rate/inspections" table? Given that your URL includes #Inspections on the end, I'm assuming you're talking about the two tables immediately underneath the "Inspections/Crashes In US" section. Here are XPath selectors that match each:
require 'nokogiri'
require 'open-uri'
url = "http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=800585"
doc = Nokogiri::HTML(open(url))
table1 = doc.at_xpath('//table[#summary="Inspections"][preceding::h4[.//a[#name="Inspections"]]]')
table2 = doc.at_xpath('//table[#summary="Crashes"][preceding::h4[.//a[#name="Inspections"]]]')
# Find a row by index (1 is the first row)
vehicle_inspections = table1.at_xpath('.//tr[2]/td').text.to_i
# Find a row by header text
out_of_service_drivers = table1.at_xpath('.//tr[th="Out of Service"]/td[2]').text.to_i
p [ vehicle_inspections, out_of_service_drivers ]
#=> [6, 0]
tow_crashes = table2.at_xpath('.//tr[th="Crashes"]/td[3]').text.to_i
p tow_crashes
#=> 0
The XPath queries may look intimidating. Let me explain how they work:
//table[#summary="Inspections"][preceding::h4[.//a[#name="Inspections"]]]
//table find a <table> at any level of the document
[#summary="Inspections"] …but only if it has a summary attribute with this value
[preceding::h4…] …and only if you can find an <h4> element earlier in the document
[.//a…] …specifically, a <h4> that has an <a> somewhere underneath it
[#name="Inspections"] …and that <a> has to have a name attribute with this text.
This would actually match two tables (there's another summary="Inspections" table later on the page), but using at_xpath finds the first matching table.
.//tr[2]/td
. Starting at the current node (this table)
//tr[2] …find the second <tr> that is a descendant at any level
/td …and then find the <td> children of that.
Again, because we're using at_xpath we find the first matching <td>.
.//tr[th="Out of Service"]/td[2]
. Starting at the current node (this table)
//tr …find any <tr> that is a descendant at any level
[th="Out of Service] …but only those <tr> that have a <th> child with this text
/td[2] …and then find the second <td> children of those.
In this case there is only one <tr> that matches the criteria, and thus only one <td> that matches, but we still use at_xpath so that we get that node directly instead of a NodeSet with a single element in it.
The goal here (and with any screen scraping) is to latch onto meaningful values on the page, not arbitrary indices.
For example, I could have written my table1 xpath as:
# Find the first table with this summary
table1 = doc.at_xpath('//table[#summary="Inspections"][1]')
…or even…
# Find the 20th table on the page
//table[20]
However, those are fragile. Someone adding a new section to the page, or code that happens to add or remove a formatting table would cause those expressions to break. You want to hunt for strong attributes and text that likely won't change, and anchor your searches based on that.
The vehicle_inspections XPath is similarly fragile, relying on the ordering of rows instead of the label text for the row.

Find email addresses in large data stream

STILL NOT RESOLVED :( [Feb 11th]
I have a large text file full of random data and want to pull out all the email addresses from it.
I would like to do this in Ruby, with pseudo code like this:
monster_data_string = "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
monster_data_string.match(EMAIL_REGEX)
Does anyone know what Ruby email regular expression I would use to accomplish this?
Please keep in mind that I'm looking for a Ruby answer to this. I have already tried numerous regex found by googling but most of them cause Ruby runtime errors stating that characters like "+" and "" are invalid/unrecognized.*
What I have already tried is:
monster_data_string.match(/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
but I receive Ruby errors stating that "+" is an invalid character
Thanks in advance
Watch this...
f = File.open("content.txt")
content = f.read
r = Regexp.new(/\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b/)
emails = content.scan(r).uniq
puts YAML.dump(emails)
If you're getting an error message about + or * being invalid in regexes, you're doing something very wrong. This is a valid regex in Ruby, although it's not the one you want:
/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i
For one thing, you don't want to anchor the regex to the start and end of lines (^ and $) if you're trying to pluck the addresses from "random" text. But once you've gotten rid of the anchors, your regex will match **joe#example.com in your test string, which I presume you don't want. This regex from Regular-Expressions.info does a better job, but read that page for tips on tweaking it to meet your particular needs.
/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
Finally (and you may already know this), you won't want to use the match() method because that will only find the first match. Try scan() instead.
Given that it is not possible to parse every valid email address using a regexp you are left with two choices:
Make a regexp that matches as many valid email addresses as possible and live with the the fact that some valid but rarely used forms of email address might get overlooked.
or
Make a regexp that Matches anything that "might be" an email address and then live with the false positives
I use the second approach to weed out obviously wrong email addresses when validating user sign up email addresses on a web page
Gleaned from Ruby Cookbook which has a very good section on email address validation:
valid = '[^ #]+'
/^#{valid}##{valid}\.#{valid}/
Apparently there is a 6343 character Perl regexp written by Paul Warren that does a very good job and also works in Ruby, but even that is not foolproof (I think it might also have some performance implications).
What kind of runtime error messages are you gettting? Is it regarding the regexps as invalid, or is it breaking due to the target string being too large?
To try and help you get there (though not very elegantly, I admit):
I think the start and end anchors (^ and $) aren't helping. You may also want to filter the asterisks?:
irb(main):001:0> mds = "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
=> "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
irb(main):003:0> mds.match(/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
=> nil
irb(main):004:0> mds.match(/([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
=> #<MatchData "**joe#example.com" 1:"**joe" 2:"example.com">
irb(main):005:0> mds.match(/([^#\s*]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
=> #<MatchData "joe#example.com" 1:"joe" 2:"example.com">
Even better,
require 'yaml'
content = "asfsfsdfsdfsf sfda **joe#example.com.au** sdfdsf cool_me#example.com.fr"
r = Regexp.new(/\b([a-zA-Z0-9._%+-]+)#([a-zA-Z0-9.-]+?)(\.[a-zA-Z.]*)\b/)
emails = content.scan(r).uniq
puts YAML.dump(emails)
will give you
---
- - joe
- example
- .com.au
- - cool_me
- example
- .com.au

Resources