Scraping a website with Nokogiri - ruby

I am using Nokogiri to scrape a website and am running into an issue when I try to grab a field from a table. I am using selector gadget to find the CSS selector of the table. I am grabbing data from a government website that details information on motor carriers.
The method that I am using looks like:
def scrape_database
url = "http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=#{self.dot}#Inspections"
doc = Nokogiri::HTML(open(url))
self.name = doc.at_css("tr:nth-child(4) .queryfield").text
self.address = doc.at_css("tr:nth-child(6) .queryfield").text
end
I grab all of the fields in the upper table using that syntax and the method operates fine, however I am having issues with the crash rate/inspections table below it.
Here is what I am using to grab that info:
self.vehicle_inspections = doc.at_css("center:nth-child(13) tr:nth-child(2) :nth-child(2)").text
undefined method `text' for nil:NilClass
If I remove text from the end of this, the method runs but doesn't grab any relevant information (obviously). I am assuming this is due to the complicated selector that I am using to grab the field, but am not quite sure.
Has anyone run into a similar problem and can you give me some advice?

Yes, that error means that your CSS selector is not finding the information; at_css is returning nil, and nil.text is not valid. You can guard against it like so:
insp = doc.at_css("long example css selector")
self.vehicle_inspections = insp && insp.text
However, it sounds to me like you "need" this data. Since you have not provided with the HTML page nor the CSS selectors, I can't help you craft a working CSS or XPath selector.
For future questions, or an edit to this one, note that actual (pared-down) code is strongly preferred over hand waving and loose descriptions of what your code looks like. If you show us the HTML page, or a relevant snippet, and describe which element/text/attribute you want, we can tell you how to select it.
I see six tables on that page. Which is the "crash rate/inspections" table? Given that your URL includes #Inspections on the end, I'm assuming you're talking about the two tables immediately underneath the "Inspections/Crashes In US" section. Here are XPath selectors that match each:
require 'nokogiri'
require 'open-uri'
url = "http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=800585"
doc = Nokogiri::HTML(open(url))
table1 = doc.at_xpath('//table[#summary="Inspections"][preceding::h4[.//a[#name="Inspections"]]]')
table2 = doc.at_xpath('//table[#summary="Crashes"][preceding::h4[.//a[#name="Inspections"]]]')
# Find a row by index (1 is the first row)
vehicle_inspections = table1.at_xpath('.//tr[2]/td').text.to_i
# Find a row by header text
out_of_service_drivers = table1.at_xpath('.//tr[th="Out of Service"]/td[2]').text.to_i
p [ vehicle_inspections, out_of_service_drivers ]
#=> [6, 0]
tow_crashes = table2.at_xpath('.//tr[th="Crashes"]/td[3]').text.to_i
p tow_crashes
#=> 0
The XPath queries may look intimidating. Let me explain how they work:
//table[#summary="Inspections"][preceding::h4[.//a[#name="Inspections"]]]
//table find a <table> at any level of the document
[#summary="Inspections"] …but only if it has a summary attribute with this value
[preceding::h4…] …and only if you can find an <h4> element earlier in the document
[.//a…] …specifically, a <h4> that has an <a> somewhere underneath it
[#name="Inspections"] …and that <a> has to have a name attribute with this text.
This would actually match two tables (there's another summary="Inspections" table later on the page), but using at_xpath finds the first matching table.
.//tr[2]/td
. Starting at the current node (this table)
//tr[2] …find the second <tr> that is a descendant at any level
/td …and then find the <td> children of that.
Again, because we're using at_xpath we find the first matching <td>.
.//tr[th="Out of Service"]/td[2]
. Starting at the current node (this table)
//tr …find any <tr> that is a descendant at any level
[th="Out of Service] …but only those <tr> that have a <th> child with this text
/td[2] …and then find the second <td> children of those.
In this case there is only one <tr> that matches the criteria, and thus only one <td> that matches, but we still use at_xpath so that we get that node directly instead of a NodeSet with a single element in it.
The goal here (and with any screen scraping) is to latch onto meaningful values on the page, not arbitrary indices.
For example, I could have written my table1 xpath as:
# Find the first table with this summary
table1 = doc.at_xpath('//table[#summary="Inspections"][1]')
…or even…
# Find the 20th table on the page
//table[20]
However, those are fragile. Someone adding a new section to the page, or code that happens to add or remove a formatting table would cause those expressions to break. You want to hunt for strong attributes and text that likely won't change, and anchor your searches based on that.
The vehicle_inspections XPath is similarly fragile, relying on the ordering of rows instead of the label text for the row.

Related

Extract a specific node from an XML file

I want to extract only the body node/tag from an XML file using doc.xpath in Ruby
The node to extract from the XML file:
<wcm:element name="Body"><p>A new study suggests that <a href="ssNODELINK/SmokingAndCancer">tobacco</a> companies may be using online video portals, such as YouTube, to get around advertising restrictions and market their products to young people.</p>
</wcm:element>
I have tried the following:
page_content = doc.xpath("/wcm:root/wcm:element").inner_text
But this extracts every node everything
Then I tried this:
page_content = doc.xpath("/wcm:root/wcm:element/Body")
But does not work.
Anyone has any suggestions how to extract exactly the body section of an XML file using doc.xpath in Ruby?
I'm not 100% certain I've understood what you mean but… let's not let that stop us. You want to get the content of a particular node from the input. Your first XPath statement:
/wcm:root/wcm:element
is extracting every element with name wcm:element that is a child of the wcm:root element which is the root element.
Your second:
/wcm:root/wcm:element/Body
is similar but looks for elements with name Body which are children of the wcm:element.
What you need to is to get the values of the wcm:element element where the attribute name is set to the value Body. You access attributes in XPath by prefixing them with an # sign and to express a where condition you use [...] - a predicate. You XPath statement needs to be:
/wcm:root/wcm:element[#name = 'Body']
I'm assuming that your XPath execution environment is fine the namespace prefixes (wcm) because you say that your first query returned content.

Dealing with duplicate ids in selenium webdriver

I am trying to automate some tests using selenium webdriver. I am dealing with a third-party login provider (OAuth) who is using duplicate id's in their html. As a result I cannot "find" the input fields correctly. When I just select on an id, I get the wrong one.
This question has already been answered for JQuery. But I would like an answer (I am presuming using Xpath) that will work in Selenium webdriver.
On other questions about this issue, answers typically say "you should not have duplicate id's in html". Preaching to the choir there. I am not in control of the webpage in question. If it was, I would use class and id properly and just fix the problem that way.
Since I cannot do that. What options do I get with xpath etc?
you can do it by driver.find_element_by_id, for example ur duplicate "duplicate_ID" is inside "div_ID" wich is unique :
driver.find_element_by_id("div_ID").find_element_by_id("duplicate_id")
for other duplicate id under another div :
driver.find_element_by_id("div_ID2").find_element_by_id("duplicate_id")
This XPath expression:
//div[#id='something']
selects all div elements in the XML document, the string value of whose id attribute is the string "something".
This Xpath expression:
count(//div[#id='something'])
produces the number of the div elements selected by the first XPath expression.
And this XPath expression:
(//div[#id='something'])[3]
selects the third (in document order) div element that is selected by the first XPath expression above.
Generally:
(//div[#id='something'])[$k]
selects the $k-th such div element ($k must be substituted with a positive integer).
Equipped with this knowledge, one can get any specific div whose id attribute has string value "something".
Which language are you working on? Dublicate id's shouldn't be a problem as you can virtually grab any attribute not just the id tag using xpath. The syntax will differ slightly in other languages (let me know if you want something else than Ruby) but this is how you do it:
driver.find_element(:xpath, "//input[#id='loginid']"
The way you go about constructing the xpath locator is the following:
From the html code you can pick any attribute:
<input id="gbqfq" class="gbqfif" type="text" value="" autocomplete="off" name="q">
Let's say for example that you want to consturct your xpath with the html code above (Google's search box) using name attribute. Your xpath will be:
driver.find_element(:xpath, "//input[#name='q']"
In other words when the id's are the same just grab another attribute available!
Improvement:
To avoid fragile xpath locators such as order in the XML document (which can change easily) you can use something even more robust. Two xpath locators instead of one. This can also be useful when dealing with hmtl tags that are really similar. You can locate an element by 2 of its attributes like this:
driver.find_element(:id, 'amount') and driver.find_element(xpath: "//input[#maxlength='50']")
or in pure xpath one liner if you prefer:
//input[#id="amount" and #maxlength='50']
Alternatively (and provided your xpath will only return one unique element) you can move one more step higher in the abstraction level; completely omitting the attribute values:
//input[#id and #maxlength]
It's not listed at http://selenium-python.readthedocs.io/locating-elements.html but I'm able access a method find_elements_by_id
This returns a list of all elements with the duplicate ID.
links = browser.find_elements_by_id("link")
for link in links:
print(link.get_attribute("href"))
you should use driver.findElement(By.xpath() but while locating element with firebug you should select absolute path for particular element instead of getting relative path this is how you will get the element even with duplicate ID's

Using Ruby and Mechanize to make new Olympic medals count

I want to remake the Olympic medals count on London2012 to better reflect the value of the medals. Currently it is only sorted by gold medals. I'd like to relist it by points, so gold=4, silver=2 and bronze=1 to make a new more rational list. I probably want to remember the previous rank then add a new rank column as well.
I'd like to try mechanize to get raw data from site, then parse the data into rows and cols, apply the new counts, then remake the list.
From source at http://www.london2012.com/medals/medal-count/ each country has a block with medals like so:
<span class="countryName">Canada</span></a></div></div></td><td class="gold c">0</td><td class="silver c">2</td><td class="bronze c">5</td>
If I use agent.get('http://www.london2012.com/medals/medal-count') It shows the whole list. How to parse specific spans and table data?
I also need to remember the rank, then when I make the new page put the new rank beside it.
Any tips on mechanize parsing and remembering data would be really helpful. More importantly your thinking process in doing something like this, I'd appreciate the help to get me started. This doesn't have to be a code answer
Thanks
First to identify the table. In chrome load the page and right click anywhere on the table. Go to inspect element. Go up the heirarchy until you're on the table. Now select it and you'll see it looks like this:
<table class="or-tbl overall_medals sortable" summary="Schedule">
The overall_medals class looks like it will be unique so that's a good one to use. Now start irb and do:
require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://www.london2012.com/medals/medal-count/'
double check that the table is unique:
page.search('table.overall_medals').size
#=> 1 (good, it is)
You can get all the data from the table into an array with:
page.search('table.overall_medals tr').map{|tr| tr.search('td').map(&:text)}
Notice that the first 2 rows are empty, let's get rid of them by using a range:
data = page.search('table.overall_medals tr')[2..-1].map{|tr| tr.search('td').map(&:text)}
The second row isn't really empty, it has the column names (in th's instead of td's). You can get those with:
columns = page.search('table.overall_medals tr[2] th').map{|th| th.text.strip}
You can get these into hashes with:
rows = data.map{|row| Hash[columns.zip row]}
Now you can do
rows[0]['Country']
#=> "United States of America"
Or even one big hash:
countries = rows.map{|row| {row['Country'] => row}}.reduce &:merge
now:
countries['France']['Gold']
#=> "8"
You might find this Medals API useful (Assuming your question is not specifically about Mechanize)
http://apify.heroku.com/resources/5014626da8cdbb0002000006
It uses Nokogiri to parse the site and the output is available as JSON:
http://apify.heroku.com/api/olympics2012_medals.json

Matching text with xpath?

I'm screen-scraping an HTML page which contains:
<table border=1 class="searchresult" cellpadding=2>
<tr><th colspan=2>Last search</th></tr>
<tr><th align=left>Search term</th><td>xxxxxx</td></tr>
<tr><th align=left>Result</th><td>yyyyyyyy/td></tr>
</table>
I want to write an XPATH expression which gets me the data cell containing "yyyyyyyy". I've gotten as far as
.//table[#class='searchresult']//tr/th
which gets me a list of all the table-header nodes in the table. I can iterate over them in user code, find the one whose .text is "Results" and then call .getnext() on that to get the table-data. But, is there a cleaner way to do this by writing a more specific XPATH pattern? It seems like there should be, but I haven't gotten my head that far around XPATH yet to figure out how.
If it matters, I'm doing this in Python with lxml.
.//table[#class='searchresult']//tr/td[preceding-sibling::th] might give you what you need.
Two comprehensive papers on semi-automatically creating XPath statements like this one, specifically for screen scraping purposes can be found here:
http://tobiasanton.com/Tobias_Anton/Academia.html
Use:
//table/tr[last()]/td
This selects any td element that is a child of any tr that is the last tr child of any table in this XHTML document.
This may select more than one td element, depending on whether or not there is only one table in the XHTML document. You need to make this expression more precise, if more than one table element is present.
For example, if the table in question is the first in the document, use:
(//table)[1]/tr[last()]/td

How do I use XPath in Nokogiri?

I have not found any documentation nor tutorial for that. Does anything like that exist?
doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
The code above will get me any table, anywhere, that has a tbody child with the attribute id equal to "threadbits_forum_251". But why does it start with double //? Why there is /tr at the end? See "Ruby Nokogiri Parsing HTML table II" for more details.
Can anybody tell me how to extract href, id, alt, src, etc., using Nokogiri?
td[3]/div[1]/a/text()' <--- extracts text
How can I extract other things?
Seems you need to read a XPath Tutorial
Your //table/tbody[#id="threadbits_forum_251"]/tr expression means:
// - Anywhere in your XML document
table/tbody - take a table element with a tbody child
[#id="threadbits_forum_251"] - where id attribute are equals to "threadbits_forum_251"
tr - and take its tr elements
So, basically, you need to know:
attributes begins with #
conditions go inside [] brackets
If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"], or td[3]/div[1]/a/#href if there is just one <a> element.
Your XPath is correct and you seem to have answered your own question's first part (almost):
doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
"the code above will get me any table table's tr, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251"
// means the following element can appear anywhere in the document.
/tr at the end means, get the tr node of the matching element.
You dont need to extract each attribute one by one. Just get the entire node containing all four attributes in Nokogiri, and get the attributes using:
theNode['href']
theNode['src']
Where theNode is your Nokogiri Node object.
Edit:
Sorry I haven't used these libraries, but I think the XPath evaluation and parsing is being done by Mechanize. So here's how you would get the entire element and its attributes in one go.
doc.xpath("td[3]/div[1]/a").each do |anchor|
puts anchor['href']
puts anchor['src']
...
end

Resources