See hierarchy below:
All I need here is "Company Title", "Company Owner", "Company Owner Title", "Street Number Street Name", and "City, State Zipcode".
I tried b.div.span.bs, but that didn't work (bs because there are multiple blocks I'm gathering data from). I also thought I'd just try something like b.tds.split('<br>') and then replace all instances of tags and somehow delete empty array cells, but I found that each block is different, so the data don't align, i.e., Company Title might be in cell 1 for the first array, but then if Company Title isn't present (for the second block) then cell 1 would be Company Owner, which is conflicting... Anyway, just trying to find a clever way to get these data. Thank you.
Here is the actual HTML; however you must first click "View All".
You can split out everything inside the <div> and then split that by <br>. The first part is Company Title (if exists) and then Company Owner is last/second.
The rest is ... trickier. Some are pretty straighforward in that Fax and Member Since have labels so those are easy. The <a> is easy.
You could probably test the phone number with a regex and then back up from there. If the one before the phone number isn't <a> then it's city, state zip and the one before that is the address. If one exists before that, it's the Company Owner Title.
Everything after the phone number in your examples have labels so those are easy.
I'm not sure all of your use cases, but often for pages where the DOM is not very helpful I just get the text and parse with Ruby:
browser.td.text.split("\n").reject(&:empty?)
This doesn't directly answer the question, but it shows how I'd go about doing this using Nokogiri, which is the standard HTML/XML parser for Ruby:
require 'nokogiri'
doc = Nokogiri::HTML('<td><div></div><br>a<br>b<br>c</td>')
doc is Nokogiri's internal representation of the document.
We use landmarks in the markup to navigate and find things we want. In this case <div> is a good starting point:
doc.at('div').next_sibling.next_sibling.text # => "a"
next_sibling is how we tell Nokogiri to look at the next node. In this case it's stepping past the first <br> and looking at the a TextNode.
That'd result in unworkable code though, so there's a better way to go:
doc.search('td br').to_html # => "<br><br><br>"
That shows we can find all the <br> tags inside the <td>, so we just have to iterate over them and use them as our landmarks:
doc.search('td br').map{ |br| br.next_sibling.text } # => ["a", "b", "c"]
Related
I have looked through several posts about this, but have failed to apply the principles used to get the result I desire, so I'm going to just post my specific problem.
I am building a Google Sheet that enables the user to pull up Bible verses.
I have it all working, however I am running into an issue with a hidden element being pulled into my text().
FUNCTION:
=IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()")
RESULT: You shall put out both male and female, putting them outside the camp, that they may not defile their camp, 1in the midst of which I dwell."
You can see the "1" that is showing up before the word "in"
I have found the xPath that pulls only that "1"
//*[#class='scripture']//span[2]//sup//text()
I am trying to remove that "1" from the text.
HELP PLEASE!!! :)
You can add a predicate to the end to exclude text nodes that are inside sup elements:
=IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()[not(ancestor::sup)]")
This will retrieve only the text nodes that are not inside a sup element, but it will still result in having the verse spread out across two cells, because there are two text nodes. You can rectify this by wrapping this expression in a JOIN():
=JOIN("", IMPORTXML("http://www.biblestudytools.com/ESV/Numbers/5-3.html",
"//*[#class='scripture']//span[2]//text()[not(ancestor::sup)]"))
I originally wrote 800 lines to do this, site by site. However, on talking to a couple of people, it seems like my code is way longer than it needs to be.
So, I've got an idea of what you'd do in Python, with a particular Egg, but I'm working with Ruby. So, does anyone have any idea how to enter details in a form field, based on what the label for it is, rather than the id/name? Using Mechanize.
Let's say your html looks like:
<label>Foo</label>
<input name="foo_field">
You can get the name of the input following a specific label:
name = page.at('label[text()="Foo"] ~ *[name]')[:name]
#=> "foo_field"
and use that to set the form value
form[name] = 'bar'
I want to remake the Olympic medals count on London2012 to better reflect the value of the medals. Currently it is only sorted by gold medals. I'd like to relist it by points, so gold=4, silver=2 and bronze=1 to make a new more rational list. I probably want to remember the previous rank then add a new rank column as well.
I'd like to try mechanize to get raw data from site, then parse the data into rows and cols, apply the new counts, then remake the list.
From source at http://www.london2012.com/medals/medal-count/ each country has a block with medals like so:
<span class="countryName">Canada</span></a></div></div></td><td class="gold c">0</td><td class="silver c">2</td><td class="bronze c">5</td>
If I use agent.get('http://www.london2012.com/medals/medal-count') It shows the whole list. How to parse specific spans and table data?
I also need to remember the rank, then when I make the new page put the new rank beside it.
Any tips on mechanize parsing and remembering data would be really helpful. More importantly your thinking process in doing something like this, I'd appreciate the help to get me started. This doesn't have to be a code answer
Thanks
First to identify the table. In chrome load the page and right click anywhere on the table. Go to inspect element. Go up the heirarchy until you're on the table. Now select it and you'll see it looks like this:
<table class="or-tbl overall_medals sortable" summary="Schedule">
The overall_medals class looks like it will be unique so that's a good one to use. Now start irb and do:
require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://www.london2012.com/medals/medal-count/'
double check that the table is unique:
page.search('table.overall_medals').size
#=> 1 (good, it is)
You can get all the data from the table into an array with:
page.search('table.overall_medals tr').map{|tr| tr.search('td').map(&:text)}
Notice that the first 2 rows are empty, let's get rid of them by using a range:
data = page.search('table.overall_medals tr')[2..-1].map{|tr| tr.search('td').map(&:text)}
The second row isn't really empty, it has the column names (in th's instead of td's). You can get those with:
columns = page.search('table.overall_medals tr[2] th').map{|th| th.text.strip}
You can get these into hashes with:
rows = data.map{|row| Hash[columns.zip row]}
Now you can do
rows[0]['Country']
#=> "United States of America"
Or even one big hash:
countries = rows.map{|row| {row['Country'] => row}}.reduce &:merge
now:
countries['France']['Gold']
#=> "8"
You might find this Medals API useful (Assuming your question is not specifically about Mechanize)
http://apify.heroku.com/resources/5014626da8cdbb0002000006
It uses Nokogiri to parse the site and the output is available as JSON:
http://apify.heroku.com/api/olympics2012_medals.json
I am using Nokogiri to scrape a website and am running into an issue when I try to grab a field from a table. I am using selector gadget to find the CSS selector of the table. I am grabbing data from a government website that details information on motor carriers.
The method that I am using looks like:
def scrape_database
url = "http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=#{self.dot}#Inspections"
doc = Nokogiri::HTML(open(url))
self.name = doc.at_css("tr:nth-child(4) .queryfield").text
self.address = doc.at_css("tr:nth-child(6) .queryfield").text
end
I grab all of the fields in the upper table using that syntax and the method operates fine, however I am having issues with the crash rate/inspections table below it.
Here is what I am using to grab that info:
self.vehicle_inspections = doc.at_css("center:nth-child(13) tr:nth-child(2) :nth-child(2)").text
undefined method `text' for nil:NilClass
If I remove text from the end of this, the method runs but doesn't grab any relevant information (obviously). I am assuming this is due to the complicated selector that I am using to grab the field, but am not quite sure.
Has anyone run into a similar problem and can you give me some advice?
Yes, that error means that your CSS selector is not finding the information; at_css is returning nil, and nil.text is not valid. You can guard against it like so:
insp = doc.at_css("long example css selector")
self.vehicle_inspections = insp && insp.text
However, it sounds to me like you "need" this data. Since you have not provided with the HTML page nor the CSS selectors, I can't help you craft a working CSS or XPath selector.
For future questions, or an edit to this one, note that actual (pared-down) code is strongly preferred over hand waving and loose descriptions of what your code looks like. If you show us the HTML page, or a relevant snippet, and describe which element/text/attribute you want, we can tell you how to select it.
I see six tables on that page. Which is the "crash rate/inspections" table? Given that your URL includes #Inspections on the end, I'm assuming you're talking about the two tables immediately underneath the "Inspections/Crashes In US" section. Here are XPath selectors that match each:
require 'nokogiri'
require 'open-uri'
url = "http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=800585"
doc = Nokogiri::HTML(open(url))
table1 = doc.at_xpath('//table[#summary="Inspections"][preceding::h4[.//a[#name="Inspections"]]]')
table2 = doc.at_xpath('//table[#summary="Crashes"][preceding::h4[.//a[#name="Inspections"]]]')
# Find a row by index (1 is the first row)
vehicle_inspections = table1.at_xpath('.//tr[2]/td').text.to_i
# Find a row by header text
out_of_service_drivers = table1.at_xpath('.//tr[th="Out of Service"]/td[2]').text.to_i
p [ vehicle_inspections, out_of_service_drivers ]
#=> [6, 0]
tow_crashes = table2.at_xpath('.//tr[th="Crashes"]/td[3]').text.to_i
p tow_crashes
#=> 0
The XPath queries may look intimidating. Let me explain how they work:
//table[#summary="Inspections"][preceding::h4[.//a[#name="Inspections"]]]
//table find a <table> at any level of the document
[#summary="Inspections"] …but only if it has a summary attribute with this value
[preceding::h4…] …and only if you can find an <h4> element earlier in the document
[.//a…] …specifically, a <h4> that has an <a> somewhere underneath it
[#name="Inspections"] …and that <a> has to have a name attribute with this text.
This would actually match two tables (there's another summary="Inspections" table later on the page), but using at_xpath finds the first matching table.
.//tr[2]/td
. Starting at the current node (this table)
//tr[2] …find the second <tr> that is a descendant at any level
/td …and then find the <td> children of that.
Again, because we're using at_xpath we find the first matching <td>.
.//tr[th="Out of Service"]/td[2]
. Starting at the current node (this table)
//tr …find any <tr> that is a descendant at any level
[th="Out of Service] …but only those <tr> that have a <th> child with this text
/td[2] …and then find the second <td> children of those.
In this case there is only one <tr> that matches the criteria, and thus only one <td> that matches, but we still use at_xpath so that we get that node directly instead of a NodeSet with a single element in it.
The goal here (and with any screen scraping) is to latch onto meaningful values on the page, not arbitrary indices.
For example, I could have written my table1 xpath as:
# Find the first table with this summary
table1 = doc.at_xpath('//table[#summary="Inspections"][1]')
…or even…
# Find the 20th table on the page
//table[20]
However, those are fragile. Someone adding a new section to the page, or code that happens to add or remove a formatting table would cause those expressions to break. You want to hunt for strong attributes and text that likely won't change, and anchor your searches based on that.
The vehicle_inspections XPath is similarly fragile, relying on the ordering of rows instead of the label text for the row.
I'm having a really strange issue with watir-webdriver.
Here's a snapshot of the input tag I'm trying to reach (couldn't figure out a way to get the source after the javascripts created the popup, lol)
Anyway here's some of my code that uses xpath to locate these elements (there is two text fields and a select tag)
firstname = b.element(:xpath, "//div[#class='ap_popover']/input[#name='firstName']")
lastname = b.element(:xpath, "//div[#class='ap_popover']/input[#name='lastName']")
authorselector = b.element(:xpath, "//div[#class='ap_popover']/select")
puts firstname
puts lastname
puts authorselector
This code successfully returns the watir element objects. However when I try to cast them:
puts firstname.to_subtype
it freaks out:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/watir-webdriver-0.4.1/lib/watir-webdriver/elements/element.rb:262:in
`assert_exists': unable to locate element, using
{:xpath=>"//div[#class='ap_popover']/input[#name='lastName']"}
(Watir::Exception::UnknownObjectException)
So, what's going on? It can find them via xpath no problem but then when I try to cast them all of a sudden xpath search fails?
It's worth mentioning the html I'm perusing through is created in it's entirety by javascript, hence why I couldn't just copy\paste it here and had to take a screenshot.
Thanks!
xpath is evil avoid it if at all possible. it's too easy to make mistakes, hard to read, and generally slower.
Have you tried something like
b.div(:id => 'contributors-table').textfield(:name => 'firstName')
If you have some wacky invalid HTML where they have two copies of all this stuff (and thus duplicated ID values which is not valid for HTML standard) then you can add in the INDEX of the element, which in this case might be needed both for the div container, and then maybe also for the input field if there are more than one of them.
b.divs(id => 'contributors-table').size #how many are there?
#example, second instance of the contributors table, third instance in that table of an text input field with the name 'firstName'
b.div(:id => 'contributors-table', :index => 1).textfield(:name => 'firstName', :index => 2)