Extract href in table with importxml in Google spreadsheet - xpath

I am trying to pull the href for each row of each table from this website:
http://www.epa.gov/region4/superfund/sites/sites.html#KY
I can pull the table information off using =IMPORTHTML(A1,"table",1) for all 7 tables, but I need the href to the site with the detailed information.
Using =IMPORTxml(A1,"//div[#class='box']") I can pull the information needed from a site like:
http://www.epa.gov/region4/superfund/sites/fedfacs/alarmyaplal.html
but I need to extract the fedfacs/alarmyaplal.html portion for each row on the original page.
I've tried using //#href, but it is not returning any results. I'm thinking it is because the data is structured in a table but I'm stuck on where to go from here.

I'm not sure about any of the Google Spreadsheet functionality, but here's an XPath to select all href attributes of the Kentucky sites (since your first link included the 'ky' anchor):
//body//a[#id='ky']/following-sibling::table[1]/tbody/tr/td[1]/strong/a/#href
This is very specific to the Kentucky table: following-sibling::table[1] means the first table node after, and at the same level of, a[#id='ky'].

Related

Keep having errors with importxml + xpath

I spend hours trying to fix this but can't find where the issue is.
I try to import data in google spreadsheet using importxml.
Here is the url :
http://www.journaldesfemmes.com/maman/creches/3-pom/creche-3098
I'm interested in exctracting email and phone number for exemple. I used chrome inspector to copy the Xpath, and few chrome plugins. I guess the issu is the Xpath. Here is the formula I used in spreadsheet :
=importxml("http://www.journaldesfemmes.com/maman/creches/3-pom/creche-3098";"/html/body/div[4]/div/div[1]/div[2]/div[1]/div/div/div/div/div[10]/table/tbody/tr[2]/td[2]")
Hope someone can help
Since the data you want is in tables, it might be easier to use importhtml.
The table you want you can get with this:
=IMPORTHTML("http://www.journaldesfemmes.com/maman/creches/3-pom/creche-3098","table",2)
To get just the phone number add index (row and column of table)
=index(IMPORTHTML("http://www.journaldesfemmes.com/maman/creches/3-pom/creche-3098","table",2),3,2)
email is:
=index(IMPORTHTML("http://www.journaldesfemmes.com/maman/creches/3-pom/creche-3098","table",2),4,2)

Web Crawling using import.io

I am trying to crawl the following website https://goo.gl/THqDhD using import.io tool. I used the connector tool to parse the whole search result for specific query (and include the pagination), and successfully chosen all the rows in the search result, but was unable to select the items'image box (as column)
import.io contain manually xpath overriding for the selected, so I tried to select images in the search results using the following xpath:
.//*[#id='container-inner']/div[3]/div[4]/div[*]/div[1]/div/a/
which should represent the columns of the table, but I got the following problem
What you have selected is not within a result
The result here is the previous selected rows, but I inspected the item box and made sure that the selection is inside. Any help please?

XPath: Forms with Unique IDs

I am trying to use XPath as part of a data scraper in order to scrape random comments from reddit for a project. The problem is, the comment forms have unique IDs that change on every page and within comment indent levels. I'm not sure how to make XPath target all of the comment fields with these different IDs.
An example is shown below:
//form[#id='form-t1_cj8cyupxa3']/div
//form[#id='form-t1_cj8e0iyx6w']/div
If there is some pattern to the id then try e.g. //form[starts-with(#id, 'form-')]/div

How do I quickly scan my DB and update it only if an external source has changed?

So I have a table Links of links, that was initially populated using Nokogiri.
I just crawled a site, got all the links in the site and dumped them into a table.
I don't expect some of them to change too often - maybe once per month. Some will never change. But basically I want to run my method to that will then execute Nokogiri and come back with a list of links.
I want to check each of the links against my database and only add a new record when a link is found that is not in the db.
How do I go about doing that in the most efficient way possible?
Assume I have an array new_links of the newest links that I got from Nokogiri.
Thanks.
To insert only the new links
#Remove found links from new_links array and insert them into DB
links_to_insert = new_links - Link.where(['url IN (?)', new_links])
links_to_insert.each { |link| Link.create!(link) }
Elegant ?

Selecting table data from a webpage

I'm trying to get the results from empire magazine website (Film Reviews (Popular Matches) table) using YQL - http://www.empireonline.com/search/default.asp?search=Dragonheart (as an example) and I'm using firebug to get the xpath but it doesn't seem to want to return results. This is what I'm using;
select * from html where url='http://www.empireonline.com/search/default.asp?search=cars' and xpath='/html/body/table[3]/tbody/tr[5]/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr/td/table[2]'
Now it seems to be able to use;
select * from html where url='http://www.empireonline.com/search/default.asp?search=cars' and xpath='//table'
But that's a whole lot of data I don't need to chuck about.
You just need to be mindful when crafting the appropriate XPath query. The following gets the link and name of each of the reviews listed in that HTML table by first targetting the "Film Reviews (Popular Matches)" paragraph, then navigating to the list of films.
SELECT href, strong
FROM html
WHERE url = 'http://www.empireonline.com/search/default.asp?search=Thor'
AND xpath = '
//p[.="Film Reviews (Popular Matches)"]
/ancestor::table[1]
/following-sibling::table[1]
//td[2]/a
'
(Try this query in the YQL console.)

Resources