Selecting table data from a webpage - xpath

I'm trying to get the results from empire magazine website (Film Reviews (Popular Matches) table) using YQL - http://www.empireonline.com/search/default.asp?search=Dragonheart (as an example) and I'm using firebug to get the xpath but it doesn't seem to want to return results. This is what I'm using;
select * from html where url='http://www.empireonline.com/search/default.asp?search=cars' and xpath='/html/body/table[3]/tbody/tr[5]/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr/td/table[2]'
Now it seems to be able to use;
select * from html where url='http://www.empireonline.com/search/default.asp?search=cars' and xpath='//table'
But that's a whole lot of data I don't need to chuck about.

You just need to be mindful when crafting the appropriate XPath query. The following gets the link and name of each of the reviews listed in that HTML table by first targetting the "Film Reviews (Popular Matches)" paragraph, then navigating to the list of films.
SELECT href, strong
FROM html
WHERE url = 'http://www.empireonline.com/search/default.asp?search=Thor'
AND xpath = '
//p[.="Film Reviews (Popular Matches)"]
/ancestor::table[1]
/following-sibling::table[1]
//td[2]/a
'
(Try this query in the YQL console.)

Related

to how can I do web scraping to get prices for my products which I have ony google spreadsheet? dynamically queries

could you please give me an idea about how I can get thi
Many sites go to great lengths to actively prevent scraping. Giving you just the data you want entirely undermines their business model. If you're a consumer, they're denied the chance to show you advertising. If you're a reseller, you can use fairly simple programming and marketing to undercut their prices.
If you find yourself unable to scrape, it may be because it's not going to be possible.
unfortunately, that won't be possible because the site is controlled by JavaScript and Google Sheets can't understand/import JS. you can test this simply by disabling JS for a given link and you will see a blank page:
A workaround. You can import the data with the following script (credits to Brad Jasper) : ImportJSON, then request with QUERY formula. This is an example with "iPhone 8" and "Playstation 4".
In column A, you write the product to search. The url to get the JSON data is automatically build in column B with a concat operator.
="https://wss2.cex.uk.webuy.io/v3/boxes?q="&A2
In column C, you have the QUERY formula combined with the ImportJSON data step.
=QUERY(ImportJSON(B2);"SELECT Col4,Col20 WHERE Col4 CONTAINS 'Plus' AND Col4 CONTAINS '64' AND Col4 CONTAINS 'Unlocked' LIMIT 1 label Col4'',Col20''";1)
Col4 : product description, Col20 : price of the product. Since the JSON will return a lot of results (multiple iPhone 8 versions), this is the step where you can refine your search. I've searched for "Plus","64" and "Unlocked" in the product description.

Web Crawling using import.io

I am trying to crawl the following website https://goo.gl/THqDhD using import.io tool. I used the connector tool to parse the whole search result for specific query (and include the pagination), and successfully chosen all the rows in the search result, but was unable to select the items'image box (as column)
import.io contain manually xpath overriding for the selected, so I tried to select images in the search results using the following xpath:
.//*[#id='container-inner']/div[3]/div[4]/div[*]/div[1]/div/a/
which should represent the columns of the table, but I got the following problem
What you have selected is not within a result
The result here is the previous selected rows, but I inspected the item box and made sure that the selection is inside. Any help please?

Extract href in table with importxml in Google spreadsheet

I am trying to pull the href for each row of each table from this website:
http://www.epa.gov/region4/superfund/sites/sites.html#KY
I can pull the table information off using =IMPORTHTML(A1,"table",1) for all 7 tables, but I need the href to the site with the detailed information.
Using =IMPORTxml(A1,"//div[#class='box']") I can pull the information needed from a site like:
http://www.epa.gov/region4/superfund/sites/fedfacs/alarmyaplal.html
but I need to extract the fedfacs/alarmyaplal.html portion for each row on the original page.
I've tried using //#href, but it is not returning any results. I'm thinking it is because the data is structured in a table but I'm stuck on where to go from here.
I'm not sure about any of the Google Spreadsheet functionality, but here's an XPath to select all href attributes of the Kentucky sites (since your first link included the 'ky' anchor):
//body//a[#id='ky']/following-sibling::table[1]/tbody/tr/td[1]/strong/a/#href
This is very specific to the Kentucky table: following-sibling::table[1] means the first table node after, and at the same level of, a[#id='ky'].

YQL + TABLE + XPATH

I'm working with YQL. I understand how to make a simple query to a web page and select content with xpath.
For example: select * from html where url="http://www.animeclick.it/manga.php?xtit=Ranmaru+XXX" and xpath="/html/body/div/table/tr/td/table/tr/td/div/div/img[contains(#src,'manga')]".
Now, there are limitation in this approach. I can't make login to the site, can't repeat different information in the page (I know can make more query or add new xpath expression) and I can't format output result
(like inside div this content :
"<p> Hello <a src="#"> Boy!</a></p>" ,
where in this case i need the text "Hello boy")
How to use YQL OPEN TABLE for this scope!??!
How to use YQL OPEN TABLE for this scope!??!
Please take the time to have a thorough read through the Creating YQL Open Data Tables chapters in the YQL docs.
In particular, an <execute> block (docs) will enable you to do all of the things that you mentioned above.

Using Yahoo APIs, how to get list of locations matching certain prefix that have weather data available

I have an app that (among other things) uses Yahoo Weather API to display weather conditions for a location selected by user.
In the configuration dialog where user can enter the location, I'd love to offer autocompletion so that while user is typing location name, list of matching cities is suggested.
I can use YQL to fetch locations matching the prefix, i.e.:
select * from geo.places where text = 'Vie*'
but the problem is that not every location has a weather station associated with it and I'd love to skip these in my autocompletion list.
Using community tables (table called weather.woeid), following query will join previous query with the weather api, returning only locations that do have weather stations:
select location from weather.woeid where w in (select woeid from geo.places where text = 'Vie*')
This almost solves my problem, except for the fact that previous query (which produces same result as weather api call) doesn't return WOEID nor any kind of identifier I can use to directly query the Weather API after configuration. How can I capture the value of join parameter w? I tried something like select w, location ... but that doesn't seem to work.
Is there any other way to get list of locations (incl. WOEID) matching certain prefix that have weather data associated with them?
Afaik it is not possible with YQL to pass through values from the Sub-Select (the inner SELECT statement) to the outer SELECT, which I is what you want to do if I understand you correctly.
Based on your use case I want to propose another solution though:
I assume that the list of locations that have a weather station associated with them is relatively static, meaning this list does not change very often. If that is the case then it would not be very optimal in terms of performance to regenerate that list every time with YQL. Instead I would generate that list offline, store it in a file or MySQL or elsewhere and then just use that static list to answer to the AJAX call of your autocomplete field.
The data in that static list could look something like this:
{
"Vienna" => 72342,
"Hamburg" => 12334,
...
}
Once the user has selected a location and pressed enter, then you can send the YQL query to weather.woeid to look up the current weather based on the WOEID.

Resources