YQL + TABLE + XPATH - xpath

I'm working with YQL. I understand how to make a simple query to a web page and select content with xpath.
For example: select * from html where url="http://www.animeclick.it/manga.php?xtit=Ranmaru+XXX" and xpath="/html/body/div/table/tr/td/table/tr/td/div/div/img[contains(#src,'manga')]".
Now, there are limitation in this approach. I can't make login to the site, can't repeat different information in the page (I know can make more query or add new xpath expression) and I can't format output result
(like inside div this content :
"<p> Hello <a src="#"> Boy!</a></p>" ,
where in this case i need the text "Hello boy")
How to use YQL OPEN TABLE for this scope!??!

How to use YQL OPEN TABLE for this scope!??!
Please take the time to have a thorough read through the Creating YQL Open Data Tables chapters in the YQL docs.
In particular, an <execute> block (docs) will enable you to do all of the things that you mentioned above.

Related

Web Crawling using import.io

I am trying to crawl the following website https://goo.gl/THqDhD using import.io tool. I used the connector tool to parse the whole search result for specific query (and include the pagination), and successfully chosen all the rows in the search result, but was unable to select the items'image box (as column)
import.io contain manually xpath overriding for the selected, so I tried to select images in the search results using the following xpath:
.//*[#id='container-inner']/div[3]/div[4]/div[*]/div[1]/div/a/
which should represent the columns of the table, but I got the following problem
What you have selected is not within a result
The result here is the previous selected rows, but I inspected the item box and made sure that the selection is inside. Any help please?

XPath: Forms with Unique IDs

I am trying to use XPath as part of a data scraper in order to scrape random comments from reddit for a project. The problem is, the comment forms have unique IDs that change on every page and within comment indent levels. I'm not sure how to make XPath target all of the comment fields with these different IDs.
An example is shown below:
//form[#id='form-t1_cj8cyupxa3']/div
//form[#id='form-t1_cj8e0iyx6w']/div
If there is some pattern to the id then try e.g. //form[starts-with(#id, 'form-')]/div

Extract href in table with importxml in Google spreadsheet

I am trying to pull the href for each row of each table from this website:
http://www.epa.gov/region4/superfund/sites/sites.html#KY
I can pull the table information off using =IMPORTHTML(A1,"table",1) for all 7 tables, but I need the href to the site with the detailed information.
Using =IMPORTxml(A1,"//div[#class='box']") I can pull the information needed from a site like:
http://www.epa.gov/region4/superfund/sites/fedfacs/alarmyaplal.html
but I need to extract the fedfacs/alarmyaplal.html portion for each row on the original page.
I've tried using //#href, but it is not returning any results. I'm thinking it is because the data is structured in a table but I'm stuck on where to go from here.
I'm not sure about any of the Google Spreadsheet functionality, but here's an XPath to select all href attributes of the Kentucky sites (since your first link included the 'ky' anchor):
//body//a[#id='ky']/following-sibling::table[1]/tbody/tr/td[1]/strong/a/#href
This is very specific to the Kentucky table: following-sibling::table[1] means the first table node after, and at the same level of, a[#id='ky'].

better selenium xpath is expecting

I'm trying to create xpath expression which will work with selenium using following html snippet.
Below is table contains various row that gets incremented with uniquely generatedid(for example in following snippet that id is 1000).
Selenium has created following expressions when row of id 1000 was added in table. However instead of using id, I want to create xpath by using 3rd data element in row which is (MyName) in html snippet.
A possible suggestion is to not use xpath whenever possible.
http://saucelabs.com/blog/index.php/2011/05/why-css-locators-are-the-way-to-go-vs-xpath/
You need to convert the places in the XPATH where it is referring to the row by its ID to its relative position in the table.
In all of your XPATHs, you would change tr[#id='1000'] to tr[3]
Your first example XPATH would look liek this:
//tr[3]/td[1]/a[1]/img //tr[#id='1000']/td[1]/span/a/img
Your second example would follow similarly:
//tr[3]/td[1]/span/a/img
As would your third:
//tr[3]/td[1]/a[2]/img
Hopefully you are now able change the rest of them.

Selecting table data from a webpage

I'm trying to get the results from empire magazine website (Film Reviews (Popular Matches) table) using YQL - http://www.empireonline.com/search/default.asp?search=Dragonheart (as an example) and I'm using firebug to get the xpath but it doesn't seem to want to return results. This is what I'm using;
select * from html where url='http://www.empireonline.com/search/default.asp?search=cars' and xpath='/html/body/table[3]/tbody/tr[5]/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr/td/table[2]'
Now it seems to be able to use;
select * from html where url='http://www.empireonline.com/search/default.asp?search=cars' and xpath='//table'
But that's a whole lot of data I don't need to chuck about.
You just need to be mindful when crafting the appropriate XPath query. The following gets the link and name of each of the reviews listed in that HTML table by first targetting the "Film Reviews (Popular Matches)" paragraph, then navigating to the list of films.
SELECT href, strong
FROM html
WHERE url = 'http://www.empireonline.com/search/default.asp?search=Thor'
AND xpath = '
//p[.="Film Reviews (Popular Matches)"]
/ancestor::table[1]
/following-sibling::table[1]
//td[2]/a
'
(Try this query in the YQL console.)

Resources