XPath query returns no results in Yahoo Pipes - xpath

Trying to get all titles from http://www.112.ru/services/wanted/people/index.shtml?roztype=1
using Yahoo Pipes Xfetch module.
My query //span[#class='uchbold'] select all titles in Firepath successfully. But in Yahoo Pipes and Hpple there is no results.

These class attributes are inserted by a JavaScript which isn't executed using Yahoo Pipes and Hpple.
Also the contents are loaded by ajax, you will have to trace the ajax calls and develop against this interface.
Using Firebug I could trace it loading
http://www.112.ru/publish/00/01/0508.01/2012/08//contents.xml
and lots of other "contents.xml" files which returned 404 errors. It contains contents in form of elements like
<view file="0901156380089d71_0508.01_00_01.full.shtml" format="full" indexed="true"/>
which seem to link again to some HTML snippets containing the actual data.

Related

How can I get my xpath provided by chrome to pull proper text versus an empty string?

I am trying to scrape property data on from "http://web6.seattle.gov/DPD/ParcelData/parcel.aspx?pin=9906000005".
I identify the element that I am interested in ("Base Zone" data in the table) and copied the xpath from the chrome developer tool. When I run it through scrapy I get an empty list.
I used the scrapy shell to upload the site and typed several response requests. The page loads and I can scrape the header, but nothing in the body of the page loads, it all comes up as empty lists.
My scrapy script is as follows:
class ZoneSpider(scrapy.Spider):
name = 'zone'
allowed_domains = ['web']
start_urls = ['http://web6.seattle.gov/DPD/ParcelData/parcel.aspx?
pin=9906000005']
def parse(self, response):
self.log("base_zone: %s" % response.xpath('//*[#id="ctl00_cph_p_i1_i0_vwZoning"]/tbody/tr/td/table/tbody/tr[1]/td[2]/span/text()').extract())
self.log("use: %s" % response.xpath('//*[#id="ctl00_cph_p_i3_i0_vwKC"]/tbody/tr/td/table/tbody/tr[3]/td[2]/text()').extract())
You will see that the logs return an empty list. In the scray shell when I use query the xpath for the header I get a valid response:
response.xpath('//*[#id="ctl00_headSection"]/title/text()').extract()
['\r\n\tSeattle Parcel Data\r\n']
But when I query anything in the body I get an empty list:
response.xpath('/body').extract()
[]
What I would like to see in my scrapy code is a response like the following:
base_zone: "SF 5000"
use: "Duplex"
If you remove tbody from your XPATH it will work
Since Developer Tools operate on a live browser DOM, what you’ll
actually see when inspecting the page source is not the original HTML,
but a modified one after applying some browser clean up and executing
Javascript code. Firefox, in particular, is known for adding
elements to tables. Scrapy, on the other hand, does not modify the
original page HTML, so you won’t be able to extract any data if you
use in your XPath expressions.
Source: https://docs.scrapy.org/en/latest/topics/developer-tools.html#caveats-with-inspecting-the-live-browser-dom

Xpath to url for import.io

I'm getting list of offered jobs on this site: http://telekom.jobs/global-careers
I'm trying to get XPath of link to get more info about job.
Here is the whole XPath to the first link:
/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[2]/td/div/a/#href
and this is what I should paste to import.io:
tr[2]/td/div/a/#href
But it doesn't work, I don't know why.
Links to more info about job offer pages are having XPath:
tr[2]/td/div/a/#href
tr[4]/td/div/a/#href
tr[6]/td/div/a/#href
tr[8]/td/div/a/#href
and so on.
Maybe that's why it doesn't work? Because the numbers arent 1,2,3 etc but 2,4,6? Or do I do something wrong?
If you create an API from from URL 2.0 and reload the website with JS on but CSS off you should be able to see the collapsible menu:
DOM is constructed in such a way on this website that all the odd rows have job titles whereas more information about the job is hidden in the even rows. For that we can use position() property of XPath, so you can use the following XPath on manual row training:
/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[position() mod 2 = 0]
Which highlights the more information boxes only giving you access to the data inside. From here you can simply target the specific attributes of the elements that have title and link available.
Link xpath: .//a[#class=’forward jobadview’]/#href
Title xpath: .//div[#class=’info’]//h3
Having said that due to the heavy use of JS on the website, it may fail to publish so we have created an API for you to query and you can retrieve the same data using that here.
https://import.io/data/mine/?id=0626d49d-5233-469d-9429-707f73f1757a

Yahoo Pipes to loop through all pages

I am looking to pull job postings from a site that has multiple pages of postings. I can pull the content from one page
On a simple example I can get it to iterate and grab page content (this is a simple example site base)
However when I take the first example and try to clean the data (I can't use the Xpath filter to grab the HTML id and I cand seem to find a way to limit the scope elsewhere. Here is what I am trying (regex, rename...):
http://pipes.yahoo.com/pipes/pipe.edit?_id=3619ea93d66e47442659a1976746ba6c
Any thoughts?

AJAX, JQuery, Parse - which one will get me my array?

In my codeacademy and codeschool lessons, I've been fetching data from google rss and simulated twitter feeds.
My newest exercise, however, involves fetching an array of text data from a REST API.
When I try
$.get('https://api.parse.com/1/classes/chats?order=-createdAt', function(x){$('.messages').append('<li>'+x.responseText+'</li>');});
I get
which has the text and username I need. Sort of...
but when I try to alert or console.log either *x.responseText or x.responseText.results I obviously get undefined instead of an array.
What am I missing?
Study more AJAX and I'll find a technique?
Or do I have to send special instructions to the parse server using some commands found here.
You are not using XMLHttpRequest directly, you are using jQuery and it will read the responseText and handle it for you.
Just use x (or, rather, x.results).

Scraping pages with asynchronous responses with Hpricot

I'm trying to scrape a page but the initial response has nothing in the body as the content is pumped in asynchronously, e.g. the results from a search on the apple website: http://www.apple.com/uk/search/?q=searching+for+something&sec=global
Any ideas on how I can successfully grab the results from the search with hpricot?
Thanks.
When the search page you refer to is loaded, it makes a request via javascript/ajax to some other location, then populates the search results. This is what you're seeing in the page. Hpricot itself can't help you here because it has no way to interpret the javascript that comes with the page in order to fetch the actual search results list.
Now, if what you're interested in are the search results, you'd need to analyze a bit what happens when you enter that page and type a search query. Some javascript in the page takes your query, and calls (via XMLHttpRequest or similar, AJAX techniques) some other script in Apple's server. This is the one that actually does the search in a database and returns the result.
I suggest you install Firefox with the Firebug plugin, or some other way of seeing the actual requests a page and its javascript components send and / or receive. You'll see that, for the search page you referred, it fetches two parts: First, the "featured" results that come from this URL:
http://www.apple.com/global/scripts/search_featured.php?q=mac+mini&section=global&geo=uk
Notice the search string is in the "q" parameter.
Second, a long results list comes from here:
http://www.apple.com/search/service/nph-search10?site=uk_www&filter=1&snum=50&q=mac+mini
These both are XML documents; you might have better luck parsing these URLs with Hpricot.

Resources