How can I get my xpath provided by chrome to pull proper text versus an empty string? - xpath

I am trying to scrape property data on from "http://web6.seattle.gov/DPD/ParcelData/parcel.aspx?pin=9906000005".
I identify the element that I am interested in ("Base Zone" data in the table) and copied the xpath from the chrome developer tool. When I run it through scrapy I get an empty list.
I used the scrapy shell to upload the site and typed several response requests. The page loads and I can scrape the header, but nothing in the body of the page loads, it all comes up as empty lists.
My scrapy script is as follows:
class ZoneSpider(scrapy.Spider):
name = 'zone'
allowed_domains = ['web']
start_urls = ['http://web6.seattle.gov/DPD/ParcelData/parcel.aspx?
pin=9906000005']
def parse(self, response):
self.log("base_zone: %s" % response.xpath('//*[#id="ctl00_cph_p_i1_i0_vwZoning"]/tbody/tr/td/table/tbody/tr[1]/td[2]/span/text()').extract())
self.log("use: %s" % response.xpath('//*[#id="ctl00_cph_p_i3_i0_vwKC"]/tbody/tr/td/table/tbody/tr[3]/td[2]/text()').extract())
You will see that the logs return an empty list. In the scray shell when I use query the xpath for the header I get a valid response:
response.xpath('//*[#id="ctl00_headSection"]/title/text()').extract()
['\r\n\tSeattle Parcel Data\r\n']
But when I query anything in the body I get an empty list:
response.xpath('/body').extract()
[]
What I would like to see in my scrapy code is a response like the following:
base_zone: "SF 5000"
use: "Duplex"

If you remove tbody from your XPATH it will work
Since Developer Tools operate on a live browser DOM, what you’ll
actually see when inspecting the page source is not the original HTML,
but a modified one after applying some browser clean up and executing
Javascript code. Firefox, in particular, is known for adding
elements to tables. Scrapy, on the other hand, does not modify the
original page HTML, so you won’t be able to extract any data if you
use in your XPath expressions.
Source: https://docs.scrapy.org/en/latest/topics/developer-tools.html#caveats-with-inspecting-the-live-browser-dom

Related

Scrapy: How to extract data from a page that loads it via ajax?

I am trying to extract data from a search result that is partially build via ajax:
https://www.vitalsana.com/catalogsearch/result/?q=ibuprofen
The wanted data PZN: 16336937 is somehow injected after page onload:
xpath does return an empty result:
//[#id="maincontent"]/div[3]/div[1]/div[2]/div[4]/ol/li[1]/form/div/div[2]/p[2]/span[2]/span
Sanme goes for the data verfügbar. It is loaded after pageload via this API I guess:
https://www.vitalsana.com/catalogsearch/searchTermsLog/save/?q=ibuprofen
I noticed that some info is within inline JS, but it is difficult to get just this JS. I tried last, but this seems to be ignored. It gets all JS including the desired info:
response.xpath('//script[last()]/text()').extract()
I am using scrapy 2.1.0. Is there a way to retrieve this data?
PZN: 16336937 is not present in the search results (Vitamin D3 != Ibuprofen).
To get the PZN number of a product (8 digits), you can extract it from the img element of each product. For example, for the first search result ([1]) :
response.xpath('substring(substring-before((//img[#class="product-image-photo img-fluid"])[1]/#src,"_"),string-length(substring-before((//img[#class="product-image-photo img-fluid"])[1]/#src,"_"))-7,8)').extract()
Output : 07728561
You could also extract the value directly from the script element, but you'll have to figure out how to escape single quotes in scrapy. The XPath :
substring-after(substring-before(//script[contains(.,"Suche")],'",'),'"id": "')
Output : 07728561
Note : using regex instead of substring functions might be cleaner.
What you could try also is to "rebuild" the json from the script element, load the json, then query on it. Something like this should work :
import json
products = response.xpath('substring(substring-after(//script[contains(.,"Suche")],"] ="),1,string-length(substring-after(//script[contains(.,"Suche")],"] ="))-1)').extract()
result = json.loads(products)
for i in result:
print i['id']
Last option : request directly the data from the API (with a well-formed payload, a valid token and the appropiate method).

Web Scraping Return Empty Value Using Xpath in Scrapy

Really need the help from this community.
My question is that when I used the code
=========================================================================
response.xpath("//div[contains(#class,'check-prices-widget-not-sponsored')]/a/div[contains(#class,'check-prices-widget-not-sponsored-link')]").extract()
enter image description here
to extract the vendor name in scrapy shell, the output is empty. I really did not know why that happened, and it seems to me that the problem might be the website info is updating dynamically?
The url for this web scraping is: https://cruiseline.com/cruise/7-night-bahamas-florida-new-york-roundtrip-32860, and what I need is the Vendor name and Price for each vendor. Besides the attached pic is the screenshot of "the inspect".
Really appreciate the help!
You need to always check HTML source code in your browser (usually with Ctrl+U).
This way you'll find that information you want is embedded inside Javascript variables using JSON:
var partnerPrices = [{"pool":"9a316391b6550eef969c8559c14a380f","partner":"ncl.com","priority":0,"currency":"USD","data":{"32860":{"2018-02-25":{"Inside":579,"Suite":1199,"Balcony":699,"Oceanview":629},....
var sponsored_partners = [{"code":"CDCNA","name":"cruises.com","value":"cruises.com","logo":"\/images\/partner-logo-cruises-sm.png","logo_sprite":"partner-logo-cruises-com"},...
So you need to import json, parse response.body (using re or another method) and next json.loads() parsed JSON strings to iterate through two arrays.

Assign a variable to xpath scrapy

Im using scrapy to crawl a webpage, the web page has 10+ links to crawl using |LinkExtractor, everything works fine but on the crawling of extracted links i need to get the page url. I have no other way to get the url but to use
response.request.url
How do i assign that value to
il.add_xpath('url', response.request.url)
If i do it like this i get error:
File "C:\Python27\lib\site-packages\scrapy\selector\unified.py", line
100, in xpath
raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
exceptions.ValueError: Invalid XPath: http://www.someurl.com/news/45539/
title-of-the-news
And for description it is like this (just for refference):
il.add_xpath('descrip', './/div[#class="main_text"]/p/text()')
Thanks
The loader comes with two ways of adding attributes to the item, and is with add_xpath and add_value, so you should use something like:
...
il.add_value('url', response.url) # yes, response also has the url attribute

Scrapy xpath returns an empty list although tag and syntax are correct

In my parse function, here is the code I have written:
hs = Selector(response)
links = hs.xpath(".//*[#id='requisitionListInterface.listRequisition']")
items = []
for x in links:
item = CrawlsiteItem()
item["title"] = x.xpath('.//*[contains(#title, "View this job description")]/text()').extract()
items.append(item)
return items
and title returns an empty list.
I am capturing an xpath with an id tag in the links and then with in the links tag, I want to get list of all the values withthe title that has view this job description.
Please help me fix the error in the code.
If you cURL the request of the URL you provided with curl "https://cognizant.taleo.net/careersection/indapac_itbpo_ext_career/moresearch.ftl?lang=en" you get back a site way different from the one you see in your browser. Your search results in the following <a> element which does not have any text() attribute to select:
<a id="requisitionListInterface.reqTitleLinkAction"
title="View this job description"
href="#"
onclick="javascript:setEvent(event);requisition_openRequisitionDescription('requisitionListInterface','actOpenRequisitionDescription',_ftl_api.lstVal('requisitionListInterface', 'requisitionListInterface.listRequisition', 'requisitionListInterface.ID5645', this),_ftl_api.intVal('requisitionListInterface', 'requisitionListInterface.ID5649', this));return ftlUtil_followLink(this);">
</a>
This is because the site loads the site loads the information displayed with an XHR request (you can look up this in Chrome for example) and then the site is updated dynamically with the returned information.
For the information you want to extract you should find this XHR request (it is not hard because this is the only one) and call it from your scraper. Then from the resulting dataset you can extract the required data -- you just have to create a parsing algorithm which goes through this pipe separated format and splits it up into job postings and then extracts the information you need like position, id, date and location.

XPath query returns no results in Yahoo Pipes

Trying to get all titles from http://www.112.ru/services/wanted/people/index.shtml?roztype=1
using Yahoo Pipes Xfetch module.
My query //span[#class='uchbold'] select all titles in Firepath successfully. But in Yahoo Pipes and Hpple there is no results.
These class attributes are inserted by a JavaScript which isn't executed using Yahoo Pipes and Hpple.
Also the contents are loaded by ajax, you will have to trace the ajax calls and develop against this interface.
Using Firebug I could trace it loading
http://www.112.ru/publish/00/01/0508.01/2012/08//contents.xml
and lots of other "contents.xml" files which returned 404 errors. It contains contents in form of elements like
<view file="0901156380089d71_0508.01_00_01.full.shtml" format="full" indexed="true"/>
which seem to link again to some HTML snippets containing the actual data.

Resources