Scrapy xpath returns an empty list although tag and syntax are correct - xpath

In my parse function, here is the code I have written:
hs = Selector(response)
links = hs.xpath(".//*[#id='requisitionListInterface.listRequisition']")
items = []
for x in links:
item = CrawlsiteItem()
item["title"] = x.xpath('.//*[contains(#title, "View this job description")]/text()').extract()
items.append(item)
return items
and title returns an empty list.
I am capturing an xpath with an id tag in the links and then with in the links tag, I want to get list of all the values withthe title that has view this job description.
Please help me fix the error in the code.

If you cURL the request of the URL you provided with curl "https://cognizant.taleo.net/careersection/indapac_itbpo_ext_career/moresearch.ftl?lang=en" you get back a site way different from the one you see in your browser. Your search results in the following <a> element which does not have any text() attribute to select:
<a id="requisitionListInterface.reqTitleLinkAction"
title="View this job description"
href="#"
onclick="javascript:setEvent(event);requisition_openRequisitionDescription('requisitionListInterface','actOpenRequisitionDescription',_ftl_api.lstVal('requisitionListInterface', 'requisitionListInterface.listRequisition', 'requisitionListInterface.ID5645', this),_ftl_api.intVal('requisitionListInterface', 'requisitionListInterface.ID5649', this));return ftlUtil_followLink(this);">
</a>
This is because the site loads the site loads the information displayed with an XHR request (you can look up this in Chrome for example) and then the site is updated dynamically with the returned information.
For the information you want to extract you should find this XHR request (it is not hard because this is the only one) and call it from your scraper. Then from the resulting dataset you can extract the required data -- you just have to create a parsing algorithm which goes through this pipe separated format and splits it up into job postings and then extracts the information you need like position, id, date and location.

Related

Scrapy: How to extract data from a page that loads it via ajax?

I am trying to extract data from a search result that is partially build via ajax:
https://www.vitalsana.com/catalogsearch/result/?q=ibuprofen
The wanted data PZN: 16336937 is somehow injected after page onload:
xpath does return an empty result:
//[#id="maincontent"]/div[3]/div[1]/div[2]/div[4]/ol/li[1]/form/div/div[2]/p[2]/span[2]/span
Sanme goes for the data verfügbar. It is loaded after pageload via this API I guess:
https://www.vitalsana.com/catalogsearch/searchTermsLog/save/?q=ibuprofen
I noticed that some info is within inline JS, but it is difficult to get just this JS. I tried last, but this seems to be ignored. It gets all JS including the desired info:
response.xpath('//script[last()]/text()').extract()
I am using scrapy 2.1.0. Is there a way to retrieve this data?
PZN: 16336937 is not present in the search results (Vitamin D3 != Ibuprofen).
To get the PZN number of a product (8 digits), you can extract it from the img element of each product. For example, for the first search result ([1]) :
response.xpath('substring(substring-before((//img[#class="product-image-photo img-fluid"])[1]/#src,"_"),string-length(substring-before((//img[#class="product-image-photo img-fluid"])[1]/#src,"_"))-7,8)').extract()
Output : 07728561
You could also extract the value directly from the script element, but you'll have to figure out how to escape single quotes in scrapy. The XPath :
substring-after(substring-before(//script[contains(.,"Suche")],'",'),'"id": "')
Output : 07728561
Note : using regex instead of substring functions might be cleaner.
What you could try also is to "rebuild" the json from the script element, load the json, then query on it. Something like this should work :
import json
products = response.xpath('substring(substring-after(//script[contains(.,"Suche")],"] ="),1,string-length(substring-after(//script[contains(.,"Suche")],"] ="))-1)').extract()
result = json.loads(products)
for i in result:
print i['id']
Last option : request directly the data from the API (with a well-formed payload, a valid token and the appropiate method).

How can I get my xpath provided by chrome to pull proper text versus an empty string?

I am trying to scrape property data on from "http://web6.seattle.gov/DPD/ParcelData/parcel.aspx?pin=9906000005".
I identify the element that I am interested in ("Base Zone" data in the table) and copied the xpath from the chrome developer tool. When I run it through scrapy I get an empty list.
I used the scrapy shell to upload the site and typed several response requests. The page loads and I can scrape the header, but nothing in the body of the page loads, it all comes up as empty lists.
My scrapy script is as follows:
class ZoneSpider(scrapy.Spider):
name = 'zone'
allowed_domains = ['web']
start_urls = ['http://web6.seattle.gov/DPD/ParcelData/parcel.aspx?
pin=9906000005']
def parse(self, response):
self.log("base_zone: %s" % response.xpath('//*[#id="ctl00_cph_p_i1_i0_vwZoning"]/tbody/tr/td/table/tbody/tr[1]/td[2]/span/text()').extract())
self.log("use: %s" % response.xpath('//*[#id="ctl00_cph_p_i3_i0_vwKC"]/tbody/tr/td/table/tbody/tr[3]/td[2]/text()').extract())
You will see that the logs return an empty list. In the scray shell when I use query the xpath for the header I get a valid response:
response.xpath('//*[#id="ctl00_headSection"]/title/text()').extract()
['\r\n\tSeattle Parcel Data\r\n']
But when I query anything in the body I get an empty list:
response.xpath('/body').extract()
[]
What I would like to see in my scrapy code is a response like the following:
base_zone: "SF 5000"
use: "Duplex"
If you remove tbody from your XPATH it will work
Since Developer Tools operate on a live browser DOM, what you’ll
actually see when inspecting the page source is not the original HTML,
but a modified one after applying some browser clean up and executing
Javascript code. Firefox, in particular, is known for adding
elements to tables. Scrapy, on the other hand, does not modify the
original page HTML, so you won’t be able to extract any data if you
use in your XPath expressions.
Source: https://docs.scrapy.org/en/latest/topics/developer-tools.html#caveats-with-inspecting-the-live-browser-dom

GSA get latest results in a collection without q param

I'm trying to get the latest results inserted into a collection (ordered by data) on the homepage. I haven't a 'q' parameter because the user doesn't make a search yet in the homepage. So, there's a way to do this? Maybe a special character, I didn't find anything in the documentation.
You could utilize the site: query to get all content from your site like
q=site%3Ahttp%3A%2F%2Fwww.yoururl.com&sort=date%3AD%3AS%3Ad1
(site:http://www.yoururl.com URL encoded)
Finally I found this way: I used the parameter requiredfields and link to it all the results that I want to show. For example:
www.gsa.it/search?q=&sort=date:D:S:d1&requiredfields=client
This will return any results that have a meta tag of this name
<meta name="client" content="lorem ipsum">
Reference: Restricts the search results to documents that contain the exact meta tag names or name-value pairs.

Xpath to url for import.io

I'm getting list of offered jobs on this site: http://telekom.jobs/global-careers
I'm trying to get XPath of link to get more info about job.
Here is the whole XPath to the first link:
/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[2]/td/div/a/#href
and this is what I should paste to import.io:
tr[2]/td/div/a/#href
But it doesn't work, I don't know why.
Links to more info about job offer pages are having XPath:
tr[2]/td/div/a/#href
tr[4]/td/div/a/#href
tr[6]/td/div/a/#href
tr[8]/td/div/a/#href
and so on.
Maybe that's why it doesn't work? Because the numbers arent 1,2,3 etc but 2,4,6? Or do I do something wrong?
If you create an API from from URL 2.0 and reload the website with JS on but CSS off you should be able to see the collapsible menu:
DOM is constructed in such a way on this website that all the odd rows have job titles whereas more information about the job is hidden in the even rows. For that we can use position() property of XPath, so you can use the following XPath on manual row training:
/html/body/div[3]/div/div[2]/div[3]/table/tbody/tr[position() mod 2 = 0]
Which highlights the more information boxes only giving you access to the data inside. From here you can simply target the specific attributes of the elements that have title and link available.
Link xpath: .//a[#class=’forward jobadview’]/#href
Title xpath: .//div[#class=’info’]//h3
Having said that due to the heavy use of JS on the website, it may fail to publish so we have created an API for you to query and you can retrieve the same data using that here.
https://import.io/data/mine/?id=0626d49d-5233-469d-9429-707f73f1757a

Xpath table changes as combobox changes too

I'm working on an application in C# that goes to a website and gets some content out of a table. It's working fine, but here is the problem: the table that I'm getting the content of changes as I select a different value in a combobox. The Xpath that I use always gets the table that is first shown on the website and I don't know how to get the other ones. I'm posting here everything I think is useful for you to help me.
The webpage is:
http://br.soccerway.com/national/brazil/serie-a/2012/regular-season/
xpath/C# code:
HtmlNodeCollection no2 = doc.DocumentNode
.SelectNodes("//*[#id='page_competition_1_block_competition_matches_summary_6']/div[2]/table/tbody/tr/td[#class='team team-a ' or #class='date no-repetition' or #class='score-time score' or #class='team team-b ']");
On the website, you have to click on the "Por semana de jogo" option, right above the scores, for the combobox to be visible.
I need to get all the scores from all the tables, not just the one that appears.
So when you select a game week from the drop down (or click the "anterior" or "proximo" links above the drop down), the JavaScript in the page makes a call to the server to get the data for the selected game week. It just sends a URL to the server via GET.
The data is returned in the form of a JSON object, and inside this object is the table HTML. This HTML is loaded into the DOM in the right place and presto, the browser displays the data for that week.
It is a bit of work to get this programmatically, but it can be done. What you can do is determine what the URL is for each week. Hopefully, most of the query strings are constant except for the week in question. So you will have a boilerplate URL that you tweak for the week you want, and send it off to the server. You get the JSON back and parse out the table HTML. Then, you're golden: you just feed that HTML into the Agility Pack and work with it as usual.
I did a little investigation, and using Chrome's Developer Tools, in the Network tab, I found that when I selected a game week, the URL that is sent off to the server looks like so (this is for week 14):
http://br.soccerway.com/a/block_competition_matches_summary?block_id=page_competition_1_block_competition_matches_summary_6&callback_params=%7B%22page%22%3A%229%22%2C%22round_id%22%3A%2217449%22%2C%22outgroup%22%3A%22%22%2C%22view%22%3A%221%22%7D&action=changePage&params=%7B%22page%22%3A13%7D
(Note that you can also use other tools, such as Firebug in FireFox or Fiddler to get the URL).
By trying other weeks and comparing, it looks like the (selected week - 1) is found in near the end in the params query string: "...%3A13...". So for week 15 you'd use "...%3A14...". Fortunately it looks like there is only one more area of difference among the URLs for different weeks and it is in the callback_params query string. Unfortunately, I wasn't able to figure out how it connects to the selected week, but hopefully you can.
So when you feed that URL into your browser, you get back the JSON block. If you search for "<table" and "/table>" you'll see the HTML that you want. In your C# code, you can just use a simple regular expression to parse it out of the JSON string:
string json = "..." // load the JSON string here
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
Regex regx = new Regex( "(?<theTable><table.*/table>)", options );
Match match = regx.Match( json );
if ( match.Success ) {
string tableHtml = match.Groups["theTable"].Value;
}
Feed the HTML string into the Agility Pack and you should be on your way.

Resources