My scrapy script returns results when xpaths are hard coded but does not work with variables. What am I missing
The following works:
response.selector.xpath('//*[(#id = "abc")]').extract()
The following DOES NOT works:
response.xpath("{}".format(xpath_variable)).extract()
Can someone please tell me what I'm doing wrong. Thanks!
Just figured this out. There was a mistake with my code since I was replacing text within the xpath. Variables do in fact work. Thanks for looking into this Tomas!
Related
I am trying to write a web scraper using scrapy and xpath but I am experiencing a frustrating problem.
I need the text in a paragraph which has HTML
<p class="list-details__item__date" id="match-date">04.03.2017 - 15:00</p>
I might be wrong, but since the p has an id attribute, it should be referable simply using
response.xpath('//p[#id="match-date"]/text()').extract()
Anyway this won't work.
I know a little of xpath and I was able to write scrapers in the past, but this one is giving me troubles. I tried many solutions, but no one seems to work
response.xpath('//p[contains(#class, "list-details__item__date") and contains(#id,"match-date")]/text()').extract()
response.xpath('//p[#class="list-details__item__date" and #id="match-date"]/text()').extract()
I also tried using "contains" as stated in many answers, but it did not work as well. This might be a stupid mistake I am doing...it would be great if someone could help me!
Thank you so much
Maybe match-date is loaded via AJAX/JS ... Please disable Javascript in your browser and then see if match-date is there or not.
Also for seek of easiness, use CSS Selectors instead of xPaths.
response.css('#match-date::text').extract()
EDIT:
To get value of data-dt attribute do this
response.css('#match-date::attr(data-dt)').extract()
OR XPath
response.xpath('//p[#id="match-date"]/#data-dt').extract()
I am having some difficulty querying a node in an xml document. the document is http://ods.od.nih.gov/api/index.aspx?resourcename=BotanicalBackground&readinglevel=Health%20Professional
i am trying to get the text of the first node.
i have tried these queries and none of them seemed to work.
*[name()='ImageURL']
//captionedimage[1]
//Factsheet/RelatedImages/captionedimage[1]/ImageURL/text()
//RelatedImages/*[1]
greatly appreciate any help.
Your three last XPATH seem to be working (you can quickly check it out at http://www.xpathtester.com/test or http://www.freeformatter.com/xpath-tester.html). The problem should be linked to the environment you use.
When I tried them on scrapy the uppercases XPATH retrieved nothing, only //factsheet/relatedimages/captionedimage[1]/imageurl/text() seemed to be working. Sadly, this behavior is surprising to me and I have no idea why it acted that way. But you should definitely try and gather more info on the environment you're using.
Try this...
./Factsheet/RelatedImages/[local-name() = 'captionedimage' and position()=1]/[local-name() = 'ImageURL']
I have a script that basically is a search/filter that runs all browsers except firefox. And I dont know what is wrong. I'm trying since saturday find what is wrong, searching here if someone had the same problem and nothing. I'm LEARNING javascript, so I'm hoping someone can point me into the right direction to find what i'm not doing right or what i'm missing. Any help will be appreciated.
http://jsfiddle.net/ccarizzo/GYcbE/
online here
The problem, as you can tell by looking in the error console, is this code:
$(listaProdutos).find('a:Contains(' + filter + ')').parent();
There is no "listaProdutos" variable in the script. You're relying on a non-standard behavior in other browsers that reflects all IDs into the global scope.
This should work:
$("listaProdutos").find('a:Contains(' + filter + ')').parent();
You need a similar change in some other places too.
Use the W3C validator to check interoperability of your web-scripts.
Click here to get yours validated.
I'm relatively new to Watir but can find no good documentation (examples) regarding how to check if an element exists. There are the API specs, of course, but these make precious little sense to me if I don't find an example.
I've tried both combinations but nothing seems to work...
if browser.image (:src "/media/images/icons/reviewertools/editreview.jpg").exists
then...
if browser.image (:src "/media/images/icons/reviewertools/editreview.jpg").exists?
then...
If anyone has a concrete suggestion as per how to implement this, please help! Thanks!
It seems you are missing a comma between parameters.
Should be
if browser.image(:src, "/media/images/icons/reviewertools/editreview.jpg").exists?
Also you can find this page useful in future to know what attributes are supported.
The code you posted should work just fine.
Edit: Oops, wrong. As Katmoon pointed out, there is a missing comma.
browser.image(:src "/media/images/icons/reviewertools/editreview.jpg").exists?
One problem you may get caught up in is if the browser variable you specified is actually an element that doesn't exist.
e.g.
b = Watir::IE.start(ipAddress)
b.frame(:name, "doesntExist).image(:src "/media/images/icons/reviewertools/editreview.jpg").exists?
The above code will throw a Watir::UnknownFrameException. You can get around this by first verifying the frame exists or by surrounding the code in a begin/rescue block.
Seems like you are using it correctly. Here is an old RDoc of Watir.
Does it not work because Watir cannot find it? Hard to tell because there is no source or link to the page that is being tested. I think that I only use image.exists?. In general, errors that come from when the image exists but is not found are:
The how is not compatible with the element type. There is a cheatsheet to help you see which object types can be found with different attributes here.
The what is not correct. You may have to play with that a little bit. Consider trying a regex string to match it such as browser.image(:src, /editreview.jpg/). As a last resort, maybe use element_by_xpath, but there are maintenance costs with that.
The location is not correct. Maybe the element is in a frame or something like that. browser.frame("detail").image(:src, /editreview.jpg/).
Try those, but please let me know what worked. One more thing, what are you checking for? If it's part of the test criteria, you can handle it that way. If you need to click on it, then forget the .exists? and just click on it. Ruby will let you know if it's not there. If you need it to be grace, learn about begin/rescue.
Good luck,
Dave
I've written a scrubyt extractor based on the 'learning' technique - that is, specifying the current text on the page and getting it to work out the XPath expressions itself. However, I now want to export the extractor so that it can be used even when the page has changed.
The documentation for scrubyt seems to be all over the place now, but from what I can find I should be able to put the line extractor.export(__FILE__) and it should work. It doesn't - I just get an error saying that there is the wrong number of arguments for export, it should have 0. I've tried it without any arguments and it still fails.
I would ask on the scrubyt forum, but it seems like no-one's been there for ages!
Any ideas what to do here?
Just had the same problem and tried "puts google_data.export()" (trying to get some stuff from google)
This gave me the following:
=== Extractor tree ===
export() is not working at the moment, due to the removal or
ParseTree, ruby2ruby and RubyInline.
For now, in case you are using examples, you can replace them by hand
based on the output below.
So if your pattern in the learning extractor looks like
book "Ruby Cookbook"
and you see the following below:
[book] /table[1]/tr/td[2]
then replace "Ruby Cookbook" with "/table[1]/tr/td[2]" (and all the
other XPaths) and you are ready!
[link] /body/div/div/div/div/div/ol/li/h3/a
which gave me the xpath I was looking for
scrubyt version is 0.4.06