scrapy xpath unsupported indirect child syntax - xpath

I want to select all the 'a' elements inside 'li's of class 'foo', so the xpath is used is li[#class="foo"]//a which works under the xpath tester and Javascript.
However, I'm trying to get this to work under a CrawlSpider built under Scrapy, specifically as one of its link extractor rules in a fashion such as
Rule(SgmlLinkExtractor(restrict_xpaths=('//li[#class="foo"]//a | //a[contains(.,"Next")]')), callback='parse_foo', follow=True)
It returns a much larger set than expected.
For example on this data.gc.ca page, there are 10 divs of dataset-item class. By selecting //div[#class="dataset-item"] I get 10 items. However, when I select with //div[#class="dataset-item"]//a I get 68 items. Per the specs, the //a should be all the a within these divs.
How can I implement the desired function in Scrapy?

Related

Watir: Retrieve all dynamic HTML elements that match an attribute?

I am trying to scrape dynamic content with Watir and I am stuck.
Basically, I know that I can use
browser.element(css: ".some_class").wait_until_present
in order to scrape only when "some_class" is loaded.
The problem is that it is only giving me the first element having this class name and I want all of them.
I also know I can use
browser.spans(css: ".some_class")
in order to collect ALL the classes having this name, the problem is that I can't combine it with "wait_until_present" (it gives me an error). And spans on his own is not working because the content is not loaded yet, the page is using javascript
Is there a way to combine both? That means waiting for the class_name to be loaded AND select all the elements matching this class name, not just the first one?
I've been stuck for ages...
Thanks a lot for your help
There currently isn't anything in Watir for waiting for a collection of elements (though I had been recently thinking about adding something). For now, you just have to manually wait for an element to appears and then get the collection.
The simplest one is to call both of your lines:
browser.element(css: ".some_class").wait_until_present
browser.spans(css: ".some_class")
If you wanted to one-liner it, you could use #tap:
browser.spans(css: ".some_class").tap { |c| c[0].wait_until_present }
#=> Watir::SpanCollection
Note that if you are just checking the class name, you might want to avoid writing the CSS-selector. Not only is it easier to read without it, it won't be as performant.
browser.spans(class: "some_class").tap { |c| c[0].wait_until_present }

Xpath Scrub of Website Incomplete Results

I am trying to use the Google Spreadsheet function "importXML" to pull in all links and titles from a Khan Academy Website:
https://www.khanacademy.org/commoncore/grade-HSA-A-SSE
So far I have tried:
=IMPORTXML("https://www.khanacademy.org/commoncore/grade-HSA-A-SSE", "//a[#class='standard-preview']")
It brings in 29 results, but not all of the "a" elements with class "standard-preview". On the webpage, there are many more elements with that class than just the 29 results.
How do I grab all the elements with the class "standard-preview". Why would my xpath not return some of the values?
My spreadsheet is below:
https://docs.google.com/spreadsheets/d/1pP-WMnoCYzG38VyT_0tYpdblSKjNGvDpa8dRMnraQ7w/edit?usp=sharing

Select Nokogiri element after an element with particular attribute

I have been at this for hours and I cannot make any progress.
I do not know how to do the following, I am used to arrays and loops, not nokogiri objects.
I want to select the table element immediately after the h2 containing span with id == "filmography"
<h2><span id ="filmography>...
<table> # What I want to find
<tr>
<td>...
So far I have used
objects = page.xpath("//h2" | "//table")
to have an array of nokogiri objects and I test each for id == "Filmography" and would work with the next object, however the elements returned are not in order as they appear on the page they are in the order all h2's then all tables.
Could I somehow have all 'h2's and 'table's as element objects in the order they appear on the page, and test the child object 'span' for its id attribute?
All advice appreciated, as I am thoroughly stuck.
This looks like it should work:
page.xpath('h2//span[#id="filmography"]').first.next_element
Nokogiri supports CSS selectors, which make this easy:
doc.at('span#filmography table').to_html
=> "<table><tr>\n<td>...</td>\n </tr></table>"
doc.at('#filmography table').to_html
=> "<table><tr>\n<td>...</td>\n </tr></table>"
at returns the first matching node, using either a CSS or XPath selector.
The "NodeSet" equivalent is search, which returns a NodeSet, which is like an Array, but would force you to use first after it, which only really makes for a longer command:
doc.search('span#filmography table').first.to_html
doc.search('#filmography table').first.to_html
Because the span tag contains an id parameter, you're safe to use at and only look for #filmography, since IDs are unique in a page.

Exclude certain elements with an xpath query

I'm using XPath to extract certain elements from the following URL:
http://gizmodo.com/how-often-cities-appear-in-books-from-the-past-200-year-1040700553
To extract the main content, I'm using the query:
//p[#class='has-media media-640']
However, I'd like to exclude all spans from within this main content that have the class "magnifier lightBox". I've looked through StackOverflow and tried all sorts of methods such as:
//div[#class='row post-content']/*[not(self::span[#class='magnifier lightBox'])]
to no avail.

How to get content between HTML tags that have been loaded by jQuery?

I'm loading data using jQuery (AJAX), which is then being loaded into a table (so this takes place after page load).
In each table row there is a 'select' link allowing users to select a row from the table. I then need to grab the information in this row and put it into a form further down the page.
$('#selection_table').on('click', '.select_link', function() {
$('#booking_address').text = $(this).closest('.address').text();
$('#booking_rate').text = $(this).closest('.rate').val();
});
As I understand it, the 'closest' function traverses up the DOM tree so since my link is in the last cell of each row, it should get the elements 'address' and 'rate from the previous row (the classes are assigned to the correct cells).
I've tried debugging myself using quick and dirty 'alert($(this).closest(etc...' in many variations, but nothing seems to work.
Do I need to do something differently to target data that was loaded after the original page load? where am I going wrong?
You are making wrong assumptions about .closest() and how .text() works. Please make a habit of studying the documentation when in doubt, it gives clear descriptions and examples on how to use jQuery's features.
.closest() will traverse the parents of the given element, trying to match the selector you have provided it. If your .select_link is not "inside" .address, your code will not work.
Also, .text() is a method, not a property (in the semantical way, because methods are in fact properties in Javascript). x.text = 1; simply overrides the method on this element, which is not a good idea, you want to invoke the method: x.text(1);.
Something along these lines might work:
var t = $(this).closest('tr').find('.address').text();
$('#booking_address').text(t);
If #booking_address is a form element, use .val() on it instead.
If it does not work, please provide the HTML structure you are using (edit your question, use jsFiddle or a similar service) and I will help you. When asking questions like this, it is a good habit anyways to provide the relevant HTML structure.
You can try using parent() and find() functions and locate the data directly, the amount of parent() and find() methods depends on your HTML.
Ex. to get previous row data that would be
$('#selection_table').on('click', '.select_link', function(){
$('#booking_address').text = $(this).parent().parent().prev().find('.address').text();
});
Where parent stands for parent element (tr), then prev() as previous row and find finds the element.
Is there a demo of the code somewhere? Check when are you calling the code. It should be after the 'success' of AJAX call.

Resources