I'm using XPath to extract certain elements from the following URL:
http://gizmodo.com/how-often-cities-appear-in-books-from-the-past-200-year-1040700553
To extract the main content, I'm using the query:
//p[#class='has-media media-640']
However, I'd like to exclude all spans from within this main content that have the class "magnifier lightBox". I've looked through StackOverflow and tried all sorts of methods such as:
//div[#class='row post-content']/*[not(self::span[#class='magnifier lightBox'])]
to no avail.
Related
I am trying to scrape dynamic content with Watir and I am stuck.
Basically, I know that I can use
browser.element(css: ".some_class").wait_until_present
in order to scrape only when "some_class" is loaded.
The problem is that it is only giving me the first element having this class name and I want all of them.
I also know I can use
browser.spans(css: ".some_class")
in order to collect ALL the classes having this name, the problem is that I can't combine it with "wait_until_present" (it gives me an error). And spans on his own is not working because the content is not loaded yet, the page is using javascript
Is there a way to combine both? That means waiting for the class_name to be loaded AND select all the elements matching this class name, not just the first one?
I've been stuck for ages...
Thanks a lot for your help
There currently isn't anything in Watir for waiting for a collection of elements (though I had been recently thinking about adding something). For now, you just have to manually wait for an element to appears and then get the collection.
The simplest one is to call both of your lines:
browser.element(css: ".some_class").wait_until_present
browser.spans(css: ".some_class")
If you wanted to one-liner it, you could use #tap:
browser.spans(css: ".some_class").tap { |c| c[0].wait_until_present }
#=> Watir::SpanCollection
Note that if you are just checking the class name, you might want to avoid writing the CSS-selector. Not only is it easier to read without it, it won't be as performant.
browser.spans(class: "some_class").tap { |c| c[0].wait_until_present }
So I have a case where I need to be able to work on the actual Hyperlink element inside the body of the docx, not just the target URL or the internal/externality of the link.
As a possible additional wrinkle this hyperlink wasn't present in the docx when it was opened but instead was added by the docx4j-xhtmlImporter.
I've iterated the list of relationships here: wordMLPackage.getMainDocumentPart().getRelationshipsPart().getRelationships().getRelationship()
And found the relationship ID of the hyperlink I want. I'm trying to use an XPath query: List<Object> results = wordMLPackage.getMainDocumentPart().getJAXBNodesViaXPath("//w:hyperlink[#r:id='rId11']", false);
But the list is empty. I also thought that it might need a refresh because I added the hyperlink at runtime so I tried with the refreshXMLFirst parameter set to true. On the off chance it wasn't a real node because it's an inner class of P, I also tried getJAXBAssociationsForXPath with the same parameters as above and that doesn't return anything.
Additionally, even XPath like "//w:hyperlink" fails to match anything.
I can see the hyperlinks in the XML if I unzip it after saving to a file, so I know the ID is right: <w:hyperlink r:id="rId11">
Is XPath the right way to find this? If it is, what am I doing wrong? If it's not, what should I be doing?
Thanks
XPathHyperlinkTest.java is a simple test case which works for me
You might be having problems because of JAXB, or possibly because of the specific way in which the binder is being set up in your case (do you start by opening an existing docx, or creating a new one?). Which docx4j version are you using?
Which JAXB implementation are you using? If its the Sun/Oracle implementation (the reference implementation, or the one included in their JDK/JRE), it might be this which is causing the problem, in which case you might try using MOXy instead.
An alternative to using XPath is to traverse the docx; see finders/ClassFinder.java
Try without namespace binding
List<Object> results = wordMLPackage.getMainDocumentPart().getJAXBNodesViaXPath("//*:hyperlink[#*:id='rId11']", false);
I want to select all the 'a' elements inside 'li's of class 'foo', so the xpath is used is li[#class="foo"]//a which works under the xpath tester and Javascript.
However, I'm trying to get this to work under a CrawlSpider built under Scrapy, specifically as one of its link extractor rules in a fashion such as
Rule(SgmlLinkExtractor(restrict_xpaths=('//li[#class="foo"]//a | //a[contains(.,"Next")]')), callback='parse_foo', follow=True)
It returns a much larger set than expected.
For example on this data.gc.ca page, there are 10 divs of dataset-item class. By selecting //div[#class="dataset-item"] I get 10 items. However, when I select with //div[#class="dataset-item"]//a I get 68 items. Per the specs, the //a should be all the a within these divs.
How can I implement the desired function in Scrapy?
I have been at this for hours and I cannot make any progress.
I do not know how to do the following, I am used to arrays and loops, not nokogiri objects.
I want to select the table element immediately after the h2 containing span with id == "filmography"
<h2><span id ="filmography>...
<table> # What I want to find
<tr>
<td>...
So far I have used
objects = page.xpath("//h2" | "//table")
to have an array of nokogiri objects and I test each for id == "Filmography" and would work with the next object, however the elements returned are not in order as they appear on the page they are in the order all h2's then all tables.
Could I somehow have all 'h2's and 'table's as element objects in the order they appear on the page, and test the child object 'span' for its id attribute?
All advice appreciated, as I am thoroughly stuck.
This looks like it should work:
page.xpath('h2//span[#id="filmography"]').first.next_element
Nokogiri supports CSS selectors, which make this easy:
doc.at('span#filmography table').to_html
=> "<table><tr>\n<td>...</td>\n </tr></table>"
doc.at('#filmography table').to_html
=> "<table><tr>\n<td>...</td>\n </tr></table>"
at returns the first matching node, using either a CSS or XPath selector.
The "NodeSet" equivalent is search, which returns a NodeSet, which is like an Array, but would force you to use first after it, which only really makes for a longer command:
doc.search('span#filmography table').first.to_html
doc.search('#filmography table').first.to_html
Because the span tag contains an id parameter, you're safe to use at and only look for #filmography, since IDs are unique in a page.
I am executing a raven query in C#, and utilising both the Where() and Search() extension methods.
I need both these functionalities, because I need to only return indices with a specific Guid field, AND text that exists in a body of text.
Unfortunatly, the Where extension method seems to not be compatible with the Search extension method. When I combine them I get a Lucene query like this:
Query: FeedOwner:25eb541c\-b04a\-4f08\-b468\-65714f259ac2 MessageBody:<<request*>>
Which seems to completely ignore the 'MessageBody' part of the criteria - so it doesnt matter what constraint I use in the 'free text', it doesnt use it.
I have tested with the 'Search' alone, and it works - so its not a problem with free-text searching by itself - just combining the two.
Thanks to #Tobias on Raven#GoogleGroups who pointed me in the right direction - there was an option to define how the Where and Search clauses would be combined:
Query<T>.Search(candidate => candidate.MessageBody, queryString + "*", options: SearchOptions.And);