Xpath query with scrapy

Xpath query with scrapy - xpath

I want to extract in the following link http://gepris.dfg.de/gepris/projekt/268853 the String Professor Dr. Michael Eiermann from the category beteiliger Wissenschaftler. How must the Xpath query must be correct. From the output I get the String Fachliche Zuordnung
Code Snippet with Xpath query:
start_urls =['http://gepris.dfg.de/gepris/projekt/268853',]
def parse(self, response):
for sel in response.xpath("//*[contains(#class,'content_frame')]/*[#id='projektbeschreibung']"):
lianjia = lianjiaItem()
lianjia['Projekttitel'] = sel.xpath("//div/div/span/text()").extract()

This xpath
response.xpath("//span[contains(., 'beteiligter Wissenschaftler')]/following-sibling::span[#class='value']/a/text()").extract()
returns array of Professors. The first one is Professor Dr. Michael Eiermann

Related

Xpath: return all nodes that match any one of the conditions

I am trying to fetch two nodes from XML as combined result using OR condition.
Nodes in XML where name = John or name="jim",both should be returned . So basically I expect following result:
<person name="John"></person>
<person name="Jim"></person>
I have tried XPath function * ///person[#name="John"] or ///person[#name="Jim"]*
but it gives me only one node.
How to construct Xpath function in this case ?
regards,
Venky

I would use a predicate person[#name = ('John', 'Jim')] if we assume Saxon means a Saxon 9 version where XPath 2 or 3 is supported. Of course the right place for your or expression would be inside the square brackets person[#name = 'Jim' or #name = 'John'].

How to find same elements with xpath

With the next xml, how coud i get the list of directors where two directors has the same LastName in one movie?
<MoviesLib>
<Movie Title="Batman" Year="2013">
<Directors>
<Director>
<Name>Robert</Name>
<LastName>Zemeckis</LastName>
</Director>
</Directors>
</Movie>
<Movie Title="Gru" Year="2012">
<Directors>
<Director>
<Name>john</Name>
<LastName>tailer</LastName>
</Director>
<Director>
<Name>Emma</Name>
<LastName>Smith</LastName>
</Director>
<Director>
<Name>Lana</Name>
<LastName>Smith</LastName>
</Director>
</Directors>
</Movie>
</MoviesLib>
for example in this case would be: Emma Smith, Lana Smith
thanks

The following XPath 2.0 expression should work:
for $d in //Director
return $d[../Director[not(. is $d) and LastName = $d/LastName]]
I can't come up with a single XPath 1.0 expression since it doesn't support for expressions (see the question How to get the context of outer predicate? for some background).

Replacing <a> tags that have two pairs of double quotes

I have asked a similar question before but this one is slightly different
I have content with this sort of links in:
Professor Steve Jackson
[UPDATE]
And this is how i read it:
content = doc.xpath("/wcm:root/wcm:element[#name='Body']").inner_text
The links has two pairs of double quotes after the href=.
I am trying to strip out the tag and retrieve only the text like so:
Professor Steve Jackson
To do this I'm using the same method which works for this sort of link which has only a single pair of double quotes:
World
This returns World:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^="ssLINK"]')
.each{|a| a.replace("<>#{a.content}</>")}
=>World
When I try To do the same for the link that has two pairs of double quotes it complains:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^=""ssLINK""]')
.each{|a| a.replace("<>#{a.content}</>")}
Error:
/var/lib/gems/1.9.1/gems/nokogiri-1.6.0/lib/nokogiri/css/parser_extras.rb:87:in
`on_error': unexpected 'ssLINK' after '[:prefix_match, "\"\""]' (Nokogiri::CSS::SyntaxError)
Anyone know how I can overcome this issue?

I can suggest you two ways to do it, but it depends on whether : every <a> tag has href's with two "" enclosing them or its just the one with ssLINK
Assume
output = []
input_text = 'Professor Steve Jackson'
1) If a tags has href with "" only with ssLink then just do
Nokogiri::HTML(input_text).css('a[href=""]').each do |nokogiri_obj|
output << nokogiri_obj.text
end
# => output = ["Professor Steve Jackson"]
2) If all the a tags has href with ""then you can try this
nokogiri_a_tag_obj = Nokogiri::HTML(input_text).css('a[href=""]')
nokogiri_a_tag_obj.each do |nokogiri_obj|
output << nokogiri_obj.text if nokogiri_obj.has_attribute?('sslink')
end
# => output = ["Professor Steve Jackson"]
With this second approach if
input_text = 'Professor Steve Jackson Some other TextSecond link'
then also the output will be ["Professor Steve Jackson"]

Your content is not XML, so any attempt to solve the problem using XML tools such as XSLT and XPath is doomed to failure. Use a regex approach, e.g. awk or Perl. However, it's not immediately obvious to me how to match
<a href="" sometext"">
without also matching
<a href="" sometext="">
so we need to know a bit more about this syntax that you are trying to parse.

CSS/Xpath sibling selector in Nokogiri

I have the following XML tree and need to get out the first name and surname only for the contrib tags with child xref nodes of ref-type "corresp".
<pmc-articleset>
<article>
<front>
<article-meta>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Wereszczynski</surname>
<given-names>Jeff</given-names>
</name>
<xref rid="aff1" ref-type="aff"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Andricioaei</surname>
<given-names>Ioan</given-names>
</name>
<xref rid="aff1" ref-type="aff"/>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
</contrib-group>
</article-meta>
</front>
</article>
</pmc-articleset>
I saw "Getting the siblings of a node with Nokogiri" which points out the CSS sibling selectors that can be used in Nokogiri, but, following the example given, my code gives siblings indiscriminately.
require "Net/http"
require "nokogiri"
url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=PMC1637560&db=pmc"
xml_data = Net::HTTP.get_response(URI.parse(url)).body
parsedoc = Nokogiri::XML.parse(xml_data)
corrdetails = parsedoc.at('contrib:has(xref[text()="*"])')
puts surname = corrdetails.xpath( "//surname" ).text
puts givennames = corrdetails.xpath("//given-names").text
=> WereszczynskiAndricioaei
=> JeffIoan
I only want the sibling node under the condition that <xref ref-type="corresp">*</> , that is an output of:
=> Andricioaei
=> Ioan
I've currently implemented this without referring to ref-type but rather selecting the asterisk within the xref tag (either is appropriate).

The problem is actually with your XPath for getting the the surname and given name, i.e., the XPath is incorrect for the lines:
puts surname = corrdetails.xpath( "//surname" ).text
puts givennames = corrdetails.xpath("//given-names").text
Starting the XPath with // means to look for the node anywhere in the document. You only want to look within the corrdetails node, which means the XPath needs to start with a dot, e.g., .//.
Change the two lines to:
puts surname = corrdetails.xpath( ".//surname" ).text
puts givennames = corrdetails.xpath(".//given-names").text

Searching for tags while parsing Wordpress XML with Nokogiri

I have an XML file of a Wordpress blog that consists of quotes:
<item>
<title>Brothers Karamazov</title>
<content:encoded><![CDATA["I think that if the Devil doesn't exist and, consequently, man has created him, he has created him in his own image and likeness."]]></content:encoded>
<category domain="post_tag" nicename="dostoyevsky"><![CDATA[Dostoyevsky]]></category>
<category domain="post_tag" nicename="humanity"><![CDATA[humanity]]></category>
<category domain="category" nicename="quotes"><![CDATA[quotes]]></category>
<category domain="post_tag" nicename="the-devil"><![CDATA[the Devil]]></category>
</item>
The things I'm trying to extract are title, author, content and tags. Here's my code so far:
require "rubygems"
require "nokogiri"
doc = Nokogiri::XML(File.open("/Users/charliekim/Downloads/quotesfromtheunderground.wordpress.2013-04-14.xml"))
doc.css("item").each do |item|
title = item.at_css("title").text
tag = item.at_xpath("category").text
content = item.at_xpath("content:encoded").text
#each post will later be pushed to an array, but I'm not worried about that yet, so for now....
puts "#{title} #{tag}"
end
I'm struggling to get all the tags from each item. I'm getting returns of something like Brothers Karamazov Dostoyevsky. I'm not worried about how it's formatted as it's only a test to see that it's picking things up correctly. Anyone know how I can go about this?
I also want to make tags that are capitalized = Author, so if you know how to do that it would help, too, although I haven't even tried it yet.
EDIT: I changed the code to this:
doc.css("item").each do |item|
title = item.at_css("title").text
content = item.at_xpath("content:encoded").text
tag = item.at_xpath("category").each do |category|
category
end
puts "#{title}: #{tag}"
end
which returns:
Brothers Karamazov: [#<Nokogiri::XML::Attr:0x80878518 name="domain" value="post_tag">, #<Nokogiri::XML::Attr:0x80878504 name="nicename" value="dostoyevsky">]
and which seems a bit more manageable. It screws up my plans for taking the Author from a capitalized tag, but, well, it's not so big of a deal. How could I pull just the second value?

You're using at_xpath and expecting it to return more than one result, when the at_ methods only return the first result.
You want something like:
tags = item.xpath("category").map(&:text)
which will return an array.
As for identifying the author, you can use a regex to select the items that start with a capital letter:
author = tags.select{|w| w =~ /^[A-Z]/}
Which will choose any capitalized tags. This leaves the tags untouched. If you wanted instead to separate the authors from the tags, you can use partition:
author, tags = item.xpath("category").map(&:text).partition{|w| w =~ /^[A-Z]/}
Note that in the above examples, author is an array and will contain all matching items (i.e. more than one capitalized tag).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Xpath query with scrapy - xpath

This xpath response.xpath("//span[contains(., 'beteiligter Wissenschaftler')]/following-sibling::span[#class='value']/a/text()").extract() returns array of Professors. The first one is Professor Dr. Michael Eiermann

Related

Xpath: return all nodes that match any one of the conditions

How to find same elements with xpath

Replacing <a> tags that have two pairs of double quotes

CSS/Xpath sibling selector in Nokogiri

Searching for tags while parsing Wordpress XML with Nokogiri

Categories

Resources