Obtain an xpath element containing another element with an specific class - xpath

Hello I have this HTML:
<div class="_3Vhpd"><span>Your commerce Data</span>
<a class="n3G0C" href='http://www.webadress.......'><span>Some Text</span</a>
</div>
I tried to obtain the tag as follow:
parser.xpath('//div[contains(#class,"_3Vhpd")]//following-sibling::*[a[#class="n3G0C"]]/#href ')
but I received none '[]'. Maybe because is not just after div but after a span...

First, you sample html doesn't have a class="n3G0C", but assuming you fix it, this xpath expression should work:
//div[contains(#class,"_3Vhpd")]//following-sibling::a/#href
Output:
http://www.webadress.......

Related

XPATH Grab following sibling and stop at the next sibling in the tree

<div class="season-rate season-summer">
<p class="heading">Summer</p>
<p class="subHeading">from</p>
<p class="price">€180,000<span>p/week + expenses</span><span class="approx">Approx
$211,500</span></p>
</div>
I am trying to grab the price here (€180,000) based on that the heading class is "Summer":
//p[contains(.,'Summer')]/following-sibling::p[2]
This returns:
€180,000p/week + expensesApprox
$211,500
But I only want:
€180,000
So I want to stop the XPATH before this next span class:
<span class="approx">Approx
$211,500</span>
I am trying variations of this without any luck!
//p[contains(.,'Summer')]/following-sibling::p[2] [not(preceding-sibling::span[contains(.,'p/week')])]
You can try this expression to get price only
//p[.="Summer"]/following-sibling::p[#class="price"]/text()
I think this should do it:
//div[p["Summer"]]/p[#class="price"]/text()[not(self="span")]
or even simpler:
//div[p["Summer"]]/p[#class="price"]/text()[not(span)]

Scrapy / XPATH : how to extract ONLY text from descendants and self

I have the following simple, nested structure:
<main>
<em>bla-bla</em>
<div class="1">1.1</div>
<div class="2">2.1</div>
<div class="2">2.2</div>
<div class="1">1.2</div>
<div class="2">
<span>
<em>2.3</em>
</span>
</div>
<div class="2">2.4</div>
</main>
I would like to extract now all text from all nodes, but struggle with the nested node ( etc.).
The expected output should be:
2.1
2.2
2.3
2.4
Trying something like:
//div[contains(#class,"2")]/text()
gives
2.1
2.2
<div class="2"><span><em>2.3</em></span></div>
<div class="2"><span><em>2.3</em></span></div>
2.4
Instead of using straight XPATH, I also tried using several steps in Scrapy, like:
divs = response.xpath("//div[contains(#class,"2")]")
for div in divs:
# now check somehow that the div contains an "em" node
Using
div.xpath("//em")
does not work, since it gives all nodes. Using div.extract() here and looking at the returned string, I could of course find using string search, but this is rather a hack and does not look like the proper Scrapy solution.
Any suggestions how to solve this either directly with Xpath or with Scrapy in general would be grealy appreciated.
What do you think about [i.strip() for i in response.xpath('//div[contains(#class, "2")]//text()').extract() if i.strip()]?
Without stripping it gives some empty cases also:
>>> response.xpath('//div[contains(#class, "2")]//text()').extract()
[u'2.1', u'2.2', u'\n ', u'\n ', u'2.3', u'\n ', u'\n ', u'2.4']
So I filter them with strip:
>>> [i.strip() for i in response.xpath('//div[contains(#class, "2")]//text()').extract() if i.strip()]
[u'2.1', u'2.2', u'2.3', u'2.4']

What is Valid Xpath for link extract by div class name?

What is Valid Xpath for link extract by div class name?
Here is html code:
<div class="poster">
<a href="/title/tt2091935/mediaviewer/rm4278707200?ref_=tt_ov_i"> <img alt="Mr. Right Poster" title="Mr. Right Poster" src="http://ia.media-imdb.com/images/M/MV5BOTcxNjUyOTMwOV5BMl5BanBnXkFtZTgwMzUxMDk4NzE#._V1_UX182_CR0,0,182,268_AL_.jpg" itemprop="image">
</a> </div>
I want to know exact Xpath as if i found href link.
I try with //a/#href[#class='poster'] but it's doesn't work
The <div> contains the <a> so you can use that to navigate:
//div[#class='poster']/a/#href
Remember that the "poster" class is defined on the <div> not on the <a> so that's where you need to apply the predicate.
//div returns all <div> elements
[#class='poster'] is a predicate that filters by class
/a returns all <a> elements that are children of those <div>s
/#href gives us the attribute we want
Depending on the system you're using you might need to wrap the whole expression in text() in order to bring back the attribute data rather than the DOM node.

Access two elements simultaneously in Nokogiri

I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.
If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>

Xpath get text of nested item not working but css does

I'm making a crawler with Scrapy and wondering why my xpath doesn't work when my CSS selector does? I want to get the number of commits from this html:
<li class="commits">
<a data-pjax="" href="/samthomson/flot/commits/master">
<span class="octicon octicon-history"></span>
<span class="num text-emphasized">
521
</span>
commits
</a>
</li
Xpath:
response.xpath('//li[#class="commits"]//a//span[#class="text-emphasized"]//text()').extract()
CSS:
response.css('li.commits a span.text-emphasized').css('::text').extract()
CSS returns the number (unescaped), but XPath returns nothing. Am I using the // for nested elements correctly?
You're not matching all values in the class attribute of the span tag, so use the contains function to check if only text-emphasized is present:
response.xpath('//li[#class="commits"]//a//span[contains(#class, "text-emphasized")]//text()')[0].strip()
Otherwise also include num:
response.xpath('//li[#class="commits"]//a//span[#class="num text-emphasized"]//text()')[0].strip()
Also, I use [0] to retrieve the first element returned by XPath and strip() to remove all whitespace, resulting in just the number.

Resources