Scrapy / XPATH : how to extract ONLY text from descendants and self - xpath

I have the following simple, nested structure:
<main>
<em>bla-bla</em>
<div class="1">1.1</div>
<div class="2">2.1</div>
<div class="2">2.2</div>
<div class="1">1.2</div>
<div class="2">
<span>
<em>2.3</em>
</span>
</div>
<div class="2">2.4</div>
</main>
I would like to extract now all text from all nodes, but struggle with the nested node ( etc.).
The expected output should be:
2.1
2.2
2.3
2.4
Trying something like:
//div[contains(#class,"2")]/text()
gives
2.1
2.2
<div class="2"><span><em>2.3</em></span></div>
<div class="2"><span><em>2.3</em></span></div>
2.4
Instead of using straight XPATH, I also tried using several steps in Scrapy, like:
divs = response.xpath("//div[contains(#class,"2")]")
for div in divs:
# now check somehow that the div contains an "em" node
Using
div.xpath("//em")
does not work, since it gives all nodes. Using div.extract() here and looking at the returned string, I could of course find using string search, but this is rather a hack and does not look like the proper Scrapy solution.
Any suggestions how to solve this either directly with Xpath or with Scrapy in general would be grealy appreciated.

What do you think about [i.strip() for i in response.xpath('//div[contains(#class, "2")]//text()').extract() if i.strip()]?
Without stripping it gives some empty cases also:
>>> response.xpath('//div[contains(#class, "2")]//text()').extract()
[u'2.1', u'2.2', u'\n ', u'\n ', u'2.3', u'\n ', u'\n ', u'2.4']
So I filter them with strip:
>>> [i.strip() for i in response.xpath('//div[contains(#class, "2")]//text()').extract() if i.strip()]
[u'2.1', u'2.2', u'2.3', u'2.4']

Related

XPATH Grab following sibling and stop at the next sibling in the tree

<div class="season-rate season-summer">
<p class="heading">Summer</p>
<p class="subHeading">from</p>
<p class="price">€180,000<span>p/week + expenses</span><span class="approx">Approx
$211,500</span></p>
</div>
I am trying to grab the price here (€180,000) based on that the heading class is "Summer":
//p[contains(.,'Summer')]/following-sibling::p[2]
This returns:
€180,000p/week + expensesApprox
$211,500
But I only want:
€180,000
So I want to stop the XPATH before this next span class:
<span class="approx">Approx
$211,500</span>
I am trying variations of this without any luck!
//p[contains(.,'Summer')]/following-sibling::p[2] [not(preceding-sibling::span[contains(.,'p/week')])]
You can try this expression to get price only
//p[.="Summer"]/following-sibling::p[#class="price"]/text()
I think this should do it:
//div[p["Summer"]]/p[#class="price"]/text()[not(self="span")]
or even simpler:
//div[p["Summer"]]/p[#class="price"]/text()[not(span)]

Obtain an xpath element containing another element with an specific class

Hello I have this HTML:
<div class="_3Vhpd"><span>Your commerce Data</span>
<a class="n3G0C" href='http://www.webadress.......'><span>Some Text</span</a>
</div>
I tried to obtain the tag as follow:
parser.xpath('//div[contains(#class,"_3Vhpd")]//following-sibling::*[a[#class="n3G0C"]]/#href ')
but I received none '[]'. Maybe because is not just after div but after a span...
First, you sample html doesn't have a class="n3G0C", but assuming you fix it, this xpath expression should work:
//div[contains(#class,"_3Vhpd")]//following-sibling::a/#href
Output:
http://www.webadress.......

Access two elements simultaneously in Nokogiri

I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.
If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>

Hidden XPATH to find element of text? Ruby-selenium

I am trying to find the xpath of the element below, so that I can later get the text using Ruby Selenium-webdriver (ie. helloPage.mainHeader.get_text).
<div class="container">
<div class="template-section">
<div class="front">
<h3 class="containerHeading">
<i class="icon_image"></i>
"Hello world <-----------------------3 whitespaces
"
</h3>
</div>
</div>
</div>
I've worked on xpaths but everytime I rerun the test it timesout essentially the element does not exist. It is clearly visible on the UI and not hidden.
Why is my xpath is wrong? I have tried the following:
//div[#class='container']//div[#class='template-section']//div[#class='front']//h3[#class='containerHeading']
//div[#class='front']//h3[#class='containerHeading']
//h3[#class='containerHeading']
I did put sleep prior to executing helloPage.mainHeader.get_text, where mainHeader has the XPath expression, and that didn't work. Is there something mysterious about the Hello World text? The format is indeed like the way I typed it out.
all your xpaths seems correct to me... I think when you are trying to find the element using your xpath ... the element is not loaded properly... try to use explicit wait. Please try to use the code provided below:
wait = Selenium::WebDriver::Wait.new(:timeout => 10)
wait.until { driver.find_elements(:xpath, "Any of your above mentioned xpaths") }

How to get a list of concatenated text nodes

My purpose is to request on a xml structure, using only one XPath evaluation, in order to get a list of strings containing the concatenation of text3 and text5 for each "my_class" div.
The structure example is given below:
<div>
<div>
<div class="my_class">
<div class="my_class_1"></div>
<div class="my_class_2">text2</div>
<div class="my_class_3">
text3
<div class="my_class_4">text4</div>
<div class="my_class_5">text5</div>
</div>
</div>
<div class="my_class_6"></div>
</div>
<div>
<div class="my_class">
<div class="my_class_1"></div>
<div class="my_class_2">text12</div>
<div class="my_class_3">
text13
<div class="my_class_4">text14</div>
<div class="my_class_5">text15</div>
</div>
</div>
</div>
</div>
This means I want to get this list of results:
- in index 0 => text3 text5
- in index 1 => text13 text15
I currently can only get the my_class nodes, but with the text12 that I want to exclude ; or a list of each string, not concatened.
How I could proceed ?
Thanks in advance for helping.
EDIT : I remove text4 and text14 from my search to be exact in my example
EDIT: Now the question has changed...
XPath 1.0: There is no such thing as "list of strings" data type. You can use this expression to select all the container elements of the text nodes you want:
/div/div/div[#class='my_class']/div[#class='my_class_3']
And then get with the proper DOM method of your host language the string value of every of those selected elements (the concatenation of all descendant text nodes) the descendat text nodes you want and concatenate their string value with the proper relative XPath or DOM method:
text()[1]|div[#class='my_class_5']
XPath 2.0: There is a sequence data type.
/div/div/div[#class='my_class']
/div[#class='my_class_3']
/concat(text()[1],div[#class='my_class_5'])
Could you not just use:
//my_class/my_class_3
And then get the .innerText from that? There might be a bit of spacing cleanup to do but it should contain all the inside text (including that from the class 4 and 5) but without the tags.
Edit: After clairification
concat(/div/div/div[#class=my_class]/div[#class=my_class_3]/text(), ' ', /div/div/div[#class=my_class]/div[#class=my_class_5]/text())
That might work

Resources