getting text inside <p> before and after <strong> - xpath

Given the example below, how to get all the text / flatten the structure?
<div class="container">
<p>before_strong
<strong>inside_strong</strong>
after_strong
</p>
<p>just_p</p>
</div>
I tried with the following ones, but they do not return "after_strong", only "before_strong inside_strong":
//div//p
//div/descendant::*
//div/descendant-or-self::*
I'm using this with Python + lxml

Two options :
//div[#class="container"]//text()[normalize-space()]
Output : 4 values (each for one text)
or
string(//div[#class="container"])
Output : one string
before_strong inside_strong after_strong just_p

Related

XPath contains whole word only

I saw the existing question with the same title but that was a different question.
Let's say that I want to find elements that has "conGraph" in the class. I have tried
//div[contains(#class,'conGraph')]
It correctly got
<div class='conGraph mr'>
but it also falsely got
<div class='conGraph_wrap'>
which is not the same class at all. For this case only, I could use 'conGraph ' and get away with it, but I would like to know the general solution for future use.
In short, I want to get elements whose class contains "word" like "word", "word word2" or "word3 word", etc, but not like "words" or "fake_word" or "sword". Is that possible?
One option could be to use 4 conditions (exact term + 3 contains function with whitespace support) :
For the first condition, you search the exact term in the attribute content. For the second, the third and the fourth you specify all the whitespace variants.
Data :
<div class='word'></div>
<div class='word word2'></div>
<div class='word word3'></div>
<div class='swords word'></div>
<div class='swords word words'></div>
<div class='words'></div>
<div class='fake_word'></div>
<div class='sword'></div>
XPath :
//div[#class="word" or contains(#class,"word ") or contains(#class," word") or contains(#class," word ")]
Output :
<div class='word'></div>
<div class='word word2'></div>
<div class='word word3'></div>
<div class='swords word'></div>
<div class='swords word words'></div>

Scrapy / XPATH : how to extract ONLY text from descendants and self

I have the following simple, nested structure:
<main>
<em>bla-bla</em>
<div class="1">1.1</div>
<div class="2">2.1</div>
<div class="2">2.2</div>
<div class="1">1.2</div>
<div class="2">
<span>
<em>2.3</em>
</span>
</div>
<div class="2">2.4</div>
</main>
I would like to extract now all text from all nodes, but struggle with the nested node ( etc.).
The expected output should be:
2.1
2.2
2.3
2.4
Trying something like:
//div[contains(#class,"2")]/text()
gives
2.1
2.2
<div class="2"><span><em>2.3</em></span></div>
<div class="2"><span><em>2.3</em></span></div>
2.4
Instead of using straight XPATH, I also tried using several steps in Scrapy, like:
divs = response.xpath("//div[contains(#class,"2")]")
for div in divs:
# now check somehow that the div contains an "em" node
Using
div.xpath("//em")
does not work, since it gives all nodes. Using div.extract() here and looking at the returned string, I could of course find using string search, but this is rather a hack and does not look like the proper Scrapy solution.
Any suggestions how to solve this either directly with Xpath or with Scrapy in general would be grealy appreciated.
What do you think about [i.strip() for i in response.xpath('//div[contains(#class, "2")]//text()').extract() if i.strip()]?
Without stripping it gives some empty cases also:
>>> response.xpath('//div[contains(#class, "2")]//text()').extract()
[u'2.1', u'2.2', u'\n ', u'\n ', u'2.3', u'\n ', u'\n ', u'2.4']
So I filter them with strip:
>>> [i.strip() for i in response.xpath('//div[contains(#class, "2")]//text()').extract() if i.strip()]
[u'2.1', u'2.2', u'2.3', u'2.4']

Access two elements simultaneously in Nokogiri

I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.
If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>

XPath / Selenium can't locate an element using a partial id with contains / start-with

I have the following HTML generated with an AjaxFormLoop.
<div id="phones">
<div class="t-forminjector tapestry-forminjector" id="rowInjector_13b87fdd8b6">
<input id="number_13b87fdd8b6" name="number_13b87fdd8b7" type="text"/>
<a id="removerowlink_13b87fdd8b6" href="#" name="removerowlink_13b87fdd8b6">remove</a>
</div>
<div class="t-forminjector tapestry-forminjector" id="rowInjector_13b87fdda70" style="background-image: none; background-color: rgb(255, 255, 251);">
<input id="number_13b87fdda70" name="number_13b87fdda70" type="text" />
<a id="removerowlink_13b87fdda70" href="#" name="removerowlink_13b87fdda70">remove</a>
</div>
</div>
I'm trying to access the second input field in child 2 using a partial ID, however I have not been successful in getting this to work.
What I've tried thus far.
String path = "//input[contains(#id,'number_')][2]";
String path = "(//input[contains(#id,'number_')])[2]";
I can't even access input 1 using 1 instead of 2, however if I remove [2] and only use
String path = "//input[contains(#id,'number_')]";
I'm able to access the first field without issue.
If I use the exact id, I'm able to access either field without issue.
I do need to use the id if possible as there is many more fields in each t-forminjector row that are not present in this example.
Implementation with Selenium.
final String path = "(//input[starts-with(#id,'quantity_')])[2]";
new Wait() {
#Override
public boolean until() {
return isElementPresent(path);
}
}.wait("Element should be present", TIMEOUT);
Resolved
I'm noticing I can't seem to use the following starts-with / contains to locate any element within to dom, however if I use a complete id, it works.
//Partial ID - fails
//*[starts-with(#id,"quantity_")]
//Exact ID - works
//*[starts-with(#id,"quantity_-112409575185705")]
The generated output you pasted here simply does not contain the string number_ anywhere in it. It does contain Number_ -- note the capital N -- but it's not the first part of the string. Perhaps you meant something like this (which at least selects something):
(//input[contains(#id, 'Number_')])[2]
Or:
(//input[starts-with(#id,'catalogNumber_')])[2]
As Iwburk stated, this was a namespace issue. According to the Selenium API,
http://release.seleniumhq.org/selenium-remote-control/0.9.0/doc/java/com/thoughtworks/selenium/Selenium.html
while using an xpath expression, I needed to used xpath=xpathExpression changing my query string to:
String path = "xpath=(//input[starts-with(#id,'quantity_')])[2]";
I found a related post here,
Element is found in XPath Checker but not in Selenium
you can't access it because you are not locating the element as to be unique in the page.
use an xpath that makes it unique ,
- you're xpath look ok .
more info here
http://www.seleniumhq.org/docs/appendix_locating_techniques.jsp
Besides the selenium syntax problem there's an xpath issue related to markup structure.
xpath 1: //input[starts-with(#id,'number_')][1]
xpath 2: (//input[starts-with(#id,'number_')])[1]
In the sample below xpath 1 will return 2 nodes (incorrect) and xpath 2 will be correct because input nodes are not siblings so surrounding parenthesis are needed to refer to the resulting nodeset
<div id="phones">
<div>
<input id="number_1" name="number_1" type="text"/>
</div>
<div>
<input id="number_2" name="number_2" type="text" />
</div>
</div>
Result without parenthesis
/ > xpath //input[starts-with(#id,'number_')][1]
Object is a Node Set :
Set contains 2 nodes:
1 ELEMENT input
ATTRIBUTE id
TEXT
content=number_1
ATTRIBUTE name
TEXT
content=number_1
ATTRIBUTE type
TEXT
content=text
2 ELEMENT input
ATTRIBUTE id
TEXT
content=number_2
ATTRIBUTE name
TEXT
content=number_2
ATTRIBUTE type
TEXT
content=text
In this next sample, parenthesis will not make a difference because nodes are siblings
<div id="other">
<input id="pre_1" type="text"/>
<input id="pre_2" type="text" />
<div>a</div>
</div>
With parenthesis
/ > xpath (//input[starts-with(#id,'pre_')])[1]
Object is a Node Set :
Set contains 1 nodes:
1 ELEMENT input
ATTRIBUTE id
TEXT
content=pre_1
ATTRIBUTE type
TEXT
content=text
Without parenthesis
/ > xpath //input[starts-with(#id,'pre_')][1]
Object is a Node Set :
Set contains 1 nodes:
1 ELEMENT input
ATTRIBUTE id
TEXT
content=pre_1
ATTRIBUTE type
TEXT
content=text
Testing was done with xmllint shell
xmllint --html --shell test.html

How to get a list of concatenated text nodes

My purpose is to request on a xml structure, using only one XPath evaluation, in order to get a list of strings containing the concatenation of text3 and text5 for each "my_class" div.
The structure example is given below:
<div>
<div>
<div class="my_class">
<div class="my_class_1"></div>
<div class="my_class_2">text2</div>
<div class="my_class_3">
text3
<div class="my_class_4">text4</div>
<div class="my_class_5">text5</div>
</div>
</div>
<div class="my_class_6"></div>
</div>
<div>
<div class="my_class">
<div class="my_class_1"></div>
<div class="my_class_2">text12</div>
<div class="my_class_3">
text13
<div class="my_class_4">text14</div>
<div class="my_class_5">text15</div>
</div>
</div>
</div>
</div>
This means I want to get this list of results:
- in index 0 => text3 text5
- in index 1 => text13 text15
I currently can only get the my_class nodes, but with the text12 that I want to exclude ; or a list of each string, not concatened.
How I could proceed ?
Thanks in advance for helping.
EDIT : I remove text4 and text14 from my search to be exact in my example
EDIT: Now the question has changed...
XPath 1.0: There is no such thing as "list of strings" data type. You can use this expression to select all the container elements of the text nodes you want:
/div/div/div[#class='my_class']/div[#class='my_class_3']
And then get with the proper DOM method of your host language the string value of every of those selected elements (the concatenation of all descendant text nodes) the descendat text nodes you want and concatenate their string value with the proper relative XPath or DOM method:
text()[1]|div[#class='my_class_5']
XPath 2.0: There is a sequence data type.
/div/div/div[#class='my_class']
/div[#class='my_class_3']
/concat(text()[1],div[#class='my_class_5'])
Could you not just use:
//my_class/my_class_3
And then get the .innerText from that? There might be a bit of spacing cleanup to do but it should contain all the inside text (including that from the class 4 and 5) but without the tags.
Edit: After clairification
concat(/div/div/div[#class=my_class]/div[#class=my_class_3]/text(), ' ', /div/div/div[#class=my_class]/div[#class=my_class_5]/text())
That might work

Resources