How to get concatenated child text nodes in lxml - xpath

This is the HTML sample:
<div class="wpb_text_column">
<div class="wpb_wrapper">
<p style="text-align: center;">First text part </p>
<p style="text-align: center;">Second text part </p>
<p style="text-align: center;">Third text part</p>
</div>
</div>
<div class="wpb_text_column">
<div class="wpb_wrapper">
<p style="text-align: center;">First text part </p>
<p style="text-align: center;">Second text part</p>
</div>
</div>
With below code
tree = html.fromstring(html_sample)
tree.xpath('//div[#class="wpb_text_column"]/div[#class="wpb_wrapper"]/p/a/text()')
I can get list of text values
['First text part ', 'Second text part ', 'Third text part', 'First text part ', 'Second text part']
However, I want to get all values from each div as single string like
['First text part Second text part Third text part', 'First text part Second text part']
and
//div[#class="wpb_text_column"]/div[#class="wpb_wrapper"]/normalize-space()
seem to be exact XPath to solve the problem, but lxml doesn't support /normalize-space() syntax:
lxml.etree.XPathEvalError: Invalid expression
So how to get desired output in lxml?

Solved with below code:
[" ".join(string.text_content().split()) for string in tree.xpath('//div[#class="wpb_text_column"]/div[#class="wpb_wrapper"]')]

Related

How to get elements between tags with XPATH

I need to get each subtitle of an article and its text. Since each subheading is inside , and I need to get everything between the first and the second. And then I will do between the second and third until I finish.
The structure is similar to this:
<article>
<p> introducion </p>
<h3>1. Subtitle </h3>
<p> text text </p>
<div> <p>other text</p> </div>
<h3>2. Subtitle </h3>
<p> text text </p>
<div> <p>other text</p> </div>
<h3>3. Subtitle </h3>
<p> text text </p>
<div> <p>other text</p> </div>
</article>
Currently I can get to the first subtitle like this: //h3[1]
But how can I get everything between the first and the second ???
This XPath expression gets nodes between //h3[1] and //h3[2] inclusive
//article/*[position()>= count(//h3[1]/preceding-sibling::*)+1 and position()<= count(//h3[2]/preceding-sibling::*)+1]
Result on browser console
$x('//article/*[position()>= count(//h3[1]/preceding-sibling::*)+1 and position()<= count(//h3[2]/preceding-sibling::*)+1]')
Array(4) [ h3, p, div, h3]
0: <h3>​
1: <p>​
2: <div>​
3: <h3>
length: 4

Xpath Sibling Text

I have been trying to figure this out for a while and can't get my head around it. I have tried using following-sibling but it's not working for me. The classes are really generic across the board. I was trying to use the text within the <strong> tag to identify then pull the sibling content:
<div class="generic-class">
<p class="generic-class2">
<strong>Content title</strong>
"
Dont Need "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title2</strong>
"
Needed Content "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title3</strong>
"
Dont Need "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title4</strong>
"
Dont Need "
<br>
</p>
</div>
I tried using the below but with no success, I did then realise that the text is actually in the <p> tag so it's not a sibling.:
normalize-space(//*[#class="generic-class"]/p/strong/following-sibling::text())
Would there be a way of me finding the text in the <strong> tag "Content title2" and then getting the text in the parent?
Any help would be amazing, thanks!
This one should return "Needed Content":
normalize-space(//p/strong[.="Content title2"]/following-sibling::text())

Enter text in input in Watir with same class

I am trying to enter text into an input field and can not successfully get it working. I have two inputs that look like this:
<div class"outerParentClass">
<div class="classLabel">From</div>
<div class="classA classB classD">
<div class="classE">
<div class="classText"> TEXT HERE </div>
<input class="classInputA classInoutB" type="text">
</div>
</div>
</div>
<div class="classLabel">To</div>
<div class="classA classB classD">
<div class="classE">
<div class="classText"> DIFFERENT TEXT HERE </div>
<input class="classInputA classInoutB" type="text">
</div>
</div>
</div>
</div>
Both the inputs are the exact same format as above. There are no Id's and both have the same classes. I am struggling at entering the text into these or even finding them correctly.
When I do this:
browser.text_field(:class => "classInputA").size
It returns 20
When I do this:
browser.text_field(:class => "classInputA")
It returns:
#<Watir::TextField:0x..fbccafb7ed2e9b85e located=false selector={:class=>"classInputA", :tag_name=>"input"}>
Not sure how to locate either of these inputs. Any suggestions?
The text adjacent to the field provides a label and context for the field. As it is likely unique, you can use this to identify the element.
To do this, find the div containing the label text. Then navigate to the adjacent div that contains the text field.
browser.div(text: 'From', class: 'classLabel') # label of interest
.element(xpath: './following-sibling::div[1]') # adjacent div containing text field
.text_field # the text field
Note that in the next release of Watir, .element(xpath: './following-sibling::div[1]') will be replaceable by just .following-sibling.

Need span title in Xpath

I need span title text (RS_GPO) as my xpath output
Here is code:
<TD id="celleditableGrid07" nowrap="nowrap" style='padding:0px;' >`
<DIV class='stacked-row'>
<span id="form(202567).form(TITLE).text" >
<span title='RPS_AEM3'>RPS_AEM3</span>
</span>
</DIV>
<DIV class='stacked-row-bottom'>
<span id="form(202567).form(CONTENT).text" >
<span title='RS_GPO'>RS_GPO</span>
</span>
</DIV>
My intention for xpath is I want catch text “RS_GPO” in to a variable.
Because this is system generated text.
Thanks in Advance.
//span[#title='RS_GPO']
OR
//div[#class='stacked-row-bottom']/span[#id='form(202567).form(CONTENT).text']/span[#title='RS_GPO']
//span[contains(#id,'form(CONTENT).text')]/span
If you want the content of the title attribute instead of the element's text content, then:
//span[contains(#id,'form(CONTENT).text')]/span/#title

How can I get several similar tags data with HtmlAgilityPack?

Before explaining, I am using VB.net and HtmlAgilityPack.
I have the below html, all three sections have the same format. I am using htmlagilitypack to extract the data from the Title and Date. My code extracts the title correctly but the date is only extracted from the first instance and repeated 3 times:
HtmlAgilityPack code:
For Each h4 As HtmlNode In docnews.DocumentNode.SelectNodes("//h4[(#class='title')]")
Dim date1 As HtmlNode = docnews.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'date ')]")
Dim newsdate As String = date1.InnerText
MessageBox.Show(h4.InnerText)
MessageBox.Show(newsdate)
Next
I thought being in each h4, I get its associated date accordingly...
HTML code:
<div class="article-header" style="" data-itemid="920729" data-source="ABC" data-preview="Text 1">
<h4 class="title">Text for Mr. A</h4>
<div class="byline">
<span class="date timestamp"><span title="29 November 2013">29-11-2013</span></span>
<span class="source" title="AGE">18</span>
</div>
<div class="preview">Text 1 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920720" data-source="ABC" data-preview="Text 2">
<h4 class="title">Text for Mr. B</h4>
<div class="byline">
<span class="date timestamp"><span title="27 November 2013">27-11-2013</span></span>
<span class="source" title="AGE">25</span>
</div>
<div class="preview">Text 2 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920719" data-source="ABC" data-pre+view="Text 3">
<h4 class="title">Text for Mr. C</h4>
<div class="byline">
<span class="date timestamp"><span title="22 October 2013">22-10-2013</span></span>
<span class="source" title="AGE">20</span>
</div>
<div class="preview">Text 3 Preview</div>
</div>
Final Output should be:
Text for Mr. A
29-11-2013
Text for Mr. B
27-11-2013
Text for Mr. C
22-10-2013
What I am getting with my code:
Text for Mr. A
29-11-2013
Text for Mr. B
29-11-2013
Text for Mr. C
29-11-2013
Any help is much appreciated.
You need to anchor your second XPath to look 'below' the h4:
Dim date1 As HtmlNode = h4.Parent.SelectSingleNode(".//span[starts-with(#class, 'date ')]")
^^^^^^^^^ ^^^
The .// tells Xpath to look under the node the Xpath is executed on. Thus by calling SelectSingleNode on the h4.Parent you get the date below the parent div tag of the h4.

Resources