How to exclude from a contains query all the informations from a child class & after some sibling text? - xpath

<root>
<a></a>
<b></b>
<c></c>
<a></a>
<d></d>
<e></e>
<a></a>
<a></a>
</root>
In an XML document, how can I exclude from a contains research all the information from nodes after <d> ?
to get only result from:
<a></a>
<b></b>
<c></c>
<a></a>
<d></d>
I can't say only the first 2 answer from
and first for
and <c> because sometimes a value will exist only after the <d>
I have this code that is working:
//div[contains(#class,'class searched')]/*[contains(text(), 'Text Searched')] | //div[contains(#class,'class searched')]/*[not(contains(#class,'class excluded'))]/*[contains(text(), 'Text Searched')]
Thanks for your help :)
EDIT for clarity:
<div Class="TopClass">
<div Class="ClassA">
<div Class="ClassB">
<h3> Text Researched</h3>
<u1 Class="ClassC">
<h3> Text Researched</h3>
</u1>
</div>
</div>
<h4>Other Text</h4>
<div Class="ClassA">
<div Class="ClassB">
<h3> Text Researched</h3>
<u1 Class="ClassC">
<h3> Text Researched</h3>
</u1>
</div>
</div>
I would like to get only the Text Researched that is between the Class B and Class C and that is above the "Other Text". Sometime the "Text researched" will only appear below the "Other Text" and i don't want to get this result so a [1] will not work there. Also the <h3> and <h4> are used elsewhere in the code.

Given this html
<div class="TopClass">
<div class="ClassA">
<div class="ClassB">
<h3> Text Researched 1</h3>
<u1 class="ClassC">
<h3> Text Researched 2</h3>
</u1>
</div>
</div>
<h4>Other Text</h4>
<div class="ClassA">
<div class="ClassB">
<h3> Text Researched 3</h3>
<u1 class="ClassC">
<h3> Text Researched 4</h3>
</u1>
</div>
</div>
</div>
This XPath expression will get the first 2 h3 tags
//div[#class="ClassA" and following-sibling::h4[.="Other Text"]]//h3[contains(.,"Text Researched 1")]/text()
Result:
echo -e 'cat //div[#class="ClassA" and following-sibling::h4[.="Other Text"]]//h3/text()\nbye' | xmllint --shell test.html
/ > cat //div[#class="ClassA" and following-sibling::h4[.="Other Text"]]//h3[contains(.,"Text Researched 1")]/text()
-------
Text Researched 1
/ > bye

Related

How to get elements between tags with XPATH

I need to get each subtitle of an article and its text. Since each subheading is inside , and I need to get everything between the first and the second. And then I will do between the second and third until I finish.
The structure is similar to this:
<article>
<p> introducion </p>
<h3>1. Subtitle </h3>
<p> text text </p>
<div> <p>other text</p> </div>
<h3>2. Subtitle </h3>
<p> text text </p>
<div> <p>other text</p> </div>
<h3>3. Subtitle </h3>
<p> text text </p>
<div> <p>other text</p> </div>
</article>
Currently I can get to the first subtitle like this: //h3[1]
But how can I get everything between the first and the second ???
This XPath expression gets nodes between //h3[1] and //h3[2] inclusive
//article/*[position()>= count(//h3[1]/preceding-sibling::*)+1 and position()<= count(//h3[2]/preceding-sibling::*)+1]
Result on browser console
$x('//article/*[position()>= count(//h3[1]/preceding-sibling::*)+1 and position()<= count(//h3[2]/preceding-sibling::*)+1]')
Array(4) [ h3, p, div, h3]
0: <h3>​
1: <p>​
2: <div>​
3: <h3>
length: 4

How to combine two SED regex commands?

I've read all the answers I could find on SOF unfortunately none of them led me to a solution.
I have thousands of files with address information and each of my SED commands work on their own
Matches address
sed -n -e 's/^.*address23ca storeh2..\(.*\) Address<\/h2.*optmob..\(.*\)<br>\(.*\)<br>\(.*\)<br>\(.*\)<\/p><p class="addressbox23.*Telephone: \(.-...-...-....\)\(.*\).*v1\/place?..\(.*\)&key.*$/\1,\2,\3,\4,\5,\6/p' afile.html
$ 142 Wayne Street,Abbey,Saskatchewan,S0N 0A0,1-232-321-4321
Matches GPS
sed -n -e 's/^.*v1\/place?..\(.*\)&key.*$/\1/p' abbey.html
$ 50.736301,-108.757103
I've tried the following but it doesn't stop matching after the telephone number, it instead continues until it matches v1\/place? and then stops. I can't figure out how to stop matching at the phone number and start the match again for the GPS.
How can I combine these two matches?
sed -n -e 's/^.*address23ca storeh2..\(.*\) Address<\/h2.*optmob..\(.*\)<br>\(.*\)<br>\(.*\)<br>\(.*\)<\/p><p class="addressbox23.*Telephone: \(.-...-...-....\)\(.*\).^*v1\/place?..\(.*\)&key.*$/\1,\2,\3,\4,\5,\6,\7/p' afile.html
$ 142 Wayne Street,Abbey,Saskatchewan,S0N 0A0,1-232-321-4321 LOADS OF unnecessary HTML src="https://www.google.com/maps/embed
A Trimmed version of a file
<!DOCTYPE html>
<html lang="en"> <!--<![endif]-->
<head></head><body> <div class="large-7 columns small-12 addWrap23ca"> <div class="storeH2Wrap23ca"> <h2 class="address23ca storeh2">Canada Post Abbey Address</h2></div><p class="addressbox23ca optmob">142 Wayne Street<br>Abbey<br>Saskatchewan<br>S0N 0A0</p><p class="addressbox23ca optmob">Telephone: 1-866-607-6301</p></div></div><div class="row"> <div class="large-12 medium-12 columns small-12"> <div class="row"> <div class="large-12 columns small-12"> <div class="storeH2Wrap23ca"> <h2 class="hours23ca storeh2">Canada Post Abbey Opening Hours</h2></div><div class="hoursCont23ca"> 13:00-16:30</p><p>Closed</p><p>Closed</p></div></div><div class="notesWrap23ca"><div class="notesTitle23ca"><p class="noteHeading23ca">Post Office Notes</p></div><div class="notesContent23ca"><p class="note23ca">This Post Office Branch closes for lunch on certain days - please see opening hours.</p></div></div></div></div></div></div><div class="row"> <div class="mapadCont23ca"> <div class="large-12 medium-12 columns small-12 map23ca"> <div class="storeH2Wrap23ca storeH2WrapMap23ca"> <h2 class="maptitle23ca storeh2">Canada Post Abbey Map Location</h2></div><div class="mapBreadCrumbs23ca"><ul><li>Canada Post Locator</li><li>></li><li>Canada Post Saskatchewan</li><li>></li><li>Canada Post in Abbey</li></ul></div> <div class="mapCont23ca"> <iframe width="100%" height="434" frameborder="0" src="https://www.google.com/maps/embed/v1/place?q=50.736301,-108.757103&key=AIzaSyDmJApckRpAR1uhfdfz_QedneaF5lAlrQU"></iframe></div><div class="searchagainouter23ca"> <div class="adddivclear" style="clear:both;"></div></body></html>
You can combine two regex's with the semicolon
$ echo "etts" | sed 's/et/te/; s/ts/st/'
test
You can use a better tool for the job like python's HTMLParser. Here's the example that prints all the tags founds where you can add whatever filters you want
#! /usr/bin/env python3
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Found a start tag:", tag)
print("\tattrs:", attrs)
def handle_endtag(self, tag):
print("Found an end tag:", tag)
def handle_data(self, data):
print("Found data:", data)
MyHTMLParser().feed('''
<!DOCTYPE html>
<html lang="en"> <!--<![endif]-->
<head></head><body> <div class="large-7 columns small-12 addWrap23ca"> <div class="storeH2Wrap23ca"> <h2 class="address23ca storeh2">Canada Post Abbey Address</h2></div><p class="addressbox23ca optmob">142 Wayne Street<br>Abbey<br>Saskatchewan<br>S0N 0A0</p><p class="addressbox23ca optmob">Telephone: 1-866-607-6301</p></div></div><div class="row"> <div class="large-12 medium-12 columns small-12"> <div class="row"> <div class="large-12 columns small-12"> <div class="storeH2Wrap23ca"> <h2 class="hours23ca storeh2">Canada Post Abbey Opening Hours</h2></div><div class="hoursCont23ca"> 13:00-16:30</p><p>Closed</p><p>Closed</p></div></div><div class="notesWrap23ca"><div class="notesTitle23ca"><p class="noteHeading23ca">Post Office Notes</p></div><div class="notesContent23ca"><p class="note23ca">This Post Office Branch closes for lunch on certain days - please see opening hours.</p></div></div></div></div></div></div><div class="row"> <div class="mapadCont23ca"> <div class="large-12 medium-12 columns small-12 map23ca"> <div class="storeH2Wrap23ca storeH2WrapMap23ca"> <h2 class="maptitle23ca storeh2">Canada Post Abbey Map Location</h2></div><div class="mapBreadCrumbs23ca"><ul><li>Canada Post Locator</li><li>></li><li>Canada Post Saskatchewan</li><li>></li><li>Canada Post in Abbey</li></ul></div> <div class="mapCont23ca"> <iframe width="100%" height="434" frameborder="0" src="https://www.google.com/maps/embed/v1/place?q=50.736301,-108.757103&key=AIzaSyDmJApckRpAR1uhfdfz_QedneaF5lAlrQU"></iframe></div><div class="searchagainouter23ca"> <div class="adddivclear" style="clear:both;"></div></body></html>
''')

Thymeleaf switch block returns incorrect value

I have a switch block in my thymeleaf page where I show an image depending on the reputation score of the user:
<h1>
<span th:text="#{user.reputation} + ${reputation}">Reputation</span>
</h1>
<div th:if="${reputation lt 0}">
<img th:src="#{/css/img/troll.png}"/>
</div>
<div th:if="${reputation} == 0">
<img th:src="#{/css/img/smeagol.jpg}"/>
</div>
<div th:if="${reputation gt 0} and ${reputation le 5}">
<img th:src="#{/css/img/samwise.png}"/>
</div>
<div th:if="${reputation gt 5} and ${reputation le 15}">
<img th:src="#{/css/img/frodo.png}"/>
</div>
<div th:if="${reputation gt 15}">
<img th:src="#{/css/img/gandalf.jpg}"/>
</div>
This statement always returns smeagol (so reputation 0), eventhough the reputation of this user is 7: example
EDIT:
I was wrong, the image showing was a rogue line:
<!--<img th:src="#{/css/img/smeagol.jpg}"/>-->
but I commented it out. Now there is no image showing.
EDIT2:
changed my comparators (see original post) and now I get the following error:
The value of attribute "th:case" associated with an element type "div" must not contain the '<' character.
EDIT3:
Works now, updated original post to working code
According to the documentation, Thymeleaf's switch statement works just like Java's - and the example suggests the same.
In other words: you cannot do
<th:block th:switch="${reputation}">
<div th:case="${reputation} < 0">
[...]
but would need to do
<th:block th:switch="${reputation}">
<div th:case="0">
[...]
which is not what you want.
Instead, you will have to use th:if, i.e. something like this:
<div th:if="${reputation} < 0">
<img th:src="#{/css/img/troll.png}"/>
</div>
Change
<div th:case="0">
<img th:src="#{/css/img/smeagol.jpg}"/>
</div>
to
<div th:case="${reputation == 0}">
<img th:src="#{/css/img/smeagol.jpg}"/>
</div>

Need span title in Xpath

I need span title text (RS_GPO) as my xpath output
Here is code:
<TD id="celleditableGrid07" nowrap="nowrap" style='padding:0px;' >`
<DIV class='stacked-row'>
<span id="form(202567).form(TITLE).text" >
<span title='RPS_AEM3'>RPS_AEM3</span>
</span>
</DIV>
<DIV class='stacked-row-bottom'>
<span id="form(202567).form(CONTENT).text" >
<span title='RS_GPO'>RS_GPO</span>
</span>
</DIV>
My intention for xpath is I want catch text “RS_GPO” in to a variable.
Because this is system generated text.
Thanks in Advance.
//span[#title='RS_GPO']
OR
//div[#class='stacked-row-bottom']/span[#id='form(202567).form(CONTENT).text']/span[#title='RS_GPO']
//span[contains(#id,'form(CONTENT).text')]/span
If you want the content of the title attribute instead of the element's text content, then:
//span[contains(#id,'form(CONTENT).text')]/span/#title

How can I get several similar tags data with HtmlAgilityPack?

Before explaining, I am using VB.net and HtmlAgilityPack.
I have the below html, all three sections have the same format. I am using htmlagilitypack to extract the data from the Title and Date. My code extracts the title correctly but the date is only extracted from the first instance and repeated 3 times:
HtmlAgilityPack code:
For Each h4 As HtmlNode In docnews.DocumentNode.SelectNodes("//h4[(#class='title')]")
Dim date1 As HtmlNode = docnews.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'date ')]")
Dim newsdate As String = date1.InnerText
MessageBox.Show(h4.InnerText)
MessageBox.Show(newsdate)
Next
I thought being in each h4, I get its associated date accordingly...
HTML code:
<div class="article-header" style="" data-itemid="920729" data-source="ABC" data-preview="Text 1">
<h4 class="title">Text for Mr. A</h4>
<div class="byline">
<span class="date timestamp"><span title="29 November 2013">29-11-2013</span></span>
<span class="source" title="AGE">18</span>
</div>
<div class="preview">Text 1 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920720" data-source="ABC" data-preview="Text 2">
<h4 class="title">Text for Mr. B</h4>
<div class="byline">
<span class="date timestamp"><span title="27 November 2013">27-11-2013</span></span>
<span class="source" title="AGE">25</span>
</div>
<div class="preview">Text 2 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920719" data-source="ABC" data-pre+view="Text 3">
<h4 class="title">Text for Mr. C</h4>
<div class="byline">
<span class="date timestamp"><span title="22 October 2013">22-10-2013</span></span>
<span class="source" title="AGE">20</span>
</div>
<div class="preview">Text 3 Preview</div>
</div>
Final Output should be:
Text for Mr. A
29-11-2013
Text for Mr. B
27-11-2013
Text for Mr. C
22-10-2013
What I am getting with my code:
Text for Mr. A
29-11-2013
Text for Mr. B
29-11-2013
Text for Mr. C
29-11-2013
Any help is much appreciated.
You need to anchor your second XPath to look 'below' the h4:
Dim date1 As HtmlNode = h4.Parent.SelectSingleNode(".//span[starts-with(#class, 'date ')]")
^^^^^^^^^ ^^^
The .// tells Xpath to look under the node the Xpath is executed on. Thus by calling SelectSingleNode on the h4.Parent you get the date below the parent div tag of the h4.

Resources