Targeting part of a comment using XPath - xpath

I'm trying to use xpath to return the value "Vancouver", from either the comment or the text after it. Can anyone point me in the right direction?
The location li is always the first item but is not always present, and the number of list items after it varies for each item.
<item>
<title>
<description>
<!-- Comment #1 -->
<ul class="class1">
<li> <!-- ABC Location=Vancouver -->Location: Vancouver</li>
<li> <!-- More comments -->Text</li>
<li> text</li>
</ul>
</description>
</item>

This will pull it from the text after the comment:
substring-after(//ul[#class='class1']/li[position()=1 and contains(.,'Location:')],'Location: ')
This specifies the first <li> inside the <ul> of class 'class1', only when it contains 'Location:', and takes the string after 'Location:'. If you want to relax the requirement that it be the first li, use this:
substring-after(//ul[#class='class1']/li[contains(.,'Location:')],'Location: ')

This isn't eloquent, and it could cause issues if your "Location: #####" were to change structurally, because this is a static solution, but it works for the above:
substring(//item//li[1],12,string-length(//item//li[1])-10)
And this returns the string equivalent, not a node.
Rushed this one a bit, so I'll give a better solution with time but this is just something to think about...
(it just strips off "Location: " and returns whatever's after it..)

Use:
substring-after(/*/description/ul
/li[1]/text()[starts-with(., 'Location: ')],
'Location: '
)
To extract the location from the comment use:
substring-after(/*/description/ul
/li[1]/comment()[starts-with(., ' ABC Location=')],
' ABC Location='
)
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:copy-of select=
"substring-after(/*/description/ul
/li[1]/text()[starts-with(., 'Location: ')],
'Location: '
)
"/>
==========
<xsl:copy-of select=
"substring-after(/*/description/ul
/li[1]/comment()[starts-with(., ' ABC Location=')],
' ABC Location='
)
"/>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<item>
<title/>
<description>
<!-- Comment #1 -->
<ul class="class1">
<li>
<!-- ABC Location=Vancouver -->Location: Vancouver
</li>
<li>
<!-- More comments -->Text
</li>
<li> text</li>
</ul>
</description>
</item>
the two XPath expressions are evaluated and the results of the evaluations are copied to the output:
Vancouver
==========
Vancouver

Related

Find and conditionally edit text in an XML file

I have a (XML-)file that has the following content:
<class>OverAll</class>
<char>
<rank> 1</rank>
<name> yyy</name>
<level> 9</level>
<experience>53842</experience>
<class>xxx</class>
</char>
<char>
<rank> 2</rank>
<name>aaa</name>
<level> 9</level>
<experience>53074</experience>
<class>zzz</class>
</char>
..and so on. I want to extract the number between the <experience> </experience> lines and replace it with a modified version of the number I found between the tag. For example, the file should look like this after the script:
<class>OverAll</class>
<char>
<rank> 1</rank>
<name> yyy</name>
<level> 9</level>
<experience>53.842</experience>
<class>xxx</class>
</char>
<char>
<rank> 2</rank>
<name>aaa</name>
<level> 9</level>
<experience>53.074</experience>
<class>zzz</class>
</char>
(I want to add a thousands separator, and values above 1 Million is required. So 2 thousand Separators :)
I am able to find and replace the number, but I dont know how to use the input number and modify it and add it back to the line.
Perhaps someone can help here?
Thank you very much :)
A one-liner sed can do it, assuming the last three digits are always decimal:
sed -zE 's#([[:digit:]]{7,})([[:digit:]]{1})[[:space:]]*(</experience[[:space:]]*>)#\1.\2\3#g;s#([[:digit:]]{3})[[:space:]]*(</experience[[:space:]]*>)#.\1\2#g'
sed parameters breakdown:
-zE
-z or --null-data: Separate lines by NULL characters to allow pattern matching across lines, because spaces, tabs and newlines are allowed by the XML syntax before the > bracket of a tag.
-E or --regexp-extended: Use extended regular expressions in the script (for portability use POSIX -E).
s#([[:digit:]]{7,})([[:digit:]]{1})[[:space:]]*(</experience[[:space:]]*>)#\1.\2\3#g:
Insert a decimal point before the last digit, to experience numbers containing seven plus one (eight) or more digits (Million or more with an extra decimal digit).
s#([[:digit:]]{3})[[:space:]]*(</experience[[:space:]]*>)#.\1\2#g:
Insert a decimal point before the last three digits, to experience numbers ending with three digits (automatically excludes the Millions experiences already processed by previous sed command.
Now keep in mind that it is not parsing the XML either, because it will replace numbers in the <experience> tag anywhere in the XML tree.
Regular expressions are not meant to parse markup languages. There are better, more efficient and dedicated tools to manipulate XML with XSLT/XPATH like saxon, xsltproc, xmllint...
Using proper XML processing with xsltproc:
decimal-experience.xsl
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- Cosmetic sugar to have the xml declaration header and indent -->
<xsl:output omit-xml-declaration="no" indent="yes"/>
<!-- Cosmetic sugar to remove unneeded spaces in elements -->
<xsl:strip-space elements="*"/>
<!-- Copy all the nodes as-is from the source xml -->
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="node()"/>
</xsl:copy>
</xsl:template>
<!-- Process the content of the experience tag within the char tag -->
<xsl:template match="char/experience/">
<!-- If the experience is not already in decimal form -->
<xsl:if test="not(contains(., '.'))">
<xsl:choose>
<!-- When the experience is less than a Million -->
<xsl:when test=". < 9999999">
<!-- The last three digits are decimals -->
<xsl:value-of select="format-number(. div 1000, '0.000')"/>
</xsl:when>
<!-- Otherwise the experience is a Million or more -->
<xsl:otherwise>
<!-- The last digit is decimal -->
<xsl:value-of select="format-number(. div 10, '0.0')"/>
</xsl:otherwise>
</xsl:choose>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Running the XSLT transformation above:
xsltproc decimal-experience.xsl characters.xml
Example output:
I created a valid fictive characters.xml with a span root tag, because your extract was invalid XML.
<?xml version="1.0"?>
<span>
<class>OverAll</class>
<char>
<rank> 1</rank>
<name> yyy</name>
<level> 9</level>
<experience>53.842</experience>
<class>xxx</class>
</char>
<char>
<rank> 2</rank>
<name>aaa</name>
<level> 9</level>
<experience>53.074</experience>
<class>zzz</class>
</char>
<char>
<rank> 3</rank>
<name>Million</name>
<level>42</level>
<experience>5585307.4</experience>
<class>zzz</class>
</char>
</span>

XPath to find element with a HTML line break

I need an xpath that will find some text containing HTML line breaks <br/>. For example:
<ul>
<li>ABC<br/>DEF</li>
<li>XYZ<br/>NOP</li>
</ul>
Let's say I'm trying to find the li that contains ABC<br/><DEF>. I've tried the following:
$x("//li[normalize-space(.)='ABC DEF']")
$x("//li[text() ='ABC<br/>DEF']")
$x("//li[contains(., 'ABC DEF']")
But they return nothing. I saw this answer XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode but I couldn't figure out how to use it in my case.
The following expression will get you close:
li[br[preceding-sibling::node()[1] = 'ABC']
[starts-with(following-sibling::node()[1], 'DEF')]]
If you need to match only items where the text ends with ABC, it will be a little longer.
The following transform will select the first matching li:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" />
<xsl:template match="/">
<matches>
<xsl:copy-of select="(//li[br[preceding-sibling::node()[1] = 'ABC']
[starts-with(following-sibling::node()[1], 'DEF')]
])
[1]" />
</matches>
</xsl:template>
</xsl:stylesheet>
Input:
<ul>
<li>ABC<br/>DEF</li>
<li>XYZ<br/>NOP</li>
<li><p>XYZ<br/>NOP</p></li>
<li>ABC<br/>DEF</li>
<li>DEF GHI</li>
<li>ABC<![CDATA[<br/>]]>DEF</li>
</ul>
Output:
<?xml version="1.0" encoding="utf-8"?>
<matches>
<li>ABC<br />DEF</li>
</matches>
//li[br]
This should work. It means: select all li elements having br child

how can I get a list of indexes of nodes that have a value using xpath

using the following;
<a>
<b>false</b>
<b>true</b>
<b>false</b>
<b>false</b>
<b>true</b>
</a>
I want to get the following result using something like
/a/b[.='true'].position()
for a result like
2,5 (as in a collection of the 2 positions)
I. XPath 1.0 solution:
Use:
count(/*/*[.='true'][1]/preceding-sibling::*)+1
This produces the position of the first b element whose string value is "true":
2
Repeat the evaluation of a similar expression, where [1] is replaced by [2] ,..., etc, up to count(/*/*[.='true'])
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:for-each select="/*/*[.='true']">
<xsl:variable name="vPos" select="position()"/>
<xsl:value-of select=
"count(/*/*[.='true'][$vPos]
/preceding-sibling::*) +1"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<a>
<b>false</b>
<b>true</b>
<b>false</b>
<b>false</b>
<b>true</b>
</a>
The XPath expression is constructed and evaluated for everyb, whose string value is"true". The results of these evaluations are copied to the output:
2
5
II. XPath 2.0 solution:
Use:
index-of(/*/*, 'true')
XSLT 2.0 - based verification:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:sequence select="index-of(/*/*, 'true')"/>
</xsl:template>
</xsl:stylesheet>
When this XSLT 2.0 transformation is applied on the same XML document (above), the XPath 2.0 expression is evaluated and the result of this evaluation is copied to the output:
2 5
A basic (& working) approach in python language :
from lxml import etree
root = etree.XML("""
<a>
<b>false</b>
<b>true</b>
<b>false</b>
<b>false</b>
<b>true</b>
</a>
""")
c = 0
lst = []
for i in root.xpath('/a/b/text()'):
c+=1
if i == 'true':
lst.append(str(c))
print ",".join(lst)

XPath ignore span

I have a HTML which contains some tags like below:
<div id="SNT">text1</div>
<div id="SNT">text2</div>
<div id="SNT">textbase1<span style='color: #EFFFFF'>text3</span></div>
<div id="SNT">textbase2<span style='color: #EFFFFF'>text4</span></div>
how can I get all the texts included in all <div> tags using XPath, ignoring the span fields?
i.e.:
text1
text2
textbase1text3
textbase2text4
This cannot be specified with a single XPath 1.0 expression.
You need to first select all relevant div elements:
//div[#id='SNT']
then for each selected node get its string node:
string(.)
In XPath 2.0 this can be specified with a single expression:
//div[#id='SNT]/string(.)
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="div[#id='SNT']">
<xsl:copy-of select="string()"/>
========
</xsl:template>
</xsl:stylesheet>
When this XSLT 1.0 transformation is applied on the following XML document (the provided XML fragment, wrapped into a single top element):
<t>
<div id="SNT">text1</div>
<div id="SNT">text2</div>
<div id="SNT">textbase1<span style='color: #EFFFFF'>text3</span></div>
<div id="SNT">textbase2<span style='color: #EFFFFF'>text4</span></div>
</t>
the relevant div elements are selected (matched) and processed by the only specified template, in which the string(.) XPath expression is evaluated and its result is copied to the output:
text1
========
text2
========
textbase1text3
========
textbase2text4
========
And for the XPath 2.0 expression:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:copy-of select="//div[#id='SNT']/string(.)"/>
</xsl:template>
</xsl:stylesheet>
When this XSLT 2.0 transformation is applied on the same XML document (above), the XPath 2.0 expression is evaluated and the result (four strings) is copied to the output:
text1 text2 textbase1text3 textbase2text4
You could simply use:
//div/text()
or
div/text()
Hope this helps.
Here's a link The lxml.etree Tutorial, and search Using XPath to find text
For example:
from lxml import etree
html = """
<span class='demo'>
Hi,
<span>Tom</span>
</span>
tree = etree.HTML(html)
node = tree.xpath('//span[#class="demo"]')[0]
print(node.xpath('string()')
If there is no other content in the HTML files, just those <div>s inside the usual HTML root elements, the following stylesheet will be sufficient to extract the text:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
</xsl:stylesheet>
If you only want the <div>s, and only with those particular IDs, use the following code - it also makes sure the linebreaks are like in your example:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="//div[#id='SNT']">
<xsl:copy-of select="node()|text()"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>

XPath: how to get text from this and next tag?

i have HTML like this:
<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>
So i need to get at the one time Hello1 with World1, Hello2 with World2 etc
UPDATE: I use Ruby Mechanize library
The Ruby library "Mechanize" uses the Nokogiri parsing library, so you can call Nokogiri directly. One potential solution might look something like this:
require 'mechanize'
require 'pp'
html = "<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>"
results = []
Nokogiri::HTML(html).xpath("//h1").each do |header|
p = header.xpath("following-sibling::p[1]").text
results << [header.text, p]
end
pp results
EDIT:
This example was tested with Mechanize v2.0.1 which uses Nokogiri ~v1.4. I also tested directly against Nokogiri v1.5.0 without issue.
EDIT #2:
This example answers a follow-up question to the original solution:
require 'nokogiri'
require 'pp'
html = <<HTML
<h1>
<p>
<font size="4">
<b>abide by (something)</b>
</font>
</p>
</h1>
<p>
<font size="3">- to follow the rules of something</font>
</p>
The cleaning staff must abide by the rules of the school.
<br>
<h1>
<p>
<font size="4">
<b>able to breathe easily again</b>
</font>
</p>
</h1>
<p>
My friend was able to breathe easily again when his company did not go bankrupt.
<br>
HTML
doc = Nokogiri::HTML(html)
results = []
Nokogiri::HTML(html).xpath("//h1").each do |header|
h1 = header.xpath("following-sibling::p/font/b").text
results << h1
end
pp results
H1 tags with nested elements are invalid, so Nokogiri corrects the error during the parsing process. The process to get at the formerly nested elements is very similar to the original solution.
Note: I glazed over the XPath part of this request. This answer is for an XSLT style sheet instead.
Expanding your XML example to give it a root element:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello3</h1>
<p>World3</p>
</root>
You could use a for-each loop along with "following-sibling" to get the elements with something like this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output encoding="UTF-8" method="text"/>
<xsl:template match="/">
<!-- start lookint for <h1> nodes -->
<xsl:for-each select="/root/h1">
<!-- output the h1 text -->
<xsl:value-of select="."/>
<!-- print a dash for spacing -->
<xsl:text> - </xsl:text>
<!-- select the next <p> node -->
<xsl:value-of select="following-sibling::p[1]"/>
<!-- print a new line -->
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The output would look like this:
Hello1 - World1
Hello2 - World2
Hello3 - World3

Resources