I have a HTML which contains some tags like below:
<div id="SNT">text1</div>
<div id="SNT">text2</div>
<div id="SNT">textbase1<span style='color: #EFFFFF'>text3</span></div>
<div id="SNT">textbase2<span style='color: #EFFFFF'>text4</span></div>
how can I get all the texts included in all <div> tags using XPath, ignoring the span fields?
i.e.:
text1
text2
textbase1text3
textbase2text4
This cannot be specified with a single XPath 1.0 expression.
You need to first select all relevant div elements:
//div[#id='SNT']
then for each selected node get its string node:
string(.)
In XPath 2.0 this can be specified with a single expression:
//div[#id='SNT]/string(.)
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="div[#id='SNT']">
<xsl:copy-of select="string()"/>
========
</xsl:template>
</xsl:stylesheet>
When this XSLT 1.0 transformation is applied on the following XML document (the provided XML fragment, wrapped into a single top element):
<t>
<div id="SNT">text1</div>
<div id="SNT">text2</div>
<div id="SNT">textbase1<span style='color: #EFFFFF'>text3</span></div>
<div id="SNT">textbase2<span style='color: #EFFFFF'>text4</span></div>
</t>
the relevant div elements are selected (matched) and processed by the only specified template, in which the string(.) XPath expression is evaluated and its result is copied to the output:
text1
========
text2
========
textbase1text3
========
textbase2text4
========
And for the XPath 2.0 expression:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:copy-of select="//div[#id='SNT']/string(.)"/>
</xsl:template>
</xsl:stylesheet>
When this XSLT 2.0 transformation is applied on the same XML document (above), the XPath 2.0 expression is evaluated and the result (four strings) is copied to the output:
text1 text2 textbase1text3 textbase2text4
You could simply use:
//div/text()
or
div/text()
Hope this helps.
Here's a link The lxml.etree Tutorial, and search Using XPath to find text
For example:
from lxml import etree
html = """
<span class='demo'>
Hi,
<span>Tom</span>
</span>
tree = etree.HTML(html)
node = tree.xpath('//span[#class="demo"]')[0]
print(node.xpath('string()')
If there is no other content in the HTML files, just those <div>s inside the usual HTML root elements, the following stylesheet will be sufficient to extract the text:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
</xsl:stylesheet>
If you only want the <div>s, and only with those particular IDs, use the following code - it also makes sure the linebreaks are like in your example:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="//div[#id='SNT']">
<xsl:copy-of select="node()|text()"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
Related
I want to translate XML using Nokogiri. I built an XSL and it all works fine. I ALSO tested it in Intellij. My data comes from two XML files.
My problem occurs when I try to get Nokogiri to do the transform. I can't seem to find a way to get it to parse multiple source files.
This is the code I am using from the documentation:
require 'Nokogiri'
doc1 = Nokogiri::XML(File.read('F:/transcoder/xslt_repo/core_xml.xml',))
xslt = Nokogiri::XSLT(File.read('F:/transcoder/xslt_repo/google.xsl'))
puts xslt.transform(doc1)
I tried:
require 'Nokogiri'
doc1 = Nokogiri::XML(File.read('F:/transcoder/xslt_repo/core_xml.xml',))
doc2 = Nokogiri::XML(File.read('F:/transcoder/xslt_repo/file_data.xml',))
xslt = Nokogiri::XSLT(File.read('F:/transcoder/xslt_repo/test.xsl'))
puts xslt.transform(doc1,doc2)
However it seems transform only takes one argument, so at the moment I am only able to parse half the data I need:
<?xml version="1.0"?>
<package package_id="LB000001">
<asset_metadata>
<series_title>test asset 1</series_title>
<season_title>Number 1</season_title>
<episode_title>ET 1</episode_title>
<episode_number>1</episode_number>
<license_start_date>21-07-2016</license_start_date>
<license_end_date>31-07-2016</license_end_date>
<rating>15</rating>
<synopsis>This is a test asset</synopsis>
</asset_metadata>
<video_file>
<file_name/>
<file_size/>
<check_sum/>
</video_file>
<image_1>
<file_name/>
<file_size/>
<check_sum/>
</image_1>
</package>
How can I get this to work?
Edit:
This is the core_metadata.xml which is created via a PHP code block and the data comes from a database.
<?xml version="1.0" encoding="utf-8"?>
<manifest task_id="00000000373">
<asset_metadata>
<material_id>LB111111</material_id>
<series_title>This is a test</series_title>
<season_title>This is a test</season_title>
<season_number>1</season_number>
<episode_title>that test</episode_title>
<episode_number>2</episode_number>
<start_date>23-08-2016</start_date>
<end_date>31-08-2016</end_date>
<ratings>15</ratings>
<synopsis>this is a test</synopsis>
</asset_metadata>
<file_info>
<source_filename>LB111111</source_filename>
<number_of_segments>2</number_of_segments>
<segment_1 seg_1_start="00:00:10.000" seg_1_dur="00:01:00.000"/>
<segment_2 seg_2_start="00:02:00.000" seg_2_dur="00:05:00.000"/>
<conform_profile definition="hd" aspect_ratio="16f16">ffmpeg -i S_PATH/F_NAME.mp4 SEG_CONFORM 2> F:/Transcoder/logs/transcode_logs/LOG_FILE.txt</conform_profile>
<transcode_profile profile_name="xbox" package_type="tar">ffmpeg -f concat -i T_PATH/CONFORM_LIST TRC_PATH/F_NAME.mp4 2> F:/Transcoder/logs/transcode_logs/LOG_FILE.txt</transcode_profile>
<target_path>F:/profiles/xbox</target_path>
</file_info>
</manifest>
The second XML (file_date.xml) is dynamically create during the trancode process by nokogiri:
<?xml version="1.0"?>
<file_data>
<video_file>
<file_name>LB111111_xbox_230816114438.mp4</file_name>
<file_size>141959922</file_size>
<md5_checksum>bac7670e55c0694059d3742285079cbf</md5_checksum>
</video_file>
<image_1>
<file_name>test</file_name>
<file_size>test</file_size>
<md5_checksum>test</md5_checksum>
</image_1>
</file_data>
I managed to work around this issue by making a call to by hard coding the file_date.xml into the XSLT file:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<package>
<xsl:attribute name="package_id">
<xsl:value-of select="manifest/asset_metadata/material_id"/>
</xsl:attribute>
<asset_metadata>
<series_title>
<xsl:value-of select="manifest/asset_metadata/series_title"/>
</series_title>
<season_title>
<xsl:value-of select="manifest/asset_metadata/season_title"/>
</season_title>
<episode_title>
<xsl:value-of select="manifest/asset_metadata/episode_title"/>
</episode_title>
<episode_number>
<xsl:value-of select="manifest/asset_metadata/episode_number"/>
</episode_number>
<license_start_date>
<xsl:value-of select="manifest/asset_metadata/start_date"/>
</license_start_date>
<license_end_date>
<xsl:value-of select="manifest/asset_metadata/end_date"/>
</license_end_date>
<rating>
<xsl:value-of select="manifest/asset_metadata/ratings"/>
</rating>
<synopsis>
<xsl:value-of select="manifest/asset_metadata/synopsis"/>
</synopsis>
</asset_metadata>
<video_file>
<file_name>
<xsl:value-of select="document('file_data.xml')/file_data/video_file/file_name"/>
</file_name>
<file_size>
<xsl:value-of select="document('file_data.xml')/file_data/video_file/file_size"/>
</file_size>
<check_sum>
<xsl:value-of select="document('file_data.xml')/file_data/video_file/md5_checksum"/>
</check_sum>
</video_file>
<image_1>
<file_name>
<xsl:value-of select="document('file_data.xml')/file_data/image_1/file_name"/>
</file_name>
<file_size>
<xsl:value-of select="document('file_data.xml')/file_data/image_1/file_size"/>
</file_size>
<check_sum>
<xsl:value-of select="document('file_data.xml')/file_data/image_1/md5_checksum"/>
</check_sum>
</image_1>
</package>
</xsl:template>
I then use Saxon to do the transform:
xslt = "java -jar C:/SaxonHE9-7-0-7J/saxon9he.jar #{temp}core_metadata.xml #{temp}#{profile}.xsl > #{temp}#{file_name}.xml"
system("#{xslt}")
I would love to find a way to do this without having to hardcode the file_date.xml into the XSLT.
Merge XML Documents and Transform
You'll have to do a bit of work to combine the XML content prior to your XLS-Transformation. #the-Tin-Man has a nice answer to a similar question in the archives, which can be adapted for your use case.
Let's say we have the following sample content:
<!--a.xml-->
<?xml version="1.0"?>
<xml>
<packages>
<package>Data here for A</package>
<package>Another Package</package>
</packages>
</xml>
<!--a.xml-->
<!--b.xml-->
<?xml version="1.0"?>
<xml>
<packages>
<package>B something something</package>
</packages>
</xml>
<!--end b.xml-->
And we want to apply the following XLST template:
<!--transform.xslt-->
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="//packages">
<html>
<body>
<h2>Packages</h2>
<ol>
<xsl:for-each select="./package">
<li><xsl:value-of select="text()"/></li>
</xsl:for-each>
</ol>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
<!--end transform.xslt-->
If we have parallel document structure, as in this case, we can merge the two XML documents' content together and pass that along for transformation.
require 'Nokogiri'
doc1 = Nokogiri::XML(File.read('./a.xml'))
doc2 = Nokogiri::XML(File.read('./b.xml'))
moved_packages = doc2.search('package')
doc1.at('/descendant::packages[1]').add_child(moved_packages)
xslt = Nokogiri::XSLT(File.read('./transform.xslt'))
puts xslt.transform(doc1)
This would generate the following output:
<html><body>
<h2>Packages</h2>
<ol>
<li>Data here for A</li>
<li>Another Package</li>
<li>B something something</li>
</ol>
</body></html>
If your XML documents have varying structure, you may benefit from an intermediary XML nodeset that you add your content to, rather than the shortcut of merging document 2 content into document 1.
I need an xpath that will find some text containing HTML line breaks <br/>. For example:
<ul>
<li>ABC<br/>DEF</li>
<li>XYZ<br/>NOP</li>
</ul>
Let's say I'm trying to find the li that contains ABC<br/><DEF>. I've tried the following:
$x("//li[normalize-space(.)='ABC DEF']")
$x("//li[text() ='ABC<br/>DEF']")
$x("//li[contains(., 'ABC DEF']")
But they return nothing. I saw this answer XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode but I couldn't figure out how to use it in my case.
The following expression will get you close:
li[br[preceding-sibling::node()[1] = 'ABC']
[starts-with(following-sibling::node()[1], 'DEF')]]
If you need to match only items where the text ends with ABC, it will be a little longer.
The following transform will select the first matching li:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" />
<xsl:template match="/">
<matches>
<xsl:copy-of select="(//li[br[preceding-sibling::node()[1] = 'ABC']
[starts-with(following-sibling::node()[1], 'DEF')]
])
[1]" />
</matches>
</xsl:template>
</xsl:stylesheet>
Input:
<ul>
<li>ABC<br/>DEF</li>
<li>XYZ<br/>NOP</li>
<li><p>XYZ<br/>NOP</p></li>
<li>ABC<br/>DEF</li>
<li>DEF GHI</li>
<li>ABC<![CDATA[<br/>]]>DEF</li>
</ul>
Output:
<?xml version="1.0" encoding="utf-8"?>
<matches>
<li>ABC<br />DEF</li>
</matches>
//li[br]
This should work. It means: select all li elements having br child
using the following;
<a>
<b>false</b>
<b>true</b>
<b>false</b>
<b>false</b>
<b>true</b>
</a>
I want to get the following result using something like
/a/b[.='true'].position()
for a result like
2,5 (as in a collection of the 2 positions)
I. XPath 1.0 solution:
Use:
count(/*/*[.='true'][1]/preceding-sibling::*)+1
This produces the position of the first b element whose string value is "true":
2
Repeat the evaluation of a similar expression, where [1] is replaced by [2] ,..., etc, up to count(/*/*[.='true'])
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:for-each select="/*/*[.='true']">
<xsl:variable name="vPos" select="position()"/>
<xsl:value-of select=
"count(/*/*[.='true'][$vPos]
/preceding-sibling::*) +1"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<a>
<b>false</b>
<b>true</b>
<b>false</b>
<b>false</b>
<b>true</b>
</a>
The XPath expression is constructed and evaluated for everyb, whose string value is"true". The results of these evaluations are copied to the output:
2
5
II. XPath 2.0 solution:
Use:
index-of(/*/*, 'true')
XSLT 2.0 - based verification:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:sequence select="index-of(/*/*, 'true')"/>
</xsl:template>
</xsl:stylesheet>
When this XSLT 2.0 transformation is applied on the same XML document (above), the XPath 2.0 expression is evaluated and the result of this evaluation is copied to the output:
2 5
A basic (& working) approach in python language :
from lxml import etree
root = etree.XML("""
<a>
<b>false</b>
<b>true</b>
<b>false</b>
<b>false</b>
<b>true</b>
</a>
""")
c = 0
lst = []
for i in root.xpath('/a/b/text()'):
c+=1
if i == 'true':
lst.append(str(c))
print ",".join(lst)
from the given html :
<span class="flag_16 left_16 armenia_16_left"> First League</span>
how i can get the (armenia) string only or at least (armenia_16_left).
thanks in advance.
Use this XPath 1.0 expression:
substring-before(substring-after(substring-after(/span /#class, ' '), ' '), '_')
In XPath 2.0 one can simply use:
tokenize(tokenize(/span /#class, ' ')[last()], '_')[1]
XSLT-based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
"<xsl:value-of select=
"substring-before(substring-after(substring-after(/span /#class, ' '), ' '), '_')
"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<span class="flag_16 left_16 armenia_16_left"> First League</span>
the Xpath expression is evaluated and the result is copied to the output:
"armenia"
When this XSLT 2.0 transformation is applied on the same XML document (above):
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
"<xsl:sequence select=
"tokenize(tokenize(/span /#class, ' ')[last()], '_')[1]"/>"
</xsl:template>
</xsl:stylesheet>
again the same correct result is produced:
"armenia"
Using only an XPath expression (and not in XSLT or DOM - just pure XPath), I'm trying to create a relative path from the current node (in a td) to an associated td in the same column of the same HTML table.
For example, suppose I have this type of data:
<table>
<tr> <td><a>Blue Jeans</a></td> <td><a>Shirt</a></td> </tr>
<tr> <td><span>$21.50</span></td> <td><span>$18.99</span></td> </tr>
</table>
and I'm on the a with "Blue Jeans" and want to find the price ($21.50). In XSLT, I could use the current() function to get the answer like this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:template match="/">
<xsl:apply-templates select="//a" />
</xsl:template>
<xsl:template match="a">
Name: <xsl:value-of select="."/>
Price: <xsl:value-of select="../../following-sibling::tr[1]/td[position() = count(current()/../preceding-sibling::td) + 1]" />
</xsl:template>
</xsl:stylesheet>
But the problem I'm running into is that there is no current() defined in XPath 1.0. I tried using the self:: axis, but like the "." shorthand, that only points to the "context" node, not the "current" node. The language that I'm seeing in the XPath standard suggests that XPath doesn't have a concept of "current node."
Is there perhaps another way to form this path or is this a limitation of XPath?
In XPath 1.0 you could do:
/table/tr/td/a[.='Blue Jeans']/following::td[count(../td)]/span
Of course, this assumes there is no colspan.
EDIT: The proof. This stylesheet:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:param name="pProduct" select="'Blue Jeans'"/>
<xsl:template match="/">
<xsl:value-of select="/table/tr/td/a[.=$pProduct]
/following::td[count(../td)]/span"/>
</xsl:template>
</xsl:stylesheet>
Output:
$21.50
With param pProduct set to 'Shirt', output:
$18.99
Note: Of course, you need the a element in context in order to select the span element. So, with your stylesheet:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="text()"/>
<xsl:template match="a">
Name: <xsl:value-of select="."/>
Price: <xsl:value-of select="following::td[count(../td)]/span" />
</xsl:template>
</xsl:stylesheet>
Output:
Name: Blue Jeans
Price: $21.50
Name: Shirt
Price: $18.99
This cannot be achieved with a single XPath 1.0 expression.
In XPath 2.0 one could write:
for $vPreceeding in count(../preceding-sibling::td)
return ../../following-sibling::tr[1]/td[$vPreceeding]