Preserve whitespace when using xinclude with docbook - whitespace

I use docbook to generate documents. The structure of the main document is modular using xinclude for the different modules.
My problem is about verbatim elements(elements with significant whitespaces) which are included into the main document via xinclude.
If I use a literallayout directly in the main document, the output is as expected: whitespaces are preserved.
I want to use an included file which contains a section with a literallayout element.
If I generate a document with the included file the output gets stripped of its whitespaces.
Can anyone tell me how to keep the whitespaces in verbatim elements like literallayout or programlisting?
File1.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<section id="someid">
<title>section title</title>
<para>
<literallayout>
This shall show a small picure with '0':
0
000
00000
</literallayout>
</para>
</section>
If I generate it as standalone document the output is as expected:
0
000
00000
If I use it as follows:
File2.xml:
<?xml version="1.0" encoding="ISO-8859-15"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<book>
<bookinfo>
...minimum input here
</bookinfo>
<chapter id="First_Chapter">
<title>Introduction</title>
<section id="First_Section">
<title>literallayout and programlisting in the main xml file</title>
<para><literallayout>This should look like a triangle built out of the character '0'
0
000
00000
</para>
</section>
<xi:include href="File1.xml"
xmlns:xi="http://www.w3.org/2001/XInclude" />
</chapter>
</book>
If I generate this document the first section will be as expected (as a pyramid)
but in the section which is included with xinclude all '0' are output on one line 0 000 00000.

Probable solution
I think found a solution to the problem myself:
It should help if the included literallayout elements have the attribute xml:space="preserve" added. This should be the default but it seems that the xinclude somehow "loses" or "replaces" it.

Here is an easy way to add CSS styling to the HTML that Docbook outputs:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:m="http://docbook.org/ns/docbook/modes"
version="2.0">
<!-- Use Docbook from online content delivery network -->
<xsl:import href="https://cdn.docbook.org/release/xsltng/current/xslt/docbook.xsl"/>
<xsl:param name="docbook-xsltng-base" select="'https://cdn.docbook.org/release/xsltng/current'"/>
<xsl:param name="resource-base-uri" select="concat($docbook-xsltng-base, '/resources/')"/>
<!-- Some CSS customizations -->
<xsl:template match="*" mode="m:html-head-last">
<style type="text/css">
pre.literallayout {
white-space: pre-wrap;
background: #D0E0FF;
padding: 1em;
border-radius: 0.5em;
line-height: normal;
font-size: smaller;
font-family: var(--mono-family);
}
</style>
</xsl:template>
</xsl:transform>
The m:html-head-last mode is described here and is one of many points of customization.

Related

Find and conditionally edit text in an XML file

I have a (XML-)file that has the following content:
<class>OverAll</class>
<char>
<rank> 1</rank>
<name> yyy</name>
<level> 9</level>
<experience>53842</experience>
<class>xxx</class>
</char>
<char>
<rank> 2</rank>
<name>aaa</name>
<level> 9</level>
<experience>53074</experience>
<class>zzz</class>
</char>
..and so on. I want to extract the number between the <experience> </experience> lines and replace it with a modified version of the number I found between the tag. For example, the file should look like this after the script:
<class>OverAll</class>
<char>
<rank> 1</rank>
<name> yyy</name>
<level> 9</level>
<experience>53.842</experience>
<class>xxx</class>
</char>
<char>
<rank> 2</rank>
<name>aaa</name>
<level> 9</level>
<experience>53.074</experience>
<class>zzz</class>
</char>
(I want to add a thousands separator, and values above 1 Million is required. So 2 thousand Separators :)
I am able to find and replace the number, but I dont know how to use the input number and modify it and add it back to the line.
Perhaps someone can help here?
Thank you very much :)
A one-liner sed can do it, assuming the last three digits are always decimal:
sed -zE 's#([[:digit:]]{7,})([[:digit:]]{1})[[:space:]]*(</experience[[:space:]]*>)#\1.\2\3#g;s#([[:digit:]]{3})[[:space:]]*(</experience[[:space:]]*>)#.\1\2#g'
sed parameters breakdown:
-zE
-z or --null-data: Separate lines by NULL characters to allow pattern matching across lines, because spaces, tabs and newlines are allowed by the XML syntax before the > bracket of a tag.
-E or --regexp-extended: Use extended regular expressions in the script (for portability use POSIX -E).
s#([[:digit:]]{7,})([[:digit:]]{1})[[:space:]]*(</experience[[:space:]]*>)#\1.\2\3#g:
Insert a decimal point before the last digit, to experience numbers containing seven plus one (eight) or more digits (Million or more with an extra decimal digit).
s#([[:digit:]]{3})[[:space:]]*(</experience[[:space:]]*>)#.\1\2#g:
Insert a decimal point before the last three digits, to experience numbers ending with three digits (automatically excludes the Millions experiences already processed by previous sed command.
Now keep in mind that it is not parsing the XML either, because it will replace numbers in the <experience> tag anywhere in the XML tree.
Regular expressions are not meant to parse markup languages. There are better, more efficient and dedicated tools to manipulate XML with XSLT/XPATH like saxon, xsltproc, xmllint...
Using proper XML processing with xsltproc:
decimal-experience.xsl
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- Cosmetic sugar to have the xml declaration header and indent -->
<xsl:output omit-xml-declaration="no" indent="yes"/>
<!-- Cosmetic sugar to remove unneeded spaces in elements -->
<xsl:strip-space elements="*"/>
<!-- Copy all the nodes as-is from the source xml -->
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="node()"/>
</xsl:copy>
</xsl:template>
<!-- Process the content of the experience tag within the char tag -->
<xsl:template match="char/experience/">
<!-- If the experience is not already in decimal form -->
<xsl:if test="not(contains(., '.'))">
<xsl:choose>
<!-- When the experience is less than a Million -->
<xsl:when test=". < 9999999">
<!-- The last three digits are decimals -->
<xsl:value-of select="format-number(. div 1000, '0.000')"/>
</xsl:when>
<!-- Otherwise the experience is a Million or more -->
<xsl:otherwise>
<!-- The last digit is decimal -->
<xsl:value-of select="format-number(. div 10, '0.0')"/>
</xsl:otherwise>
</xsl:choose>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Running the XSLT transformation above:
xsltproc decimal-experience.xsl characters.xml
Example output:
I created a valid fictive characters.xml with a span root tag, because your extract was invalid XML.
<?xml version="1.0"?>
<span>
<class>OverAll</class>
<char>
<rank> 1</rank>
<name> yyy</name>
<level> 9</level>
<experience>53.842</experience>
<class>xxx</class>
</char>
<char>
<rank> 2</rank>
<name>aaa</name>
<level> 9</level>
<experience>53.074</experience>
<class>zzz</class>
</char>
<char>
<rank> 3</rank>
<name>Million</name>
<level>42</level>
<experience>5585307.4</experience>
<class>zzz</class>
</char>
</span>

Nokogiri XSLT transform using multiple source XML files

I want to translate XML using Nokogiri. I built an XSL and it all works fine. I ALSO tested it in Intellij. My data comes from two XML files.
My problem occurs when I try to get Nokogiri to do the transform. I can't seem to find a way to get it to parse multiple source files.
This is the code I am using from the documentation:
require 'Nokogiri'
doc1 = Nokogiri::XML(File.read('F:/transcoder/xslt_repo/core_xml.xml',))
xslt = Nokogiri::XSLT(File.read('F:/transcoder/xslt_repo/google.xsl'))
puts xslt.transform(doc1)
I tried:
require 'Nokogiri'
doc1 = Nokogiri::XML(File.read('F:/transcoder/xslt_repo/core_xml.xml',))
doc2 = Nokogiri::XML(File.read('F:/transcoder/xslt_repo/file_data.xml',))
xslt = Nokogiri::XSLT(File.read('F:/transcoder/xslt_repo/test.xsl'))
puts xslt.transform(doc1,doc2)
However it seems transform only takes one argument, so at the moment I am only able to parse half the data I need:
<?xml version="1.0"?>
<package package_id="LB000001">
<asset_metadata>
<series_title>test asset 1</series_title>
<season_title>Number 1</season_title>
<episode_title>ET 1</episode_title>
<episode_number>1</episode_number>
<license_start_date>21-07-2016</license_start_date>
<license_end_date>31-07-2016</license_end_date>
<rating>15</rating>
<synopsis>This is a test asset</synopsis>
</asset_metadata>
<video_file>
<file_name/>
<file_size/>
<check_sum/>
</video_file>
<image_1>
<file_name/>
<file_size/>
<check_sum/>
</image_1>
</package>
How can I get this to work?
Edit:
This is the core_metadata.xml which is created via a PHP code block and the data comes from a database.
<?xml version="1.0" encoding="utf-8"?>
<manifest task_id="00000000373">
<asset_metadata>
<material_id>LB111111</material_id>
<series_title>This is a test</series_title>
<season_title>This is a test</season_title>
<season_number>1</season_number>
<episode_title>that test</episode_title>
<episode_number>2</episode_number>
<start_date>23-08-2016</start_date>
<end_date>31-08-2016</end_date>
<ratings>15</ratings>
<synopsis>this is a test</synopsis>
</asset_metadata>
<file_info>
<source_filename>LB111111</source_filename>
<number_of_segments>2</number_of_segments>
<segment_1 seg_1_start="00:00:10.000" seg_1_dur="00:01:00.000"/>
<segment_2 seg_2_start="00:02:00.000" seg_2_dur="00:05:00.000"/>
<conform_profile definition="hd" aspect_ratio="16f16">ffmpeg -i S_PATH/F_NAME.mp4 SEG_CONFORM 2> F:/Transcoder/logs/transcode_logs/LOG_FILE.txt</conform_profile>
<transcode_profile profile_name="xbox" package_type="tar">ffmpeg -f concat -i T_PATH/CONFORM_LIST TRC_PATH/F_NAME.mp4 2> F:/Transcoder/logs/transcode_logs/LOG_FILE.txt</transcode_profile>
<target_path>F:/profiles/xbox</target_path>
</file_info>
</manifest>
The second XML (file_date.xml) is dynamically create during the trancode process by nokogiri:
<?xml version="1.0"?>
<file_data>
<video_file>
<file_name>LB111111_xbox_230816114438.mp4</file_name>
<file_size>141959922</file_size>
<md5_checksum>bac7670e55c0694059d3742285079cbf</md5_checksum>
</video_file>
<image_1>
<file_name>test</file_name>
<file_size>test</file_size>
<md5_checksum>test</md5_checksum>
</image_1>
</file_data>
I managed to work around this issue by making a call to by hard coding the file_date.xml into the XSLT file:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<package>
<xsl:attribute name="package_id">
<xsl:value-of select="manifest/asset_metadata/material_id"/>
</xsl:attribute>
<asset_metadata>
<series_title>
<xsl:value-of select="manifest/asset_metadata/series_title"/>
</series_title>
<season_title>
<xsl:value-of select="manifest/asset_metadata/season_title"/>
</season_title>
<episode_title>
<xsl:value-of select="manifest/asset_metadata/episode_title"/>
</episode_title>
<episode_number>
<xsl:value-of select="manifest/asset_metadata/episode_number"/>
</episode_number>
<license_start_date>
<xsl:value-of select="manifest/asset_metadata/start_date"/>
</license_start_date>
<license_end_date>
<xsl:value-of select="manifest/asset_metadata/end_date"/>
</license_end_date>
<rating>
<xsl:value-of select="manifest/asset_metadata/ratings"/>
</rating>
<synopsis>
<xsl:value-of select="manifest/asset_metadata/synopsis"/>
</synopsis>
</asset_metadata>
<video_file>
<file_name>
<xsl:value-of select="document('file_data.xml')/file_data/video_file/file_name"/>
</file_name>
<file_size>
<xsl:value-of select="document('file_data.xml')/file_data/video_file/file_size"/>
</file_size>
<check_sum>
<xsl:value-of select="document('file_data.xml')/file_data/video_file/md5_checksum"/>
</check_sum>
</video_file>
<image_1>
<file_name>
<xsl:value-of select="document('file_data.xml')/file_data/image_1/file_name"/>
</file_name>
<file_size>
<xsl:value-of select="document('file_data.xml')/file_data/image_1/file_size"/>
</file_size>
<check_sum>
<xsl:value-of select="document('file_data.xml')/file_data/image_1/md5_checksum"/>
</check_sum>
</image_1>
</package>
</xsl:template>
I then use Saxon to do the transform:
xslt = "java -jar C:/SaxonHE9-7-0-7J/saxon9he.jar #{temp}core_metadata.xml #{temp}#{profile}.xsl > #{temp}#{file_name}.xml"
system("#{xslt}")
I would love to find a way to do this without having to hardcode the file_date.xml into the XSLT.
Merge XML Documents and Transform
You'll have to do a bit of work to combine the XML content prior to your XLS-Transformation. #the-Tin-Man has a nice answer to a similar question in the archives, which can be adapted for your use case.
Let's say we have the following sample content:
<!--a.xml-->
<?xml version="1.0"?>
<xml>
<packages>
<package>Data here for A</package>
<package>Another Package</package>
</packages>
</xml>
<!--a.xml-->
<!--b.xml-->
<?xml version="1.0"?>
<xml>
<packages>
<package>B something something</package>
</packages>
</xml>
<!--end b.xml-->
And we want to apply the following XLST template:
<!--transform.xslt-->
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="//packages">
<html>
<body>
<h2>Packages</h2>
<ol>
<xsl:for-each select="./package">
<li><xsl:value-of select="text()"/></li>
</xsl:for-each>
</ol>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
<!--end transform.xslt-->
If we have parallel document structure, as in this case, we can merge the two XML documents' content together and pass that along for transformation.
require 'Nokogiri'
doc1 = Nokogiri::XML(File.read('./a.xml'))
doc2 = Nokogiri::XML(File.read('./b.xml'))
moved_packages = doc2.search('package')
doc1.at('/descendant::packages[1]').add_child(moved_packages)
xslt = Nokogiri::XSLT(File.read('./transform.xslt'))
puts xslt.transform(doc1)
This would generate the following output:
<html><body>
<h2>Packages</h2>
<ol>
<li>Data here for A</li>
<li>Another Package</li>
<li>B something something</li>
</ol>
</body></html>
If your XML documents have varying structure, you may benefit from an intermediary XML nodeset that you add your content to, rather than the shortcut of merging document 2 content into document 1.

XPath to find element with a HTML line break

I need an xpath that will find some text containing HTML line breaks <br/>. For example:
<ul>
<li>ABC<br/>DEF</li>
<li>XYZ<br/>NOP</li>
</ul>
Let's say I'm trying to find the li that contains ABC<br/><DEF>. I've tried the following:
$x("//li[normalize-space(.)='ABC DEF']")
$x("//li[text() ='ABC<br/>DEF']")
$x("//li[contains(., 'ABC DEF']")
But they return nothing. I saw this answer XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode but I couldn't figure out how to use it in my case.
The following expression will get you close:
li[br[preceding-sibling::node()[1] = 'ABC']
[starts-with(following-sibling::node()[1], 'DEF')]]
If you need to match only items where the text ends with ABC, it will be a little longer.
The following transform will select the first matching li:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" />
<xsl:template match="/">
<matches>
<xsl:copy-of select="(//li[br[preceding-sibling::node()[1] = 'ABC']
[starts-with(following-sibling::node()[1], 'DEF')]
])
[1]" />
</matches>
</xsl:template>
</xsl:stylesheet>
Input:
<ul>
<li>ABC<br/>DEF</li>
<li>XYZ<br/>NOP</li>
<li><p>XYZ<br/>NOP</p></li>
<li>ABC<br/>DEF</li>
<li>DEF GHI</li>
<li>ABC<![CDATA[<br/>]]>DEF</li>
</ul>
Output:
<?xml version="1.0" encoding="utf-8"?>
<matches>
<li>ABC<br />DEF</li>
</matches>
//li[br]
This should work. It means: select all li elements having br child

XPath ignore span

I have a HTML which contains some tags like below:
<div id="SNT">text1</div>
<div id="SNT">text2</div>
<div id="SNT">textbase1<span style='color: #EFFFFF'>text3</span></div>
<div id="SNT">textbase2<span style='color: #EFFFFF'>text4</span></div>
how can I get all the texts included in all <div> tags using XPath, ignoring the span fields?
i.e.:
text1
text2
textbase1text3
textbase2text4
This cannot be specified with a single XPath 1.0 expression.
You need to first select all relevant div elements:
//div[#id='SNT']
then for each selected node get its string node:
string(.)
In XPath 2.0 this can be specified with a single expression:
//div[#id='SNT]/string(.)
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="div[#id='SNT']">
<xsl:copy-of select="string()"/>
========
</xsl:template>
</xsl:stylesheet>
When this XSLT 1.0 transformation is applied on the following XML document (the provided XML fragment, wrapped into a single top element):
<t>
<div id="SNT">text1</div>
<div id="SNT">text2</div>
<div id="SNT">textbase1<span style='color: #EFFFFF'>text3</span></div>
<div id="SNT">textbase2<span style='color: #EFFFFF'>text4</span></div>
</t>
the relevant div elements are selected (matched) and processed by the only specified template, in which the string(.) XPath expression is evaluated and its result is copied to the output:
text1
========
text2
========
textbase1text3
========
textbase2text4
========
And for the XPath 2.0 expression:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:copy-of select="//div[#id='SNT']/string(.)"/>
</xsl:template>
</xsl:stylesheet>
When this XSLT 2.0 transformation is applied on the same XML document (above), the XPath 2.0 expression is evaluated and the result (four strings) is copied to the output:
text1 text2 textbase1text3 textbase2text4
You could simply use:
//div/text()
or
div/text()
Hope this helps.
Here's a link The lxml.etree Tutorial, and search Using XPath to find text
For example:
from lxml import etree
html = """
<span class='demo'>
Hi,
<span>Tom</span>
</span>
tree = etree.HTML(html)
node = tree.xpath('//span[#class="demo"]')[0]
print(node.xpath('string()')
If there is no other content in the HTML files, just those <div>s inside the usual HTML root elements, the following stylesheet will be sufficient to extract the text:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
</xsl:stylesheet>
If you only want the <div>s, and only with those particular IDs, use the following code - it also makes sure the linebreaks are like in your example:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="//div[#id='SNT']">
<xsl:copy-of select="node()|text()"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>

XPath: how to get text from this and next tag?

i have HTML like this:
<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>
So i need to get at the one time Hello1 with World1, Hello2 with World2 etc
UPDATE: I use Ruby Mechanize library
The Ruby library "Mechanize" uses the Nokogiri parsing library, so you can call Nokogiri directly. One potential solution might look something like this:
require 'mechanize'
require 'pp'
html = "<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>"
results = []
Nokogiri::HTML(html).xpath("//h1").each do |header|
p = header.xpath("following-sibling::p[1]").text
results << [header.text, p]
end
pp results
EDIT:
This example was tested with Mechanize v2.0.1 which uses Nokogiri ~v1.4. I also tested directly against Nokogiri v1.5.0 without issue.
EDIT #2:
This example answers a follow-up question to the original solution:
require 'nokogiri'
require 'pp'
html = <<HTML
<h1>
<p>
<font size="4">
<b>abide by (something)</b>
</font>
</p>
</h1>
<p>
<font size="3">- to follow the rules of something</font>
</p>
The cleaning staff must abide by the rules of the school.
<br>
<h1>
<p>
<font size="4">
<b>able to breathe easily again</b>
</font>
</p>
</h1>
<p>
My friend was able to breathe easily again when his company did not go bankrupt.
<br>
HTML
doc = Nokogiri::HTML(html)
results = []
Nokogiri::HTML(html).xpath("//h1").each do |header|
h1 = header.xpath("following-sibling::p/font/b").text
results << h1
end
pp results
H1 tags with nested elements are invalid, so Nokogiri corrects the error during the parsing process. The process to get at the formerly nested elements is very similar to the original solution.
Note: I glazed over the XPath part of this request. This answer is for an XSLT style sheet instead.
Expanding your XML example to give it a root element:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello3</h1>
<p>World3</p>
</root>
You could use a for-each loop along with "following-sibling" to get the elements with something like this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output encoding="UTF-8" method="text"/>
<xsl:template match="/">
<!-- start lookint for <h1> nodes -->
<xsl:for-each select="/root/h1">
<!-- output the h1 text -->
<xsl:value-of select="."/>
<!-- print a dash for spacing -->
<xsl:text> - </xsl:text>
<!-- select the next <p> node -->
<xsl:value-of select="following-sibling::p[1]"/>
<!-- print a new line -->
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The output would look like this:
Hello1 - World1
Hello2 - World2
Hello3 - World3

Resources