Boost read/write XML file: how to change the characters encoding? - boost

I'm trying to read/write an XML file, using Boost functions read_xml and write_xml.
The XML file original encoding is "windows-1252", but after the read/write operations, the encoding became "utf-8".
This is the XML original file:
<?xml version="1.0" encoding="windows-1252" standalone="no" ?>
<lot>
<name>Lot1</name>
<lot_id>123</lot_id>
<descr></descr>
<job>
<name>TEST</name>
<num_items>2</num_items>
<item>
<label>Item1</label>
<descr>Item First Test</descr>
</item>
<item>
<label>Item2</label>
<descr>Item Second Test</descr>
</item>
</job>
</lot>
And this is the output one:
<?xml version="1.0" encoding="utf-8"?>
<lot>
<name>Lot1</name>
<lot_id>123</lot_id>
<descr></descr>
<job>
<name>TEST</name>
<num_items>2</num_items>
<item>
<label>Item1</label>
<descr>Item First Test</descr>
</item>
<item>
<label>Item2</label>
<descr>Item Second Test</descr>
</item>
</job>
</lot>
This is my C++ code (just a test code):
#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/xml_parser.hpp>
using boost::property_tree::ptree;
ptree xmlTree;
read_xml(FILE_XML, xmlTree);
for (auto it = xmlTreeChild.begin(); it != xmlTreeChild.end();)
{
std::string strItem = it->first.data();
if (strcmp(strItem.c_str(), "item") == 0)
{
std::string strLabel = it->second.get_child("label").data();
if (strcmp(strLabel.c_str(), "item3") != 0)
{
it = xmlTreeChild.erase(it);
}
}
++it;
}
auto settings = boost::property_tree::xml_writer_make_settings<std::string>('\t', 1);
write_xml(FILE_XML, xmlTree, std::locale(), settings);
I need to read and re-write the file using the same encoding from the original file.
I've tried also to change the Locale settings, using:
std::locale newlocale1("English_USA.1252");
read_xml(FILE_XML, xmlTree, 0, newlocale1);
...
auto settings = boost::property_tree::xml_writer_make_settings<std::string>('\t', 1);
write_xml(FILE_XML, xmlTree, newlocale1, settings);
but I've got the same result.
How can I be able to read and write, using the original file encoding, with Boost functions?
Thank you

You can pass an encoding via the writer settings:
auto settings = boost::property_tree::xml_writer_make_settings<std::string>(
'\t', 1, "windows-1252");
Of course, make sure key/values are in fact latin1/cp1252 compatible (this makes sense as long as you read all the information from the source file; however you have to take care when e.g. assigning user input to a property tree node; you might need to convert from the input encoding to cp1252 first).
Live On Coliru

To fix the problem you experience you have to replace this line:
read_xml(FILE_XML, xmlTree);
with
read_xml(FILE_XML,
xmlTree,
boost::property_tree::xml_parser::trim_whitespace);
as far as I know your issue cannot be fixed only by modifying the settings of the write_xml function.
I tried it and worked: when I compare the files ignoring the whitespaces, the input and output xml files are identical.

You can also write to a string stream as following:
#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/xml_parser.hpp>
boost::property_tree::ptree pt;
std::ostringstream oss;
write_xml(
oss, pt,
boost::property_tree::xml_writer_make_settings<char>(
'\t', 0, "ASCII"));

Related

I want to omit specifying default namespace using libxml-ruby

I have questions about libxml-ruby.
There is a xml file "sample.xml".
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://***" xmlns:r="http://???">
<sheetData>
<row><v>1</v></row>
</sheetData>
</worksheet>
I want to deal with nodes without specifying default namespace like below.
xml = XML::Document.file('sample.xml')
sheet_data = xml.find_first('sheetData')
Of course, I can do it like below.
NS = {
main: 'http://***',
r: 'http://???',
}
sheet_data = xml.find_first('main:sheetData', NS)
But I want to omit string of default namespace.
I tried some properties and methods belongs to XML::Namespace[s], but not effected.
And one more problem when I save a xml file.
ns = XML::Namespace.new(xml.root, 'main', 'http://***')
row = XML::Node.new('row', nil, ns)
sheet_data << row
xml.save("sample.xml")
Published like below.
<row><v>1</v></row>
<main:row/>
I want that it's omitted string of "main:".
So I do this, but it's really ugly.
open('sample.xml', 'wb') do |f|
f.write(xml.to_s.gsub(/(<\/?)main:/, '\1'))
end
Do you have any good idea?

Testing Nokogiri XML generation with blank nodes

I'm having a bit of trouble testing some XML generation using Nokogiri when the node is blank. I'm using Minitest to compare the generated XML string with a template fixture file. My test fails with the blank node as Minitest is comparing <Node></Node> with <Node />.
XML Generation
builder = Nokogiri::XML::Builder.new encoding: "UTF-8" do |xml|
xml.Header
xml.FileName #object.filename
end
Template file
This is the file I'm using as a fixture in my tests
<?xml version="1.0" encoding="UTF-8"?>
<Header/>
<FileName></FileName>
Minitest output
3) Failure:
--- expected
+++ actual
## -25,7 +25,7 ##
<Header />
- <FileName/>
+ <FileName></FileName>
As you can see, MiniTest is trying to compare a self-closing tag with a non-self-closing tag and making the test fail. Changing the fixture tag to a self-closing one results, strangely, in exactly the same error message.
It's because sometimes #object.filename is nil - if I have a blank XML node (as in xml.Header above) using a self-closing tag in my fixture works no problem.
I would use XML schema in this case:
def test_that_xml_data_conforms_to_schema
xml_data = ...
schema_data = ...
fragment = Nokogiri::XML.parse(xml_data)
schema = Nokogiri::XML::Schema(schema_data)
assert schema.valid?(fragment)
end

Reading xml nodes using VB script

I have an xml file that I want to read using VBScript (Technology limitation). Below is the code and xml file. I am able to read the file if there is no DTD element involved but the code doesn't work for file having DTD and xml-style element.
Code-
Dim xmlDoc1:Set xmlDoc1 = CreateObject("MSXML2.DomDocument")
xmlDoc1.async=False
xmlDoc1.load "C:\ABC.xml"
Dim xmlTCID:Set xmlTCID = xmlDoc1.selectNodes("//*")
For nNodeCount = 0 To xmlTCID.length
MsgBox(xmlTCID(nNodeCount).nodeName)
Next
ABC.xml -
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE RESULT SYSTEM "Result.dtd"[]>
<?xml-stylesheet type="text/xsl" href="Result.xsl"?>
<SUMMARY>
<TITLE>Test</TITLE>
</SUMMARY>
<IDS>
<DATA>
<NAME>A</NAME>
<VALUE>PASS</VALUE>
</DATA>
<DATA>
<NAME>B</NAME>
<VALUE>PASS</VALUE>
</DATA
<DATA>
<NAME>C</NAME>
<VALUE>FAIL</VALUE>
</DATA
</IDS>
<IDS>
<DATA>
<NAME>A</NAME>
<VALUE>PASS</VALUE>
</DATA>
<DATA>
<NAME>B</NAME>
<VALUE>FAIL</VALUE>
</DATA
</IDS>
Note - If I avoid -
<!DOCTYPE RESULT SYSTEM "Result.dtd"[]>
<?xml-stylesheet type="text/xsl" href="Result.xsl"?>
The above code is able to read the nodes but with the above two lines in xml file, it gives the below error -
Requirement - I need to read the name of last DATA node with FAIL for each IDS node.
Any suggestion as what to do to get the code working even with -
<!DOCTYPE RESULT SYSTEM "Result.dtd"[]>
<?xml-stylesheet type="text/xsl" href="Result.xsl"?>
As there are problems with your XML - more than one top level element, miising ">" - setting the ProhibitDTD Property to False won't solve all of your tasks.
xmlDoc.validateOnParse=False
worked for me.

Get value of XML attribute with namespace

I'm parsing a pptx file and ran into an issue. This is a sample of the source XML:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<p:presentation xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<p:sldMasterIdLst>
<p:sldMasterId id="2147483648" r:id="rId2"/>
</p:sldMasterIdLst>
<p:sldIdLst>
<p:sldId id="256" r:id="rId3"/>
</p:sldIdLst>
<p:sldSz cx="10080625" cy="7559675"/>
<p:notesSz cx="7772400" cy="10058400"/>
</p:presentation>
I need to to get the r:id attribute value in the sldMasterId tag.
doc = Nokogiri::XML(path_to_pptx)
doc.xpath('p:presentation/p:sldMasterIdLst/p:sldMasterId').attr('id').value
returns 2147483648 but I need rId2, which is the r:id attribute value.
I found the attribute_with_ns(name, namespace) method, but
doc.xpath('p:presentation/p:sldMasterIdLst/p:sldMasterId').attribute_with_ns('id', 'r')
returns nil.
You can reference the namespace of attributes in your xpath the same way you reference element namespaces:
doc.xpath('p:presentation/p:sldMasterIdLst/p:sldMasterId/#r:id')
If you want to use attribute_with_ns, you need to use the actual namespace, not just the prefix:
doc.at_xpath('p:presentation/p:sldMasterIdLst/p:sldMasterId')
.attribute_with_ns('id', "http://schemas.openxmlformats.org/officeDocument/2006/relationships")
http://nokogiri.org/Nokogiri/XML/Node.html#method-i-attributes
If you need to distinguish attributes with the same name, with different namespaces use attribute_nodes instead.
doc.xpath('p:presentation/p:sldMasterIdLst/p:sldMasterId').each do |element|
element.attribute_nodes().select do |node|
puts node if node.namespace && node.namespace.prefix == "r"
end
end

Get Nokogiri to not add "default" namespace when adding nodes

Background:
I want to take some xml from one file, put it in a template file and then save the modified template as a new file. It works, but when I save the file out, all the nodes that I added have a default namespace prepeneded, i.e.
<default:ComponentRef Id="C__AD1817F9C64A42F0A14DDDDC82DFC8D9"/>
<default:ComponentRef Id="C__157DD41D70854617A3D6D1E4A39B589F"/>
<default:ComponentRef Id="C__2E6D8662F38FE62CAFA9F8842A28F510"/>
<default:ComponentRef Id="C__54E5E2181323D4A5F37293DAA87B4230"/>
Which I want to be just:
<ComponentRef Id="C__AD1817F9C64A42F0A14DDDDC82DFC8D9"/>
<ComponentRef Id="C__157DD41D70854617A3D6D1E4A39B589F"/>
<ComponentRef Id="C__2E6D8662F38FE62CAFA9F8842A28F510"/>
<ComponentRef Id="C__54E5E2181323D4A5F37293DAA87B4230"/>
The following is my ruby code:
file = "wixmain/generated/DarkOutput.wxs"
template = "wixmain/generated/MsiComponentTemplate.wxs"
output = "wixmain/generated/MSIComponents.wxs"
dark_output = Nokogiri::XML(File.open(file))
template_file = Nokogiri::XML(File.open(template))
#get stuff from dark output
components = dark_output.at_css("Directory[Id='TARGETDIR']")
component_ref = dark_output.at_css("Feature[Id='DefaultFeature']")
#where to insert in template doc
template_component_insert_point = template_file.at_css("DirectoryRef[Id='InstallDir']")
template_ref_insert_point = template_file.at_css("ComponentGroup[Id='MSIComponentGroup']")
template_component_insert_point.children= components.children()
template_ref_insert_point.children= component_ref.children()
#write out filled template to output file
File.open(output, 'w') { |f| template_file.write_xml_to f }
Update
Example of my template file:
<?xml version="1.0" encoding="utf-8"?>
<Wix xmlns='http://schemas.microsoft.com/wix/2006/wi'>
<Fragment>
<ComponentGroup Id='MSIComponentGroup'>
</ComponentGroup>
</Fragment>
<Fragment Id='MSIComponents'>
<DirectoryRef Id='InstallDir'>
</DirectoryRef>
</Fragment>
</Wix>
Workaround was to remove the xmlns attribute in the input file.
Or to use the remove_namespaces! method when opening the input file
input_file = Nokogiri::XML(File.open(input))
input_file.remove_namespaces!
I think you are missing a sample of the template file. Also, is the sample from the input complete?
Nokogiri is either finding the default: namespace during its parsing of one of the two files, and you are inheriting it, or maybe it is not happy with the sample during parsing and is unable to parse cleanly, and as a result somehow adding the default: namespace. You can check the emptiness of the errors array after parsing the dark_output and template_file to see if Nokogiri is happy.
dark_output = Nokogiri::XML(File.open(file))
template_file = Nokogiri::XML(File.open(template))
if (dark_output.errors.any? || template_file.errors.any?)
[... do something here ...]
end
For the fastest answer, you might want to take this question directly to the developers via the Nokogiri-Talk mail-list.

Resources