How do I get Nokogiri to add the right XML encoding?

How do I get Nokogiri to add the right XML encoding? - ruby

I have created a xml doc with Nokogiri: Nokogiri::XML::Document
The header of my file is <?xml version="1.0"?> but I'd expect to have <?xml version="1.0" encoding="UTF-8"?>. Is there any options I could use so the encoding appears ?

Are you using Nokogiri XML Builder? You can pass an encoding option to the new() method:
new(options = {})
Create a new Builder object. options
are sent to the top level Document
that is being built.
Building a document with a particular encoding for example:
Nokogiri::XML::Builder.new(:encoding => 'UTF-8') do |xml|
...
end
Also this page says you can do the following (when not using Builder):
doc = Nokogiri.XML('<foo><bar /><foo>', nil, 'EUC-JP')
Presumably you could change 'EUC-JP' to 'UTF-8'.

When parsing the doc you can set the encoding like this:
doc = Nokogiri::XML::Document.parse(xml_input, nil, "UTF-8")
For me that returns
<?xml version="1.0" encoding="UTF-8"?>

If you're not using Nokogiri::XML::Builder but rather creating a document object directly, you can just set the encoding with Document#encoding=:
doc = Nokogiri::XML::Document.new
# => #<Nokogiri::XML::Document:0x1180 name="document">
puts doc.to_s
# <?xml version="1.0"?>
doc.encoding = 'UTF-8'
puts doc.to_s
# <?xml version="1.0" encoding="UTF-8"?>

Related

How to compact existing XML using Nokogiri

I'm trying to compact an existing XML file using Nokogiri. I have the following demo code:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
XML
doc.write_xml_to($stdout, indent: 0)
I expected to see:
<?xml version="1.0" encoding="UTF-8"?>
<root><foo><bar>test</bar></foo></root>
but instead I saw:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
I tried:
doc.write_to($stdout, indent: 0, save_with: Nokogiri::XML::Node::SaveOptions::AS_XML)
but that doesn't work either.
How can I remove the ignorable whitespaces?

You can tell Nokogiri to ignore empty text nodes and then to output without indentation:
require 'nokogiri'
xml = <<EOT
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
EOT
doc = Nokogiri::XML(xml) { |opts|
opts.noblanks
opts.strict.noblanks
}
doc.to_xml(:indent_text => '', :indent => 0)
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
# "<root>\n" +
# "<foo>\n" +
# "<bar>test</bar>\n" +
# "</foo>\n" +
# "</root>\n"

Okay, I answer my own question.
Nokogiri does not remove the white spaces because Nokogiri doesn't know if the white spaces are ignorable or not (no DTD, no schema), so it keeps all the whitespace-only text as text nodes. I should remove them manually before writing the XML doc to the IO device.
#!/usr/bin/env ruby
require 'bundler'
Bundler.require :default
doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
XML
# remove ignorable white spaces
doc.xpath('//text()').each do |node|
node.content = '' if node.text =~ /\A\s+\z/m
end
doc.write_xml_to($stdout, indent: 0)
This is a big progress for me, but I still haven't reached my goal because the XML file I'm working on has inline self-closing tags, and there are whitespace-only text nodes between those tags that should not be compacted. I'm trying to figure out a way to handle this corner case now.

How do I use xpath on nodes with a prefix but without a namespace?

I have an XML file that I need to parse. I have no control over the format of the file and cannot change it.
The file makes use of a prefix (call it a), but it doesn't define a namespace for that prefix anywhere. I can't seem to use xpath to query for nodes with the a namespace.
Here's the contents of the xml document
<?xml version="1.0" encoding="UTF-8"?>
<a:root>
<a:thing>stuff0</a:thing>
<a:thing>stuff1</a:thing>
<a:thing>stuff2</a:thing>
<a:thing>stuff3</a:thing>
<a:thing>stuff4</a:thing>
<a:thing>stuff5</a:thing>
<a:thing>stuff6</a:thing>
<a:thing>stuff7</a:thing>
<a:thing>stuff8</a:thing>
<a:thing>stuff9</a:thing>
</a:root>
I am using Nokogiri to query the document:
doc = Nokogiri::XML(open('text.xml'))
things = doc.xpath('//a:thing')
The fails giving the following error:
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //a:thing
From my research, I found out that I could specify the namespace for the prefix in the xpath method:
things = doc.xpath('//a:thing', a: 'nobody knows')
This returns an empty array.
What would be the best way for me to get the nodes that I need?

The problem is that the namespace is not properly defined in the XML document. As a result, Nokogiri sees the node names as being "a:root" instead of "a" being a namespace and "root" being the node name:
xml = %Q{
<?xml version="1.0" encoding="UTF-8"?>
<a:root>
<a:thing>stuff0</a:thing>
<a:thing>stuff1</a:thing>
</a:root>
}
doc = Nokogiri::XML(xml)
puts doc.at_xpath('*').node_name
#=> "a:root"
puts doc.at_xpath('*').namespace
#=> ""
Solution 1 - Specify node name with colon
One solution is to search for nodes with the name "a:thing". You cannot do //a:thing since the XPath will treat the "a" as a namespace. You can get around this by doing //*[name()="a:thing"]:
xml = %Q{
<?xml version="1.0" encoding="UTF-8"?>
<a:root>
<a:thing>stuff0</a:thing>
<a:thing>stuff1</a:thing>
</a:root>
}
doc = Nokogiri::XML(xml)
things = doc.xpath('//*[name()="a:thing"]')
puts things
#=> <a:thing>stuff0</a:thing>
#=> <a:thing>stuff1</a:thing>
Solution 2 - Modify the XML document to define the namespace
An alternative solution is to modify the XML file that you get to properly define the namespace. The document will then behave with namespaces as expected:
xml = %Q{
<?xml version="1.0" encoding="UTF-8"?>
<a:root>
<a:thing>stuff0</a:thing>
<a:thing>stuff1</a:thing>
</a:root>
}
xml.gsub!('<a:root>', '<a:root xmlns:a="foo">')
doc = Nokogiri::XML(xml)
things = doc.xpath('//a:thing')
puts things
#=> <a:thing>stuff0</a:thing>
#=> <a:thing>stuff1</a:thing>

Nokogiri XSLT tagging document as XML type when using JSON

I am using Nokogiri to transform an XML document to JSON. The code is straight forward:
#document = Nokogiri::XML(entry.data)
xslt = Nokogiri::XSLT(File.read("#{File.dirname(__FILE__)}/../../xslt/my.xslt"))
transform = xslt.transform(#document)
entry in this case is a Mongoid based model and data is an XML blob attribute stored as a string on MongoDB.
When I dump the contents of transform, the JSON is there. The problem is, Nokogiri is tagging the top of the document with:
<?xml version="1.0"?>
What's the correct way of addressing that?

Try the #apply_to method as below(Source):
require 'nokogiri'
doc = Nokogiri::XML('<?xml version="1.0"><root />')
xslt = Nokogiri::XSLT("<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'/>")
puts xslt.transform(doc)
puts "######"
puts xslt.apply_to(doc)
# >> <?xml version="1.0"?>
# >> ######
# >>

Validating DTD-String with Nokogiri

I am switching from LibXML to Nokogiri. I have a method in my code to check if an xml document matches an Dtd. The Dtd is read from a database (as string).
This is an example within an irb session
require 'xml'
doc = LibXML::XML::Document.string('<foo bar="baz" />') #=> <?xml version="1.0" encoding="UTF-8"?>
dtd = LibXML::XML::Dtd.new('<!ELEMENT foo EMPTY><!ATTLIST foo bar ID #REQUIRED>') #=> #<LibXML::XML::Dtd:0x000000026f53b8>
doc.validate dtd #=> true
As I understand #validate of Nokogiri::XML::Document it is only possible to check DTDs within the Document. How would I do this to archive the same result?

I think what you need is internal_subset:
require 'nokogiri'
doc = Nokogiri::HTML("<!DOCTYPE html>")
# then you can get the info you want
doc.internal_subset # Nokogiri::XML::DTD
# for example you can get name, system_id, external_id, etc
doc.internal_subset.name
doc.internal_subset.system_id
doc.internal_subset.external_id
Here is a full doc of Nokogiri::XML::DTD.
Thanks

Ruby: How do I get attribute values from XML with Nokogiri?

How to get the value of the message value ("ready to use")?
<?xml version="1.0" encoding="UTF-8"?>
<response status="ok" permission_level="admin" message="ready to use" cached="0">
<title>kit</title>
</response>
Thanks

require 'rubygems'
require 'nokogiri'
string = %Q{
<?xml version="1.0" encoding="UTF-8"?>
<response status="ok" permission_level="admin" message="ready to use" cached="0">
<title>kit</title>
</response>
}
doc = Nokogiri::XML(string)
doc.css("response").each do |response_node|
puts response_node["message"]
end
save and run this ruby file, you will get result:
#=> ready to use

You subscript them.
doc = Nokogiri::HTML(open('http://google.com'))
doc.css('img:first').first['alt']
=> "Google"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How do I get Nokogiri to add the right XML encoding? - ruby

I have created a xml doc with Nokogiri: Nokogiri::XML::Document The header of my file is <?xml version="1.0"?> but I'd expect to have <?xml version="1.0" encoding="UTF-8"?>. Is there any options I could use so the encoding appears ?

When parsing the doc you can set the encoding like this: doc = Nokogiri::XML::Document.parse(xml_input, nil, "UTF-8") For me that returns <?xml version="1.0" encoding="UTF-8"?>

Related

How to compact existing XML using Nokogiri

How do I use xpath on nodes with a prefix but without a namespace?

Nokogiri XSLT tagging document as XML type when using JSON

Validating DTD-String with Nokogiri

Ruby: How do I get attribute values from XML with Nokogiri?

Categories

Resources