Ruby doesn't treat hebrew letters well - ruby

I am trying to read an XML which have also hebrew letters and its content is:
<?xml version="1.0" encoding="UTF-8"?>
<keywords type="array">
<keyword>seo software</keyword>
<keyword>ipad</keyword>
<keyword>muffuletta manhattanization</keyword>
<keyword>cheap motels</keyword>
<keyword>שפות תכנות</keyword>
</keywords>
And my code to do it is:
# encoding: UTF-8
def use
#require "rexml/document"
file = File.new( "sources/rankabove-test.xml" )
puts file.read
end
However, it doesn't help me, and the output of the 'puts' command is gibberish for the Hebrew letters:
╫⌐╫ñ╫ץ╫¬ ╫¬╫¢╫á╫ץ╫¬
I am using win xp 32 bit. Does anyone familiar with that problem? Anything I can do?

I don't think the problem is Ruby:
# encoding: UTF-8
puts RUBY_VERSION
# >> 1.9.2
xml = '
<?xml version="1.0" encoding="UTF-8"?>
<keywords type="array">
<keyword>seo software</keyword>
<keyword>ipad</keyword>
<keyword>muffuletta manhattanization</keyword>
<keyword>cheap motels</keyword>
<keyword>שפות תכנות</keyword>
</keywords>
'
require 'nokogiri'
doc = Nokogiri::XML(xml)
puts doc.search('//keyword').last.text
# >> שפות תכנות
require "rexml/document"
require 'rexml/node'
require 'rexml/xpath'
doc = REXML::Document.new(xml)
puts REXML::XPath.match(doc, '//keyword').last.text
# >> שפות תכנות
Using both Nokogiri and REXML I get the same output on Mac OS.

Related

How to compact existing XML using Nokogiri

I'm trying to compact an existing XML file using Nokogiri. I have the following demo code:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
XML
doc.write_xml_to($stdout, indent: 0)
I expected to see:
<?xml version="1.0" encoding="UTF-8"?>
<root><foo><bar>test</bar></foo></root>
but instead I saw:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
I tried:
doc.write_to($stdout, indent: 0, save_with: Nokogiri::XML::Node::SaveOptions::AS_XML)
but that doesn't work either.
How can I remove the ignorable whitespaces?
You can tell Nokogiri to ignore empty text nodes and then to output without indentation:
require 'nokogiri'
xml = <<EOT
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
EOT
doc = Nokogiri::XML(xml) { |opts|
opts.noblanks
opts.strict.noblanks
}
doc.to_xml(:indent_text => '', :indent => 0)
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
# "<root>\n" +
# "<foo>\n" +
# "<bar>test</bar>\n" +
# "</foo>\n" +
# "</root>\n"
Okay, I answer my own question.
Nokogiri does not remove the white spaces because Nokogiri doesn't know if the white spaces are ignorable or not (no DTD, no schema), so it keeps all the whitespace-only text as text nodes. I should remove them manually before writing the XML doc to the IO device.
#!/usr/bin/env ruby
require 'bundler'
Bundler.require :default
doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
<foo>
<bar>test</bar>
</foo>
</root>
XML
# remove ignorable white spaces
doc.xpath('//text()').each do |node|
node.content = '' if node.text =~ /\A\s+\z/m
end
doc.write_xml_to($stdout, indent: 0)
This is a big progress for me, but I still haven't reached my goal because the XML file I'm working on has inline self-closing tags, and there are whitespace-only text nodes between those tags that should not be compacted. I'm trying to figure out a way to handle this corner case now.

Building blank XML tags with Nokogiri?

I'm trying to build up an XML document using Nokogiri. Everything is pretty standard so far; most of my code just looks something like:
builder = Nokogiri::XML::Builder.new do |xml|
...
xml.Tag1(object.attribute_1)
xml.Tag2(object.attribute_2)
xml.Tag3(object.attribute_3)
xml.Tag4(nil)
end
builder.to_xml
However, that results in a tag like <Tag4/> instead of <Tag4></Tag4>, which is what my end user has specified that the output needs to be.
How do I tell Nokogiri to put full tags around a nil value?
SaveOptions::NO_EMPTY_TAGS will get you what you want.
require 'nokogiri'
builder = Nokogiri::XML::Builder.new do |xml|
xml.blah(nil)
end
puts 'broken:'
puts builder.to_xml
puts 'fixed:'
puts builder.to_xml(save_with: Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS)
output:
(511)-> ruby derp.rb
broken:
<?xml version="1.0"?>
<blah/>
fixed:
<?xml version="1.0"?>
<blah></blah>

Search node in xml by using Nokogiri xpath (with xml namesapce)

I found Nokogiri is quite powerful on dealing with xml but I met a special case
I am trying to search a node in xml file like this
<?xml version="1.0" encoding="utf-8" ?>
<ConfigurationSection>
<Configuration xmlns="clr-namespace:Newproject.Framework.Server.Store.Configuration;assembly=Newproject.Framework.Server" >
<Configuration.Store>SqlServer</Configuration.Store>
<Configuration.Engine>Staging</Configuration.Engine>
</Configuration>
</ConfigurationSection>
When I do a
xml = File.new(webconfig,"r")
doc = Nokogiri::XML(xml.read)
nodes = doc.search("//Configuration.Store")
xml.close
I got empty nodes. Something am I missing? I have tried
nodes = doc.search("//Configuration\.Store")
still no luck.
Updated: I have attached the whole xml file
Updated the xml Again: My mistake, it does have a namaspace
EDIT #2: Solution now includes #parse_with_namespace
You can find a number of Nokogiri methods pertaining to namespaces in the Nokogiri::XML::Node documentation.
# encoding: UTF-8
require 'rspec'
require 'nokogiri'
XML = <<XML
<?xml version="1.0" encoding="utf-8" ?>
<ConfigurationSection>
<Configuration xmlns="clr-namespace:Newproject.Framework.Server.Store.Configuration;assembly=Newproject.Framework.Server" >
<Configuration.Store>SqlServer</Configuration.Store>
<Configuration.Engine>Staging</Configuration.Engine>
</Configuration>
</ConfigurationSection>
XML
class ConfigParser
def parse(xml)
doc = Nokogiri::XML(xml).remove_namespaces!
configuration = doc.at('/ConfigurationSection/Configuration')
store = configuration.at("./Configuration.Store").text
engine = configuration.at("./Configuration.Engine").text
{store: store, engine: engine}
end
def parse_with_namespace(xml)
doc = Nokogiri::XML(xml)
configuration = doc.at('/ConfigurationSection/xmlns:Configuration', 'xmlns' => 'clr-namespace:Newproject.Framework.Server.Store.Configuration;assembly=Newproject.Framework.Server')
store = configuration.at("./xmlns:Configuration.Store", 'xmlns' => 'clr-namespace:Newproject.Framework.Server.Store.Configuration;assembly=Newproject.Framework.Server').text
engine = configuration.at("./xmlns:Configuration.Engine", 'xmlns' => 'clr-namespace:Newproject.Framework.Server.Store.Configuration;assembly=Newproject.Framework.Server').text
{store: store, engine: engine}
end
end
describe ConfigParser do
before(:each) do
#parsed = subject.parse XML
#parsed_with_ns = subject.parse_with_namespace XML
end
it "should be able to parse the Configuration Store" do
#parsed[:store].should eq "SqlServer"
end
it "should be able to parse the Configuration Engine" do
#parsed[:engine].should eq "Staging"
end
it "should be able to parse the Configuration Store with namespace" do
#parsed_with_ns[:store].should eq "SqlServer"
end
it "should be able to parse the Configuration Engine with namespace" do
#parsed_with_ns[:engine].should eq "Staging"
end
end
require 'nokogiri'
XML = "<Configuration>
<Configuration.Store>SqlServer</Configuration.Store>
<Configuration.Engine>Staging</Configuration.Engine>
</Configuration>"
p Nokogiri::VERSION, Nokogiri.XML(XML).search('//Configuration.Store')
#=> "1.5.0"
#=> [#<Nokogiri::XML::Element:0x8103f0f8 name="Configuration.Store" children=[#<Nokogiri::XML::Text:0x81037524 "SqlServer">]>]
p RUBY_DESCRIPTION
#=> "ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.7.0]"

rexml and nokogiri XML parsing

Can someone please explain why there is a difference in Nokogiri and REXML outputs in the code below.
require 'rubygems'
require 'Nokogiri'
require 'rexml/document'
xml = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>
<yml>
<a>TM and © 2009</a>
</yml>"
puts 'nokogiri'
doc = Nokogiri::XML(xml)
puts doc.to_s, "\n"
puts 'rexml'
doc = REXML::Document.new(xml)
puts doc.to_s
outputs:
nokogiri
<?xml version="1.0" encoding="ISO-8859-1"?>
<yml>
<a>TM and ? 2009</a>
</yml>
rexml
<?xml version='1.0' encoding='ISO-8859-1'?>
<yml>
<a>TM and © 2009</a>
</yml>
Sure, nokogiri is converting the text using ISO-8859-1, whereas rexml is just outputting what you put in. If you change the XML to utf-8 encoding then you'll get:
nokogiri:
<?xml version="1.0" encoding="utf-8"?>
<yml>
<a>TM and © 2009</a>
</yml>
rexml:
<?xml version='1.0' encoding='UTF-8'?>
<yml>
<a>TM and © 2009</a>
</yml>

ruby malformed XML: missing tag start

I have a very weird problem: I run the same code on the two xml files, the second of which is the copy of the first one (I copied the contents, maybe that's a problem).
The code uses REXML to parse the xml file, on the first file it's all good, on the second I have this error:
Failed: malformed XML: missing tag start
Line: 2
Position: 102
Last 80 unconsumed characters:
<t>dede</t>
The contents of the xml file is:
<?xml version="1.0" standalone="yes"?>
<t>dede</t>
Any ideas?
Thanks a lot
I do not have any such problem using this code:
require 'rexml/document'
doc = REXML::Document.new <<ENDXML
<?xml version="1.0" standalone="yes"?>
<t>dede</t>
ENDXML
doc.each_element('//t'){ |e| puts e }
#=> <t>dede</t>
What version of Ruby are you using, and what does your code actually look like?
Edit: Based off the new information that you're using the stream parser, here's another piece of code that also works for me using Ruby 1.8.7:
class Listener
def method_missing( name, *args ); puts "I don't support '#{name}'"; end
def tag_start( name, attrs ); puts "<#{name} #{attrs.inspect}>"; end
def text( str ); p str; end
def tag_end( name ); puts "</#{name}>"; end
end
require 'stringio'
xml = StringIO.new <<ENDXML
<?xml version="1.0" standalone="yes"?>
<t>dede</t>
ENDXML
require 'rexml/document'
doc = REXML::Document.parse_stream( xml, Listener.new )
#=> "\t"
#=> I don't support 'xmldecl'
#=> "\n\t"
#=> <t {}>
#=> "dede"
#=> </t>
#=> "\n"
It's because of the file encoding. I have the same problem and found out the file was UCS-2 encoded. Either UTF-8 or ANSI works, but UCS-2 doesn't, it seems. It probably needs specialized parsers for this format first. I just converted the xml file in Notepad++ to test the different encodings.
REXML seems a bit too eager to throw a ParseException. Encoding is definitely a major culprit. Check the encoding of your files.

Resources