How do I use Nokogiri to parse a single file with multiple XML documents in it? - ruby

I have a single file that contains multiple concatenated XML files like so:
<?xml version ... ?>
<!DOCTYPE ... >
...
<?xml version ... ?>
<!DOCTYPE ... >
...
<?xml version ... ?>
<!DOCTYPE ... >
...
Is there any way to parse the file as is, using Nokogiri, as opposed to slicing the file up?

You need to slice it into individual documents, but that is an easy thing to do.
Ruby's String.split method makes it easy. For instance if variable foo contains the text, then foo.split("<?xml version ... ?>\n") will return an array you can loop over:
foo.split("<?xml version ... ?>\n")
[
[0] "",
[1] "<!DOCTYPE ... >\n...\n",
[2] "<!DOCTYPE ... >\n...\n",
[3] "<!DOCTYPE ... >\n..."
]
Parse each of those chunks and you'll be on your way. You might need to prepend the XML DECL statement to make Nokogiri happy, but I think it'll do OK without it.

This wouldn't be a valid XML file, so you can't parse it all in one go. But you may be able to create a class that inherits from File, and has the smarts to return end-of-file when you get to the end of each XML document. With that, you should be able to open your file once, but you would still make multiple calls to your XML parser.
If the XML fragments are not very large, it may be best to slurp a fragment at at time into a string variable (perhaps using regexp), and parse that.

Related

How to replace the first few bytes of a file in Ruby without opening the whole file?

I have a 30MB XML file that contains some gibberish in the beginning, and so typically I have to remove that in order for Nokogiri to be able to parse the XML document properly.
Here's what I currently have:
contents = File.open(file_path).read
if contents[0..123].include? 'authenticate_response'
fixed_contents = File.open(file_path).read[123..-1]
File.open(file_path, 'w') { |f| f.write(fixed_contents) }
end
However, this actually causes the ruby script to open up the large XML file twice. Once to read the first 123 characters, and another time to read everything but the first 123 characters.
To solve the first issue, I was able to accomplish this:
contents = File.open(file_path).read(123)
However, now I need to remove these characters from the file without reading the entire file. How can I "trim" the beginning of this file without having to open the entire thing in memory?
You can open the file once, then read and check the "garbage" and finally pass the opened file directly to nokogiri for parsing. That way, you only need read the file once and don't need to write it at all.
File.open(file_path) do |xml_file|
if xml_file.read(123).include? 'authenticate_response'
# header found, nothing to do
else
# no header found. We rewind and let nokogiri parse the whole file
xml_file.rewind
end
xml = Nokogiri::XML.parse(xml_file)
# Now to whatever you want with the parsed XML document
end
Please refer to the documentation of IO#read, IO#rewind and Nokigiri::XML::Document.parse for details about those methods.

Finding XML text values with spaces using Nokogiri

How to find the Fetus Summary node using xpath?
<container flag="SEPARATE">
<relationship>CONTAINS</relationship>
<concept>
<value>125008</value>
<scheme>
<designator>DCM</designator>
</scheme>
<meaning>Fetus Summary</meaning>
</concept>
</container>
This doesn't work:
xml.xpath( '//*[.="Fetus Summary"]' )
But similar code does, when using text values without spaces. Can someone please help?
Try to use normalize-space() function, e.g.:
//*[normalize-space(.)="Fetus Summary"]
Your original code works just fine with the XML that you have shared:
require 'nokogiri'
doc = Nokogiri.XML '
<container flag="SEPARATE">
<relationship>CONTAINS</relationship>
<concept>
<value>125008</value>
<scheme>
<designator>DCM</designator>
</scheme>
<meaning>Fetus Summary</meaning>
</concept>
</container>
'
puts doc.at_xpath( '//*[.="Fetus Summary"]' )
#=> <meaning>Fetus Summary</meaning>
This works even if you have the XML declaration <?xml version="1.0" encoding="ISO-8859-1"?> at the start of the document.
My guess is that your document has whitespace characters that look like spaces or tabs but are not. From where are you getting your XML? (Over the Internet using open-uri, or from a file?) Can you please post the result of p my_xml_string to your question, which will encode any complex characters?

convert xml to utf-8 encoding

I have an xml that starts with
<?xml version='1.0' encoding='ISO-8859-8'?>
when I attempt to do
Hash.from_xml(my_xml)
I get a #<REXML::ParseException: No close tag for /root/response/message> (REXML::ParseException)
in the message tag there is indeed characters in the above encoding. I need to parse that XML, so I am guessing that I need to convert it all to utf-8 or something else that the parse will like.
Is there a way to do this? (other uses like with Nokogiri are also good)
Nokogiri seems to do the right thing:
# test.xml
<?xml version='1.0' encoding='ISO-8859-8'?>
<what>
<body>דה</body>
</what>
xml = Nokogiri::XML(File.read 'test.xml')
puts xml.at_xpath('//body').content
# => "דה"
You can also tell Nokogiri what encoding to use (e.g., Nokogiri::XML(File.read('test.xml'), nil, 'ISO-8859-8')), but that doesn't seem to be necessary here.
If that doesn't help, you might want to check that your XML is well-formed.
You can then convert the XML to UTF-8 if you like:
xml2 = xml.serialize(:encoding => 'UTF-8') {|c| c.format.as_xml }
If you just want to convert your Nokogiri XML to a hash, take a look at some of the solutions here: Convert a Nokogiri document to a Ruby Hash, or you can just do: Hash.from_xml(xml2).

How to convert test report log files into Jenkins Junit XML File format?

I've been assigned a task to do with Ruby, but as I'm new to it I'm searching over the internet for a proper solution to solve it:
I have a list of log files reports in a web page actually divided in passed or failed.
I need to take log files and convert them into a format compatible with Jenkins Junit XML File format.
Everything, the passage from log to XML has to be written in Ruby.
Any ideas?!
I looked all over the internet and found information but still have no clear ideas how to solve this.
Are there any Ruby libraries that make this easier?
Has anyone ever handled anything like this?
You don't show the format you need, and I don't know what Jenkins needs, but creating XML is easy. Unfortunately, what you want will take a book, or several articles, which is beyond the scope of Stack Overflow. Basically though...
You can use a templating system, like ERB where you create templates for your overall XML document, or Nokogiri::Builder can be used to generate XML, or you can do it old school and use simple string interpolation to create your XML.
A syslog file is typically fairly well structured, at least for the first several fields, followed by free-form text which is the output of various commands. A log file from Apache is similar, with columns of text, followed by some free-form, but easily parsable text. There are gems here and there, along with tutorials on how to parse a log, so search around and you'll find something. The idea is you want to break down each line read into text you can assign to an XML node.
Once you have your fields, you can substitute them into the template or have Ruby interpolate the variables into strings, or use Builder to add the text between the tags.
It's not really hard, but is going to take several small tasks to accomplish.
Using string interpolation, if you wanted XML like:
<xml>
<tag1>
<tag2>some text</tag2>
<tag2>some more text</tag2>
</tag1>
</xml>
You could create it like:
var1 = "some text"
var2 = "some more text"
xml = %Q{
<xml>
<tag1>
<tag2>#{var1}</tag2>
<tag2>#{var2}</tag2>
</tag1>
</xml>
}
puts xml
Similarly, if you want to use ERB:
require 'erb'
var1 = "some text"
var2 = "some more text"
template = ERB.new <<-EOF
<xml>
<tag1>
<tag2><%= var1 %></tag2>
<tag2><%= var2 %></tag2>
</tag1>
</xml>
EOF
puts template.result(binding)
Which outputs:
<xml>
<tag1>
<tag2>some text</tag2>
<tag2>some more text</tag2>
</tag1>
</xml>
Or, using Nokogiri::Builder:
require 'nokogiri'
var1 = "some text"
var2 = "some more text"
builder = Nokogiri::XML::Builder.new do |node|
node.xml {
node.tag1 {
[var1, var2].each do |t|
node.tag2(t)
end
}
}
end
puts builder.to_xml
Which outputs:
<?xml version="1.0"?>
<xml>
<tag1>
<tag2>some text</tag2>
<tag2>some more text</tag2>
</tag1>
</xml>
Under this link there is a small explanation that may help
https://pzolee.blogs.balabit.com/2012/11/jenkins-vs-junit-xml-format/
Basically you just use https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin
to import a XML that follows the Junit schema. You can convert XML to Junit format having the schemes.
A good description of the Junit format can be found under: llg.cubic.org/docs/junit/

Invalid characters before my XML in Ruby

When I look in an XML file, it looks fine, and starts with <?xml version="1.0" encoding="utf-16le" standalone="yes"?>
But when I read it in Ruby and print it to stout, there are two ?s in front of that: ??<?xml version="1.0" encoding="utf-16le" standalone="yes"?>
Where do these come from, and how do I remove them? Parsing it like this with REXML fails immediately. Removing the first to characters and then parsing it, gives me this error:
REXML::ParseException: #<REXML::ParseException: malformed XML: missing tag start
Line:
Position:
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-16le" s>
What is the right way to handle this?
Edit: Below is my code. The ftp.get downloads the xml from an ftp server. (I wonder if that might be relevant.)
xml = ftp.get
puts xml
until xml[0,1] == "<" # to remove the 2 invalid characters
puts xml[0,2]
xml.slice! 0
end
puts xml
document = REXML::Document.new(xml)
The last puts prints the correct xml. But because of the two invalid characters, I've got the feeling something else went wrong. It shouldn't be necessary to remove anything. I'm at a loss what the problem might be, though.
Edit 2: I'm using Net::FTP to download the XML, but with this new method that lets me read the contents into a string instead of a file:
class Net::FTP
def gettextcontent(remotefile, &block) # :yield: line
f = StringIO.new()
begin
retrlines("RETR " + remotefile) do |line|
f.puts(line)
yield(line) if block
end
ensure
f.close
return f
end
end
end
Edit 3: It seems to be caused by StringIO (in Ruby 1.8.7) not supporting unicode. I'm not sure if there's a workaround for that.
Those 2 characters are most likely a unicode bom: bytes that tell whoever is reading the file what the byte order is.
As long as you know what the encoding of the file is, it should be safe to strip them - they aren't actual content
To answer my own question, the real problem here is that encoding support in Ruby 1.8.7 is lacking. StringIO is particular seems to make a mess of it. REXML also has trouble handling unicode in Ruby 1.8.7.
The most attractive solution would be of course to upgrade to 1.9.3, but that's not practical for this project right now.
So what I ended up doing is, avoid StringIO and simply download to a file on disk, and then instead of processing the XML with REXML, use nokogiri instead.
Together, that solves all my problems.

Resources