Parsing of big file with Ruby - ruby

I need to parse extremely big XML file (near 50GB), how I can do it with Ruby? It's not possible to split it on chunks, I'v already tried.

I parsed a 40GB file using Nokogiri::XML::Reader.
Structure of my XML file:
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="4">
<row Id="5">
</posts>
Code:
require 'nokogiri'
fname = "Posts.xml"
xml = Nokogiri::XML::Reader(File.open(fname))
xml.each do |posts|
posts.each do |post|
next if post.node_type == 14 # TYPE_SIGNIFICANT_WHITESPACE
# do something with post
end
end
I think the answer depends on how you plan to use the data. In my case, I simply needed to stream the post nodes.

Related

Finding XML text values with spaces using Nokogiri

How to find the Fetus Summary node using xpath?
<container flag="SEPARATE">
<relationship>CONTAINS</relationship>
<concept>
<value>125008</value>
<scheme>
<designator>DCM</designator>
</scheme>
<meaning>Fetus Summary</meaning>
</concept>
</container>
This doesn't work:
xml.xpath( '//*[.="Fetus Summary"]' )
But similar code does, when using text values without spaces. Can someone please help?
Try to use normalize-space() function, e.g.:
//*[normalize-space(.)="Fetus Summary"]
Your original code works just fine with the XML that you have shared:
require 'nokogiri'
doc = Nokogiri.XML '
<container flag="SEPARATE">
<relationship>CONTAINS</relationship>
<concept>
<value>125008</value>
<scheme>
<designator>DCM</designator>
</scheme>
<meaning>Fetus Summary</meaning>
</concept>
</container>
'
puts doc.at_xpath( '//*[.="Fetus Summary"]' )
#=> <meaning>Fetus Summary</meaning>
This works even if you have the XML declaration <?xml version="1.0" encoding="ISO-8859-1"?> at the start of the document.
My guess is that your document has whitespace characters that look like spaces or tabs but are not. From where are you getting your XML? (Over the Internet using open-uri, or from a file?) Can you please post the result of p my_xml_string to your question, which will encode any complex characters?

Ruby - Reading and editing XML file

I am writing a Ruby (1.9.3) script that reads XML files from a folder and then edit it if necessary.
My issue is that I was given XML files converted by Tidy but its ouput is a little strange, fo example:
<?xml version="1.0" encoding="utf-8"?>
<XML>
<item>
<ID>000001</ID>
<YEAR>2013</YEAR>
<SUPPLIER>Supplier name test,
Coproration</SUPPLIER>
...
As you can see the has and extra CRLF. I dont know why it has this behaviour but I am addressing it with a ruby script. But am having trouble as I need to see either if the last character of the line is ">" or if the first is "<" so that I can see if there is something wrong with the markup.
I have tried:
Dir.glob("C:/testing/corrected/*.xml").each do |file|
puts file
File.open(file, 'r+').each_with_index do |line, index|
first_char = line[0,1]
if first_char != "<"
//copy this line to the previous line and delete this one?
end
end
end
I also feel like I should be copying the original file content as I read it to another temporary file and then overwrite. Is that the best "way"? Any tips are welcome as I do not have much experience in altering a files content.
Regards
Does that extra \n always appear in the <SUPPLIER> node? As others have suggested, Nokogiri is a great choice for parsing XML (or HTML). You could iterate through each <SUPPLIER> node and remove the \n character, then save the XML as a new file.
require 'nokogiri'
# read and parse the old file
file = File.read("old.xml")
xml = Nokogiri::XML(file)
# replace \n and any additional whitespace with a space
xml.xpath("//SUPPLIER").each do |node|
node.content = node.content.gsub(/\n\s+/, " ")
end
# save the output into a new file
File.open("new.xml", "w") do |f|
f.write xml.to_xml
end

How to convert test report log files into Jenkins Junit XML File format?

I've been assigned a task to do with Ruby, but as I'm new to it I'm searching over the internet for a proper solution to solve it:
I have a list of log files reports in a web page actually divided in passed or failed.
I need to take log files and convert them into a format compatible with Jenkins Junit XML File format.
Everything, the passage from log to XML has to be written in Ruby.
Any ideas?!
I looked all over the internet and found information but still have no clear ideas how to solve this.
Are there any Ruby libraries that make this easier?
Has anyone ever handled anything like this?
You don't show the format you need, and I don't know what Jenkins needs, but creating XML is easy. Unfortunately, what you want will take a book, or several articles, which is beyond the scope of Stack Overflow. Basically though...
You can use a templating system, like ERB where you create templates for your overall XML document, or Nokogiri::Builder can be used to generate XML, or you can do it old school and use simple string interpolation to create your XML.
A syslog file is typically fairly well structured, at least for the first several fields, followed by free-form text which is the output of various commands. A log file from Apache is similar, with columns of text, followed by some free-form, but easily parsable text. There are gems here and there, along with tutorials on how to parse a log, so search around and you'll find something. The idea is you want to break down each line read into text you can assign to an XML node.
Once you have your fields, you can substitute them into the template or have Ruby interpolate the variables into strings, or use Builder to add the text between the tags.
It's not really hard, but is going to take several small tasks to accomplish.
Using string interpolation, if you wanted XML like:
<xml>
<tag1>
<tag2>some text</tag2>
<tag2>some more text</tag2>
</tag1>
</xml>
You could create it like:
var1 = "some text"
var2 = "some more text"
xml = %Q{
<xml>
<tag1>
<tag2>#{var1}</tag2>
<tag2>#{var2}</tag2>
</tag1>
</xml>
}
puts xml
Similarly, if you want to use ERB:
require 'erb'
var1 = "some text"
var2 = "some more text"
template = ERB.new <<-EOF
<xml>
<tag1>
<tag2><%= var1 %></tag2>
<tag2><%= var2 %></tag2>
</tag1>
</xml>
EOF
puts template.result(binding)
Which outputs:
<xml>
<tag1>
<tag2>some text</tag2>
<tag2>some more text</tag2>
</tag1>
</xml>
Or, using Nokogiri::Builder:
require 'nokogiri'
var1 = "some text"
var2 = "some more text"
builder = Nokogiri::XML::Builder.new do |node|
node.xml {
node.tag1 {
[var1, var2].each do |t|
node.tag2(t)
end
}
}
end
puts builder.to_xml
Which outputs:
<?xml version="1.0"?>
<xml>
<tag1>
<tag2>some text</tag2>
<tag2>some more text</tag2>
</tag1>
</xml>
Under this link there is a small explanation that may help
https://pzolee.blogs.balabit.com/2012/11/jenkins-vs-junit-xml-format/
Basically you just use https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin
to import a XML that follows the Junit schema. You can convert XML to Junit format having the schemes.
A good description of the Junit format can be found under: llg.cubic.org/docs/junit/

Invalid characters before my XML in Ruby

When I look in an XML file, it looks fine, and starts with <?xml version="1.0" encoding="utf-16le" standalone="yes"?>
But when I read it in Ruby and print it to stout, there are two ?s in front of that: ??<?xml version="1.0" encoding="utf-16le" standalone="yes"?>
Where do these come from, and how do I remove them? Parsing it like this with REXML fails immediately. Removing the first to characters and then parsing it, gives me this error:
REXML::ParseException: #<REXML::ParseException: malformed XML: missing tag start
Line:
Position:
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-16le" s>
What is the right way to handle this?
Edit: Below is my code. The ftp.get downloads the xml from an ftp server. (I wonder if that might be relevant.)
xml = ftp.get
puts xml
until xml[0,1] == "<" # to remove the 2 invalid characters
puts xml[0,2]
xml.slice! 0
end
puts xml
document = REXML::Document.new(xml)
The last puts prints the correct xml. But because of the two invalid characters, I've got the feeling something else went wrong. It shouldn't be necessary to remove anything. I'm at a loss what the problem might be, though.
Edit 2: I'm using Net::FTP to download the XML, but with this new method that lets me read the contents into a string instead of a file:
class Net::FTP
def gettextcontent(remotefile, &block) # :yield: line
f = StringIO.new()
begin
retrlines("RETR " + remotefile) do |line|
f.puts(line)
yield(line) if block
end
ensure
f.close
return f
end
end
end
Edit 3: It seems to be caused by StringIO (in Ruby 1.8.7) not supporting unicode. I'm not sure if there's a workaround for that.
Those 2 characters are most likely a unicode bom: bytes that tell whoever is reading the file what the byte order is.
As long as you know what the encoding of the file is, it should be safe to strip them - they aren't actual content
To answer my own question, the real problem here is that encoding support in Ruby 1.8.7 is lacking. StringIO is particular seems to make a mess of it. REXML also has trouble handling unicode in Ruby 1.8.7.
The most attractive solution would be of course to upgrade to 1.9.3, but that's not practical for this project right now.
So what I ended up doing is, avoid StringIO and simply download to a file on disk, and then instead of processing the XML with REXML, use nokogiri instead.
Together, that solves all my problems.

How do you remove illegal characters from an xml file in HTTParty?

I was trying to download an xml file that has '&' symbols in it using the HTTParty gem and I am getting this error:
"treeparser.rb:95:in `rescue in parse' <RuntimeError: Illegal character '&'
in raw string "4860 BOOMM 10x20 MD&"> (MultiXml::ParseError)"
Here is my code:
class SAPOrders
include HTTParty
default_params :output => 'xml'
format :xml
base_uri '<webservice url>'
end
xml = SAPOrders.get('/<nameOfFile.xml>').inspect
What am I missing?
If you are using HTTPParty and it's trying to parse the incoming XML before you can get your hands on it, then you'll need to split that process into the get, and the parse, so you can put code between the two.
I use OpenURI and Nokogiri for just those reasons, but whether you use those two, or their equivalents, you will have the opportunity to pre-process the XML before parsing it. '&' is an illegal character when bare; It should be encoded or in a CDATA block, but unfortunately in the wilds of the internet, there are lots of malformed XML feeds and files.
The thing I like about Nokogiri for this task is it keeps on chugging, at least as far as it can. You can look to see if you had errors after the document is parsed, and you can tweak some of its parser settings to control what it will do or complain about:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<a>
<b parm="4860 BOOMM 10x20 MD&">foobar</b>
</a>
EOT
puts doc.errors
puts doc.to_xml
Which will output:
xmlParseEntityRef: no name
<?xml version="1.0"?>
<a>
<b parm="4860 BOOMM 10x20 MD">foobar</b>
</a>
Notice that Nokogiri stripped the & but I was still able to get usable output. You have to decide whether you want an error and to halt using the STRICT option, or to continue, but Nokogiri can do either, depending on your needs.
You can massage the incoming XML:
require 'nokogiri'
xml = <<EOT
<a>
<b parm="4860 BOOMM 10x20 MD&">foobar</b>
</a>
EOT
xml['MD&'] = 'MD&'
doc = Nokogiri::XML(xml) do |config|
config.strict
end
puts doc.errors
puts doc.to_xml
Which now outputs:
<?xml version="1.0"?>
<a>
<b parm="4860 BOOMM 10x20 MD&">foobar</b>
</a>
I know this isn't a perfect answer, but from my experience dealing with a lot of RSS/Atom and XML/HTML parsing, sometimes we have to open the dirty-tricks bag and go with whatever works instead of what was elegant.
Another path to nirvana in HTTParty, would be to sub-class the parser. You should be able to get inside that flow of the XML to the parser and massage it there. From the docs:
# Intercept the parsing for all formats
class SimpleParser < HTTParty::Parser
def parse
perform_parsing
end
end

Resources