Why can't REXML parse CDATA preceded by a line break? - ruby

I'm very new to Ruby, and trying to parse an XML document with REXML that has been previously pretty-printed (by REXML) with some slightly erratic results.
Some CDATA sections have a line break after the opening XML tag, but before the opening of the CDATA block, in these cases REXML parses the text of the tag as empty.
Any idea if I can get REXML to read these lines?
If not, could I re-write them before hand with a regex or something?
Is this even Valid XML?
Here's an example XML document (much abridged):
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
<content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
<content type="base64">
<![CDATA[VGhpcyB3b250IHdvcms=]]></content>
<content><![CDATA[This will work]]></content>
<content>
<![CDATA[This will not appear]]></content>
<content>
Seems happy</content>
<content>Obviously no problem</content>
</root-tag>
and here's my Ruby script (distilled down to a minimal example):
require 'rexml/document'
require 'base64'
include REXML
module RexmlSpike
file = File.new("ex.xml")
doc = Document.new file
doc.elements.each("root-tag/content") do |contentElement|
if contentElement.attributes["type"] == "base64"
puts "decoded: " << Base64.decode64(contentElement.text)
else
puts "raw: " << contentElement.text
end
end
puts "Finished."
end
The output I get is:
>> ruby spike.rb
decoded: Well done! It works :)
decoded:
raw: This will work
raw:
raw:
Seems happy
raw: Obviously no problem
Finished.
I'm using Ruby 1.9.3p392 on OSX Lion. The object of the exercise is ultimately to parse comments from some BlogML into the custom import XML used by Disqus.

Why
Having anything before the <![CDATA[]]> overrides whatever is in the <![CDATA[]]>. Anything from a letter, to a newline (like you've discovered), or a single space. This makes sense, because your example is getting the text of the element, and whitespace counts as text. In the examples where you are able to access <![CDATA[]]>, it is because text is nil.
Solution
If you look at the documentation for Element, you'll see that it has a function called cdatas() that:
Get an array of all CData children. IMMUTABLE.
So, in your example, if you do an inner loop on contentElement.cdatas() you would see the content of all your missing tags.

I'd recommend using Nokogiri, which is the defacto XML/HTML parser for Ruby. Using it to access the contents of the <content> tags, I get:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
<content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
<content type="base64">
<![CDATA[VGhpcyB3b250IHdvcms=]]></content>
<content><![CDATA[This will work]]></content>
<content>
<![CDATA[This will not appear]]></content>
<content>
Seems happy</content>
<content>Obviously no problem</content>
</root-tag>
EOT
doc.search('content').each do |n|
puts n.content
end
Which outputs:
V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==
VGhpcyB3b250IHdvcms=
This will work
This will not appear
Seems happy
Obviously no problem

Your xml is valid, but not the way you expects, as #lightswitch05 pointed out. You can use the w3c xml validator
If you are using XML from the wild world web, it is a good idea to use nokogiri because it usually works as you think it should, not as it really should.
Side note: this is exactly why I avoid XML and use JSON instead: XML have a proper definition but no one seems to use it anyway.

Related

How can I keep CDATA while updating the content of an xml node?

In a Ruby script, I want to update the CDATA content while keeping the format as CDATA.
doc = Nokogiri::XML(File.open('text.xml'))
doc.xpath('//Test').each do |test|
test.content = 'new string'
end
This is my test.xml file
<?xml version="1.0" encoding="UTF-8"?>
<Test><![CDATA[<p>Some content</p>]]></Test>
Problem is, in my doc CDATA converts to Text. Is there any way I can keep CDATA property?
Thanks
Your Nokogiri::XML::Element#content= will replace the content of Test with a text node (and destroy whatever content used to be there). You need to access your CDATA, then run content= on that. For example:
doc.xpath('//Test').each do |test|
cdata = test.children.find(&:cdata?)
cdata.content = 'new string' if cdata
end
(It would be more straightforward if you could tell XPath to directly select the CDATA node, but I don't know that it can do that.)

Problem reading XML with Nokogiri

My Ruby script is supposed to read in an XML doc from a URL and check it for well-formedness, returning any errors. I have a sample bad XML document hosted with the following text (from the Nokogiri tutorial:
<?xml version="1.0"?>
<root>
<open>foo
<closed>bar</closed>
</root>
My test script is as follows (url refers to the above xml file hosted on my personal server):
require 'nokogiri'
document = Nokogiri::XML(url)
puts document
puts document.errors
The output is:
<?xml version="1.0"?>
Start tag expected, '<' not found
Why is it only capturing the first line of the XML file? It does this with even with known good XML files.
It is trying to parse the url, not its content. Please, take into account that first parameter to Nokogiri::XML must be a string containing the document or an IO object since it is just a shortcut to Nokogiri::XML::Document.parse as stated here.
EDIT: For reading from an uri
require 'open-uri'
open(uri).read
I'm not too sure what code you are using to actually output the contents of the XML. I only see error printing code. However, I have posted some sample code to effectively move through XML with Nokogiri below:
<item>
Something
</item>
<item>
Else
</item>
doc = Nokogiri::XML(open(url))
set = doc.xpath('//item')
set.each {|item| puts item.to_s}
#=> Something
#=> Else
In general, the tutorial here should help you.
if you are getting the xml from a Nokogiri xml already, then make sure you use '.to_s' before passing it to the XML function.
for example,
xml = Nokogiri::XML(existing_nokogiri_xml_doc.to_s)

How to tidy up malformed xml in ruby

I'm having issues tidying up malformed XML code I'm getting back from the SEC's edgar database.
For some reason they have horribly formed xml. Tags that contain any sort of string aren't closed and it can actually contain other xml or html documents inside other tags. Normally I'd had this off to Tidy but that isn't being maintained.
I've tried using Nokogiri::XML::SAX::Parser but that seems to choke because the tags aren't closed. It seems to work alright until it hits the first ending tag and then it doesn't fire off on any more of them. But it is spiting out the right characters.
class Filing < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
puts "starting: #{name}"
end
def characters str
puts "chars: #{str}"
end
def end_element name
puts "ending: #{name}"
end
end
It seems like this would be the best option because I can simply have it ignore the other xml or html doc. Also it would make the most sense because some of these documents can get quite large so storing the whole dom in memory would probably not work.
Here are some example files: 1 2 3
I'm starting to think I'll just have to write my own custom parser
Nokogiri's normal DOM mode is able to automatically fix-up the XML so it is syntactically correct, or a reasonable facsimile of that. It sometimes gets confused and will shift closing tags around, but you can preprocess the file to give it a nudge in the right direction if need be.
I saved the XML #1 out to a document and loaded it:
require 'nokogiri'
doc = ''
File.open('./test.xml') do |fi|
doc = Nokogiri::XML(fi)
end
puts doc.to_xml
After parsing, you can check the Nokogiri::XML::Document instance's errors method to see what errors were generated, for perverse pleasure.
doc.errors
If using Nokogiri's DOM model isn't good enough, have you considered using XMLLint to preprocess and clean the data, emitting clean XML so the SAX will work? Its --recover option might be of use.
xmllint --recover test.xml
It will output errors on stderr, and the code on stdout, so you can pipe it easily to another file.
As for writing your own parser... why? You have other options available to you, and reinventing a nicely implemented wheel is not a good use of time.

Parsing SEC Edgar XML file using Ruby into Nokogiri

I'm having problems parsing the SEC Edgar files
Here is an example of this file.
The end result is I want the stuff between <XML> and </XML> into a format I can access.
Here is my code so far that doesn't work:
scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)
Ok, there are a couple of things wrong:
sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
You don't say what you want from the file. Without that information we can't recommend a real solution. You need to take more time to define the question better.
Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(
open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603
I recommend practicing in IRB and reading the docs for Nokogiri
> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>]
that should get you going
Given this was asked a year back, the answer is probably OBE, but what the fellow should do is examine all of the documents that are on the site, and notice the actual filing details can be found at:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm
Within this, you will see that the XML document is is after is already parsed out ready for further manipulation at:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc.xml
Be warned, however, the actual file name at the end is determined by the submitter of the document, not by the SEC. Therefore, you cannot depend on the document always being 'primary_doc.xml'.

Remove all but certain tags in an XML document with Ruby

require 'nokogiri'
doc = Nokogiri::XML "<root>
<a>foo<c>bar</c></a>
<b>jim<d>jam></d></b>
<a>more</a>
<x>no no no</x>
</root>"
doc.css("a, b").each {|o| p o.to_s}
# "<a>foo<c>bar</c></a>"
# "<a>more</a>"
# "<b>jim<d>jam></d></b>"
How can I keep tags in their original order? Or also remove nested tags?
You might want to look at whitelist/blacklist/scrubbing gems. Sanitize and Loofah come to mind.
From Sanitize's description:
Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string.
From Loofah's description:
Loofah excels at HTML sanitization (XSS prevention). It includes some nice HTML sanitizers, which are based on HTML5lib’s whitelist, so it most likely won’t make your codes less secure. (These statements have not been evaluated by Netexperts.)
In either case, they'll save you from reinventing a wheel.
require 'nokogiri'
doc = Nokogiri::XML "
<root>
<a>foo<c>bar</c></a>
<b>jim<d>jam></d></b>
<a>more</a>
<x>no no no</x>
</root>"
doc.xpath('root//*[name()!="a"][name()!="b"]').remove
puts doc
#=> <?xml version="1.0"?>
#=> <root>
#=> <a>foo</a>
#=> <b>jim</b>
#=> <a>more</a>
#=>
#=> </root>
If this is just an issue of order and none of the tags you need to isolate are nested, using XPath instead of CSS selectors in Nokogiri should return the tags in the same order they are in the document:
doc.xpath("//a | //h3").each { |o| puts o }
I'm not sure if this behavior is in any spec for Nokogiri, so you may want to be careful, but in my experience it is true.
Of course, if the tags you're after are ever nested you may need to define what it means to "remove all but certain tags" (e.g. what happens to removed tags and their contents that exist inside non-removed tags and their contents, etc.).
If your requirement is sufficiently complicated such that XPath queries won't cut it, you may need to "walk the DOM" using something like doc.root.children and recursively examine the children of each node.

Resources