Problem reading XML with Nokogiri - ruby

My Ruby script is supposed to read in an XML doc from a URL and check it for well-formedness, returning any errors. I have a sample bad XML document hosted with the following text (from the Nokogiri tutorial:
<?xml version="1.0"?>
<root>
<open>foo
<closed>bar</closed>
</root>
My test script is as follows (url refers to the above xml file hosted on my personal server):
require 'nokogiri'
document = Nokogiri::XML(url)
puts document
puts document.errors
The output is:
<?xml version="1.0"?>
Start tag expected, '<' not found
Why is it only capturing the first line of the XML file? It does this with even with known good XML files.

It is trying to parse the url, not its content. Please, take into account that first parameter to Nokogiri::XML must be a string containing the document or an IO object since it is just a shortcut to Nokogiri::XML::Document.parse as stated here.
EDIT: For reading from an uri
require 'open-uri'
open(uri).read

I'm not too sure what code you are using to actually output the contents of the XML. I only see error printing code. However, I have posted some sample code to effectively move through XML with Nokogiri below:
<item>
Something
</item>
<item>
Else
</item>
doc = Nokogiri::XML(open(url))
set = doc.xpath('//item')
set.each {|item| puts item.to_s}
#=> Something
#=> Else
In general, the tutorial here should help you.

if you are getting the xml from a Nokogiri xml already, then make sure you use '.to_s' before passing it to the XML function.
for example,
xml = Nokogiri::XML(existing_nokogiri_xml_doc.to_s)

Related

How can I keep CDATA while updating the content of an xml node?

In a Ruby script, I want to update the CDATA content while keeping the format as CDATA.
doc = Nokogiri::XML(File.open('text.xml'))
doc.xpath('//Test').each do |test|
test.content = 'new string'
end
This is my test.xml file
<?xml version="1.0" encoding="UTF-8"?>
<Test><![CDATA[<p>Some content</p>]]></Test>
Problem is, in my doc CDATA converts to Text. Is there any way I can keep CDATA property?
Thanks
Your Nokogiri::XML::Element#content= will replace the content of Test with a text node (and destroy whatever content used to be there). You need to access your CDATA, then run content= on that. For example:
doc.xpath('//Test').each do |test|
cdata = test.children.find(&:cdata?)
cdata.content = 'new string' if cdata
end
(It would be more straightforward if you could tell XPath to directly select the CDATA node, but I don't know that it can do that.)

Removing XML tags when parsing XML

Using Ruby with Nokogiri is there an easy way to remove tags around returned results? I can't find one in the docs.
Example from the Nokogiri site:
characters[0].to_s # => "<character>Al Bundy</character>"
I was hoping to get:
Al Bundy
Try using the text method:
characters[0].text
You can use the .inner_html method. Here is an example you can use from a basic xml sitemap:
parse_content.css("url").each do |x|
location = x.css("loc").inner_html
last_mod = x.css("lastmod").inner_html
end
You can read about sitemaps here: https://www.sitemaps.org/protocol.html

Why can't REXML parse CDATA preceded by a line break?

I'm very new to Ruby, and trying to parse an XML document with REXML that has been previously pretty-printed (by REXML) with some slightly erratic results.
Some CDATA sections have a line break after the opening XML tag, but before the opening of the CDATA block, in these cases REXML parses the text of the tag as empty.
Any idea if I can get REXML to read these lines?
If not, could I re-write them before hand with a regex or something?
Is this even Valid XML?
Here's an example XML document (much abridged):
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
<content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
<content type="base64">
<![CDATA[VGhpcyB3b250IHdvcms=]]></content>
<content><![CDATA[This will work]]></content>
<content>
<![CDATA[This will not appear]]></content>
<content>
Seems happy</content>
<content>Obviously no problem</content>
</root-tag>
and here's my Ruby script (distilled down to a minimal example):
require 'rexml/document'
require 'base64'
include REXML
module RexmlSpike
file = File.new("ex.xml")
doc = Document.new file
doc.elements.each("root-tag/content") do |contentElement|
if contentElement.attributes["type"] == "base64"
puts "decoded: " << Base64.decode64(contentElement.text)
else
puts "raw: " << contentElement.text
end
end
puts "Finished."
end
The output I get is:
>> ruby spike.rb
decoded: Well done! It works :)
decoded:
raw: This will work
raw:
raw:
Seems happy
raw: Obviously no problem
Finished.
I'm using Ruby 1.9.3p392 on OSX Lion. The object of the exercise is ultimately to parse comments from some BlogML into the custom import XML used by Disqus.
Why
Having anything before the <![CDATA[]]> overrides whatever is in the <![CDATA[]]>. Anything from a letter, to a newline (like you've discovered), or a single space. This makes sense, because your example is getting the text of the element, and whitespace counts as text. In the examples where you are able to access <![CDATA[]]>, it is because text is nil.
Solution
If you look at the documentation for Element, you'll see that it has a function called cdatas() that:
Get an array of all CData children. IMMUTABLE.
So, in your example, if you do an inner loop on contentElement.cdatas() you would see the content of all your missing tags.
I'd recommend using Nokogiri, which is the defacto XML/HTML parser for Ruby. Using it to access the contents of the <content> tags, I get:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
<content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
<content type="base64">
<![CDATA[VGhpcyB3b250IHdvcms=]]></content>
<content><![CDATA[This will work]]></content>
<content>
<![CDATA[This will not appear]]></content>
<content>
Seems happy</content>
<content>Obviously no problem</content>
</root-tag>
EOT
doc.search('content').each do |n|
puts n.content
end
Which outputs:
V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==
VGhpcyB3b250IHdvcms=
This will work
This will not appear
Seems happy
Obviously no problem
Your xml is valid, but not the way you expects, as #lightswitch05 pointed out. You can use the w3c xml validator
If you are using XML from the wild world web, it is a good idea to use nokogiri because it usually works as you think it should, not as it really should.
Side note: this is exactly why I avoid XML and use JSON instead: XML have a proper definition but no one seems to use it anyway.

XPath-REXML-Ruby: Selecting multiple siblings/ancestors/descendants

This is my first post here. I have just started working with Ruby and am using REXML for some XML handling. I present a small sample of my xml file here:
<record>
<header>
<identifier>oai:lcoa1.loc.gov:loc.gmd/g3195.ct000379</identifier>
<datestamp>2004-08-13T15:32:50Z</datestamp>
<setSpec>gmd</setSpec>
</header>
<metadata>
<titleInfo>
<title>Meet-konstige vertoning van de grote en merk-waardige zons-verduistering</title>
</titleInfo>
</metadata>
</record>
My objective is to match the last numerical value in the tag with a list of values that I have from an array. I have achieved this with the following code snippet:
ids = XPath.match(xmldoc, "//identifier[text()='oai:lcoa1.loc.gov:loc.gmd/"+mapid+"']")
Having got a particular identifier that I wish to investigate, now I want to go back to and select and then select to get the value in the node for that particular identifier.
I have looked at the XPath tutorials and expressions and many of the related questions on this website as well and learnt about axes and the different concepts such as ancestor/following sibling etc. However, I am really confused and cannot figure this out easily.
I was wondering if I could get any help or if someone could point me towards an online resource "easy" to read.
Thank you.
UPDATE:
I have been trying various combinations of code such as:
idss = XPath.match(xmldoc, "//identifier[text()='oai:lcoa1.loc.gov:loc.gmd/"+mapid+"']/parent::header/following-sibling::metadata/child::mods/child::titleInfo/child::title")
The code compiles but does not output anything. I am wondering what I am doing so wrong.
Here's a way to accomplish it using XPath, then going up to the record, then XPath to get the title:
require 'rexml/document'
include REXML
xml=<<END
<record>
<header>
<identifier>oai:lcoa1.loc.gov:loc.gmd/g3195.ct000379</identifier>
<datestamp>2004-08-13T15:32:50Z</datestamp>
<setSpec>gmd</setSpec>
</header>
<metadata>
<titleInfo>
<title>Meet-konstige</title>
</titleInfo>
</metadata>
</record>
END
doc=Document.new(xml)
mapid = "ct000379"
text = "oai:lcoa1.loc.gov:loc.gmd/g3195.#{mapid}"
identifier_nodes = XPath.match(doc, "//identifier[text()='#{text}']")
record_node = identifier_nodes.first.parent.parent
record_node.elements['metadata/titleInfo/title'].text
=> "Meet-konstig"

Parsing SEC Edgar XML file using Ruby into Nokogiri

I'm having problems parsing the SEC Edgar files
Here is an example of this file.
The end result is I want the stuff between <XML> and </XML> into a format I can access.
Here is my code so far that doesn't work:
scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)
Ok, there are a couple of things wrong:
sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
You don't say what you want from the file. Without that information we can't recommend a real solution. You need to take more time to define the question better.
Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(
open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603
I recommend practicing in IRB and reading the docs for Nokogiri
> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>]
that should get you going
Given this was asked a year back, the answer is probably OBE, but what the fellow should do is examine all of the documents that are on the site, and notice the actual filing details can be found at:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm
Within this, you will see that the XML document is is after is already parsed out ready for further manipulation at:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc.xml
Be warned, however, the actual file name at the end is determined by the submitter of the document, not by the SEC. Therefore, you cannot depend on the document always being 'primary_doc.xml'.

Resources