Why are there blank nodes/attributes when using LibXML Ruby? - ruby

Using the Gem libxml-ruby, when we parse XML like so:
document = LibXML::XML::Parser.string( xmlData ).parse
for n in document.root.children
# Do something
end
What we actually get is something like this:
root
-node empty
-node with data
-node empty
Same thing with attributes, there's a blank one padding between those we actually care about. What we end up needing to use is :options => LibXML::XML::Parser::Options::NOBLANKS
Why? :(

(Not necessarily an answer, but need formatting.)
What does the XML look like?
This XML:
<baz>
<plugh>ohai</plugh>
</baz>
may contain whitespace text nodes for the CR/LF and indentation between the <baz> and <plugh> opening tags, and the same for between the closing tags. This may or may not be significant whitespace depending on the nature of the XML. Structurally, it's different than:
<baz><plugh>ohai</plugh></baz>

Related

Preserving whitespace / line breaks with REXML

I'm using Ruby 1.9.3 and REXML to parse an XML document, make a few changes (additions/subtractions), then re-output the file. Within this file is a block that looks like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2
some.namespace.something3=somevalue3
</someElement>
The problem is that after re-writing the file, this block always ends up looking like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2 some.namespace.something3=somevalue3
</someElement>
The newline after the second value (but never the first!) has been lost and turned into a space. Later, some other code which I have no control or influence over will be reading this file and depending on those newlines to properly parse the content. Generally in this situation i'd use a CDATA to preserve the whitespace, but this isn't an option as the code that parses this data later is not expecting one - it's essential that the inner text of this element is preserved exactly as-is.
My read/write code looks like this:
xmlFile = File.open(myFile)
contents = xmlFile.read
xmlDoc = REXML::Document.new(contents, { :respect_whitespace => :all })
xmlFile.close
{perform some tasks}
out = ""
xmlDoc.write(out, 2)
File.open(filePath, "w"){|file| file.puts(out)}
I'm looking for a way to preserve the whitespace of text between elements when reading/writing a file in this manner using REXML. I've read a number of other questions here on stackoverflow on this subject, but none that quite replicate this scenario. Any ideas or suggestions are welcome.
I get correct behavior by removing the indent (second) parameter to Document.write():
#xmlDoc.write(out, 2)
xmlDoc.write(out)
That seems like a bug in Document.write() according to my reading of the docs, but if you don't really need to set the indentation, then leaving that off should solve yor problem.

Ruby Regex: Return just the match

When I do
puts /<title>(.*?)<\/title>/.match(html)
I get
<h2>foobar</h2>
But I want just
foobar
What's the most elegant method for doing so?
The most elegant way would be to parse HTML with an HTML parser:
require 'nokogiri'
html = '<title><h2>Pancakes</h2></title>'
doc = Nokogiri::HTML(html)
title = doc.at('title').text
# title is now 'Pancakes'
If you try to do this with a regular expression, you will probably fail. For example, if you have an <h2> in your <title> what's to prevent you from having something like this:
<title><strong>Where</strong> is <span>pancakes</span> <em>house?</em></title>
Trying to handle something like that with a single regex is going to be ugly but doc.at('title').text handles that as easily as it handles <title>Pancakes</title> or <title><h2>Pancakes</h2></title>.
Regular expressions are great tools but they shouldn't be the only tool in your toolbox.
Something of this style will return just the contents of the match.
html[/<title>(.*?)<\/title>/,1]
Maybe you need to tell us more, like what html might contain, but right now, you are capturing the contents of the title block, irrespective of the internal tags. I think that is the way you should do it, rather than assuming that there is an internal tag you want to handle, especially because what would happen if you had two internal tags? This is why everyone is telling you to use an html parser, which you really should do.

How do I count a sub string using a regex in ruby?

I have a very large xml file which I load as a string
so my XML lools like
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
I want to count the number of occurrences the string
article ID="5705641" contentstatus="Changed"
how can I convert the ID to a regex
Here is what I have tried doing
searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count
Please let me know how can I achieve this?
Thanks
I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.
Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.
That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.
Something like
/article ID="[1-9]{7}" contentstatus="Changed"/
Quotation marks aren't special characters in a regex, so you don't need to escape them.
When in doubt about regex in Ruby, I recommend checking out Rubular.com.
And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.
If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:
//article[#contentstatus="Changed"]
Or, if possible:
count(//article[#contentstatus="Changed"])
Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.
I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.
require 'nokogiri'
xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756263" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT
doc = Nokogiri::XML(xml)
puts doc.search('//article[#contentstatus="Changed"]').size.to_s + ' found'
puts doc.search('//article[#contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }
>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca
The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but
Your current string looks almost perfect to me, just remove the errant / from around the numbers:
searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'

How can I make empty tags self-closing with Nokogiri?

I've created an XML template in ERB. I fill it in with data from a database during an export process.
In some cases, there is a null value, in which case an element may be empty, like this:
<someitem>
</someitem>
In that case, the client receiving the export wants it to be converted into a self-closing tag:
<someitem/>
I'm trying to see how to get Nokogiri to do this, but I don't see it yet. Does anybody know how to make empty XML tags self-closing with Nokogiri?
Update
A regex was sufficient to do what I specified above, but the client now also wants tags whose children are all empty to be self-closing. So this:
<someitem>
<subitem>
</subitem>
<subitem>
</subitem>
</someitem>
... should also be
<someitem/>
I think that this will require using Nokogiri.
Search for
<([^>]+)>\s*</\1>
and replace with
<\1/>
In Ruby:
result = subject.gsub(/<([^>]+)>\s*<\/\1>/, '<\1/>')
Explanation:
< # Match opening bracket
( # Match and remember...
[^>]+ # One or more characters except >
) # End of capturing group
> # Match closing bracket
\s* # Match optional whitespace & newlines
< # Match opening bracket
/ # Match /
\1 # Match the contents of the opening tag
> # Match closing bracket
A couple questions:
<foo></foo> is the same as <foo />, so why worry about such a tiny detail? If it is syntactically significant because the text node between the two is a "\n", then put a test in your ERB template that checks for the value that would go there, and if it's not initialized output the self-closing tag instead? See "Yak shaving".
Why involve Nokogiri? You should be able to generate correct XML in ERB since you're in control of the template.
EDIT - Nokogiri's behavior is to not-rewrite parsed XML unless it has to. I suspect you'd have to remove the node in question, then reinsert it as an empty node to get Nokogiri to output what you want.

Ruby - Writing Hpricot data to a file

I am currently doing some XML parsing and I've chosen to use Hpricot because of it's ease of use and syntax, however I am running into some problems. I need to write a piece of XML data that I have found out to another file. However, when I do this the format is not preserved. For example, if the content should look like this:
<dict>
<key>item1</key><value>12345</value>
<key>item2</key><value>67890</value>
<key>item3</key><value>23456</value>
</dict>
And assuming that there are many entries like this in the document. I am iterating through the 'dict' items by using
hpricot_element = Hpricot(xml_document_body)
f = File.new('some_new_file.xml')
(hpricot_element/:dict).each { |dict| f.write( dict.to_original_html ) }
After using the above code, I would expect that the output look like the following exactly like the XML shown above. However to my surprise, the output of the file looks more like this:
<dict>\n", " <key>item1</key><value>12345</value>\n", " <key>item2</key><value>67890</value>\n", " <key>item3</key><value>23456</value\n", " </dict>
I've tried splitting at the "\n" characters and writing to the file one line at a time, but that didn't seem to work either as it did not recognize the "\n" characters. Any help is greatly appreciated. It might be a very simple solution, but I am having troubling finding it. Thanks!
hpricot_element = Hpricot::XML(xml_document_body)
File.open('some_new_file.xml', 'w') {|f| f.write xml_document_body }
Don't use an an xml parser if you want the original xml to be written. It is unnecessary. You should still use one if you want to further process the data, though.
Also, for XML, you should be using Hpricot::XML instead of just Hpricot.
My solution was to just replace the literal '\n' characters with line breaks and remove the extra punctuation by simply adding two gsubs that looked like the following:
f.write( dict.to_original_html.gsub('\n', "\n").gsub('" ,"', '') )
I don't know why I didn't see this before. Like I said, it might be an easy answer that I wasn't seeing and that's exactly how it turned out. Thanks for all the answers!

Resources