Minify XML with Ruby - ruby

Given an XML string:
xml = "<org><people> <person>Joe Shmoe</person> <person>Bo Bob</person>
<person>New Guy</person> </people><other><![CDATA[ This string might
have tags < > < > and stuff, don't touch this ]]></other></org>"
How can I get rid of newlines and spaces between the tags, without affecting tag text, CDATA, etc?
Result should be:
xml = "<org><people><person>Joe Shmoe</person><person>Bo Bob</person><person>New Guy</person></people><other><![CDATA[ This string might
have tags < > < > and stuff, don't touch this ]]></other></org>"
UPDATE:
This is what I've come up with so far- I just can't figure out how to have it ignore CDATA content...
xml.gsub(/>\s+</,"><")
Also, would much rather use an XML parser for this, as from what I hear regexing XML is a bad thing.

Yes! What you want is canonicalization!
http://xml4r.github.io/libxml-ruby/rdoc/classes/LibXML/XML/Document.html#method-i-canonicalize
LibXML-Ruby gem can do this. Since the docs are shitty and doesn't even say what it does, here are the specs
http://www.w3.org/TR/xml-c14n
This is used a lot in XML signing.
And yes! Using regular expressions on XML is bad.
BTW you can also print your xml object as a string, and set indentation:
http://xml4r.github.io/libxml-ruby/rdoc/classes/LibXML/XML/Document.html#method-i-to_s

Related

Add unescaped entities to document with Nokogiri [duplicate]

I would like to add things like bullet points "•" to HTML using the XML Builder in Nokogiri, but everything is being escaped. How do I prevent it from being escaped?
I would like the result to be:
<span>•</span>
rather than:
<span>&#8226;</span>
I'm just doing this:
xml.span {
xml.text "•\ "
}
What am I missing?
If you define
class Nokogiri::XML::Builder
def entity(code)
doc = Nokogiri::XML("<?xml version='1.0'?><root>&##{code};</root>")
insert(doc.root.children.first)
end
end
then this
builder = Nokogiri::XML::Builder.new do |xml|
xml.span {
xml.text "I can has "
xml.entity 8665
xml.text " entity?"
}
end
puts builder.to_xml
yields
<?xml version="1.0"?>
<span>I can has • entity?</span>
PS this a workaround only, for a clean solution please refer to the libxml2 documentation (Nokogiri is built on libxml2) for more help. However, even these folks admit that handling entities can be quite ..err, cumbersome sometimes.
When you're setting the text of an element, you really are setting text, not HTML source. < and & don't have any special meaning in plain text.
So just type a bullet: '•'. Of course your source code and your XML file will have to be using the same encoding for that to come out right. If your XML file is UTF-8 but your source code isn't, you'd probably have to say '\xe2\x80\xa2' which is the UTF-8 byte sequence for the bullet character as a string literal.
(In general non-ASCII characters in Ruby 1.8 are tricky. The byte-based interfaces don't mesh too well with XML's world of all-text-is-Unicode.)

Preserving whitespace / line breaks with REXML

I'm using Ruby 1.9.3 and REXML to parse an XML document, make a few changes (additions/subtractions), then re-output the file. Within this file is a block that looks like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2
some.namespace.something3=somevalue3
</someElement>
The problem is that after re-writing the file, this block always ends up looking like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2 some.namespace.something3=somevalue3
</someElement>
The newline after the second value (but never the first!) has been lost and turned into a space. Later, some other code which I have no control or influence over will be reading this file and depending on those newlines to properly parse the content. Generally in this situation i'd use a CDATA to preserve the whitespace, but this isn't an option as the code that parses this data later is not expecting one - it's essential that the inner text of this element is preserved exactly as-is.
My read/write code looks like this:
xmlFile = File.open(myFile)
contents = xmlFile.read
xmlDoc = REXML::Document.new(contents, { :respect_whitespace => :all })
xmlFile.close
{perform some tasks}
out = ""
xmlDoc.write(out, 2)
File.open(filePath, "w"){|file| file.puts(out)}
I'm looking for a way to preserve the whitespace of text between elements when reading/writing a file in this manner using REXML. I've read a number of other questions here on stackoverflow on this subject, but none that quite replicate this scenario. Any ideas or suggestions are welcome.
I get correct behavior by removing the indent (second) parameter to Document.write():
#xmlDoc.write(out, 2)
xmlDoc.write(out)
That seems like a bug in Document.write() according to my reading of the docs, but if you don't really need to set the indentation, then leaving that off should solve yor problem.

How to use Regular Expression to insert text in between text?

I have a unique scenario. There is a web application which is a simulator to check sending of data in XML and getting the data back in xml and verifying few details in xml.
Now the xml data which I am sending has a lot of details. In that xml I will have to insert a parameter which I have defined in my test. I am not able to get, how to send the data as parameter in the xml before sending it.
the xml structre looks like this
id='12345'><version>1.3.4<</version><accno>1234567890</accno>add<address details</> ..........
Now int this xml structure, I have parameterized <accno>1234567890</accno> ... Mean in begin of the script I am declaring accno='1234567890'
Now I want to using accno as parameter in the xml instead of the hard coded value in the xml. Please suggest how to do this.
XML is not regular, but context-free. Use a proper parser like Nokogiri instead of regex. See RegEx match open tags except XHTML self-contained tags.
As answer, as requested.
I will say editing xml, by regex is a bad idea.
but just to answer the direct question use gsub. eg.
str.gsub(/reg_match/, newstring)
but better way of doing it will be use of hpricot,
Or you can also use ruby templates.
require 'erb'
require 'ostruct'
data = {:accno => "1234567890"}
variables = OpenStruct.new(data)
template = "<id='12345'><version>1.3.4</version><accno><%= accno%></accno>"
res = ERB.new(template).result(variables.instance_eval { binding })
puts res
First identify the pattern, then replace it using gsub!
xml_data.gsub! (pattern, replacement)
http://ruby-doc.org/docs/ProgrammingRuby/html/ref_c_string.html#String.gsub_oh
The fast way to do it is with gsub (like Rajkaran says). The right way to do it is rexml or some other xml library. Investment should be related to how much you will use this kind of thing in the future.

How do I count a sub string using a regex in ruby?

I have a very large xml file which I load as a string
so my XML lools like
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
I want to count the number of occurrences the string
article ID="5705641" contentstatus="Changed"
how can I convert the ID to a regex
Here is what I have tried doing
searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count
Please let me know how can I achieve this?
Thanks
I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.
Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.
That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.
Something like
/article ID="[1-9]{7}" contentstatus="Changed"/
Quotation marks aren't special characters in a regex, so you don't need to escape them.
When in doubt about regex in Ruby, I recommend checking out Rubular.com.
And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.
If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:
//article[#contentstatus="Changed"]
Or, if possible:
count(//article[#contentstatus="Changed"])
Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.
I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.
require 'nokogiri'
xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756263" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT
doc = Nokogiri::XML(xml)
puts doc.search('//article[#contentstatus="Changed"]').size.to_s + ' found'
puts doc.search('//article[#contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }
>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca
The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but
Your current string looks almost perfect to me, just remove the errant / from around the numbers:
searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'

xml tag with a dot in haml

I have a tag that contains a dot (.) that I want haml to preserve:
Haml:
%text
%text.resource
...
I would like Haml to expand to:
<text>
<text.resource>...
</text.resource>
<text>
but it keeps doing:
<text>
<text class="resource">...
<text>
<text>
Is there any easy way to "escape" "class" expansion in Haml?
HAML is made to generate HTML of various forms, but you can trick it to generate other things by being creative. Putting in what you want to get back out:
<text>
<text.resource>...
</text.resource>
<text>
will work, because if HAML sees a line that doesn't start with one of its reserved characters it'll output it as is. You can't indent though, or it will get mad.
From the docs:
Note that HTML tags are passed through unmodified as well. If you have some HTML you don’t want to convert to Haml, or you’re converting a file line-by-line, you can just include it as-is. For example:
%p
<div id="blah">Blah!</div>
is compiled to:
<p>
<div id="blah">Blah!</div>
</p>
You could do:
<text>
= " <text.resource>..."
= " </text.resource>"
<text>
if you insist on indentation:
>> <text>
>> <text.resource>...
>> </text.resource>
>> <text>
EDIT:
The OP says:
the problem I have is that the elypsis (...) means that I have to add more haml code there (a bunch of xml tags that would be "children" of and therefore I need to "indent" the lines after the comments...
XML doesn't care about indentation; Indentation is a for-human-eyes-only aesthetic. I'd worry more about being functionally and syntactically correct. If you absolutely have to have "pretty" XML, then consider running the HAML output through xmllint, or tidy with the xml flags set.
Or, abandon HAML because you're starting to abuse it, and use something like ERB and/or Erubis which is more free form and less caring about syntax, or go old-school and generate the XML via print and puts statements. If you insist on using HAML and having your indentation, then I'd suggest consulting with the HAML developers and see if they have a recommendation. There might be a HAML filter that would be of use, or some other way of forcing the indentation level inline.
My advice, as someone who's been doing this a long time and been there too many times is: We, as software developers, can lose sight of the end-goal of being functional and spin off into some yak-shaving exercise worrying about minutia that don't accomplish anything real. Unless it's a specification that every indenting space is sacred I'd worry more about getting correct XML and move on, then later return to it and see if it can be tweaked to perfection.

Resources