Add unescaped entities to document with Nokogiri [duplicate] - ruby

I would like to add things like bullet points "•" to HTML using the XML Builder in Nokogiri, but everything is being escaped. How do I prevent it from being escaped?
I would like the result to be:
<span>•</span>
rather than:
<span>&#8226;</span>
I'm just doing this:
xml.span {
xml.text "•\ "
}
What am I missing?

If you define
class Nokogiri::XML::Builder
def entity(code)
doc = Nokogiri::XML("<?xml version='1.0'?><root>&##{code};</root>")
insert(doc.root.children.first)
end
end
then this
builder = Nokogiri::XML::Builder.new do |xml|
xml.span {
xml.text "I can has "
xml.entity 8665
xml.text " entity?"
}
end
puts builder.to_xml
yields
<?xml version="1.0"?>
<span>I can has • entity?</span>
PS this a workaround only, for a clean solution please refer to the libxml2 documentation (Nokogiri is built on libxml2) for more help. However, even these folks admit that handling entities can be quite ..err, cumbersome sometimes.

When you're setting the text of an element, you really are setting text, not HTML source. < and & don't have any special meaning in plain text.
So just type a bullet: '•'. Of course your source code and your XML file will have to be using the same encoding for that to come out right. If your XML file is UTF-8 but your source code isn't, you'd probably have to say '\xe2\x80\xa2' which is the UTF-8 byte sequence for the bullet character as a string literal.
(In general non-ASCII characters in Ruby 1.8 are tricky. The byte-based interfaces don't mesh too well with XML's world of all-text-is-Unicode.)

Related

ruby builder: can't generate xml tags like <fu-ba:r>

Hy Folks
I got the Problem that i have to create an xml in ruby with builder, running on a sinatra server.
The Xml is filled with xml tags like this one:
<fu-ba:r test="test1" source="h1">
somthing
</fu-ba:r>
now i don't know how to get builder to create a tag like this one (the attributes are no Problem).
i Tried:
xml.fu-ba:r(......)
xml."fu-ba:r"(.......)
xml. << "fu-ba:r"(......)
Every idea or solution would help a lot, thanks Folks
Ruby identifiers are consist of alphabets, decimal digits, and the
underscore character, and begin with a alphabets(including
underscore). There are no restrictions on the lengths of Ruby
identifiers.
Since ruby identifiers don't allow the use of special characters builder has a method called tag! for this very purpose.
For example
x.tag!("fu-ba:r") {
x.text! "something"
}
Outputs
# <fu-ba:r>
# something
# </fu-ba:r>

Minify XML with Ruby

Given an XML string:
xml = "<org><people> <person>Joe Shmoe</person> <person>Bo Bob</person>
<person>New Guy</person> </people><other><![CDATA[ This string might
have tags < > < > and stuff, don't touch this ]]></other></org>"
How can I get rid of newlines and spaces between the tags, without affecting tag text, CDATA, etc?
Result should be:
xml = "<org><people><person>Joe Shmoe</person><person>Bo Bob</person><person>New Guy</person></people><other><![CDATA[ This string might
have tags < > < > and stuff, don't touch this ]]></other></org>"
UPDATE:
This is what I've come up with so far- I just can't figure out how to have it ignore CDATA content...
xml.gsub(/>\s+</,"><")
Also, would much rather use an XML parser for this, as from what I hear regexing XML is a bad thing.
Yes! What you want is canonicalization!
http://xml4r.github.io/libxml-ruby/rdoc/classes/LibXML/XML/Document.html#method-i-canonicalize
LibXML-Ruby gem can do this. Since the docs are shitty and doesn't even say what it does, here are the specs
http://www.w3.org/TR/xml-c14n
This is used a lot in XML signing.
And yes! Using regular expressions on XML is bad.
BTW you can also print your xml object as a string, and set indentation:
http://xml4r.github.io/libxml-ruby/rdoc/classes/LibXML/XML/Document.html#method-i-to_s

Why are there blank nodes/attributes when using LibXML Ruby?

Using the Gem libxml-ruby, when we parse XML like so:
document = LibXML::XML::Parser.string( xmlData ).parse
for n in document.root.children
# Do something
end
What we actually get is something like this:
root
-node empty
-node with data
-node empty
Same thing with attributes, there's a blank one padding between those we actually care about. What we end up needing to use is :options => LibXML::XML::Parser::Options::NOBLANKS
Why? :(
(Not necessarily an answer, but need formatting.)
What does the XML look like?
This XML:
<baz>
<plugh>ohai</plugh>
</baz>
may contain whitespace text nodes for the CR/LF and indentation between the <baz> and <plugh> opening tags, and the same for between the closing tags. This may or may not be significant whitespace depending on the nature of the XML. Structurally, it's different than:
<baz><plugh>ohai</plugh></baz>

How do I count a sub string using a regex in ruby?

I have a very large xml file which I load as a string
so my XML lools like
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
I want to count the number of occurrences the string
article ID="5705641" contentstatus="Changed"
how can I convert the ID to a regex
Here is what I have tried doing
searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count
Please let me know how can I achieve this?
Thanks
I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.
Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.
That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.
Something like
/article ID="[1-9]{7}" contentstatus="Changed"/
Quotation marks aren't special characters in a regex, so you don't need to escape them.
When in doubt about regex in Ruby, I recommend checking out Rubular.com.
And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.
If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:
//article[#contentstatus="Changed"]
Or, if possible:
count(//article[#contentstatus="Changed"])
Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.
I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.
require 'nokogiri'
xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756263" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT
doc = Nokogiri::XML(xml)
puts doc.search('//article[#contentstatus="Changed"]').size.to_s + ' found'
puts doc.search('//article[#contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }
>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca
The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but
Your current string looks almost perfect to me, just remove the errant / from around the numbers:
searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'

How do I extract links from HTML using regex?

I want to extract links from google.com; My HTML code looks like this:
<a href="http://www.test.com/" class="l"
I took me around five minutes to find a regex that works using www.rubular.com.
It is:
"(.*?)" class="l"
The code is:
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read()
links = source.scan(/"(.*?)" class="l"/)
links.each { |link| puts #{link}
}
The problem is, is it not outputting the websites links.
Those links actually have class=l not class="l". By the way, to figure this put I added some logging to the method so that you can see the output at various stages and debug it. I searched for the string you were expecting to find and didn't find it, which is why your regex failed. So I looked for the right string you actually wanted and changed the regex accordingly. Debugging skills are handy.
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read
puts "--- PAGE SOURCE ---"
puts source
links = source.scan(/<a.+?href="(.+?)".+?class=l/)
puts "--- FOUND THIS MANY LINKS ---"
puts links.size
puts "--- PRINTING LINKS ---"
links.each do |link|
puts "- #{link}"
end
I also improved your regex. You are looking for some text that starts with the opening of an a tag (<a), then some characters of some sort that you dont care about (.+?), an href attribute (href="), the contents of the href attribute that you want to capture ((.+?)), some spaces or other attributes (.+?), and lastly the class attrubute (class=l).
I have .+? in three places there. the . means any character, the + means there must be one or more of the things right before it, and the ? means that the .+ should try to match as short a string as possible.
To put it bluntly, the problem is that you're using regexes. The problem is that HTML is what is known as a context-free language, while regular expressions can only the class of languages that are known as regular languages.
What you should do is send the page data to a parser that can handle HTML code, such as Hpricot, and then walk the parse tree you get from the parser.
What im going wrong?
You're trying to parse HTML with regex. Don't do that. Regular expressions cannot cover the range of syntax allowed even by valid XHTML, let alone real-world tag soup. Use an HTML parser library such as Hpricot.
FWIW, when I fetch ‘http://www.google.com/search?q=ruby’ I do not receive ‘class="l"’ anywhere in the returned markup. Perhaps it depends on which local Google you are using and/or whether you are logged in or otherwise have a Google cookie. (Your script, like me, would not.)

Resources