Nokogiri producing different results on heroku? - ruby

I'm having a very strange problem and I'd appreciate help tracking it down.
I'm using the nokogiri gem to parse some html, and I am parsing a file which has a weird character in it. Not entirely sure what this character is, in vim it shows as ^Q.
On my own computer, everything works fine, however on heroku it inserts a </body></html><html> when it hits the character and selectors only return the elements before the weird character.
To illustrate:
Nokogiri::HTML( open("http://thoms.net.nz/e2.html")).css("body div").count is 1 on heroku, and two on my computer. - The file containing this character can be downloaded from http://thoms.net.nz/e2.html.
Both my computer and heroku are running nokogiri 1.5.5 with ruby 1.9.3.

The ^Q is a software control character (XON), which isn't supposed to be in HTML. I suspect its unexpected presence is confusing both Nokogiri and Heroku, but in different ways.
HTML documents from the wilds of the internet can be corrupted in any numbers of ways. I've seen all sorts of garbage in them, and if I couldn't make sense of it using iconv or a Unicode transliteration, I'd resort to a quick global search and replace to remove anything not in the normal ASCII range before further processing.
In Ruby, global search and replace uses String#gsub.
doc = Nokogiri::HTML(html.gsub("\u0011", ''))

Related

Nokogiri outputs different strings on different systems

I am reading a local .html file using the following line:
myDoc = File.open("Ina.html") { |f| Nokogiri::HTML(f) }
I get a Node using xpath and then I simply print it
divNode = myDoc.at_xpath('//div[#id="mw-content-text"]/p[1]')
puts divNode
Fragment of Output on one system: Using ruby 2.3
<p><b>Ina:</b> Ñe’êpehê , ñe’ẽtéva rire (aha´aína)</p>
Fragment of Output on another system: Using ruby 2.1
<p><b>Ina:</b> Ñe’êpehê , ñe’ẽtéva rire (aha´aína)</p>
Any thoughts on what is going on with the encoding? All the suggestions of forcing the encoding and/or specifying the encoding have not been successful.
Well, I fixed the problem but I still don't fully understand why this way would not work.
So, the solution was to simply read the whole .html file and then instantiate the nokogiri object by parsing the the string of the file.
file = File.open(outputFolder + "/" + htmlName,"rb")
content = file.read
doc = Nokogiri::HTML.parse(content,nil, "UTF-8")
To me,this was equivalent to either one of the statements I tried:
myDoc = File.open("Ina.html") { |f| Nokogiri::HTML(f) }
myDoc = File.open("Ina.html", nil, "UTF-8") { |f| Nokogiri::HTML(f) }
nokogiri does weird stuff sometimes. I couldn't explain what nokogiri is "supposed" to do here -- both versions are 'correct' in representing the same thing in an HTML document. Is it exactly the same version of nokogiri? If so, it could be a different version of libxml, which nokogiri uses under the hood, and in some cases will use an existing system install. Or the ruby 2.1 vs 2.3 difference could matter, although that seems unlikely.
Basically, if you want exactly the same behavior, you need to use exactly the same version of everything -- ruby, nokogiri, libxml.
The first is just the straight unicode bytes, the second has non-ascii characters replaced by html character entities. Both should be rendered the same in a browser. If you want one of those behaviors and not the other (personally I think I'd rather have the unicode), that's kind of a different question, but there's probably a way to force nokogiri to do it. But I don't know it.
If you use Nokogiri::XML instead of Nokogiri::HTML, I'd wager it won't replace non-ascii with html character entities, but you also, if I recall right, won't get some "forgiving of not quite legal syntax" behavior the HTML parser uses.
Wait, now looking closer, I'm thinking maybe the second one doesn't represent the same thing, it's html character entities, but I'm not sure they're really the right ones. Could encoding have gotten messed up? Depending on how you are reading the data in, and the OS, and what the LANG env variable is set to if it's a unix machine, it could be messing up the encoding.
Also, are you positive that the Ina.html file you are opening is really truly identical on both systems? Could it have become corrupted or transformed differently in the download process? Copy the file from one machine to the other to make sure the two files are really identical.

convert ascii characters to ruby encoding

I'm testing a feature with watir and running into an issue with validating ascii characters in the html.
I'm grabbing the product description from a database like so 'Company® Some Product' and use it as the string that i'm validating against.
and it shows up that way in the html. However Ruby is looking for Company\u00AE Some Product, so my test is failing.
Anyone have any solutions for getting around these special characters when they turn up?
HTML Entities gem may help:
http://htmlentities.rubyforge.org/
http://htmlentities.rubyforge.org/doc/

How can I render XML character entity references in Ruby?

I am reading some data from an XML webservice with Ruby, something like this:
<phrases>
<phrase language="en_US">¡I'm highly annoyed with character references!</phrase>
</phrases>
I'm parsing the XML and grabbing an array of phrases. As you can see, the phrase text contains some XML character entity references. I'd like to replace them with the actual character being referenced. This is simple enough with the numeric references, but nasty with the XML and HTML ones. I'd like to avoid having a big hash in my code that holds the character for each XML or HTML character reference, i.e. http://www.java2s.com/Code/Java/XML/Resolvesanentityreferenceorcharacterreferencetoitsvalue.htm
Surely there's a library for this out there, right?
Update
Yes, there is a library out there, and it's called HTMLEntities:
: jmglov#laurana; sudo gem install htmlentities
Successfully installed htmlentities-4.2.4
: jmglov#laurana; irb
irb(main):001:0> require 'htmlentities'
=> []
irb(main):002:0> HTMLEntities.new.decode "¡I'm highly annoyed with character references!"
=> "¡I'm highly annoyed with character references!"
REXML can do it, though it won't handle "¡" or " ". The list of predefined XML entities (aside from Unicode numeric entities) is actually quite small. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Given this input XML:
<phrases>
<phrase language="en_US">"I'm highly annoyed with character references!©</phrase>
</phrases>
you can parse the XML and the embedded entities like this (for example):
require 'rexml/document'
doc = REXML::Document.new(File.open('/tmp/foo.xml').readlines.join(''))
phrase = REXML::XPath.first(doc, '//phrases/phrase')
text = phrase.first # Type is REXML::Text
puts(text.value)
Obviously, that example assumes that the XML is in file /tmp/foo.xml. You can just as easily pass a string of XML. On my Mac and Ubuntu systems, running it produces:
$ ruby /tmp/foo.rb
"I'm highly annoyed with character references!©
This isn't an attempt to provide a solution, it's to relate some of my own experiences dealing with XML from the wild. I was using Perl at first, then later using Ruby, and the experiences are something you can encounter easily if you grab enough XML or RDF/RSS/Atom feeds.
I've often seen XML CDATA contain HTML, both encoded and unencoded. The encoded HTML was probably the result of someone doing things the right way, via some API or library to generate XML. The unencoded HTML was probably someone using a script to wrap the HTML with tags, resulting in invalid XML, but I had to deal with it anyway.
I've also seen XML CDATA containing HTML that had been encoded multiple times, requiring me to unencode everything, even after the XML engine had done its thing. Sometimes during an intermediate pass I'd suddenly have non-UTF8 characters in the string along with encoded ones, as a result of someone appending comments or joining multiple HTML streams together that were from different character-sets. For whatever the reason, it was really ugly and caused XML parsing to break or emit a lot of warnings. I'd have to loop over the content, decoding and checking to see if the previous pass was the same as the current decoding pass, and bailing if nothing had changed. There was no guarantee I'd have a string in a valid character-set at the time though, so I'd have to tell iconv to convert it to UTF8 and throw away characters that wouldn't convert cleanly.
Nokogiri can decode the content of a node various ways, by creative use of the to_xml and to_html methods. You can also look at the HTMLEntities gem, Loofah, and others to go after the CDATA contents. Loofah is nice because it's designed to whitelist/blacklist tags you might encounter.
The XML spec is supposed to protect us from such shenanigans, but, as one of my co-workers used to tell me, "We can make it fool-proof, but not damn-fool-proof". People are SO inventive and the specs mean nothing to someone who didn't bother to read them or doesn't care.

In Ruby, how to automatically convert non-supported characters in text-processing?

(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-8­23
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).

clean up strange encoding in ruby

I'm currently playing a bit with couchdb.
I'm trying to migrate some blog data from redis (key value store) to couchdb (key value store).
Seeing as I probably migrated this data a gazillion times from and to different blogging engines (everybody has got to have a hobby :) ), there seem to be some encoding snafus.
I'm using CouchREST to access CouchDB from ruby and I'm getting this:
<JSON::GeneratorError: source sequence is illegal/malformed>
the problem seems to be the body_html part of the object:
<Post:0x00000000e9ee18 #body_html="[.....]Wie Sie bereits wissen, m\xF6chte EUserv k\xFCnftig seine [...]
Those are supposed to be Umlauts ("möchte" and "künftig").
Any idea how to get rid of those problems? I tried some conversions using the ruby 1.9 encoding feature or iconv before inserting, but haven't got any luck yet :(
If I try to e.g. convert that stuff to ISO-8859-1 using the .encode() method of ruby 1.9, this is what happens (different text, same problem):
#<Encoding::UndefinedConversionError: "\xC6\x92" from UTF-8 to ISO-8859-1>
I try to e.g. convert that stuff to ISO-8859-1
Close. You actually want to do it the other way around: you've got ISO-8859-1(*), you want UTF-8(**). So str.encode('utf-8', 'iso-8859-1') would be more likely to do the trick.
*: actually you might well have Windows code page 1252, which is like ISO-8859-1, but with extra smart-quotes and things in the range 0x80-0x9F which ISO-8859-1 uses for control codes. If so, use 'cp1252' instead.
**: well, you probably do. Working with UTF-8 is the best way forward so you can store all possible characters. If you really want to keep working in ISO-8859-1/cp1252, then presumably the problem is just that Ruby has mis-guessed the character set in use and you can fix it by calling str.force_encoding('iso-8859-1').

Resources