Nokogiri, open-uri, and Unicode Characters - ruby

I'm using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")
At this point, the title looks like this:
Rag\303\271
Instead of:
Ragù
How can I have nokogiri return the proper character (e.g. ù in this case)?
Here's an example URL:
http://www.epicurious.com/recipes/food/views/Tagliatelle-with-Duck-Ragu-242037

Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(...).read and pass the resulting string to Nokogiri.
Analysis:
If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. "Genealogía de Jesucristo". But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:
# encoding: UTF-8
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI'))
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1]
puts h52.text, h52.text.encoding
#=> Genealogà a de Jesucristo
#=> UTF-8
We can see that this is not the fault of open-uri:
html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
gene = html.read[/Gene\S+/]
puts gene, gene.encoding
#=> Genealogía
#=> UTF-8
This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:
# encoding: UTF-8
require 'nokogiri'
require 'open-uri'
html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
doc = Nokogiri::HTML(html.read)
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1].text
puts h52, h52.encoding, h52 == "Genealogía de Jesucristo"
#=> Genealogía de Jesucristo
#=> UTF-8
#=> true

I was having the same problem and the Iconv approach wasn't working. Nokogiri::HTML is an alias to Nokogiri::HTML.parse(thing, url, encoding, options).
So, you just need to do:
doc = Nokogiri::HTML(open(link).read, nil, 'utf-8')
and it'll convert the page encoding properly to utf-8. You'll see Ragù instead of Rag\303\271.

When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.
If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.
If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.
For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string
for 1.8, you probably need to look at Iconv.
Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:
require 'win32ole'
WIN32OLE.codepage = WIN32OLE::CP_UTF8
If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.
Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.

Try setting the encoding option of Nokogiri, like so:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
doc.encoding = 'utf-8'
title = doc.at_css("title")

Changing Nokogiri::HTML(...) to Nokogiri::HTML5(...) fixed issues I was having with parsing certain special character, specifically em-dashes.
(The accented characters in your link came through fine in both, so don't know if this would help you with that.)
EXAMPLE:
url = 'https://www.youtube.com/watch?v=4r6gr7uytQA'
doc = Nokogiri::HTML(open(url))
doc.title
=> "Josh Waitzkin â\u0080\u0094 How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"
doc = Nokogiri::HTML5(open(url))
doc.title
=> "Josh Waitzkin — How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

You need to convert the response from the website being scraped (here epicurious.com) into utf-8 encoding.
as per the html content from the page being scraped, its "ISO-8859-1" for now. So, you need to do something like this:
require 'iconv'
doc = Nokogiri::HTML(Iconv.conv('utf-8//IGNORE', 'ISO-8859-1', open(link).read))
Read more about it here: http://www.quarkruby.com/2009/9/22/rails-utf-8-and-html-screen-scraping

Just to add a cross-reference, this SO page gives some related information:
How to make Nokogiri transparently return un/encoded Html entities untouched?

Tip: you could also use the Scrapifier gem to get metadata, as the page title, from URIs in a very simple way. The data are all encoded in UTF-8.
Check it out: https://github.com/tiagopog/scrapifier
Hope it's useful for you.

Related

Ruby URI.extract returns empty array or ArgumentError: invalid byte sequence in UTF-8

I'm trying to get a list of files from url like this:
require 'uri'
require 'open-uri'
url = 'http://www.wmprof.com/media/niti/download'
html = open(url).read
puts URI.extract(html).select{ |link| link[/(PL)/]}
This code returns ArgumentError: invalid byte sequence in UTF-8 in line with URI.extract (even though html.encoding returns utf-8)
I've found some solutions to encoding problems, but when I'm changing the code to
html.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
URI.extract returns empty string, even when I'm not calling the select method on it. Any suggestions?
The character encoding of the website might be ISO-8859-1 or a related one. We can't tell for sure since there are only two occurrences of the same non-US-ASCII-character and it doesn't really matter anyway.
html.each_char.reject(&:ascii_only?) # => ["\xDC", "\xDC"]
Finding the actual encoding is done by guessing. The age of HTML 3.2 or the used language/s might be a clue. And in this case especially the content of the PDF file is helpful (it contains SPRÜH-EX and the file has the name TI_DE_SPR%dcH_EX.pdf). Then we only need to find the encoding for which "\xDC" and "Ü" are equal. Either by knowing it or writing some Ruby:
Encoding.list.select { |e| "Ü" == "\xDC".encode!(Encoding::UTF_8, e) rescue next }.map(&:name)
Of course, letting a program do the guessing is an option too. There is the libguess library. The web browser can do it too. However you you need to download the file though unless the server might tell the browser it's UTF-8 even if it isn't (like in this case). Any decent text editor will also try to detect the file encoding: e.g. ST3 thinks it's Windows 1252 which is a superset of ISO-8859-1 (like UTF-8 is of US-ASCII).
Possible solutions are manually setting the string encoding to ISO-8859-1:
html.force_encoding(Encoding::ISO_8859_1)
Or (preferably) transcoding the string from ISO-8859-1 to UTF-8:
html.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
To answer the other question: URI.extract isn't the method you're looking for. Apparently it's obsolete and more importantly, it doesn't extract relative URI.
A simple alternative is using a regular expression with String#scan. It works with this site but it might not with other ones. You have to use a HTML parser for the best reliability (there might be also a gem). Here's an example that should do what you want:
html.scan(/href="(.*?PL.*?)"/).flatten # => ["SI_PL_ACTIV_bicompact.pdf", ...]

Some spaces are removed when scraping a string using Nokogiri

I am very new to Ruby and I am currently working on site-scraping using Nokogiri to practice. I would like to scrape the details from 'deals' from a random group-buying site. I have been able to successfully scrape a site but I am having problems in parsing the output. I tried the solutions suggested in here and also using regex. So far, I have failed.
I am trying to parse the following title/description from this page:
Frosty Frappes starting at P100 for P200 worth at Café Tavolo – up to 55% off
This is what I got:
FrostyFrappes starting at P100 for P200 worth at Caf Tavolo up to 55% off
Here are the snippets in my code:
require 'uri'
require 'nokogiri'
html = open(url)
doc = Nokogiri::HTML(html.read)
doc.encoding = "utf-8"
title = doc.at_xpath('/html/body/div/div[9]/div[2]/div/div/div/h1/a')
puts title.content.to_s.strip.gsub(/[^0-9a-z%&!\n\/(). ]/i, '')
Please do tell me if I missed something out. Thank you.
Your xpath is too rigid and your regex is removing chars you want to keep. Here's how I would do it:
title = doc.at('div#contentDealTitle h1 a').text.strip.gsub(/\s+/,' ')
That says take the text from the first a tag that comes after div#contentDealTitle and h1, strip it (remove leading and trailing spaces) and replace all sequences of 1 or more whitespace char with a single space.

Incompatible encodings with ruby and Nokogiri HTML

I'm parsing an external HTML page with Nokogiri. That page is encoded with ISO-8859-1. Part of the data I want to extract, contains some – (dash) html entities:
xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
f = xml.xpath("//div[#style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[#class='tit_inter_cnz']/text()")
f[0].text #=> Preview M/E/C/A \u0096 John Digweed
In the last line, the String should be rendered on the browser with a dash. The browser correctly renders it if I specify my page as ISO-8859-1 encoding, however, my Sinatra app uses UTF-8. How can I correctly display that text in the browser? Today is is being displayed as a square with a small number inside.
I tried force_encoding('ISO-8859-1'), but then I get a CompatibilityError from Sinatra.
Any clues?
[Edit]
Below are screenshots of the app:
-> Firefox with character encoding UTF-8
-> [Firefox with character encoding Western (ISO-8859-1)
It's worth mentioning that in the ISO-8859-1 mode above, the dash is shown correctly, but there is another incorrect character with it just before the dash. Weird :(
After parsing a document in Nokogiri you can tell it to assume a different encoding. Try:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML((open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
doc.encoding = 'UTF-8'
I can't see that page from here, to confirm this fixes the problem, but it's worked for similar problems.
Summary: The problematic characters are control characters from ISO-8859-1, not intended for display.
Details and Investigation:
Here's a test showing that you are getting valid UTF-8 from Nokogiri and Sinatra:
require 'sinatra'
require 'open-uri'
get '/' do
html = open("http://flybynight.com.br/agenda.php").read
p [ html.encoding, html.valid_encoding? ]
#=> [#<Encoding:ISO-8859-1>, true]
str = html[ /Preview.+?John Digweed/ ]
p [ str, str.encoding, str.valid_encoding? ]
#=> ["Preview M/E/C/A \x96 John Digweed", #<Encoding:ISO-8859-1>, true]
utf8 = str.encode('UTF-8')
p [ utf8, utf8.encoding, utf8.valid_encoding? ]
#=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]
require 'nokogiri'
doc = Nokogiri.HTML(html, nil, 'ISO-8859-1')
p doc.encoding
#=> "ISO-8859-1"
dig = doc.xpath("//div[#class='tit_inter_cnz']")[1]
p [ dig.text, dig.text.encoding, dig.text.valid_encoding? ]
#=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]
<<-ENDHTML
<!DOCTYPE html>
<html><head><title>Dig it!</title></head><body>
<p>Here it comes...</p>
<p>#{dig.text}</p>
</body></html>
ENDHTML
end
This properly serves up content with Content-Type:text/html;charset=utf-8 on my computer. Chrome does not show my this character in the browser, however.
Analyzing that response, the same Unicode byte pair comes back for the dash as is seen in the above: \xC2\x96. This appears to be this Unicode character which seem to be an odd dash.
I would chalk this up to bad source data, and simply throw:
#encoding: UTF-8
at the top of your Ruby source file(s), and then put in:
f = ...text.gsub( "\xC2\x96", "-" ) # Or a better Unicode character
Edit: If you look at the browser test page for that character you will see (at least in in Chrome and Firefox for me) that the UTF-8 literal version is blank, but the hex and decimal escape versions show up. I cannot fathom why this is, but there you have it. The browsers are simply not displaying your character correctly when presented in raw form.
Either make it an HTML entity, or a different Unicode dash. Either way a gsub is called for.
Edit #2: One more odd note: the character in the source encoding has a hexadecimal byte value of 0x96. As far as I can tell, this does not appear to be a printable ISO-8859-1 character. As shown in the official spec for ISO-8859-1, this falls in one of the two non-printing regions.
I work in publishing of scientific manuscripts and there are many dashes. The dash that you are using is not an ASCII dash, it is a unicode dash. Forcing the ISO encoding is probably having the effect of making the dash change.
http://www.fileformat.info/info/unicode/char/96/index.htm
That site is excellent for unicode issues.
The reason you are getting a square is that perhaps your browser does not support this. It is probably correctly rendered. I would keep UTF-8 encoding, and if you want to make that dash so everyone can see it, convert it to an ascii dash.
You may want to try Iconv to convert the characters to ASCII/UTF-8 http://craigjolicoeur.com/blog/ruby-iconv-to-the-rescue

How do I import using FasterCSV a row with a name like "Ciarán"?

I am trying to load in my data migration a member database. Quite a few of the names have special characters such as "Ciarán". I've set up a simple example like this:
require 'rubygems'
require 'fastercsv'
FasterCSV.foreach("/Users/developer/Work/madmin/db/data/Members.csv") do |row|
puts row.inspect
end
and I get the following:
/usr/local/lib/ruby/gems/1.8/gems/fastercsv-1.5.0/lib/faster_csv.rb:1616:in `shift': FasterCSV::MalformedCSVError (FasterCSV::MalformedCSVError)
when I hit the row with this name.
I have been googling character encoding and UTF-8, but have not yet found a solution. I'd like to keep the special characters but would rather not have to edit each member name that fails.
Many thanks,
Brett
It works right off the bat for me, but if you need to change the encoding, you can pass an encoding option to FasterCSV. For example, to tell it to use UTF-8, you can do this:
require 'rubygems'
require 'fastercsv'
FasterCSV.foreach("some file.csv", :encoding => 'u') do |row|
puts row.inspect
end
The encoding options are listed in the documentation for new.
I've read elsewhere that this can be fixed by setting KCODE. For example:
$KCODE = "U"
Stick this at the top.
James Edward Gray has also said he's added encoding support to FasterCSV but it's in trunk only.

How can I get Nokogiri to parse and return an XML document?

Here's a sample of some oddness:
#!/usr/bin/ruby
require 'rubygems'
require 'open-uri'
require 'nokogiri'
print "without read: ", Nokogiri(open('http://weblog.rubyonrails.org/')).class, "\n"
print "with read: ", Nokogiri(open('http://weblog.rubyonrails.org/').read).class, "\n"
Running this returns:
without read: Nokogiri::XML::Document
with read: Nokogiri::HTML::Document
Without the read returns XML, and with it is HTML? The web page is defined as "XHTML transitional", so at first I thought Nokogiri must have been reading OpenURI's "content-type" from the stream, but that returns 'text/html':
(rdb:1) doc = open(('http://weblog.rubyonrails.org/'))
(rdb:1) doc.content_type
"text/html"
which is what the server is returning. So, now I'm trying to figure out why Nokogiri is returning two different values. It doesn't appear to be parsing the text and using heuristics to determine whether the content is HTML or XML.
The same thing is happening with the ATOM feed pointed to by that page:
(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
(rdb:1) doc.class
Nokogiri::XML::Document
(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails').read)
(rdb:1) doc.class
Nokogiri::HTML::Document
I need to be able to parse a page without knowing what it is in advance, either HTML or a feed (RSS or ATOM) and reliably determine which it is. I asked Nokogiri to parse the body of either a HTML or XML feed file, but I'm seeing those inconsistent results.
I thought I could write some tests to determine the type but then I ran into xpaths not finding elements, but regular searches working:
(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
(rdb:1) doc.class
Nokogiri::XML::Document
(rdb:1) doc.xpath('/feed/entry').length
0
(rdb:1) doc.search('feed entry').length
15
I figured xpaths would work with XML but the results don't look trustworthy either.
These tests were all done on my Ubuntu box, but I've seen the same behavior on my Macbook Pro. I'd love to find out I'm doing something wrong, but I haven't seen an example for parsing and searching that gave me consistent results. Can anyone show me the error of my ways?
It has to do with the way Nokogiri's parse method works. Here's the source:
# File lib/nokogiri.rb, line 55
def parse string, url = nil, encoding = nil, options = nil
doc =
if string =~ /^\s*<[^Hh>]*html/i # Probably html
Nokogiri::HTML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_HTML)
else
Nokogiri::XML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_XML)
end
yield doc if block_given?
doc
end
The key is the line if string =~ /^\s*<[^Hh>]*html/i # Probably html. When you just use open, it returns an object that doesn't work with regex, thus it always returns false. On the other hand, read returns a string, so it could be regarded as HTML. In this case it is, because it matches that regex. Here's the start of that string:
<!DOCTYPE html PUBLIC
The regex matches the "!DOCTYPE " to [^Hh>]* and then matches the "html", thus assuming it's HTML. Why someone selected this regex to determine if the file is HTML is beyond me. With this regex, a file that begins with a tag like <definitely-not-html> is considered HTML, but <this-is-still-not-html> is considered XML. You're probably best off staying away from this dumb function and invoking Nokogiri::HTML::Document#parse or Nokogiri::XML::Document#parse directly.
Responding to this part of your question:
I thought I could write some tests to
determine the type but then I ran into
xpaths not finding elements, but
regular searches working:
I've just come across this problem using Nokogiri to parse an Atom feed. The problem seemed down to the anonymous name-space declaration:
<feed xmlns="http://www.w3.org/2005/Atom">
Removing the XMLNS declaration from the source XML would enable Nokogiri to search with XPath as usual. Removing that declaration from the feed obviously wasn't an option here, so instead I just removed the namespaces from the document after parsing:
doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
doc.remove_namespaces!
doc.xpath('/feed/entry').length
Ugly I know, but it did the trick.

Resources