Iconv::IllegalSequence when using www::mechanize - ruby

I'm trying to do a little bit of webscraping, but the WWW:Mechanize gem doesn't seem to like the encoding and crashes.
The post request results in a 302 redirect (which mechanize follows, so far so good) and the resulting page seems to crash it.
I googled quite a bit, but nothing came up so far how to solve this. Any of you got an idea?
Code:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac Safari'
answer = agent.post('https://www.budget.de/de/reservierung/privatkunden/step1/schnellbuchung',
{"Country" => "Deutschland",
"Abholstation" => "Aalen",
"Abgabestation" => "Aalen",
"Abholdatum" => "26.02.2009",
"Abholzeit_stunde" => "13",
"Abholzeit_minute" => "30",
"Abgabedatum" => "28.02.2009",
"Abgabezeit_stunde" => "13",
"Abgabezeit_minute" => "30",
"CountryID" => "DE",
"AbholstationID"=>"AA1",
"AbgabestationID"=>"AA1"
}
)
puts answer.body
Error:
D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `iconv': "\204nderungen vorbe"... (Iconv::IllegalSequence)
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `to_native_charset'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_header_handler.rb:29:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_body_parser.rb:35:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/pre_connect_hook.rb:14:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:25:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:494:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:545:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:403:in `post_form'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:322:in `post'
from test.rb:7

That page is most certainly UTF-8, however Mechanize uses NKF (a core Ruby library) to guess the encoding and for some reason it comes up as Shift JIS. The quickest way to work around the problem is to override the encoding mapping for Mechanize, so that when it attempts to convert the body to UTF-8 using Iconv it passes in the source encoding as UTF-8 as well. You can do it like this:
WWW::Mechanize::Util::CODE_DIC[:SJIS] = "UTF-8"
Place that just after the line where you require the Mechanize library. You may want to set the value back immediately after, or even better, find the root cause of the problem and submit a patch if necessary.
Note: The way I solved this was by debugging the Mechanize library by using the backtrace. The to_native_charset method calls detect_charset which is where the problem was.

In my case a Mechanize::File was returned by the get method which doesn't use encoding at all.
I was able to fix it by manually converting with Iconv, but this only works if you know the encoding already.
result = #agent.get uri
# Mechanize::File instead of Mechanize::Page is returned
# so we have to convert manually
result = Iconv.conv("utf-8", "iso-8859-1", result.body)

Related

ruby and curl: skipping invalid pages

I am building a script to parse multiple page titles. Thanks to another question in stack I have now this working bit
curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)[1]
puts simian
but if you try the same where a page has no title for example
curl = %x(curl http://zales.1.ai)
it dies with undefined method for nill class as it has no title ....
I can't check if curl is nil as it is not in this case (it contains another line)
Do you have any solution to have this working even if the title is not present and move to the next page to check ? I would appreciate if we stick to this code as I did try other solutions with nokogiri and uri (Nokogiri::HTML(open("http:/.....") but this is not working either as subdomains like byname_meee.1.ai do not work with the default open-uri so I am thankful if we can stick to this code that uses curl.
UPDATE
I realize that I probably left out some specific cases that are ought to be clarified. This is for parsing 300-400 pages. In the first run I have noticed at least two cases where nokogiri, hpricot but even the more basic open-uri do not work
1) open-uri simply fails in a simple domain with _
like http://levant_alejandro.1.ai this is a valid domain and works with curl but not with open_uri or nokogiri using open_uri
2)The second case if a page has no title like
http://zales.1.ai
3) Third is a page with an image and no valid HTML like http://voldemortas.1.ai/
A fourth case would be a page that has nothing but an internal server error or passenger/rack error.
The first three cases can be sorted with this solution (thanks to Havenwood in #ruby IRC channel)
curl = %x(curl http://voldemortas.1.ai/)
begin
simian = curl.match(/<title>(.*)<\/title>/)[1]
rescue NoMethodError
simian = "" # curl was nil?
rescue ArguementError
simian = "" # not html?
end
puts simian
Now I am aware that this is not elegant nor optimal.
REPHRASED QUESTION
Do you have better way to achieve the same with nokogiri or another gem that includes these cases (no title or no HTML valid page or even 404 page) ? Given that the pages I am parsing have a fairly simple title structure, is the above solution suitable ? For the sake of knowledge it would be useful to know why using an extra gem for the parsing like nokogiri would be better option (note: I try to have few gem dependencies as often and over time they tend to break).
You're making it much to hard on yourself.
Nokogiri doesn't care where you get the HTML, it just wants the body of the document. You can use Curb, Open-URI, a raw Net::HTTP connection, and it'll parse the content returned.
Try Curb:
require 'curb'
require 'nokogiri'
doc = Nokogiri::HTML(Curl.get('http://http://odin.1.ai').body_str)
doc.at('title').text
=> "Welcome to Dotgeek.org * 1.ai"
If you don't know whether you'll have a <title> tag, then don't try to do it all at once:
title = doc.at('title')
next if (!title)
puts title.text
Take a look at "equivalent of curl for Ruby?" for more ideas.
You just need to check for the match before accessing it. If curl.match is nil, the you can't access the grouping:
curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)
simian &&= simian[1] # only access the matched group if available
puts simian
Do heed the Tin Man's advice and use Nokogiri. Your regexp is really only suitable for a brittle solution -- it fails when the title element is spread over multiple lines.
Update
If you really don't want to use an HTML parser and if you promise this is for a quick script, you can use OpenURI (wrapper around net/http) in the standard library. It's at least a little cleaner than parsing curl output.
require 'open-uri'
def extract_title_content(line)
title = line.match(%r{<title>(.*)</title>})
title &&= title[1]
end
def extract_title_from(uri)
title = nil
open(uri) do |page|
page.lines.each do |line|
return title if title = extract_title_content(line)
end
end
rescue OpenURI::HTTPError => e
STDERR.puts "ERROR: Could not download #{uri} (#{e})"
end
puts extract_title_from 'http://odin.1.ai'
What you're really looking for, it seems, is a way to skip non-HTML responses. That's much easier with a curl wrapper like curb, like the Tin Man suggested, than dropping to the shell and using curl there:
1.9.3p125 :001 > require 'curb'
=> true
1.9.3p125 :002 > response = Curl.get('http://odin.1.ai')
=> #<Curl::Easy http://odin.1.ai?>
1.9.3p125 :003 > response.content_type
=> "text/html"
1.9.3p125 :004 > response = Curl.get('http://voldemortas.1.ai')
=> #<Curl::Easy http://voldemortas.1.ai?>
1.9.3p125 :005 > response.content_type
=> "image/png"
1.9.3p125 :006 >
So your code could look something like this:
response = Curl.get(url)
if response.content_type == "text/html" # or more fuzzy: =~ /text/
match = response.body_str.match(/<title>(.*)<\/title>/)
title = match && match[1]
# or use Nokogiri for heavier lifting
end
No more exceptions
puts simian

How to insert a string to a text field using mechanize in ruby?

I know is a very simple question but I've been stuck for an hour and I just can't understand how this works.
I need to scrape some stuff from my school's library so I need to insert 'CE' to a text field and then click on a link with text 'Clasificación'. The output is what I am going to use to work. So here is my code.
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
url = 'http://biblio02.eld.edu.mx/janium-bin/busqueda_rapida.pl?Id=20110720161008#'
searchStr = 'CE'
agent = Mechanize.new
page = agent.get(url)
searchForm = page.form_with(:method => 'post')
searchForm['buscar'] = searchStr
clasificacionLink = page.link_with(:href => "javascript:onClick=set_index_and_submit(\'51\');").click
page = agent.submit(searchForm,clasificacionLink)
When I run it, it gives me this error
janium.rb:31: undefined method `[]=' for nil:NilClass (NoMethodError)
Thanks!
I think your problem is actually on line 13, not 31, and I'll even tell why I think that. Not only does your script not have 31 lines but, from the fine manual:
form_with(criteria)
Find a single form matching criteria.
There are several forms on that page that have method="post". Apparently Mechanize returns nil when it can't exactly match the form_with criteria including the single part mentioned in the documentation; so, if your criteria matches more than one thing, form_with returns nil instead of choosing one of the options and you end up trying to do this:
nil['buscar'] = searchStr
But nil doesn't have a []= method so you get your NoMethodError.
If you use this:
searchForm = page.form_with(:name => 'forma')
you'll get past the first part as there is exactly one form with name="forma" on that page. Then you'll have trouble with this:
clasificacionLink = page.link_with(:href => "javascript:onClick=set_index_and_submit(\'51\');").click
page = agent.submit(searchForm, clasificacionLink)
as Mechanize doesn't know what to do with JavaScript (at least mine doesn't). But if you use just this:
page = agent.submit(searchForm)
you'll get a page and then you can continue building and debugging your script.
mu's answer sounds reasonable. I am not sure if this is strictly necessary, but you might also try to put braces around searchStr.
searchForm['buscar'] = [searchStr]

Nokogiri and Mechanize problem

I am doing one the examples at the mechanize doc site and I want to parse the results using
nokogiri.
My problem is that when the following line gets executed:
doc = Nokogiri::HTML(search_results, 'UTF-8' )
the following error occurs:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html/document.rb:71:in `parse': undefined method `name' for "UTF-8":String (NoMethodError)
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html.rb:13:in `HTML'
from mechanize_test.rb:16:in `<main>'
I have installed ruby 1.9 on a windows vista machine
The results returned by mechanize are non-latin (utf8)
The code sample follows.
# encoding: UTF-8
require 'rubygems'
require 'mechanize'
require 'nokogiri'
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get("http://www.google.com/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = "invitations"
search_results = agent.submit(search_form)
puts search_results.body
doc = Nokogiri::HTML(search_results, 'UTF-8')
#Douglas Drouillard
Thanx for looking into this. I found out I made a mistake. The call to nokogiri should have been:
doc = Nokogiri::HTML(search_results.body, 'UTF-8')
Note that search_results is different that search_results.body.
Search_results contains info coming right out of mechanize instantiation
while search_resuls.body contains html utf8 info that nokogiri can parse with no problem.
This appears to be issue with what Nokogiri expects as parameters to the parse method that is being called. The first issue I see, is that you are passing in the encoding option in the wrong parameter slot,
A parsing example from Nokogiri project page that specifies encoding
Nokogiri.XML('<foo><bar /><foo>', nil, 'EUC-JP')
Notice the encoding is the third parameter, not the second. But that still does not fully explain the behavior you are seeing, as the encoding should simply be ignored.
Per the Nokogiri documentation a call to Nokogiri::HTML() is a convenience method for the parse method.
Code for Nokogiri::HTML::parse
def parse thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block
document.parse(thing, url, encoding, options, &block)
end
The source for the Nokogiri::HTML::Document parse method is a bit long, but here is the relevant part though:
string_or_io.respond_to?(:encoding)
unless string_or_io.encoding.name == "ASCII-8BIT"
encoding ||= string_or_io.encoding.name
end
end
Notice string_or_io.encoding.name, this matches the error your saw, undefined method 'name' for "UTF-8":String (NoMethodError).
Does your search_results object has an attribute with a key value pair of {:encoding => 'UTF-8'}? It appears Nokogiri is looking for the encoding to store an object that then has a name attribute of 'UTF-8'.

ruby 1.9: invalid byte sequence in UTF-8

I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...
In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
file_contents.encode!('UTF-8', 'UTF-16')
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
The accepted answer nor the other answer work for me. I found this post which suggested
string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
This fixed the problem for me.
My current solution is to run:
my_string.unpack("C*").pack("U*")
This will at least get rid of the exceptions which was my main problem
Try this:
def to_utf8(str)
str = str.force_encoding('UTF-8')
return str if str.valid_encoding?
str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end
I recommend you to use a HTML parser. Just find the fastest one.
Parsing HTML is not as easy as it may seem.
Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the "�" symbol. So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string.
Even inside attribute values you have to decode HTML entities like amp
Here is a great question that sums up why you can not reliably parse HTML with a regular expression:
RegEx match open tags except XHTML self-contained tags
attachment = file.read
begin
# Try it as UTF-8 directly
cleaned = attachment.dup.force_encoding('UTF-8')
unless cleaned.valid_encoding?
# Some of it might be old Windows code page
cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
end
attachment = cleaned
rescue EncodingError
# Force it to UTF-8, throwing out invalid bits
attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
end
This seems to work:
def sanitize_utf8(string)
return nil if string.nil?
return string if string.valid_encoding?
string.chars.select { |c| c.valid_encoding? }.join
end
I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:
ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t
While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz. CSV.parse(f, :encoding => "iso-8859-1"), which turned my freaky deaky cyrillic K's into a much more manageable /\xCA/, which I could then remove with string.gsub!(/\xCA/, '')
Before you use scan, make sure that the requested page's Content-Type header is text/html, since there can be links to things like images which are not encoded in UTF-8. The page could also be non-html if you picked up a href in something like a <link> element. How to check this varies on what HTTP library you are using. Then, make sure the result is only ascii with String#ascii_only? (not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). If both of those tests pass, it is safe to use scan.
There is also the scrub method to filter invalid bytes.
string.scrub('')
If you don't "care" about the data you can just do something like:
search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"
I just used valid_encoding? to get passed it. Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results.

How to set the mechanize page encoding?

I'm trying to get a page with an ISO-8859-1 encoding clicking on a link, so the code is similar to this:
page_result = page.link_with( :text => 'link_text' ).click
So far I get the result with a wrong encoding, so I see characters like:
'T�tulo:' instead of 'Título:'
I've tried several approaches, including:
Stating the encoding in the first request using the agent like:
#page_search = #agent.get(
:url => 'http://www.server.com',
:headers => { 'Accept-Charset' => 'ISO-8859-1' } )
Stating the encoding for the page itself
page_result.encoding = 'ISO-8859-1'
But I must be doing something wrong: a simple puts always show the wrong characters.
Do you know how to state the encoding?
Thanks in advance,
Added: Executable example:
require 'rubygems'
require 'mechanize'
WWW::Mechanize::Util::CODE_DIC[:SJIS] = "ISO-8859-1"
#agent = WWW::Mechanize.new
#page = #agent.get(
:url => 'http://www.mcu.es/webISBN/tituloSimpleFilter.do?cache=init&layout=busquedaisbn&language=es',
:headers => { 'Accept-Charset' => 'utf-8' } )
puts #page.body
Hey you can just do a:
agent.page.encoding = 'utf-8'
Hope it helps!
The previous answer is correct, but in my code it looks slightly different:
agent = Mechanize.new
page = agent.get('http://example.com')
page.encoding = 'windows-1251'
page.search('p').each do |para|
puts para.text
end
Sorry, it was my mistake: I come from a Java background and there strings are internally converted to utf-16. I forgot Ruby doesn't do it. Mechanize was recovering the page flawlessly, but I needed to convert the data via iconv.
Mental note: Ruby stores the strings without converting its encoding.
Yeah, Mechanize will try to detect the encoding itself (using the NKF core Ruby library) to guess the encoding) and sometimes fails.
Maybe this might help:
WWW::Mechanize::Util::CODE_DIC[:SJIS] = "ISO-8859-1"
I'm not too sure about the exact syntax, but I think the CODE_DICT Hash might be a good place to look :)
I had a similar problem a while back.

Resources