How to get Mechanize to auto-convert body to UTF8? - ruby

I found some solutions using post_connect_hook and pre_connect_hook, but it seems like they don't work. I'm using the latest Mechanize version (2.1). There are no [:response] fields in the new version, and I don't know where to get them in the new version.
https://gist.github.com/search?q=pre_connect_hooks
https://gist.github.com/search?q=post_connect_hooks
Is it possible to make Mechanize return a UTF8 encoded version, instead of having to convert it manually using iconv?

Since Mechanize 2.0, arguments of pre_connect_hooks() and post_connect_hooks() were changed.
See the Mechanize documentation:
pre_connect_hooks()
A list of hooks to call before retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.
 
post_connect_hooks()
A list of hooks to call after retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.
Now you can't change the internal response-body value because an argument is not array. So, the next best way is to replace an internal parser with your own:
class MyParser
def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block)
# insert your conversion code here. For example:
# thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists.
Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block)
end
end
agent = Mechanize.new
agent.html_parser = MyParser
page = agent.get('http://somewhere.com/')
...

I found a solution that works pretty well:
class HtmlParser
def self.parse(body, url, encoding)
body.encode!('UTF-8', encoding, invalid: :replace, undef: :replace, replace: '')
Nokogiri::HTML::Document.parse(body, url, 'UTF-8')
end
end
Mechanize.new.tap do |web|
web.html_parser = HtmlParser
end
No issues were found yet.

In your script, just enter: page.encoding = 'utf-8'
However, depending on your scenario, you may alternatively need to enter the reverse (the encoding of the website Mechanize is working with) instead. For that, open Firefox, open the website you want Mechanize to work with, select Tools in the menubar, and then open Page Info. Determine what the page is encoded in from there.
Using that info, you would instead enter what the page is encoded in (such as page.encoding = 'windows-1252').

How about something like this:
class Mechanize
alias_method :original_get, :get
def get *args
doc = original_get *args
doc.encoding = 'utf-8'
doc
end
end

Related

Scraping a webpage with Mechanize and Nokogiri and storing data in XML doc

I am trying to scrape a website and store data in XML using Mechanize and Nokogiri. I didn't set up a Rails project and I am only using Ruby and IRB.
I wrote this method:
def mechanize_club
agent = Mechanize.new
agent.get("http://www.rechercheclub.applipub-fft.fr/rechercheclub/")
form = agent.page.forms.first
form.field_with(:name => 'codeLigue').options[0].select
form.submit
page2 = agent.get('http://www.rechercheclub.applipub-fft.fr/rechercheclub/club.do?codeClub=01670001&millesime=2015')
body = page2.body
html_body = Nokogiri::HTML(body)
codeclub = html_body.search('.form').children("tr:first").children("th:first").to_i
#codeclubs << codeclub
filepath = '/davidgeismar/Documents/codeclubs.xml'
builder = Nokogiri::XML::Builder.new(encoding: 'UTF-8') do |xml|
xml.root {
xml.codeclubs {
#codeclubss.each do |c|
xml.codeclub {
xml.code_ c.code
}
end
}
}
end
puts builder.to_xml
end
My first problem is that I don't know how to test my code.
I call ruby webscraper.rb in my console, the file is treated I think, but it doesn't create an XML file in the specified path.
Then, more specifically I am quite sure this code is wrong as I didn't get a chance to test it.
Basically what I am trying to do is to submit a form several times:
agent = Mechanize.new
agent.get("http://www.rechercheclub.applipub-fft.fr/rechercheclub/")
form = agent.page.forms.first
form.field_with(:name => 'codeLigue').options[0].select
form.submit
I think this code is ok, but I dont want it to only select options[0], I want it to select an option, then scrape all the data I need, then go back to page, then select options[1]... until there are no more options (an iteration I guess).
the file is treated I think, but it doesnt create an xml file in the specified path.
There is nothing in your code that creates a file. You print some output, but don't do anything to open or write a file.
Perhaps you should read the IO and File documentation and review how you are using your filepath variable?
The second problem is that you don't call your method anywhere. Though it's defined and Ruby will see it and parse the method, it has no idea what you want to do with it unless you invoke the method:
def mechanize_club
...
end
mechanize_club()

Reading a Gmail Message with ruby-gmail

I am looking for an instance method from the ruby-gmail gem that would allow me to read either:
the body
or
subject
of a Gmail message.
After reviewing the documentation, found here, I couldn't find anything!?
There is a .message instance method found in the Gmail::Message class section; but it only returns, for lack of a better term, email "mumbo-jumbo," for the body.
My attempt:
#!/usr/local/bin/ruby
require 'gmail'
gmail = Gmail.connect('username', 'password')
emails = gmail.inbox.emails(:from => 'someone#mail.com')
emails.each do |email|
email.read
email.message
end
Now:
email.read does not work
email.message returns that, "mumbo-jumbo," mentioned above
Somebody else asked this question on SO but didn't get an answer.
This probably isn't exactly the answer to your question, but I will tell you what I have done in the past. I tried using the ruby-gmail gem but it didn't do what I wanted it to do in terms of reading a message. Or, at least, I couldn't get it to work. Instead I use the built-in Net::IMAP class to log in and get a message.
require 'net/imap'
imap = Net::IMAP.new('imap.gmail.com',993,true)
imap.login('<username>','<password>')
imap.select('INBOX')
subject_id = search_mail(imap, 'SUBJECT', '<mail_subject>')
subject_message = imap.fetch(subject_id,'RFC822')[0].attr['RFC822']
mail = Mail.read_from_string subject_message
body_message = mail.html_part.body
From here your message is stored in body_message and is HTML. If you want the entire email body you will probably need to learn how to use Nokogiri to parse it. If you just want a small bit of the message where you know some of the surrounding characters you can use a regex to find the part you are interested in.
I did find one page associated with the ruby-gmail gem that talks about using ruby-gmail to read a Gmail message. I made a cursory attempt at testing it tonight but apparently Google upped the security on my account and I couldn't get in using irb without tinkering with my Gmail configuration (according to the warning email I received). So I was unable to verify what is stated on that page, but as I mentioned my past attempts were unfruitful whereas Net::IMAP works for me.
EDIT:
I found this, which is pretty cool. You will need to add in
require 'cgi'
to your class.
I was able to implement it in this way. After I have my body_message, call the html2text method from that linked page (which I modified slightly and included below since you have to convert body_message to a string):
plain_text = html2text(body_message)
puts plain_text #Prints nicely formatted plain text to the terminal
Here is the slightly modified method:
def html2text(html)
text = html.to_s.
gsub(/( |\n|\s)+/im, ' ').squeeze(' ').strip.
gsub(/<([^\s]+)[^>]*(src|href)=\s*(.?)([^>\s]*)\3[^>]*>\4<\/\1>/i,
'\4')
links = []
linkregex = /<[^>]*(src|href)=\s*(.?)([^>\s]*)\2[^>]*>\s*/i
while linkregex.match(text)
links << $~[3]
text.sub!(linkregex, "[#{links.size}]")
end
text = CGI.unescapeHTML(
text.
gsub(/<(script|style)[^>]*>.*<\/\1>/im, '').
gsub(/<!--.*-->/m, '').
gsub(/<hr(| [^>]*)>/i, "___\n").
gsub(/<li(| [^>]*)>/i, "\n* ").
gsub(/<blockquote(| [^>]*)>/i, '> ').
gsub(/<(br)(| [^>]*)>/i, "\n").
gsub(/<(\/h[\d]+|p)(| [^>]*)>/i, "\n\n").
gsub(/<[^>]*>/, '')
).lstrip.gsub(/\n[ ]+/, "\n") + "\n"
for i in (0...links.size).to_a
text = text + "\n [#{i+1}] <#{CGI.unescapeHTML(links[i])}>" unless
links[i].nil?
end
links = nil
text
end
You also mentioned in your original question that you got mumbo-jumbo with this step:
email.message *returns mumbo-jumbo*
If the mumbo-jumbo is HTML, you can probably just use your existing code with this html2text method instead of switching over to Net::IMAP as I had discussed when I posted my original answer.
Nevermind, it's:
email.subject
email.body
silly me
ok, so how do I get the body in "readable" text? without all the encoding stuff and html?
Subject, text body and HTML body:
email.subject
if email.message.multipart?
text_body = email.message.text_part.body.decoded
html_body = email.message.html_part.body.decoded
else
# Only multipart messages contain a HTML body
text_body = email.message.body.decoded
html_body = text
end
Attachments:
email.message.attachments.each do |attachment|
path = "/tmp/#{attachment.filename}"
File.write(path, attachment.decoded)
# The MIME type might be useful
content_type = attachment.mime_type
end
require 'gmail'
gmail = Gmail.connect('username', 'password')
emails = gmail.inbox.emails(:from => 'someone#mail.com')
emails.each do |email|
puts email.subject
puts email.text_part.body.decoded
end

is it possible to convert Mechanize::File into Mechanize::Page

I'm having a trouble with Mechanize gem, how to convert Mechanize::File into Mechanize::Page,
here's my piece of code:
**link** = page.link_with(:href => %r{/en/users}).click
when users link clicked it goes to the page with the list of users, now i want to click the first user, but i can't achieve this, because link return Mechanize::File object
Any help, suggestions 'd be great, thanks
Mechanize uses Content-Type to determine how the resource should be handled. Occasionally websites will not set the mime-types for their resources. Mechanize::File is the default for unset Content-Type.
If you are only dealing with 'text/html' you can following Jimm Stout's suggestion of using post_connect_hooks
agent = Mechanize.new do |a|
a.post_connect_hooks << ->(_,_,response,_) do
if response.content_type.empty?
response.content_type = 'text/html'
end
end
end
Just parse the body with nokogiri:
link = page.link_with(:href => %r{/en/users}).click
doc = Nokogiri::HTML link.body
agent.get doc.at('a')[:href]

In Ruby/Rails, how can I encode/escape special characters in URLs?

How do I encode or 'escape' the URL before I use OpenURI to open(url)?
We're using OpenURI to open a remote url and return the xml:
getresult = open(url).read
The problem is the URL contains some user-input text that contains spaces and other characters, including "+", "&", "?", etc. potentially, so we need to safely escape the URL. I saw lots of examples when using Net::HTTP, but have not found any for OpenURI.
We also need to be able to un-escape a similar string we receive in a session variable, so we need the reciprocal function.
Don't use URI.escape as it has been deprecated in 1.9.
Rails' Active Support adds Hash#to_query:
{foo: 'asd asdf', bar: '"<#$dfs'}.to_query
# => "bar=%22%3C%23%24dfs&foo=asd+asdf"
Also, as you can see it tries to order query parameters always the same way, which is good for HTTP caching.
Ruby Standard Library to the rescue:
require 'uri'
user_text = URI.escape(user_text)
url = "http://example.com/#{user_text}"
result = open(url).read
See more at the docs for the URI::Escape module. It also has a method to do the inverse (unescape)
The main thing you have to consider is that you have to escape the keys and values separately before you compose the full URL.
All the methods which get the full URL and try to escape it afterwards are broken, because they cannot tell whether any & or = character was supposed to be a separator, or maybe a part of the value (or part of the key).
The CGI library seems to do a good job, except for the space character, which was traditionally encoded as +, and nowadays should be encoded as %20. But this is an easy fix.
Please, consider the following:
require 'cgi'
def encode_component(s)
# The space-encoding is a problem:
CGI.escape(s).gsub('+','%20')
end
def url_with_params(path, args = {})
return path if args.empty?
path + "?" + args.map do |k,v|
"#{encode_component(k.to_s)}=#{encode_component(v.to_s)}"
end.join("&")
end
def params_from_url(url)
path,query = url.split('?',2)
return [path,{}] unless query
q = query.split('&').inject({}) do |memo,p|
k,v = p.split('=',2)
memo[CGI.unescape(k)] = CGI.unescape(v)
memo
end
return [path, q]
end
u = url_with_params( "http://example.com",
"x[1]" => "& ?=/",
"2+2=4" => "true" )
# "http://example.com?x%5B1%5D=%26%20%3F%3D%2F&2%2B2%3D4=true"
params_from_url(u)
# ["http://example.com", {"x[1]"=>"& ?=/", "2+2=4"=>"true"}]
Ruby has the built-in URI library, and the Addressable gem, in particular Addressable::URI
I prefer Addressable::URI. It's very full featured and handles the encoding for you when you use the query_values= method.
I've seen some discussions about URI going through some growing pains so I tend to leave it alone for handling encoding/escaping until these things get sorted out:
http://osdir.com/ml/ruby-core/2010-06/msg00324.html
http://osdir.com/ml/lang-ruby-core/2009-06/msg00350.html
http://osdir.com/ml/ruby-core/2011-06/msg00748.html

How to access html request parameters for a .rhtml page served by webrick?

I'm using webrick (the built-in ruby webserver) to serve .rhtml
files (html with ruby code embedded --like jsp).
It works fine, but I can't figure out how to access parameters
(e.g. http://localhost/mypage.rhtml?foo=bar)
from within the ruby code in the .rhtml file.
(Note that I'm not using the rails framework, only webrick + .rhtml files)
Thanks
According to the source code of erbhandler it runs the rhtml files this way:
Module.new.module_eval{
meta_vars = servlet_request.meta_vars
query = servlet_request.query
erb.result(binding)
}
So the binding should contain a query (which contains a hash of the query string) and a meta_vars variable (which contains a hash of the environment, like SERVER_NAME) that you can access inside the rhtml files (and the servlet_request and servlet_response might be available too, but I'm not sure about them).
If that is not the case you can also try querying the CGI parameter ENV["QUERY_STRING"] and parse it, but this should only be as a last resort (and it might only work with CGI files).
This is the solution:
(suppose the request is http://your.server.com/mypage.rhtml?foo=bar)
<html>
<body>
This is my page (mypage.rhtml, served by webrick)
<%
# embedded ruby code
servlet_request.query ["foo"] # this simply prints bar on console
%>
</body>
</html>
You don't give much details, but I imagine you have a servlet to serve the files you will process with erb, and by default the web server serves any static file in a public directory.
require 'webrick'
include WEBrick
require 'erb'
s = HTTPServer.new( :Port => 8080,:DocumentRoot => Dir::pwd + "/public" )
class MyServlet < HTTPServlet::AbstractServlet
def do_GET(req, response)
File.open('public/my.rhtml','r') do |f|
#template = ERB.new(f.read)
end
response.body = #template.result(binding)
response['Content-Type'] = "text/html"
end
end
s.mount("/my", MyServlet)
trap("INT"){
s.shutdown
}
s.start
This example is limited, when you go to /my always the same file is processed. Here you should construct the file path based on the request path. Here I said a important word: "request", everything you need is there.
To get the HTTP header parameters, use req[header_name]. To get the parameters in the query string, use req.query[param_name]. req is the HTTPRequest object passed to the servlet.
Once you have the parameter you need, you have to bind it to the template. In the example we pass the binding object from self (binding is defined in Kernel, and it represents the context where code is executing), so every local variable defined in the do_GET method would be available in the template. However, you can create your own binding for example passing a Proc object and pass it to the ERB processor when calling 'result'.
Everything together, your solution would look like:
def do_GET(req, response)
File.open('public/my.rhtml','r') do |f|
#template = ERB.new(f.read)
end
foo = req.query["foo"]
response.body = #template.result(binding)
response['Content-Type'] = "text/html"
end
Browsing the documentation, it looks like you should have an HTTPRequest from which you can get the query string. You can then use parse_query to get a name/value hash.
Alternatively, it's possible that just calling query() will give you the hash directly... my Ruby-fu isn't quite up to it, but you might want to at least give it a try.

Resources