using mechanize and got the uninitialized constant Object::WWW (NameError) - ruby

I'm using mechanize in windows 7 x64 OS, but got the the uninitialized constant Object::WWW (NameError),
the code is very simple:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
error occurs at the line agent = WWW::Mechanize.new
any help is appreciated!

remove the WWW:: - that got removed a long time ago.

While googling found the following code, may be useful to you.
#a = Mechanize.new { |agent|
agent.user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; es-ES; rv:1.9
.2.3) Gecko/20100401 Firefox/6.0.2'
}
#a.get("http://www.somesite.com"]) do |page|
page.search("//a[id='id-name']"]).each do |a|
puts a
end

Related

Using a splat to catch errors is not working

I have a lot of errors that I need to catch, so I put them all into two arrays and made a constant to hold them, however, when I run the program I receive the exception:
C:/Users/thomas_j_perkins/bin/ruby/tool/sql_tool/whitewidow/lib/imports/constants_and_requires.rb:62:in `<top (required)>': uninitialized constant RestClient::MaxRedirectsReached (NameError)
from whitewidow.rb:6:in `require_relative'
from whitewidow.rb:6:in `<main>'
Here's how the constants look:
LOADING_ERRORS = [RestClient::ResourceNotFound, RestClient::InternalServerError, RestClient::RequestTimeout,
RestClient::Gone, RestClient::SSLCertificateNotVerified, RestClient::Forbidden,
OpenSSL::SSL::SSLError, Errno::ECONNREFUSED, URI::InvalidURIError, Errno::ECONNRESET,
Timeout::Error, OpenSSL::SSL::SSLError, Zlib::GzipFile::Error, RestClient::MultipleChoices,
RestClient::Unauthorized, SocketError, RestClient::BadRequest, RestClient::ServerBrokeConnection,
RestClient::MaxRedirectsReached]
FATAL_ERRORS = [Mechanize::ResponseCodeError, RestClient::ServiceUnavailable, OpenSSL::SSL::SSLError,
RestClient::BadGateway]
Here's how I'm using them:
begin
# Do some cool stuff
rescue *FATAL_ERRORS => e
puts e
end
--
begin
# Do some more cool stuff
rescue *LOADING_ERRORS => e
puts e
end
Am I doing something wrong to where I will receive a top required error? Just in case you need it here's the entire requiring file that the error is specifying:
# Built in libraries
require 'rubygems'
require 'bundler/setup'
require 'mechanize'
require 'nokogiri'
require 'rest-client'
require 'timeout'
require 'uri'
require 'fileutils'
require 'yaml'
require 'date'
require 'optparse'
require 'tempfile'
require 'socket'
require 'net/http'
# Created libraries
require_relative '../../lib/modules/format'
require_relative '../../lib/misc/credits'
require_relative '../../lib/misc/legal'
require_relative '../../lib/misc/spider'
require_relative '../../lib/modules/copy'
require_relative '../../lib/modules/site_info'
require_relative '../../lib/modules/expansion/string_expan'
# Modules that need to be included
include Format
include Credits
include Legal
include Whitewidow
include Copy
include SiteInfo
# Constants used throughout the program
=begin
USER_AGENTS = { # Temporary fix for user agents until I can refactor the YAML file
1 => 'Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620',
2 => 'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
3 => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.3pre) Gecko/20100403 Lorentz/3.6.3plugin2pre (.NET CLR 4.0.20506)',
4 => 'Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)',
5 => 'igdeSpyder (compatible; igde.ru; +http://igde.ru/doc/tech.html)',
6 => 'larbin_2.6.3 (ltaa_web_crawler#groupes.epfl.ch)',
7 => 'Mozilla/5.0 (Linux; Android 5.0.2; SAMSUNG SM-T550 Build/LRX22G) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/3.3 Chrome/38.0.2125.102 Safari/537.36',
8 => 'Dalvik/2.1.0 (Linux; U; Android 6.0.1; Nexus Player Build/MMB29T)',
9 => 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
10 => 'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)',
}
=end
FORMAT = Format::StringFormat.new
PATH = Dir.pwd
VERSION = Whitewidow.version
SEARCH = File.readlines("#{PATH}/lib/lists/search_query.txt").sample
USER_AGENTS = YAML.load_file("#{PATH}/lib/lists/rand-age.yml")
OPTIONS = {}
USER_AGENT = USER_AGENTS[rand(1..10)]
SKIP = %w(/webcache.googleusercontent.com stackoverflow.com github.com)
LOADING_ERRORS = [RestClient::ResourceNotFound, RestClient::InternalServerError, RestClient::RequestTimeout,
RestClient::Gone, RestClient::SSLCertificateNotVerified, RestClient::Forbidden,
OpenSSL::SSL::SSLError, Errno::ECONNREFUSED, URI::InvalidURIError, Errno::ECONNRESET,
Timeout::Error, OpenSSL::SSL::SSLError, Zlib::GzipFile::Error, RestClient::MultipleChoices,
RestClient::Unauthorized, SocketError, RestClient::BadRequest, RestClient::ServerBrokeConnection,
RestClient::MaxRedirectsReached]
FATAL_ERRORS = [Mechanize::ResponseCodeError, RestClient::ServiceUnavailable, OpenSSL::SSL::SSLError,
RestClient::BadGateway]
I installed mechanize and rest-client
gem install mechanize
gem install rest-client
then I opened an IRB session
require mechanize
require rest-client
then tested you FATAL_ERROR array and was able to raise the error and handle it with your code.
So there is no problem with the way you are using the * splat operator.
The problem is in your LOADING_ERRORS array.
When I tried doing the same thing with your LOADING_ERRORS array, I got the same error message as you.
I cloned the rest-client git repository and searched in the lib/restclient/exceptions.rb file and it seems like there is no RestClient::MaxRedirectsReached defined.
If you remove that exception from your array, the code works.
After further research in the repository, there is a history.md file and it states:
Changes to redirection behavior: (#381, #484)
Remove RestClient::MaxRedirectsReached in favor of the normal
ExceptionWithResponse subclasses. This makes the response accessible on
the exception object as .response, making it possible for callers to tell
what has actually happened when the redirect limit is reached.
When following HTTP redirection, store a list of each previous response on
the response object as .history. This makes it possible to access the
original response headers and body before the redirection was followed.
Follow redirection consistently, regardless of whether the HTTP method was
passed as a symbol or string. Under the hood rest-client now normalizes the
HTTP request method to a lowercase string.
So it seems like that exception has been removed from the rest-client library.
You may want to replace it with RestClient::ExceptionWithResponse

Ruby - nokogiri, open-uri - Fail to parse page [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
This code work on some pages, like klix.ba, but cant figure out why it doesn't work for others.
There is no error to explain what went wrong, nothing.
If puts page works, which means I can target the page, and parse it, why I cant get single elements?
require 'nokogiri'
require 'open-uri'
url = 'http://www.olx.ba/'
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
page = Nokogiri::XML(open(url,'User-Agent' => user_agent), nil, "UTF-8")
#puts page - This line work
puts page.xpath('a')
First of all, why are you parsing it as XML?
The following should be correct, considering your page is a HTML website:
page = Nokogiri::HTML(open(url,'User-Agent' => user_agent), nil, "UTF-8")
Furthermore, if you want to strip out all the links (a-tags), this is how:
page.css('a').each do |element|
puts element
end
If you are want to parse content from a web page you need to do this:
require 'nokogiri'
require 'open-uri'
url = 'http://www.olx.ba/'
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
page = Nokogiri::HTML(open(url,'User-Agent' => user_agent), nil, "UTF-8")
#puts page - This line work
puts page.xpath('a')
Here take a look at the Nokogiri documentation
One thing I would suggest is to use a debugger break point in your code (probably after assigning page). Look at the Pry-debugger gem.
So I would do something like this:
require 'nokogiri'
require 'open-uri'
require 'pry' # require the necessary library
url = 'http://www.olx.ba/'
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
page = Nokogiri::HTML(open(url,'User-Agent' => user_agent), nil, "UTF-8")
binding.pry # stop a moment in time in you code (break point)
#puts page - This line work
puts page.xpath('a')

Trouble scraping Google trends using Capybara and Poltergeist

I want to get the top trending queries in a particular category on Google Trends. I could download the CSV for that category but that is not a viable solution because I want to branch into each query and find the trending sub-queries for each.
I am unable to capture the contents of the following table, which contains the top 10 trending queries for a topic. Also for some weird reason taking a screenshot using capybara returns a darkened image.
<div id="TOP_QUERIES_0_0table" class="trends-table">
Please run the code on the Ruby console to see it working. Capturing elements/screenshot works fine for facebook.com or google.com but doesn't work for trends.
I am guessing this has to do with the table getting generated dynamically on page load but I'm not sure if that should block capybara from capturing the elements already loaded on the page. Any hints would be very valuable.
require 'capybara/poltergeist'
require 'capybara/dsl'
require 'csv'
class PoltergeistCrawler
include Capybara::DSL
def initialize
Capybara.register_driver :poltergeist_crawler do |app|
Capybara::Poltergeist::Driver.new(app, {
:js_errors => false,
:inspector => false,
phantomjs_logger: open('/dev/null')
})
end
Capybara.default_wait_time = 3
Capybara.run_server = false
Capybara.default_driver = :poltergeist_crawler
page.driver.headers = {
"DNT" => 1,
"User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0"
}
end
# handy to peek into what the browser is doing right now
def screenshot(name="screenshot")
page.driver.render("public/#{name}.jpg",full: true)
end
# find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields
def doc
Nokogiri.parse(page.body)
end
end
crawler = PoltergeistCrawler.new
url = "http://www.google.com/trends/explore#cat=0-45&geo=US&date=today%2012-m&cmpt=q"
crawler.visit url
crawler.screenshot
crawler.find(:xpath, "//div[#id='TOP_QUERIES_0_0table']")
Capybara::ElementNotFound: Unable to find xpath "//div[#id='TOP_QUERIES_0_0table']"
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:41:in block in find'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/base.rb:84:insynchronize'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:30:in find'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/session.rb:676:inblock (2 levels) in '
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/dsl.rb:51:in block (2 levels) in <module:DSL>'
from (irb):45
from /Users/karan/.rbenv/versions/1.9.3-p484/bin/irb:12:in'
The javascript error was due to the incorrect USER-Agent. Once I changed the User Agent to that of my chrome browser it worked !
"User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"

Switching ip while parsing the site with ruby mechanize

Is there any way to change, or hide send request ip, while i'm parsing a website with my ruby mechanize program? To avoid bun from site server.
I've seen sites changing ip-adresses, like this http://www.newipnow.com/ . But don't figure how to use it in my program.
Here is my code:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'logger'
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
agent = Mechanize.new do |a|
a.ssl_version,
a.verify_mode = 'SSLv3',
OpenSSL::SSL::VERIFY_NONE, a.user_agent_alias = 'Windows Mozilla'
end
authrization = agent.get("http://vk.com/")
vk_form = authrization.forms.first
vk_form.email = 'myaccount'
vk_form.pass = 'mypassword'
authrization = agent.submit(vk_form, vk_form.buttons.first)
Yes, you can set a proxy like this:
agent.set_proxy host, port, user, pass

How can I redirect pretty-print in IRB

I am trying to redirect pretty-print output in IRB but pp page >> results.txt does not work.
How can I redirect pretty print to file? I am using Windows OS.
My code
require 'nokogiri'
require 'mechanize'
agent = Mechanize.new
agent.user_agent_alias = 'Windows Mozilla'
page = agent.get('http://www.asus.com/Search/')
pp page
You can't redirect output to a file inside a Ruby script using >>. That trick only works at the command-line.
To write to a file use:
File.open('results.txt', 'a') { |fo| pp page, fo }
See the documentation for pp for more information.
Ok I got it to work, for anyone curious, this is based on another pretty print question I found:
require 'nokogiri'
require 'mechanize'
agent = Mechanize.new
agent.user_agent_alias = 'Windows Mozilla'
page = agent.get('http://www.asus.com/Search/')
pp page
File.open("results.txt","w") do |f|
PP.pp(page,f)
end

Resources