Ruby Mechanize: Follow a Link - ruby

In Mechanize on Ruby, I have to assign a new variable to every new page I come to. For example:
page2 = page1.link_with(:text => "Continue").click
page3 = page2.link_with(:text => "About").click
...etc
Is there a way to run Mechanize without a variable holding every page state? like
my_only_page.link_with(:text => "Continue").click!
my_only_page.link_with(:text => "About").click!

I don't know if I understand your question correctly, but if it's a matter of looping through a lot of pages dynamically and process them, you could do it like this:
require 'mechanize'
url = "http://example.com"
agent = Mechanize.new
page = agent.get(url) #Get the starting page
loop do
# What you want to do on the page - ex. extract something...
item = page.parser.css('.some_item').text
item.save
if link = page.link_with(:text => "Continue") # As long as there is still a nextpage link...
page = link.click
else # If no link left, then break out of loop
break
end
end

Related

Repeated results from youtube while crawling

I am trying to fetch results from google and saving them to a file. But the results are getting repeated.
Also when I save them to file, only the last one link is getting printed to file.
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.com/videohp')
google_form = page.form('f')
google_form.q = 'ruby'
page = agent.submit(google_form, google_form.buttons.first)
linky = page.links
for link in linky do
if link.href.to_s =~/url.q/
str=link.href.to_s
strList=str.split(%r{=|&})
$url=strList[1].gsub("h%3Fv%3D", "h?v=")
$heading = link.text
$res = $url
if ($url.to_s.include? "webcache")
next
elsif ($url.to_s.include? "channel")
next
end
puts $res
end
end
for link in linky do
File.open("aaa.htm", 'w') { |file| file.write($res) }
end
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.com/videohp')
google_form = page.form('f')
google_form.q = 'ruby'
page = agent.submit(google_form, google_form.buttons.first)
linky = page.links
for link in linky do
if link.href.to_s =~/url.q/
str=link.href.to_s
strList=str.split(%r{=|&})
$url=strList[1].gsub("h%3Fv%3D", "h?v=")
$heading = link.text
$res = $url
if ($url.to_s.include? "webcache")
next
elsif ($url.to_s.include? "channel")
next
end
puts $res
File.open("aaa.htm", 'w') { |file| file.write($res) }
end
end
This is really two questions and it's clear you're just starting out with Ruby- you will get better with practice but it would help to keep reading up on the fundamentals of the language, this looks a bit like PHP written in Ruby.
First up the links are quite probably showing up multiple times because they are present more than once in the page. You aren't doing anything to catch that.
Secondly you have a global variable ( these tend to cause problems and should only really be used if you can't find an alternative ) which you are putting each URL into, but every time you do that, you overwrite what you had before. So every time you go $res = $url you are overwriting whatever was in $res with the last $url you got.
If you made an array instead of having the single value $res ( it can be a local variable too ) then you could just use myArray.push(url) to add each new url to it.
When you have got all the urls in your array, you could use myArray.uniq to get rid of the duplicates before you write it out to your file.
It looks like you don't really know Ruby.
Please do not use global variables unless you really need them - in this case you don't, it's not PHP. Simple assignment is enough. :)
To iterate through collection, use dedicated #each method. In your case you'd like to filter collection of links and leave those that match your needs valid_links = links.filter { |link| ... }.
Return false if they don't match your needs, return true if they match your statements.
In the File.open, you need to go through the collection inside File.open block (you will have valid_links to go through).

Clicking through google pages with mechanize

I'm trying to figure out how to use the link_with function in Ruby's mechanize gem. I've got the basic concept down:
page = <site>
blah blah blah
next_page = page.link_with(:text => "Next")
page = link.click
However it seems that when I use this with a little test, it goes very slowly, what I'm tying to do is loop through the first ten pages of google using a loop do with a little time variable to count down from 10, when the time variable hits 0 I want the program to break out of the loop. It seems like it's working, but it only pulls the first link off of google and just sits there.
Source:
require 'mechanize'
require 'uri'
SEARCH = "test"
#agent = Mechanize.new
page = #agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = #agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls = str_list[1]
urls_to_log = URI.decode(urls)
puts urls_to_log
time = 10
loop do
next_page = page.link_with(:text => 'Next')
page = link.click
time -= 1
end
if time == 0
break
end
end
end
I found a bit of a reference here. However it doesn't really explain it in terms that I understand.
What am I doing wrong to where this just sits on the first link, and goes nowhere?
All you need to do to follow Next links is something like:
while page = page.link_with(:text => 'Next').click
# do something with page
end

Stripping out results from a website that doesn't have differing URLs

I'm trying to automate the process of searching for alternative telephone numbers using SayNoTo0870 . Every time one searches for an alternate number or name it brings up the '/companysearch.php' page.
Clearly this page has no reference, and in my mind you can't just link to this page.
What I'm hoping to do is use the code below, to automate the opening of a browser, searching of a name/number, stripping out the HTML and then providing the top 5 results. I've got the automation part down, but clearly when trying to save the webpage using Hpricot it only brings up the 'Sorry nothing can be found page' because I can't link directly to the search result page.
Here is my code thus far:
(I've removed comments to shorten it)
require 'rubygems'
require 'watir'
require 'hpricot'
require 'open-uri'
class OH870
def searchName(name)
browser = Watir::Browser.new
browser.goto 'http://www.saynoto0870.com/search.php'
browser.text_field(:name => 'search_name').set name
browser.button(:name => 'submit').click
end
def searchNumber(number)
browser = Watir::Browser.new
browser.goto 'http://www.saynoto0870.com/search.php'
browser.text_field(:name => 'number').set number
browser.button(:name => 'submit').click
end
def loadNew(website)
doc = Hpricot(open(website))
puts(doc)
end
def strip_tags
stripped = website.gsub( %r{</?[^>]+?>}, '' )
puts stripped
end
end # class
class Main < OH870
puts "What is the name of the place you want?"
website = 'http://www.saynoto0870.com/companysearch.php'
question = gets.chomp
whichNumber = OH870.new
whichNumber.searchName(question)
#result = OH870.new
#withoutTags = website.strip_tags
#result.loadNew(withoutTags)
end
Now I'm not sure whether there's a way of "asking watir to follow through to the companysearch.php page and dump the results without having to pass this page as a variable.
I wonder if anyone has any suggestions here?
With WATIR, minus the extraneous libraries, here's all it takes to accomplish what you've described (using the 'name' test case only). I've pulled it out of the function format since you already know how to do that, and this will be a clearer test case path.
require 'watir'
#browser = Watir::Browser.new :firefox #open a browser called #browser
#browser.goto "http://(your search page here)" #go to the search page
#browser.text_field(:name => 'name').value = "Awesome" #fill in the 'name' field
#browser.button(:name => 'submit').click #submit the form
If all goes well, we should now be looking at the search results. WATIR already knows it's on a new page - we don't have to specify a URL. In the case that the results are in a frame, we do need to access that frame before we can view its content. Let's pretend they're in a DIV element with an ID of "search_results":
results = #browser.div(:id => "search_results").text
resultsFrame = #browser.frame(:index => 1) #in the case of a frame
results = resultsFrame.div(id => "search_results).text
As you can see, you do not need to save the entire page to parse the results. They could be in table cells, they could be in a different div per line, or a new frame. All are easily accessible with WATIR to be stored in a variable, array, or immediately written to the console or log file.
#results = Array.new #create an Array to store our results
#browser.divs.each do |div| #for each div element on the page
if div.id == "search_results" #if the div ID equals "search_results"
#results << div.text #add it to our array named #results
end
end
Now, if you just wanted the top 5 there are many ways to access them.
#results[0] #first element
#results[0..4] #first 5 elements
I'd also suggest you look into a few programming principles like DRY (Don't Repeat Yourself). In your function definitions where you see that they share code, like opening the browser and visiting the same URL - you can consolidate those:
def search(how, what)
#browser = Watir::Browser.new :firefox
#browser.goto "(that search url again)"
#browser.text_field(:name => how).value = what
etc...
end
search("name", "Hilton")
search("number", "555555")
Since we know that the two available text_field names are "name" and "number", and those make good logical sense as a 'how', we can parameterize them and use a single function for both the Search by Name and Search by Number test cases. This is more efficient, as long as the test cases remain similar enough to be shared.

Mechanize: picking right submit from multiple in same form

I use Mechanize to loop through a table, which is paginated.
I have a problem with a form that holds multiple submit inputs. The input tags are used as pagination and they are generated dynamically. When I loop through the pages I need to scrape, I need to be able to pick the right input, since only one of them will take me to the “next page”. The right tag can be identified by different attributes such as name, class, value etc. My problem is though, that I can’t find out how to tell mechanize which one to use.
I tried this:
require 'mechanize'
require 'yaml'
url = "http://www.somewhere.com"
agent = Mechanize.new
page = agent.get(url)
loop do
puts "some content from site using nokogiri"
if next_page = page.form_with(:action => /.*/)
page = next_page.submit(page.form_with(:action => /.*/).submits[3])
else
break
end
end
From this post, http://rubyforge.org/pipermail/mechanize-users/2008-November/000314.html, but as told the number of tags are changing so just picking a hardcoded number of the submits is not too good an idea.
What I would like to know is if there is a way like this:
loop do
puts "some content from site using nokogiri"
if next_page = page.form_with(:action => /.*/)
page = next_page.submit(:name => /the_right_submit_button/)
else
break
end
end
or something like that, maybe with a css or xpath selector.
I usually use form.button_with to select the right button to click:
form = results_page.forms[0]
results_page = form.submit(form.button_with(:name=>'ctl00$ContentBody$ResultsPager$NextButton'))

How to get redirect log in Mechanize?

In ruby, if you use mechanize following 301/302 redirects like this
require 'mechanize'
m = WWW::Mechanize.new
m.get('http://google.com')
how to get the list of the pages mechanize was redirected through? (Like http://google.com => http://www.google.com => http://google.com.ua)
OK, here is the code in mechanize responsible for redirection
elsif res_klass <= Net::HTTPRedirection
return page unless follow_redirect?
log.info("follow redirect to: #{ response['Location'] }") if log
from_uri = page.uri
raise RedirectLimitReachedError.new(page, redirects) if redirects + 1 > redirection_limit
redirect_verb = options[:verb] == :head ? :head : :get
page = fetch_page( :uri => response['Location'].to_s,
:referer => page,
:params => [],
:verb => redirect_verb,
:redirects => redirects + 1
)
#history.push(page, from_uri)
return page
but trying to m.history.map {|p| puts p.uri} shows 3 times the uri of last page..
The key here is to take advantage of the built in logging in Mechanize. Here's a full code sample using the built in Rails logging facilities.
require 'mechanize'
require 'logger'
mechanize_logger = Logger.new('log/mechanize.log')
mechanize_logger.level = Logger::INFO
url = 'http://google.com'
agent = Mechanize.new
agent.log = mechanize_logger
agent.get(url)
And then check the output of log/mechanize.log in your log directory and you'll see the whole mechanize process including the intermediate urls.
I'm not certain, but here are a couple of things to try:
see what's in m.history[i].uri after the get()
You might need something like:
for m.redirection_limit in 0..99
begin
m.get(url)
break
rescue WWW::Mechanize::RedirectLimitReachedError
# code here could get control at
# intermediate redirection levels
end
end

Resources