Clicking through google pages with mechanize - ruby

I'm trying to figure out how to use the link_with function in Ruby's mechanize gem. I've got the basic concept down:
page = <site>
blah blah blah
next_page = page.link_with(:text => "Next")
page = link.click
However it seems that when I use this with a little test, it goes very slowly, what I'm tying to do is loop through the first ten pages of google using a loop do with a little time variable to count down from 10, when the time variable hits 0 I want the program to break out of the loop. It seems like it's working, but it only pulls the first link off of google and just sits there.
Source:
require 'mechanize'
require 'uri'
SEARCH = "test"
#agent = Mechanize.new
page = #agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = #agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls = str_list[1]
urls_to_log = URI.decode(urls)
puts urls_to_log
time = 10
loop do
next_page = page.link_with(:text => 'Next')
page = link.click
time -= 1
end
if time == 0
break
end
end
end
I found a bit of a reference here. However it doesn't really explain it in terms that I understand.
What am I doing wrong to where this just sits on the first link, and goes nowhere?

All you need to do to follow Next links is something like:
while page = page.link_with(:text => 'Next').click
# do something with page
end

Related

Repeated results from youtube while crawling

I am trying to fetch results from google and saving them to a file. But the results are getting repeated.
Also when I save them to file, only the last one link is getting printed to file.
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.com/videohp')
google_form = page.form('f')
google_form.q = 'ruby'
page = agent.submit(google_form, google_form.buttons.first)
linky = page.links
for link in linky do
if link.href.to_s =~/url.q/
str=link.href.to_s
strList=str.split(%r{=|&})
$url=strList[1].gsub("h%3Fv%3D", "h?v=")
$heading = link.text
$res = $url
if ($url.to_s.include? "webcache")
next
elsif ($url.to_s.include? "channel")
next
end
puts $res
end
end
for link in linky do
File.open("aaa.htm", 'w') { |file| file.write($res) }
end
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.com/videohp')
google_form = page.form('f')
google_form.q = 'ruby'
page = agent.submit(google_form, google_form.buttons.first)
linky = page.links
for link in linky do
if link.href.to_s =~/url.q/
str=link.href.to_s
strList=str.split(%r{=|&})
$url=strList[1].gsub("h%3Fv%3D", "h?v=")
$heading = link.text
$res = $url
if ($url.to_s.include? "webcache")
next
elsif ($url.to_s.include? "channel")
next
end
puts $res
File.open("aaa.htm", 'w') { |file| file.write($res) }
end
end
This is really two questions and it's clear you're just starting out with Ruby- you will get better with practice but it would help to keep reading up on the fundamentals of the language, this looks a bit like PHP written in Ruby.
First up the links are quite probably showing up multiple times because they are present more than once in the page. You aren't doing anything to catch that.
Secondly you have a global variable ( these tend to cause problems and should only really be used if you can't find an alternative ) which you are putting each URL into, but every time you do that, you overwrite what you had before. So every time you go $res = $url you are overwriting whatever was in $res with the last $url you got.
If you made an array instead of having the single value $res ( it can be a local variable too ) then you could just use myArray.push(url) to add each new url to it.
When you have got all the urls in your array, you could use myArray.uniq to get rid of the duplicates before you write it out to your file.
It looks like you don't really know Ruby.
Please do not use global variables unless you really need them - in this case you don't, it's not PHP. Simple assignment is enough. :)
To iterate through collection, use dedicated #each method. In your case you'd like to filter collection of links and leave those that match your needs valid_links = links.filter { |link| ... }.
Return false if they don't match your needs, return true if they match your statements.
In the File.open, you need to go through the collection inside File.open block (you will have valid_links to go through).

Nokogiri Throwing Exception in Function but not outside of Function

I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....

How to scrape a web page with dynamic content added by JavaScript?

I am trying to scrape this webpage, it have lazy load as we scroll it gets loaded. Using Nokogiri I am able to scrape the initial page, but not the rest of the page which load after scrolling.
To get lazy loaded page, scrap the following pages:
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=31&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=46&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=61&ajax=true
...
require 'rubygems'
require 'nokogiri'
require 'mechanize'
require 'open-uri'
number = 1
while true
url = "http://www.flipkart.com/mens-footwear/shoes" +
"/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&" +
"sid=osp%2Ccil%2Cnit%2Ce1f&start=#{number}&ajax=true"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
products = doc.css(".browse-product")
break if products.size == 0
products.each do |item|
title = item.at_css(".fk-display-block,.title").text.strip
price = (item.at_css(".pu-final").text || '').strip
link = item.at_xpath(".//a[#class='fk-display-block']/#href")
image = item.at_xpath(".//div/a/img/#src")
puts number
puts "#{title} - #{price}"
puts "http://www.flipkart.com#{link}"
puts image
puts "========================"
number += 1
end
end

Ruby Mechanize: Follow a Link

In Mechanize on Ruby, I have to assign a new variable to every new page I come to. For example:
page2 = page1.link_with(:text => "Continue").click
page3 = page2.link_with(:text => "About").click
...etc
Is there a way to run Mechanize without a variable holding every page state? like
my_only_page.link_with(:text => "Continue").click!
my_only_page.link_with(:text => "About").click!
I don't know if I understand your question correctly, but if it's a matter of looping through a lot of pages dynamically and process them, you could do it like this:
require 'mechanize'
url = "http://example.com"
agent = Mechanize.new
page = agent.get(url) #Get the starting page
loop do
# What you want to do on the page - ex. extract something...
item = page.parser.css('.some_item').text
item.save
if link = page.link_with(:text => "Continue") # As long as there is still a nextpage link...
page = link.click
else # If no link left, then break out of loop
break
end
end

Ruby - Mechanize: Select link by classname and other questions

At the moment I'm having a look on Mechanize.
I am pretty new to Ruby, so please be patient.
I wrote a little test script:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
page.links.each do |ll|
page_links << ll
end
puts page_links.size
This works. But page_links includes not only the search results. It also includes the google links like Login, Pictures, ...
The result links own a styleclass "1". Is it possible to select only the links with class == 1? How do I achieve this?
Is it possible to modify the "agentalias"? If I own a website, including google analytics or something, what browserclient will I see in ga going with mechanize on my site?
Can I select elements by their ID instead of their name? I tried to use
my_form = page.form_with(:id => 'myformid')
But this does not work.
in such cases like your I am using Nokogiri DOM search.
Here is your code a little bit rewritten:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
#maybe you better use 'h3.r > a.l' here
page.parser.css("a.l").each do |ll|
#page.parser here is Nokogiri::HTML::Document
page_links << ll
puts ll.text + "=>" + ll["href"]
end
puts page_links.size
Probably this article is a good place to start:
getting-started-with-nokogiri
By the way samples in the article also deal with Google search ;)
You can build a list of just the search result links by changing your code as follows:
page.links.each do |ll|
cls = ll.attributes.attributes['class']
page_links << ll if cls && cls.value == 'l'
end
For each element ll in page.links, ll.attributes is a Nokogiri::XML::Element and ll.attributes.attributes is a Hash containing the attributes on the link, hence the need for ll.attributes.attributes to get at the actual class and the need for the nil check before comparing the value to 'l'
The problem with using :id in the criteria to find a form is that it clashes with Ruby's Object#id method for returning a Ruby object's internal id. I'm not sure what the work around for this is. You would have no problem selecting the form by some other attribute (e.g. its action.)
I believe the selector you are looking for is:
:dom_id e.g. in your case:
my_form = page.form_with(:dom_id => 'myformid')

Resources