I wrote a simple web scrawler using Mechanize, now I'm stuck at how to get next page recursively, below is the code.
def self.generate_page #generate a Mechainze page object,the first page
agent = Mechanize.new
url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
page = agent.get(url)
page
end
def self.next_page(n_page) #get next page recursively by click next tag showed in each pages
puts n_page
# if I dont use puts , I get nothing , when using puts, I get
#<Mechanize::Page:0x007fd341c70fd0>
#<Mechanize::Page:0x007fd342f2ce08>
#<Mechanize::Page:0x007fd341d0cf70>
#<Mechanize::Page:0x007fd3424ff5c0>
#<Mechanize::Page:0x007fd341e1f660>
#<Mechanize::Page:0x007fd3425ec618>
#<Mechanize::Page:0x007fd3433f3e28>
#<Mechanize::Page:0x007fd3433a2410>
#<Mechanize::Page:0x007fd342446ca0>
#<Mechanize::Page:0x007fd343462490>
#<Mechanize::Page:0x007fd341c2fe18>
#<Mechanize::Page:0x007fd342d18040>
#<Mechanize::Page:0x007fd3432c76a8>
#which are the results I want
np = Mechanize.new.click(n_page.link_with(:text=>/next/)) unless n_page.link_with(:text=>/next/).nil?
result = next_page(np) unless np.nil?
result # here the value is empty, I dont know what is worng
end
def self.get_page # trying to pass the result of next_page() method
puts next_page(generate_page)
# it seems result is never passed here,
end
I followed these two links What is recursion and how does it work?
and Ruby recursive function
but still cant figure out what's wrong.. hope someone can help me out.. Thanks
There are a few issues with your code:
You shouldn't be calling Mechanize.new more than once.
From a stylistic perspective, you are doing too many nil checks.
Unless you have a preference for recursion, it'll probably be easier to do it iteratively.
To have your next_page method return an array containing every link page in the chain, you could write this:
# you should store the mechanize agent as a global variable
Agent = Mechanize.new
# a helper method to DRY up the code
def click_to_next_page(page)
Agent.click(n_page.link_with(:text=>/next/))
end
# repeatedly visits next page until none exists
# returns all seen pages as an array
def get_all_next_pages(n_page)
results = []
np = click_to_next_page(n_page)
results.push(np)
until !np
np = click_to_next_page(np)
np && results.push(np)
end
results
end
# testing it out (i'm not actually running this)
base_url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
root_page = Agent.get(base_url)
next_pages = get_all_next_pages(root_page)
puts next_pages
Related
Hi I am just doing a bit of refactoring on a small cli web scraping project I did in Ruby and I was simply wondering if there was cleaner way to write a particular section without repeating the code too much.
Basically with the code below, I pulled data from a website but I had to do this per page. You will notice that both methods are only different by their name and the source.
def self.scrape_first_page
html = open("https://www.texasblackpages.com/united-states/san-antonio")
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
def self.scrape_second_page
html = open('https://www.texasblackpages.com/united-states/san-antonio?page=2')
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
Is there a way for me to streamline this process all with just one method pulling from one source, but have the ability to access different pages within the same site, or this is pretty much the best and only way? They owners of the website do not have a public api from me to pull from in case anyone is wondering.
Remember that in programming you want to steer towards code that follows the Zero, One or Infinity Rule avoid the dreaded two. In other words, write methods that take no arguments, fixed arguments (one), or an array of unspecified size (infinity).
So the first step is to clean up the scraping function to make it as generic as possible:
def scrape(page)
doc = Nokogiri::HTML(open(page))
# Use map here to return an array of Business objects
doc.css('div.grid_element').map do |business|
Business.new.tap do |biz|
# Use tap to modify this object before returning it
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
end
Note that apart from the extraction code, there's nothing specific about this. Takes a URL, returns Business objects in an Array.
In order to generate pages 1..N, consider this:
def pages(base_url, start: 1)
page = start
Enumerator.new do |y|
loop do
y << base_url % page
page += 1
end
end
end
Now that's an infinite series, but you can always cap it to whatever you want with take(n) or by instead looping until you get an empty list:
# Collect all business from each of the pages...
businesses = pages('https://www.texasblackpages.com/united-states/san-antonio?page=%d').lazy.map do |page|
# ...by scraping the page...
scrape(page)
end.take_while do |results|
# ...and iterating until there's no results, as in Array#any? is false.
results.any?
end.to_a.flatten
The .lazy part means "evaluate each part of the chain sequentially" as opposed to the default behaviour of trying to evaluate each stage to completion. This is important or else it will try and download an infinite number of pages before moving to the next test.
The .to_a on the end forces that chain to run to completion. The .flatten squishes all the page-wise results into a single result set.
Of course if you want to scrape the first N pages, it's a lot easier:
pages('https://www.texasblackpages.com/.../san-antonio?page=%d').take(n).flat_map do |page|
scrape(page)
end
It's almost no code!
This was suggested by #Todd A. Jacobs
def self.scrape(url)
html = open(url)
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
The downside is with there not being a public api I had to invoke the method as many times as I need it since the url's are representing different pages within the wbesite, but this is fine because I was able to get rid of the repeating methods.
def make_listings
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=2")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=3")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=4")
end
i ever had some problem with you, i do loop though. usually if the page support pagination then the first page it have chance to use query param page also.
def self.scrape
page = 1
loop do
url = "https://www.texasblackpages.com/united-states/san-antonio?page=#{page}"
# do nokogiri parse
# do data scrapping
page += 1
end
end
you can have break on certain page condition.
I am trying to scrape the next page of the website called https://www.jobsatosu.com/postings/search. Because there are many jobs, there are many pages. Our team successfully scraped the first page like this:
def initialize
#agent_menu = Mechanize.new
#page = #agent_menu.get(PAGE_URL)
#form = #page.forms[0]
I am working on trying to scrape the next page. Also, we were told to use Nokogiri and Mechanize in Ruby. I just have to scrape the next page and do not have to parse it.
This is what I did:
def next_page
#page_num += 1
new_url = "https://www.jobsatosu.com/postings/search?page=#{#page_num}"
#new_page = #agent_menu.get(new_url)
#new_form = #new_page.forms[0]
end
I made one page_num for all to share. If someone calls the method, then it gets iterated by 1 and it gets the new URL, puts it in #new_page.
I haven't tested this out, but any thoughts on this code?
You need to initialize #page_num = 0 before use
In the first time #page_num is nil so #page_num += 1 raises execption
NoMethodError: undefined method '+' for nil:NilClass
Actually you don't describe variable before using but in this case, you need to do
I'm trying to scrape a website however I cannot seem to get my while-loop to break out once it hits a page with no more information:
def scrape_verse_items(keyword)
pg = 1
while pg < 1000
puts "page #{pg}"
url = "https://www.bible.com/search/bible?page=#{pg}&q=#{keyword}&version_id=1"
doc = Nokogiri::HTML(open(url))
items = doc.css("ul.search-result li.reference")
error = doc.css('div#noresults')
until error.any? do
if keyword != ''
item_hash = {}
items.each do |item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
item_hash[title] = content
end
else
puts "Please enter a valid search"
end
if error.any?
break
end
end
pg += 1
end
item_hash
end
puts scrape_verse_items('joy')
I know this doesn't exactly answer your question, but perhaps you might consider using a different approach altogether.
Using while and until loops can get a bit confusing, and usually isn't the most performant way of doing things.
Maybe you would consider using recursion instead.
I've written a small script that seems to work :
class MyScrapper
def initialize;end
def call(keyword)
puts "Please enter a valid search" && return unless keyword
scrape({}, keyword, 1)
end
private
def scrape(results, keyword, page)
doc = load_page(keyword, page)
return results if doc.css('div#noresults').any?
build_new_items(doc).merge(scrape(results, keyword, page+1))
end
def load_page(keyword, page)
url = "https://www.bible.com/search/bible?page=#{page}&q=#{keyword}&version_id=1"
Nokogiri::HTML(open(url))
end
def build_new_items(doc)
items = doc.css("ul.search-result li.reference")
items.reduce({}) do |list, item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
list[title] = content
list
end
end
end
You call it by doing MyScrapper.new.call("Keyword") (It might make more sense to have this as a module you include or even have them as class methods to avoid the need to instantiate the class.
What this does is, call a method called scrape and you give it the starting results, keyword, and page. It loads the page, if there are no results it returns the existing results it has found.
Otherwise it builds a hash from the page it loaded, and then the method calls itself, and merges the results with the new hash it just build. It does this till there are no more results.
If you want to limit the page results you can just change this like:
return results if doc.css('div#noresults').any?
to this:
return results if doc.css('div#noresults').any? || page > 999
Note: You might want to double-check the results that are being returned are correct. I think they should be but I wrote this quite quickly, so there could always be a small bug hiding somewhere in there.
I want to collect the names of users in a particular group, called Nature, in the photo-sharing website Fotolog. This is my code:
require 'rubygems'
require 'mechanize'
require 'csv'
def getInitUser()
agent1 = Mechanize.new
number = 0
while number<=500
address = 'http://http://www.fotolog.com/nature/participants/#{number}/'
logfile2 = File.new("Fotolog/Users.csv","a")
tryConut = 0
begin
page = agent1.get(address)
rescue
tryConut=tryConut+1
if tryConut<5
retry
end
return
end
arrayUsers= []
# search for the users
page.search("a[class=img_border_radius").map do |opt|
link = opt.attributes['href'].text
link = link.gsub("http://www.fotolog.com/","").gsub("/","")
arrayUsers << link
logfile2.print("#{link}\n")
end
number = number+100
end
return arrayUsers
end
arrayUsers = getInitUser()
arrayUsers.each do |user|
getFriend(user)
end
But the Users.csv file I am getting is empty. What's wrong here? I suspect it might have something to do with the "class" tag I am using. But from the inspect element, it seems to be the correct class, isn't it? I am just getting started with web crawling, so I apologise if this is a silly query.
I am pulling bitbucket repo list using Ruby. The response from bitbucket will contain only 10 repositories and a marker for the next page where there will be another 10 repos and so on ... (they call it pagination)
So, I wrote a recursive function which calls itself if the next page marker is present. This will continue until it reaches the last page.
Here is my code:
#!/usr/local/bin/ruby
require 'net/http'
require 'json'
require 'awesome_print'
#repos = Array.new
def recursive(url)
### here goes my net/http code which connects to bitbucket and pulls back the response in a JSON as request.body
### hence, removing this code for brevity
hash = JSON.parse(response.body)
hash["values"].each do |x|
#repos << x["links"]["self"]["href"]
end
if hash["next"]
puts "next page exists"
puts "calling recusrisve with: #{hash["next"]}"
recursive(hash["next"])
else
puts "this is the last page. No more recursions"
end
end
repo_list = recursive('https://my_bitbucket_url')
#repos.each {|x| puts x}
Now, my code works fine and it lists all the repos.
Question:
I am new to Ruby, so I am not sure about the way I have used the global variable #repos = Array.new above. If I define the array in function, then each call to the function will create a new array overwriting its contents from previous call.
So, how do the Ruby programmers use a global symbol in such cases. Does my code obey Ruby ethics or is it something really amateur (yet correct because it works) way of doing it.
The consensus is to avoid global variables as much as possible.
I would either build the collection recursively like this:
def recursive(url)
### ...
result = []
hash["values"].each do |x|
result << x["links"]["self"]["href"]
end
if hash["next"]
result += recursive(hash["next"])
end
result
end
or hand over the collection to the function:
def recursive(url, result = [])
### ...
hash["values"].each do |x|
result << x["links"]["self"]["href"]
end
if hash["next"]
recursive(hash["next"], result)
end
result
end
Either way you can call the function
repo_list = recursive(url)
And I would write it like this:
def recursive(url)
# ...
result = hash["values"].map { |x| x["links"]["self"]["href"] }
result += recursive(hash["next"]) if hash["next"]
result
end