Why is this ruby code returning a blank page instead of filling it up with user names? - ruby

I want to collect the names of users in a particular group, called Nature, in the photo-sharing website Fotolog. This is my code:
require 'rubygems'
require 'mechanize'
require 'csv'
def getInitUser()
agent1 = Mechanize.new
number = 0
while number<=500
address = 'http://http://www.fotolog.com/nature/participants/#{number}/'
logfile2 = File.new("Fotolog/Users.csv","a")
tryConut = 0
begin
page = agent1.get(address)
rescue
tryConut=tryConut+1
if tryConut<5
retry
end
return
end
arrayUsers= []
# search for the users
page.search("a[class=img_border_radius").map do |opt|
link = opt.attributes['href'].text
link = link.gsub("http://www.fotolog.com/","").gsub("/","")
arrayUsers << link
logfile2.print("#{link}\n")
end
number = number+100
end
return arrayUsers
end
arrayUsers = getInitUser()
arrayUsers.each do |user|
getFriend(user)
end
But the Users.csv file I am getting is empty. What's wrong here? I suspect it might have something to do with the "class" tag I am using. But from the inspect element, it seems to be the correct class, isn't it? I am just getting started with web crawling, so I apologise if this is a silly query.

Related

Nokogiri Throwing Exception in Function but not outside of Function

I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....

GitHub Archive - Issues with retrieving data with ranges

I am trying to retrieve data from [GitHub Archive]: https://www.githubarchive.org/ and is having trouble retrieving data when I add a range. It works when I use http://data.githubarchive.org/2015-01-01-15.json.gz, but getting a `open_http': 404 Not Found (OpenURI::HTTPError) message when using http://data.githubarchive.org/2015-01-01-{0..23}.json.gz.
Using curl http://data.githubarchive.org/2015-01-01-{0..23}.json.gz seems to be working.
Basically, my goal is to write a program to retrieve the top 42 most active repositories over a certain time range.
Here's my code, please let me know I'm using the API incorrectly or code issues.
require 'open-uri'
require 'zlib'
require 'yajl'
require 'pry'
require 'date'
events = Hash.new(0)
type = 'PushEvent'
after = '2015-01-01T13:00:00Z'
before = '2015-01-02T03:12:14-03:00'
f_after_time = DateTime.parse(after).strftime('%Y-%m-%d-%H')
f_after_time = DateTime.parse(before).strftime('%Y-%m-%d-%H')
base = 'http://data.githubarchive.org/'
# query = '2015-01-01-15.json.gz'
query = '2015-01-01-{0..23}.json.gz'
url = base + query
uri = URI.encode(url)
gz = open(uri)
js = Zlib::GzipReader.new(gz).read
Yajl::Parser.parse(js) do |event|
if event['type'] == type
if event['repository']
repo_name = event['repository']['url'].gsub('https://github.com/', '')
events[repo_name] +=1
elsif event['repo'] #to account for older api
repo_name = event['repo']['url'].gsub('https://github.com/', '')
events[repo_name] +=1
end
end
end
# Sort events based on # of events and return top 42 repos
sorted_events = events.sort_by {|_key, value| value}.reverse.first(42)
sorted_events.each { |e| puts "#{e[0]} - #{e[1]} events" }
I believe brackets are not allowed in URL, so maybe you should try urlencoding it?

Ruby script with mechanize and nokogiri

So basically I asked a question earlier and got an answer which solved that question but I now realise I need more help as I have spent the last few hours trying to fix this but keep getting "nil:Nilclass: error. Basically I need to go through every show listed on this site (so through each letter, and through each page this letter has) and get the following:
1. The shows title (this part I have done)
2. then either copy the page url for each show and add "/episodes/" to the end of it, or click the show and then the episode tab and copy that url.
This is what I have so far:
require 'mechanize'
shows = Array.new
agent = Mechanize.new
agent.get 'http://www.tv.com/shows/sort/a_z/'
agent.page.search('//div[#class="alphabet"]//li[not(contains(#class, "selected"))]/a').each do |letter_link|
agent.get letter_link[:href]
agent.page.search('//li[#class="show"]/a').each { |show_link| shows << show_link.text }
while next_page_link = agent.page.at('//div[#class="_pagination"]//a[#class="next"]') do
agent.get next_page_link[:href]
agent.page.search('//li[#class="show"]/a').each { |show_link| shows << show_link.text }
end
end
require 'pp'
pp shows
So an end result would like something like the following:
Title: Game of Thrones
URL: http://www.tv.com/shows/game-of-thrones/episodes/
I have tried everything (even writing it from scratch) but just can't seem to add the extra parts so I was hoping someone here my be able to help me do so. Thanks
How about this
require 'mechanize'
shows = {}
base_uri = "http://www.tv.com/"
agent = Mechanize.new
agent.get 'http://www.tv.com/shows/sort/a_z/'
agent.page.search('//div[#class="alphabet"]//li[not(contains(#class, "selected"))]/a').each do |letter_link|
agent.get letter_link[:href]
letter = letter_link.text.upcase
shows[letter] = agent.page.search('//li[#class="show"]/a').map{ |show_link| {show_link.text => base_uri << show_link[:href].to_s << 'episodes/'} }
while next_page_link = agent.page.at('//div[#class="_pagination"]//a[#class="next"]') do
agent.get next_page_link[:href]
shows[letter] << agent.page.search('//li[#class="show"]/a').map{ |show_link| {show_link.text => base_uri << show_link[:href].to_s << 'episodes/'} }
end
shows[letter].flatten!
end
puts shows
This will create the following structure Hash[letter] => Array[{ShowName => LinktoEpisodes}]
Example:
{"A"=>[{"A Show"=>"http://www.tv.com/shows/a-show/episodes/"},{"Another Show"=>"http://www.tv.com/shows/another-show/episodes/"},...],"B"=>[B SHOWS....],....}
Hope this helps.

Ruby EOFError with open-uri and loop

I'm attempting to build a web crawler and ran into a bit of a snag. Basically what I'm doing is extracting the links from a web page and pushing each link to a queue. Whenever the Ruby interpreter hits this section of code:
links.each do |link|
url_frontier.push(link)
end
I receive the following error:
/home/blah/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:141:in `read_nonblock': end of file reached (EOFError)
If I comment out the above block of code I get no errors. Please, any help would be appreciated. Here is the rest of the code:
require 'open-uri'
require 'net/http'
require 'uri'
class WebCrawler
def self.Spider(root)
eNDCHARS = %{.,'?!:;}
num_documents = 0
token_list = []
url_repository = Hash.new
url_frontier = Queue.new
url_frontier.push(root.to_s)
while !url_frontier.empty? && num_documents < 10
url = url_frontier.pop
if !url_repository.has_key?(url)
document = open(url)
html = document.read
# extract url's
links = URI.extract(html, ['http']).collect { |u| eNDCHARS.index(u[-1]) ? u.chop : u }
links.each do |link|
url_frontier.push(link)
end
# tokenize
Tokenizer.tokenize(document).each do |word|
token_list.push(IndexStructures::Term.new(word, url))
end
# add to the repository
url_repository[url] = true
num_documents += 1
end
end
# sort by term (primary) and document id (secondary) in reverse to aid in the construction of the inverted index
return num_documents, token_list.sort_by! { |term| [term.term, term.document_id]}.reverse!
end
end
I encountered the same error but with Watir-webdriver, running firefox in headless mode. What I found out was, if I was running two of my applications in parallel and I destroy "headless" in one of the applications, it automatically kills the other one as well with the exact error you quoted. Though my situation is not the same as yours, I think the issue is related to prematurely closing the file handle externally while your application is still using it. I removed the destroy command from my application and the error disappeared.
Hope this helps.

Ruby - Mechanize: Select link by classname and other questions

At the moment I'm having a look on Mechanize.
I am pretty new to Ruby, so please be patient.
I wrote a little test script:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
page.links.each do |ll|
page_links << ll
end
puts page_links.size
This works. But page_links includes not only the search results. It also includes the google links like Login, Pictures, ...
The result links own a styleclass "1". Is it possible to select only the links with class == 1? How do I achieve this?
Is it possible to modify the "agentalias"? If I own a website, including google analytics or something, what browserclient will I see in ga going with mechanize on my site?
Can I select elements by their ID instead of their name? I tried to use
my_form = page.form_with(:id => 'myformid')
But this does not work.
in such cases like your I am using Nokogiri DOM search.
Here is your code a little bit rewritten:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
#maybe you better use 'h3.r > a.l' here
page.parser.css("a.l").each do |ll|
#page.parser here is Nokogiri::HTML::Document
page_links << ll
puts ll.text + "=>" + ll["href"]
end
puts page_links.size
Probably this article is a good place to start:
getting-started-with-nokogiri
By the way samples in the article also deal with Google search ;)
You can build a list of just the search result links by changing your code as follows:
page.links.each do |ll|
cls = ll.attributes.attributes['class']
page_links << ll if cls && cls.value == 'l'
end
For each element ll in page.links, ll.attributes is a Nokogiri::XML::Element and ll.attributes.attributes is a Hash containing the attributes on the link, hence the need for ll.attributes.attributes to get at the actual class and the need for the nil check before comparing the value to 'l'
The problem with using :id in the criteria to find a form is that it clashes with Ruby's Object#id method for returning a Ruby object's internal id. I'm not sure what the work around for this is. You would have no problem selecting the form by some other attribute (e.g. its action.)
I believe the selector you are looking for is:
:dom_id e.g. in your case:
my_form = page.form_with(:dom_id => 'myformid')

Resources