I need to scrape 10k URLs from this website and some of them are out of service (I think... it's an error that does not return the JSON I'm looking for, so rest-client returns 500 Internal Server error in my program)
Error syntax: `exception_with_response': 500 Internal Server Error (RestClient::InternalServerError)
To loop through the URLs, I'm using a range (1..30).each do |id|. I concatenate the URL with the current iteration of this range.
response = RestClient.get(url+id)
The problem is some times the URL I'm storing in the response variable does not exist and/or the webpage returns some error.
How could I protect my code so I can just pass through this problematic URL and keep the scraping?
Here's my current code (I put every code of the loop in a begin/rescue block, but I do not know how do write the code to do such thing):
require 'nokogiri'
require 'csv'
require 'rest-client'
require 'json'
link = "https://webfec.org.br/Utils/GetCentrobyId?cod="
CSV.open('data2.csv', 'ab') do |csv|
csv << ['Name', 'Street', 'Info', 'E-mail', 'Site']
(1..30).each do |id|
begin
response = RestClient.get(link+id.to_s)
json = JSON.parse(response)
html = json["Data"]
doc = Nokogiri::HTML.parse(html)
name = doc.xpath("/html/body/table/tbody/tr[1]").text
street = doc.xpath("/html/body/table/tbody/tr[2]").text.gsub(Regexp.union(REMOVER), " ")
info = doc.xpath("/html/body/table/tbody/tr[3]").text.gsub(Regexp.union(REMOVER), " ")
email = doc.xpath("/html/body/table/tbody/tr[4]").text.gsub(Regexp.union(REMOVER), " ")
site = doc.xpath("/html/body/table/tbody/tr[5]").text.gsub(Regexp.union(REMOVER), " ")
csv << [name, street, info, email, site]
rescue
end
end
end
You can see I put everything in the loop inside a begin block and there is the rescue block at the end but I'm kind lost on how do to such thing.
You should just rescue the exception for exmaple:
[*1..3].each{|i| RestClient.get('https://fooboton.free.beeceptor.com') rescue RestClient::InternalServerError; next}
So for your case do:
CSV.open('data2.csv', 'ab') do |csv|
csv << ['Name', 'Street', 'Info', 'E-mail', 'Site']
(1..30).each do |id|
begin
response = RestClient.get(link+id.to_s)
rescue RestClient::InternalServerError
next # skip this iteration in your loop
end
... # rest of your code
I want to collect the names of users in a particular group, called Nature, in the photo-sharing website Fotolog. This is my code:
require 'rubygems'
require 'mechanize'
require 'csv'
def getInitUser()
agent1 = Mechanize.new
number = 0
while number<=500
address = 'http://http://www.fotolog.com/nature/participants/#{number}/'
logfile2 = File.new("Fotolog/Users.csv","a")
tryConut = 0
begin
page = agent1.get(address)
rescue
tryConut=tryConut+1
if tryConut<5
retry
end
return
end
arrayUsers= []
# search for the users
page.search("a[class=img_border_radius").map do |opt|
link = opt.attributes['href'].text
link = link.gsub("http://www.fotolog.com/","").gsub("/","")
arrayUsers << link
logfile2.print("#{link}\n")
end
number = number+100
end
return arrayUsers
end
arrayUsers = getInitUser()
arrayUsers.each do |user|
getFriend(user)
end
But the Users.csv file I am getting is empty. What's wrong here? I suspect it might have something to do with the "class" tag I am using. But from the inspect element, it seems to be the correct class, isn't it? I am just getting started with web crawling, so I apologise if this is a silly query.
I got Ruby to travel to a web site, iterate through a list of campaigns and scrape the pages for specific data. The problem I have now is getting it from the structure Nokogiri gives me, and outputting it into a readable form.
campaign_list = Array.new
campaign_list.push(1042360, 1042386, 1042365, 992307)
browser = Watir::Browser.new :chrome
browser.goto '<redacted>'
browser.text_field(:id => 'email').set '<redacted>'
browser.text_field(:id => 'password').set '<redacted>'
browser.send_keys :enter
file = File.new('hourlysales.csv', 'w')
data = {}
campaign_list.each do |campaign|
browser.goto "<redacted>"
if browser.text.include? "Application Error"
puts "Error loading page, I recommend restarting script"
# Possibly automatic restart of script
else
hourly_data = Nokogiri::HTML.parse(browser.html).text
# file.write data
puts hourly_data
end
This is the output I get:
{"views":[[17,145],[18,165],[19,99],[20,71],[21,31],[22,26],[23,10],[0,15],[1,1], [2,18],[3,19],[4,35],[5,47],[6,44],[7,67],[8,179],[9,141],[10,112],[11,95],[12,46],[13,82],[14,79],[15,70],[16,103]],"orders":[[17,10],[18,9],[19,5],[20,1],[21,1],[22,0],[23,0],[0,1],[1,0],[2,1],[3,0],[4,1],[5,2],[6,1],[7,5],[8,11],[9,6],[10,5],[11,3],[12,1],[13,2],[14,4],[15,6],[16,7]],"conversion_rates":[0.06870229007633588,0.05442176870748299,0.050505050505050504,0.014084507042253521,0.03225806451612903,0.0,0.0,0.06666666666666667,0.0,0.05555555555555555,0.0,0.02857142857142857,0.0425531914893617,0.022727272727272728,0.07462686567164178,0.06134969325153374,0.0425531914893617,0.044642857142857144,0.031578947368421054,0.021739130434782608,0.024390243902439025,0.05063291139240506,0.08571428571428572,0.06741573033707865]}
The arrays stand for { views [[hour, # of views], [hour, # of views], etc. }. Same with orders. I don't need conversion rates.
I also need to add the values up for each key, so after doing this for 5 pages, I have one key for each hour of the day, and the total number of views for that hour. I tried a couple each loops, but couldn't make any progress.
I appreciate any help you guys can give me.
It looks like the output (which from your code I assume is the content of hourly_data) is JSON. In that case, it's easy to parse and add up the numbers. Something like this:
require "json" # at the top of your script
# ...
def sum_hours_values(data, hours_values=nil)
# Start with an empty hash that automatically initializes missing keys to `0`
hours_values ||= Hash.new {|hsh,hour| hsh[hour] = 0 }
# Iterate through the [hour, value] arrays, adding `value` to the running
# count for that `hour`, and return `hours_values`
data.each_with_object(hours_values) do |(hour, value), hsh|
hsh[hour] += value
end
end
# ... Watir/Nokogiri stuff here...
# Initialize these so they persist outside the loop
hours_views, orders_views = nil
campaign_list.each do |campaign|
browser.goto "<redacted>"
if browser.text.include? "Application Error"
# ...
else
# ...
hourly_data_parsed = JSON.parse(hourly_data)
hours_views = sum_hours_values(hourly_data_parsed["views"], hours_views)
hours_orders = sum_hours_values(hourly_data_parsed["orders"], orders_views)
end
end
puts "Views by hour:"
puts hours_views.sort.map {|hour_views| "%2i\t%4i" % hour_views }
puts "Orders by hour:"
puts hours_orders.sort.map {|hour_orders| "%2i\t%4i" % hour_orders }
P.S. There's a really nice recursive version of sum_hours_values I didn't include since the iterative version is clearer to most Ruby programmers. If you're into recursion I leave it as an exercise for you. ;)
I'm attempting to build a web crawler and ran into a bit of a snag. Basically what I'm doing is extracting the links from a web page and pushing each link to a queue. Whenever the Ruby interpreter hits this section of code:
links.each do |link|
url_frontier.push(link)
end
I receive the following error:
/home/blah/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:141:in `read_nonblock': end of file reached (EOFError)
If I comment out the above block of code I get no errors. Please, any help would be appreciated. Here is the rest of the code:
require 'open-uri'
require 'net/http'
require 'uri'
class WebCrawler
def self.Spider(root)
eNDCHARS = %{.,'?!:;}
num_documents = 0
token_list = []
url_repository = Hash.new
url_frontier = Queue.new
url_frontier.push(root.to_s)
while !url_frontier.empty? && num_documents < 10
url = url_frontier.pop
if !url_repository.has_key?(url)
document = open(url)
html = document.read
# extract url's
links = URI.extract(html, ['http']).collect { |u| eNDCHARS.index(u[-1]) ? u.chop : u }
links.each do |link|
url_frontier.push(link)
end
# tokenize
Tokenizer.tokenize(document).each do |word|
token_list.push(IndexStructures::Term.new(word, url))
end
# add to the repository
url_repository[url] = true
num_documents += 1
end
end
# sort by term (primary) and document id (secondary) in reverse to aid in the construction of the inverted index
return num_documents, token_list.sort_by! { |term| [term.term, term.document_id]}.reverse!
end
end
I encountered the same error but with Watir-webdriver, running firefox in headless mode. What I found out was, if I was running two of my applications in parallel and I destroy "headless" in one of the applications, it automatically kills the other one as well with the exact error you quoted. Though my situation is not the same as yours, I think the issue is related to prematurely closing the file handle externally while your application is still using it. I removed the destroy command from my application and the error disappeared.
Hope this helps.
I'm trying to open a CSV file, look up a string, and then return the 2nd column of the csv file, but only the the first instance of it. I've gotten as far as the following, but unfortunately, it returns every instance. I'm a bit flummoxed.
Can the gods of Ruby help? Thanks much in advance.
M
for the purpose of this example, let's say names.csv is a file with the following:
foo, happy
foo, sad
bar, tired
foo, hungry
foo, bad
#!/usr/local/bin/ruby -w
require 'rubygems'
require 'fastercsv'
require 'pp'
FasterCSV.open('newfile.csv', 'w') do |output|
FasterCSV.foreach('names.csv') do |lookup|
index_PL = lookup.index('foo')
if index_PL
output << lookup[2]
end
end
end
ok, so, if I want to return all instances of foo, but in a csv, then how does that work?
so what I'd like as an outcome is happy, sad, hungry, bad. I thought it would be:
FasterCSV.open('newfile.csv', 'w') do |output|
FasterCSV.foreach('names.csv') do |lookup|
index_PL = lookup.index('foo')
if index_PL
build_str << "," << lookup[2]
end
output << build_str
end
end
but it does not seem to work
Replace foreach with open (to get an Enumerable) and find:
FasterCSV.open('newfile.csv', 'w') do |output|
output << FasterCSV.open('names.csv').find { |r| r.index('foo') }[2]
end
The index call will return nil if it doesn't find anything; that means that the find will give you the first row that has 'foo' and you can pull out the column at index 2 from the result.
If you're not certain that names.csv will have what you're looking for then a bit of error checking would be advisable:
FasterCSV.open('newfile.csv', 'w') do |output|
foos_row = FasterCSV.open('names.csv').find { |r| r.index('foo') }
if(foos_row)
output << foos_row[2]
else
# complain or something
end
end
Or, if you want to silently ignore the lack of 'foo' and use an empty string instead, you could do something like this:
FasterCSV.open('newfile.csv', 'w') do |output|
output << (FasterCSV.open('names.csv').find { |r| r.index('foo') } || ['','',''])[2]
end
I'd probably go with the "complain if it isn't found" version though.