I am building a simple web spider using Sidekiq and Mechanize.
When I run this for one domain, it works fine. When I run it for multiple domains, it fails. I believe the reason is that web_page gets overwritten when instantiated by another Sidekiq worker, but I am not sure if that's true or how to fix it.
# my scrape_search controller's create action searches on google.
def create
#scrape = ScrapeSearch.build(keywords: params[:keywords], profession: params[:profession])
agent = Mechanize.new
scrape_search = agent.get('http://google.com/') do |page|
search_result = page.form...
search_result.css("h3.r").map do |link|
result = link.at_css('a')['href'] # Narrowing down to real search results
#domain = Domain.new(some params)
ScrapeDomainWorker.perform_async(#domain.url, #domain.id, remaining_keywords)
end
end
end
I'm creating a Sidekiq job per domain. Most of the domains I'm looking for should contain just a few pages, so there's no need for sub-jobs per page.
This is my worker:
class ScrapeDomainWorker
include Sidekiq::Worker
...
def perform(domain_url, domain_id, keywords)
#domain = Domain.find(domain_id)
#domain_link = #domain.protocol + '://' + domain_url
#keywords = keywords
# First we scrape the homepage and get the first links
#domain.to_parse = ['/'] # to_parse is an array of PATHS to parse for the domain
mechanize_path('/')
#domain.verified << '/' # verified is an Array field containing valid domain paths
get_paths(#web_page) # Now we should have to_scrape populated with homepage links
#domain.scraped = 1 # Loop counter
while #domain.scraped < 100
#domain.to_parse.each do |path|
#domain.to_parse.delete(path)
#domain.scraped += 1
mechanize_path(path) # We create a Nokogiri HTML doc with mechanize for the valid path
...
get_paths(#web_page) # Fire this to repopulate to_scrape !!!
end
end
#domain.save
end
def mechanize_path(path)
agent = Mechanize.new
begin
#web_page = agent.get(#domain_link + path)
rescue Exception => e
puts "Mechanize Exception for #{path} :: #{e.message}"
end
end
def get_paths(web_page)
paths = web_page.links.map {|link| link.href.gsub((#domain.protocol + '://' + #domain.url), "") } ## This works when I scrape a single domain, but fails with ".gsub for nil" when I scrape a few domains.
paths.uniq.each do |path|
#domain.to_parse << path
end
end
end
This works when I scrape a single domain, but fails with .gsub for nil for web_page when I scrape a few domains.
You can wrap you code in another class, and then create and object of that class within your worker:
class ScrapeDomainWrapper
def initialize(domain_url, domain_id, keywords)
# ...
end
def mechanize_path(path)
# ...
end
def get_paths(web_page)
# ...
end
end
And your worker:
class ScrapeDomainWorker
include Sidekiq::Worker
def perform(domain_url, domain_id, keywords)
ScrapeDomainWrapper.new(domain_url, domain_id, keywords)
end
end
Also, bear in mind that Mechanize::Page#links may be a nil.
Related
I'm building a small web scraper using Ruby and now I'm trying to refactor my code. Unfortunately, I'm encountering some errors while I'm refactoring my code. This is one of the errors.
Basically, I'm calling two separate methods in the first method which is whey_scrapper. Each of these two methods are basically responsible of scraping a specific item on the webpage. When I run and debug this code with byebug, I basically try to display the products or prices I've scraped but I get an error message saying that 'products' or 'prices' is undefined. This is my current code:
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
products = Array.new
product_names = parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
products << product
end
end
def prices_scrapper
prices = Array.new
product_prices = parsed_page.css('div.price-box')
product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
prices << price
end
end
byebug
whey_scrapper
end
There's a lot going on here, but to make it more Ruby you'd consider making those lazy-initialized and giving them names that reflect that:
class Wheyscrapper
URL = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?%s"
def initialize(company:)
#company = company
# Use encode_www_form to encode query-string parameters
#url = URL % URI.encode_www_form(manufacturer: company)
end
def document
# Lazy-initialize a parsd version of the page
#document ||= Nokogiri::HTML(open(url).read)
end
def products
document.css('div.product-primary').map do |product_name|
{
name: product_name.css('h2.product-name').text
}
end
end
def prices
document.css('div.price-box').map do |product_price|
{
amount: product_price.css('span.price').text
}
end
end
end
This fixes a lot of the data propagation problems you had in your original. When you declare a variable it's a local variable, meaning it doesn't exist outside of that particular call of that particular method. If you want to persist it for longer you need to use instance variables, as in #products, or you need to define methods that return the data you need.
The above approach combines that, using a lazy-initialized instance variable to persist the parsed document, and exposes that as a method the other methods can use.
Now you can spin this up:
scraper = WheyScraper.new(company: "Body & Fit")
Where that should enable everything to be available directly:
scraper.prices
scraper.products
When you learn how to use Ruby effectively you'll often find solutions to your problems that are really minimal. Usually a lot of Ruby code is a sign that it's not being used properly.
This should be refactored in a better way but this should at least work without refactor, based on my comments above
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
#parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
#products = Array.new
product_names = #parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
#products << product
end
end
def prices_scrapper
#prices = Array.new
#product_prices = #parsed_page.css('div.price-box')
#product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
#prices << price
end
end
end
w = Wheyscrapper.new.whey_scrapper
I have some experience of Selenium in Python and Cucumber/Watir/RSpec in Ruby, and can write scripts that execute successfully, but they aren't using classes, so I am trying to learn more about classes and splitting the scripts up in to pageobejcts.
I found this example to learn from: http://watir.com/guides/page-objects/ so copied the script and made some minor edits as you'll see below.
I'm using SublimeText 3.x with Ruby 2.4.x on Win10, so you know what tools I'm using.
I put the whole script in to a single .rb file (the only differences are that I replaced the URL and the elements to enter the username and password) and tried to execute it and get the following error:
C:/selenium/ruby/lotw/lotwlogin.rb:3:in `<main>': uninitialized constant Site (NameError).
I added the top line (required 'watir') line and it made no difference to the error encountered.
So I have in lotwlogin.rb essentilly the structure and syntax of the original script with custom elements. However, the core structure is reporting an error and I don't know what to do about it.
Here is my script:
require 'watir'
site = Site.new(Watir::Browser.new :chrome) # was :firefox but that no longer works since FF63
login_page = site.login_page.open
user_page = login_page.login_as "testuser", "testpassword" # dummy user and password for now
user_page.should be_logged_in
class BrowserContainer
def initialize(browser)
#browser = browser
end
end
class Site < BrowserContainer
def login_page
#login_page = LoginPage.new(#browser)
end
def user_page
#user_page = UserPage.new(#browser)
end
def close
#browser.close
end
end
class LoginPage < BrowserContainer
URL = "https://lotw.arrl.org/lotw/login"
def open
#browser.goto URL
##browser.window.maximize
self # no idea what this is for
end
def login_as(user, pass)
user_field.set user
password_field.set pass
login_button.click
next_page = UserPage.new(#browser)
Watir::Wait.until { next_page.loaded? }
next_page
end
private
def user_field
#browser.text_field(:name => "login")
end
def password_field
#browser.text_field(:name => "password")
end
def login_button
#browser.button(:value => "Log On")
end
end # LoginPage
class UserPage < BrowserContainer
def logged_in?
logged_in_element.exists?
end
def loaded?
#browser.h3 == "Welcome to Your Logbook of the World User Account Home Page"
end
private
def logged_in_element
#browser.div(:text => "Log off")
end
end # UserPage
Any assistance how to not get the Site error would be appreciated.
Thanks
Mike
You define class Site only a few lines below. But at that point, it's not yet known.
Move this logic to after all class definitions:
site = Site.new(Watir::Browser.new :chrome) # was :firefox but that no longer works since FF63
login_page = site.login_page.open
user_page = login_page.login_as "testuser", "testpassword" # dummy user and password for now
user_page.should be_logged_in
I need to:
Open a Rakefile
Find if a certain task is defined
Find if a certain variable is defined
This works to find tasks defined inside a Rakefile, but it pollutes the global namespace (i.e. if you run it twice, all tasks defined in first one will show up in the second one):
sub_rake = Rake::DefaultLoader.new
sub_rake.load("Rakefile")
puts Rake.application.tasks
In Rake, here is where it loads the Makefile:
https://github.com/ruby/rake/blob/master/lib/rake/rake_module.rb#L28
How do I get access to the variables that are loaded there?
Here is an example Rakefile I am parsing:
load '../common.rake'
#source_dir = 'source'
desc "Run all build and deployment tasks, for continuous delivery"
task :deliver => ['git:pull', 'jekyll:build', 'rsync:push']
Here's some things I tried that didn't work. Using eval on the Rakefile:
safe_object = Object.new
safe_object.instance_eval("Dir.chdir('" + f + "')\n" + File.read(folder_rakefile))
if safe_object.instance_variable_defined?("#staging_dir")
puts " Staging directory is " + f.yellow + safe_object.instance_variable_get("#staging_dir").yellow
else
puts " Staging directory is not specified".red
end
This failed when parsing desc parts of the Rakefile. I also tried things like
puts Rake.instance_variables
puts Rake.class_variables
But these are not getting the #source_dir that I am looking for.
rakefile_body = <<-RUBY
load '../common.rake'
#source_dir = 'some/source/dir'
desc "Run all build and deployment tasks, for continuous delivery"
task :deliver => ['git:pull', 'jekyll:build', 'rsync:push']
RUBY
def source_dir(ast)
return nil unless ast.kind_of? AST::Node
if ast.type == :ivasgn && ast.children[0] == :#source_dir
rhs = ast.children[1]
if rhs.type != :str
raise "#source_dir is not a string literal! #{rhs.inspect}"
else
return rhs.children[0]
end
end
ast.children.each do |child|
value = source_dir(child)
return value if value
end
nil
end
require 'parser/ruby22'
body = Parser::Ruby22.parse(rakefile_body)
source_dir body # => "some/source/dir"
Rake runs load() on the Rakefile inside load_rakefile in the Rake module. And you can easily get the tasks with the public API.
Rake.load_rakefile("Rakefile")
puts Rake.application.tasks
Apparently that load() invocation causes the loaded variables to be captured into the main Object. This is the top-level Object of Ruby. (I expected it to be captured into Rake since the load call is made in the context of the Rake module.)
Therefore, it is possible to access instance variables from the main object using this ugly code:
main = eval 'self', TOPLEVEL_BINDING
puts main.instance_variable_get('#staging_dir')
Here is a way to encapsulate the parsing of the Rakefile so that opening two files will not have all the things from the first one show up when you are analyzing the second one:
class RakeBrowser
attr_reader :tasks
attr_reader :variables
include Rake::DSL
def task(*args, &block)
if args.first.respond_to?(:id2name)
#tasks << args.first.id2name
elsif args.first.keys.first.respond_to?(:id2name)
#tasks << args.first.keys.first.id2name
end
end
def initialize(file)
#tasks = []
Dir.chdir(File.dirname(file)) do
eval(File.read(File.basename(file)))
end
#variables = Hash.new
instance_variables.each do |name|
#variables[name] = instance_variable_get(name)
end
end
end
browser = RakeBrowser.new(f + "Rakefile")
puts browser.tasks
puts browser.variables[:#staging_dir]
We have a Rails 3.2 website which is fairly large with thousands of URLs. We implemented Cache_Digests gem for Russian Doll caching. It is working well. We want to further optimize by warming up the cache overnight so that user gets a better experience during the day. I have seen answer to this question: Rails: Scheduled task to warm up the cache?
Could it be modified for warming up large number of URLs?
To trigger cache hits for many pages with expensive load times, just create a rake task to iteratively send web requests to all record/url combinations within your site. (Here is one implementation)
Iteratively Net::HTTP request all site URL/records:
To only visit every page, you can run a nightly Rake task to make sure that early morning users still have a snappy page with refreshed content.
lib/tasks/visit_every_page.rake:
namespace :visit_every_page do
include Net
include Rails.application.routes.url_helpers
task :specializations => :environment do
puts "Visiting specializations..."
Specialization.all.sort{ |a,b| a.id <=> b.id }.each do |s|
begin
puts "Specialization #{s.id}"
City.all.sort{ |a,b| a.id <=> b.id }.each do |c|
puts "Specialization City #{c.id}"
Net::HTTP.get( URI("http://#{APP_CONFIG[:domain]}/specialties/#{s.id}/#{s.token}/refresh_city_cache/#{c.id}.js") )
end
Division.all.sort{ |a,b| a.id <=> b.id }.each do |d|
puts "Specialization Division #{d.id}"
Net::HTTP.get( URI("http://#{APP_CONFIG[:domain]}/specialties/#{s.id}/#{s.token}/refresh_division_cache/#{d.id}.js") )
end
end
end
end
# The following methods are defined to fake out the ActionController
# requirements of the Rails cache
def cache_store
ActionController::Base.cache_store
end
def self.benchmark( *params )
yield
end
def cache_configured?
true
end
end
(If you want to directly include cache expiration/recaching into this task, check out this implementation.)
via a Custom Controller Action:
If you need to bypass user authentication restrictions to get to your pages, and/or you don't want to screw up (too badly) your website's tracking analytics, you can create a custom controller action for hitting cache digests that use tokens to bypass authentication:
app/controllers/specializations.rb:
class SpecializationsController < ApplicationController
...
before_filter :check_token, :only => [:refresh_cache, :refresh_city_cache, :refresh_division_cache]
skip_authorization_check :only => [:refresh_cache, :refresh_city_cache, :refresh_division_cache]
...
def refresh_cache
#specialization = Specialization.find(params[:id])
#feedback = FeedbackItem.new
render :show, :layout => 'ajax'
end
def refresh_city_cache
#specialization = Specialization.find(params[:id])
#city = City.find(params[:city_id])
render 'refresh_city.js'
end
def refresh_division_cache
#specialization = Specialization.find(params[:id])
#division = Division.find(params[:division_id])
render 'refresh_division.js'
end
end
Our custom controller action renders the views of other expensive to load pages, causing cache hits to those pages. E.g. refresh_cache renders the same view page & data as controller#show, so requests to refresh_cache will warm up the same cache digests as controller#show for those records.
Security Note:
For security reasons, I recommend before providing access to any custom refresh_cache controller request that you pass in a token and check it to make sure that it corresponds with a unique token for that record. Matching URL tokens to database records before providing access (as seen above) is trivial because your Rake task has access to the unique tokens of each record -- just pass the record's token in with each request.
tl;dr:
To trigger thousands of site URL's/cache digests, create a rake task to iteratively request every record/url combination in your site. You can bypass your app's user authentication restrictions for this task by creating a a custom controller action that authenticates access via tokens instead.
I realize this question is about a year old, but I just worked out my own answer, after scouring a bunch of partial & incorrect solutions.
Hopefully this will help the next person...
Per my own utility class, which can be found here:
https://raw.githubusercontent.com/JayTeeSF/cmd_notes/master/automated_action_runner.rb
You can simply run this (per it's .help method) and pre-cache your pages, without tying-up your own web-server, in the process.
class AutomatedActionRunner
class StatusObject
def initialize(is_valid, error_obj)
#is_valid = !! is_valid
#error_obj = error_obj
end
def valid?
#is_valid
end
def error
#error_obj
end
end
def self.help
puts <<-EOH
Instead tying-up the frontend of your production site with:
`curl http://your_production_site.com/some_controller/some_action/1234`
`curl http://your_production_site.com/some_controller/some_action/4567`
Try:
`rails r 'AutomatedActionRunner.run(SomeController, "some_action", [{id: "1234"}, {id: "4567"}])'`
EOH
end
def self.common_env
{"rack.input" => "", "SCRIPT_NAME" => "", "HTTP_HOST" => "localhost:3000" }
end
REQUEST_ENV = common_env.freeze
def self.run(controller, controller_action, params_ary=[], user_obj=nil)
success_objects = []
error_objects = []
autorunner = new(controller, controller_action, user_obj)
Rails.logger.warn %Q|[AutomatedAction Kickoff]: Preheating cache for #{params_ary.size} #{autorunner.controller.name}##{controller_action} pages.|
params_ary.each do |params_hash|
status = autorunner.run(params_hash)
if status.valid?
success_objects << params_hash
else
error_objects << status.error
end
end
return process_results(success_objects, error_objects, user_obj.try(:id), autorunner.controller.name, controller_action)
end
def self.process_results(success_objects=[], error_objects=[], user_id, controller_name, controller_action)
message = %Q|AutomatedAction Summary|
backtrace = (error_objects.first.try(:backtrace)||[]).join("\n\t").inspect
num_errors = error_objects.size
num_successes = success_objects.size
log_message = %Q|[#{message}]: Generated #{num_successes} #{controller_name}##{controller_action}, pages; Failed #{num_errors} times; 1st Fail: #{backtrace}|
Rails.logger.warn log_message
# all the local-variables above, are because I typically call Sentry or something with extra parameters!
end
attr_reader :controller
def initialize(controller, controller_action, user_obj)
#controller = controller
#controller = controller.constantize unless controller.respond_to?(:name)
#controller_instance = #controller.new
#controller_action = controller_action
#env_obj = REQUEST_ENV.dup
#user_obj = user_obj
end
def run(params_hash)
Rails.logger.warn %Q|[AutomatedAction]: #{#controller.name}##{#controller_action}(#{params_hash.inspect})|
extend_with_autorun unless #controller_instance.respond_to?(:autorun)
#controller_instance.autorun(#controller_action, params_hash, #env_obj, #user_obj)
end
private
def extend_with_autorun
def #controller_instance.autorun(action_name, action_params, action_env, current_user_value=nil)
self.params = action_params # suppress strong parameters exception
self.request = ActionDispatch::Request.new(action_env)
self.response = ActionDispatch::Response.new
define_singleton_method(:current_user, -> { current_user_value })
send(action_name) # do it
return StatusObject.new(true, nil)
rescue Exception => e
return StatusObject.new(false, e)
end
end
end
I'm reading a Redis set within an EventMachine reactor loop using a suitable Redis EM gem ('em-hiredis' in my case) and have to check if some Redis sets contain members in a cascade. My aim is to get the name of the set which is not empty:
require 'eventmachine'
require 'em-hiredis'
def fetch_queue
#redis.scard('todo').callback do |scard_todo|
if scard_todo.zero?
#redis.scard('failed_1').callback do |scard_failed_1|
if scard_failed_1.zero?
#redis.scard('failed_2').callback do |scard_failed_2|
if scard_failed_2.zero?
#redis.scard('failed_3').callback do |scard_failed_3|
if scard_failed_3.zero?
EM.stop
else
queue = 'failed_3'
end
end
else
queue = 'failed_2'
end
end
else
queue = 'failed_1'
end
end
else
queue = 'todo'
end
end
end
EM.run do
#redis = EM::Hiredis.connect "redis://#{HOST}:#{PORT}"
# How to get the value of fetch_queue?
foo = fetch_queue
puts foo
end
My question is: how can I tell EM to return the value of 'queue' in 'fetch_queue' to use it in the reactor loop? a simple "return queue = 'todo'", "return queue = 'failed_1'" etc. in fetch_queue results in "unexpected return (LocalJumpError)" error message.
Please for the love of debugging use some more methods, you wouldn't factor other code like this, would you?
Anyway, this is essentially what you probably want to do, so you can both factor and test your code:
require 'eventmachine'
require 'em-hiredis'
# This is a simple class that represents an extremely simple, linear state
# machine. It just walks the "from" parameter one by one, until it finds a
# non-empty set by that name. When a non-empty set is found, the given callback
# is called with the name of the set.
class Finder
def initialize(redis, from, &callback)
#redis = redis
#from = from.dup
#callback = callback
end
def do_next
# If the from list is empty, we terminate, as we have no more steps
unless #current = #from.shift
EM.stop # or callback.call :error, whatever
end
#redis.scard(#current).callback do |scard|
if scard.zero?
do_next
else
#callback.call #current
end
end
end
alias go do_next
end
EM.run do
#redis = EM::Hiredis.connect "redis://#{HOST}:#{PORT}"
finder = Finder.new(redis, %w[todo failed_1 failed_2 failed_3]) do |name|
puts "Found non-empty set: #{name}"
end
finder.go
end