ruby: multiple identical or synced instances of mechanize? - ruby

As far as I know, I read elsewhere that ruby mechanize is not thread save. Thus, to accelerate some 'gets', I opted to instantiate several independent Mechanize objects and use them in parallel. This seems to work OK
BTW, I would like to make all instances as similar as possible, as similar as sharing 'everything' they could know (cookies, etc).
Is there any way to make deep copies of an already 'configured' Mechanize object. My aim is to only configure one of them and copy make clones of it.
For instance, if I can create a Mechanize object like this (only an example, but suppose there are a lot more of configured attributes):
agent = Mechanize.new { |a| a.read_timeout = 20; a.max_history = 1 }
How can I get copies of that don't interfere each other while 'get'ing?.
agent2 = agent.dup # are not thread save copies
agent2 = Marshal.load(Marshal.dump(agent)) # thorws an error

This appears to work until you change a value for max_history or read_timeout.
class Mechanize
def clone
Mechanize.new do |a|
a.cookie_jar = cookie_jar
a.max_history = max_history
a.read_timeout = read_timeout
end
end
end
Testing:
agent1 = Mechanize.new { |a| a.max_history = 30; a.read_timeout = 30 }
agent2 = agent1.clone
agent2.max_history == 30 # true
agent2.cookie_jar == agent1.cookie_jar # true

Related

Looking for a cleaner way to scrape from website by avoiding repeating

Hi I am just doing a bit of refactoring on a small cli web scraping project I did in Ruby and I was simply wondering if there was cleaner way to write a particular section without repeating the code too much.
Basically with the code below, I pulled data from a website but I had to do this per page. You will notice that both methods are only different by their name and the source.
def self.scrape_first_page
html = open("https://www.texasblackpages.com/united-states/san-antonio")
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
def self.scrape_second_page
html = open('https://www.texasblackpages.com/united-states/san-antonio?page=2')
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
Is there a way for me to streamline this process all with just one method pulling from one source, but have the ability to access different pages within the same site, or this is pretty much the best and only way? They owners of the website do not have a public api from me to pull from in case anyone is wondering.
Remember that in programming you want to steer towards code that follows the Zero, One or Infinity Rule avoid the dreaded two. In other words, write methods that take no arguments, fixed arguments (one), or an array of unspecified size (infinity).
So the first step is to clean up the scraping function to make it as generic as possible:
def scrape(page)
doc = Nokogiri::HTML(open(page))
# Use map here to return an array of Business objects
doc.css('div.grid_element').map do |business|
Business.new.tap do |biz|
# Use tap to modify this object before returning it
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
end
Note that apart from the extraction code, there's nothing specific about this. Takes a URL, returns Business objects in an Array.
In order to generate pages 1..N, consider this:
def pages(base_url, start: 1)
page = start
Enumerator.new do |y|
loop do
y << base_url % page
page += 1
end
end
end
Now that's an infinite series, but you can always cap it to whatever you want with take(n) or by instead looping until you get an empty list:
# Collect all business from each of the pages...
businesses = pages('https://www.texasblackpages.com/united-states/san-antonio?page=%d').lazy.map do |page|
# ...by scraping the page...
scrape(page)
end.take_while do |results|
# ...and iterating until there's no results, as in Array#any? is false.
results.any?
end.to_a.flatten
The .lazy part means "evaluate each part of the chain sequentially" as opposed to the default behaviour of trying to evaluate each stage to completion. This is important or else it will try and download an infinite number of pages before moving to the next test.
The .to_a on the end forces that chain to run to completion. The .flatten squishes all the page-wise results into a single result set.
Of course if you want to scrape the first N pages, it's a lot easier:
pages('https://www.texasblackpages.com/.../san-antonio?page=%d').take(n).flat_map do |page|
scrape(page)
end
It's almost no code!
This was suggested by #Todd A. Jacobs
def self.scrape(url)
html = open(url)
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
The downside is with there not being a public api I had to invoke the method as many times as I need it since the url's are representing different pages within the wbesite, but this is fine because I was able to get rid of the repeating methods.
def make_listings
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=2")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=3")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=4")
end
i ever had some problem with you, i do loop though. usually if the page support pagination then the first page it have chance to use query param page also.
def self.scrape
page = 1
loop do
url = "https://www.texasblackpages.com/united-states/san-antonio?page=#{page}"
# do nokogiri parse
# do data scrapping
page += 1
end
end
you can have break on certain page condition.

How to use user input across classes in Ruby?

I’m writing an app that scrapes genius.com to show a user the top ten songs. The user can then pick a song to see the lyrics.
I’d like to know how to employ the user input collected in my cli class inside of a method in my scraper class.
Right now I have part of the scrape happening outside the scraper class, but I'd like a clean division of responsibility.
Here’s part of my code:
Class CLI
def get_user_song
chosen_song = gets.strip.to_i
if chosen_song > 10 || chosen_song < 1
puts "Only the hits! Choose a number from 1-10."
end
I’d like to be able to do something like the below.
Class Scraper
def self.scrape_lyrics
page = Nokogiri::HTML(open("https://genius.com/#top-songs"))
#url = page.css('div#top-songs a').map {|link| link['href']}
user_selection = #input_from_cli #<---this is where I'd like to use the output
# of the 'gets' method above.
#print_lyrics = #url[user_selection - 1]
scrape_2 = Nokogiri::HTML(open(#print_lyrics))
puts scrape_2.css(".lyrics").text
end
I'm basically wondering how I can pass the chosen song variable into the Scraper class. I've tried a writing class method, but was having trouble writing it in a way that didn't break the rest of my program.
Thanks for any help!
I see two possible solutions to your problem. Which one is appropriate for this depends on your design goals. I'll try to explain with each option:
From a plain reading of your code, the user inputs the number without seeing the content of the page (through your program). In this case the simple way would be to pass in the selected number as a parameter to the scrape_lyrics method:
def self.scrape_lyrics(user_selection)
page = Nokogiri::HTML(open("https://genius.com/#top-songs"))
#url = page.css('div#top-songs a').map {|link| link['href']}
#print_lyrics = #url[user_selection -1]
scrape_2 = Nokogiri::HTML(open(#print_lyrics))
puts scrape_2.css(".lyrics").text
end
All sequencing happens in the CLI class and the scraper is called with all necessary data at the get go.
When imagining your tool more interactively, I was thinking it could be useful to have the scraper download the current top 10 and present the list to the user to choose from. In this case the interaction is a little bit more back-and-forth.
If you still want a strict separation, you can split scrape_lyrics into scrape_top_ten and scrape_lyrics_by_number(song_number) and sequence that in the CLI class.
If you expect the interaction flow to be very dynamic it might be better to inject the interaction methods into the scraper and invert the dependency:
def self.scrape_lyrics(cli)
page = Nokogiri::HTML(open("https://genius.com/#top-songs"))
titles = page.css('div#top-songs h3:first-child').map {|t| t.text}
user_selection = cli.choose(titles) # presents a choice to the user, returning the selected number
#url = page.css('div#top-songs a').map {|link| link['href']}
#print_lyrics = #url[user_selection - 1]
scrape_2 = Nokogiri::HTML(open(#print_lyrics))
puts scrape_2.css(".lyrics").text
end
See the tty-prompt gem for an example implementation of the latter approach.

How to reuse the query result for faster export to csv and xls without using global or session variable

I have a functionality that initially shows the results in HTML (a report) and then
can be exported to CSV and XLS
The idea is to reuse the results, of the query used to render the HTML, to export the same records without re-running the query again
The closest implementation is this: Storing the result in the global variable $last_consult
I have the following INDEX method in a Ruby controller
def index
begin
respond_to do |format|
format.html {
#filters = {}
#filters['email_enterprise'] = session[:enterprise_email] ;
# Add the selected filters
if (params[:f_passenger].to_s != '')
#filters['id_passenger'] = params[:f_passenger] ;
end
if (session[:role] == 2)
#filters['cost_center'] = session[:cc_name]
end
# Apply the filters and assign them to $last_consult that is used
$last_consult = InfoVoucher.where(#filters)
#info_vouchers = $last_consult.paginate(:page => params[:page], :per_page => 10)
estimate_destinations (#info_vouchers)
#cost_centers = fill_with_cost_centers(#info_vouchers)
}
format.csv {
send_data InfoVoucher.export
}
format.xls {
send_data InfoVoucher.export(col_sep: "\t")
}
The .export method is defined like this
class InfoVoucher < ActiveRecord::Base
include ActiveModel::Serializers::JSON
default_scope { order('voucher_created_at DESC') }
def attributes
instance_values
end
#Exporta a CSV o XLS
def self.export(options = {})
column_names = ["...","...","...","...","...",...]
exported_col_names = ["Solicitud", "Inicio", "Final", "Duracion", "Pasajero", "Unidades", "Recargo", "Propina", "Costo", "Centro de costo", "Origen", "Destino", "Proyecto", "Conductor", "Placas"]
CSV.generate(options) do |csv|
csv << exported_col_names
$last_consult.each do |row_export|
csv << row_export.attributes['attributes'].values_at(*column_names)
end
end
end
end
But this approach only works as long as there is no concurrent users between viewing the report and exporting it which in this case is unaceptable
I try to use a session variable to store the query result but since the result of the query can be quite it fails with this error
ActionDispatch::Cookies::CookieOverflow: ActionDispatch::Cookies::CookieOverflow
I have read about flash but don't consider it a good choice for this
Can you please point me in the right direction in how to persist the query results ,currently store in $last_consult, and make it avaible for the CSV and XLS export without using a global or session variable
Rails 4 has a bunch of cache solutions:
SQL query caching: caches the query result set for the duration of the request.
Memory caching: Limited to 32 mb. An example use is small sets, such as a list of object ids that were time-consuming to generate, e.g. the result of a complex select.
File caching: Great for huge results. Probably not what you want for your particular DB query, unless your results are huge and also you're using a RAM disk or SSD.
Memcache and dalli: an excellent fast distributed cache that's independent of your app. For your question, memcache can be a very good solution for apps that return the same results or reports to multiple users.
Terracotta Ehcache: this is enterprise and JRuby. I haven't personally used it. Looks like it good be good if you're building a serious workhorse app.
When you use any of these, you don't store the information in a global variable, nor a controller variable. Instead, you store the information by creating a unique cache key.
If your information is specific to a particular user, such as the user's most recent query, then a decent choice for the unique cache key is "#{current_user.id}-last-consult".
If your information is generic across users, such as a report that depends on your filters and not on a particular user, then a decent choice for the unique cache key is #filters.hash.
If your information is specific to a particular user, and also the the specific filters, the a decent choice for the unique cache is is "#{current_user.id}-#{#filters.hash}". This is a powerful generic way to cache user-specific information.
I have had excellent success with the Redis cache gem, which can work separately from Rails 4 caching. https://github.com/redis-store/redis-rails
I found this great article about most of the caching strategies that you mention
http://hawkins.io/2012/07/advanced_caching_part_1-caching_strategies/
After reading joelparkerhenderson answer I read this this great article about most of the caching strategies he mentioned
http://hawkins.io/2012/07/advanced_caching_part_1-caching_strategies/
I decided to use Dalli gem that depends on memcached 1.4+
And in order to configure and use Dalli I read
https://www.digitalocean.com/community/tutorials/how-to-use-memcached-with-ruby-on-rails-on-ubuntu-12-04-lts and
https://github.com/mperham/dalli/wiki/Caching-with-Rails
And this is how it ended up being implemented:
Installation/Configuration
sudo apt-get install memcached
installation can be verified running command
memcached -h
Then install the Dalli gem and configure it
gem install dalli
Add this lines to the Gemfile
# To cache query results or any other long-running-task results
gem 'dalli'
Set this lines in your config/environments/production.rb file
# Configure the cache to use the dalli gem and expire the contents after 1 hour and enable compression
config.perform_caching = true
config.cache_store = :dalli_store, 'localhost:11211', {:expires_in => 1.hour, :compress => true }
Code
In the controller I created an new method called query_info_vouchers which runs the query and stores the result in the cache by calling the Rails.cache.write method
In the index method I call the fetch to see if any cached data is available and this is only done for the CSV and XLS export format
def index
begin
add_breadcrumb "Historial de carreras", info_vouchers_path
respond_to do |format|
format.html {
query_info_vouchers
}
format.csv {
#last_consult = Rails.cache.fetch ("#{session[:passenger_key]}_last_consult") do
query_info_vouchers
end
send_data InfoVoucher.export(#last_consult)
}
format.xls {
#last_consult = Rails.cache.fetch ("#{session[:passenger_key]}_last_consult") do
query_info_vouchers
end
send_data InfoVoucher.export(#last_consult,col_sep: "\t")
}
end
rescue Exception => e
Airbrake.notify(e)
redirect_to manager_login_company_path, flash: {notice: GlobalConstants::ERROR_MESSAGES[:no_internet_conection]}
end
end
def query_info_vouchers
# Por defecto se filtran las carreras que son de la empresa
#filters = {}
#filters['email_enterprise'] = session[:enterprise_email] ;
# Add the selected filters
if (params[:f_passenger].to_s != '')
#filters['id_passenger'] = params[:f_passenger] ;
end
if (session[:role] == 2)
#filters['cost_center'] = session[:cc_name]
end
# Apply the filters and store them in the MemoryCache to make them available when exporting
#last_consult = InfoVoucher.where(#filters)
Rails.cache.write "#{session[:passenger_key]}_last_consult", #last_consult
#info_vouchers = #last_consult.paginate(:page => params[:page], :per_page => 10)
estimate_destinations (#info_vouchers)
#cost_centers = fill_with_cost_centers(#last_consult)
end
An in the model .export method
def self.export(data, options = {})
column_names = ["..","..","..","..","..","..",]
exported_col_names = ["Solicitud", "Inicio", "Final", "Duracion", "Pasajero", "Unidades", "Recargo", "Propina", "Costo", "Centro de costo", "Origen", "Destino", "Proyecto", "Conductor", "Placas"]
CSV.generate(options) do |csv|
csv << exported_col_names
data.each do |row_export|
csv << row_export.attributes['attributes'].values_at(*column_names)
end
end
end

De-dupe Sidekiq queues

How could I de-dupe all Sidekiq queues, ie ensure each job in the queue has unique worker and arguments.
(This arises because, for example, an object is saved twice, triggering some new job each time; but we only want it to be processed. So I'm looking to periodically de-dupe queues.)
You can use sidekiq unique jobs gem - looks like it actually does what you need.
Added later:
Here is basic implementation of what you are asking for - it would not be fast, but should be OK for small queues. I've also met this problem when repacking JSON - in my environment it was necessary to re-encode json the same way.
#for proper json packing (I had an issue with it while testing)
require 'bigdecimal'
class BigDecimal
def as_json(options = nil) #:nodoc:
if finite?
self
else
NilClass::AS_JSON
end
end
end
Sidekiq.redis do |connection|
# getting items from redis
items_count = connection.llen('queue:background')
items = connection.lrange('queue:background', 0, 100)
# remove retrieved items
connection.lrem('queue:background', 0, 100)
# jobs are in json - decode them
items_decoded = items.map{|item| ActiveSupport::JSON.decode(item)}
# group them by class and arguments
grouped = items_decoded.group_by{|item| [item['class'], item['args']]}
# get second and so forth from each group
duplicated = grouped.values.delete_if{|mini_list| mini_list.length < 2}
for_deletion = duplicated.map{|a| a[0...-1]}.flatten
for_deletion_packed = for_deletion.map{|item| JSON.generate(item)}
# removing duplicates one by one
for_deletion_packed.each do |packed_item|
connection.lrem('queue:background', 0, packed_item)
end
end

Ruby/Mechanize Any way to drain RAM? -> failed to allocate memory

I've built a code which vote for me on a website...
The Ruby script works quite well but after few minuts this script stop with this errors : link of the screen-shot
So I've inspected the windows task manager and the memory alocate to the ruby.exe grow after each loop !
here is the incriminate peace of code :
class VoteWebsite
def self.main
agent = Mechanize.new
agent.user_agent_alias = 'Windows Mozilla'
while $stop!=true
page = agent.get 'http://website.com/vote.php'
reports_divs = page.search(".//div[#class='Pad1Color2']")
tds = reports_divs.search("td")
i = 3;j = 0;ouiboucle=0;voteboucle=0
while i < tds.length
result = tds[i].to_s.scan(/<font class="ColorRed"><b>(.*?) :<\/b><\/font>/)
type = result[0].to_s[2..-3]
k=i
case type
when "Type of vote"
j= i+1;i += 4
result2 = tds[j].to_s.scan(/<div id="btn(.*?)">/)
id = result2[0].to_s[2..-3]
monvote=define_vote($vote_type, tds[k].to_s, $vote_auto)
page2 = agent.get 'http://website.com/AJAX_Vote.php?id='+id+'&vote='+monvote
voteboucle+=1
.
.
.
else
.
.
.
end
end
end
end
end
VoteWebsite.main
I think that declaring all the variables inside the method to Global variable should fix this probleme but the code is quite big and there is planty of variables inside this method.
So is there any way (any Ruby instruction) to drain all this variable at the end of each loop ?
The problem came, in fact, from the history of mechanize see this answer or the Mechanize::History.clear method or even just set the Mechanize::History.max_size attribute to a reasonable value.
#!/usr/bin/env ruby
require 'mechanize'
class GetContent
def initialize url
agent = Mechanize.new
agent.user_agent_alias = 'Windows Mozilla'
agent.history.max_size=0
while true
page = agent.get url
end
end
end
myPage = GetContent.new('http://www.nypost.com/')
hope it helps !
You can always force the garbage collector to kick in:
GC.start
As a note, this doesn't look very Ruby. Packing multiple statements on to one line using ; is bad form, and using $ type variables is probably a relic of it being ported from something else.
Remember that $-prefixed variables are global variables in Ruby and can cause tons of problems if used carelessly and should be reserved for very specific circumstances. The best alternative is an #-prefixed instance variable, or if you must, a declared CONSTANT.

Resources