How to refresh a large database? - ruby

I built a rake task to donwload a zip from Awin datafeed and import it to my product model via activerecord-import.
require 'zip'
require 'httparty'
require 'active_record'
require 'activerecord-import'
namespace :affiliate_datafeed do
desc "Import products data from Awin"
task import_product_awin: :environment do
url = "https://productdata.awin.com"
dir = "db/affiliate_datafeed/awin.zip"
File.open(dir, "wb") do |f|
f.write HTTParty.get(url).body
end
zip_file = Zip::File.open(dir)
entry = zip_file.glob('*.csv').first
csv_text = entry.get_input_stream.read
products = []
CSV.parse(csv_text, :headers=>true).each do |row|
products << Product.new(row.to_h)
end
Product.import(products)
end
end
How to update the product db only if the product doesn't exist or if there is a new date in the last_updated field? What is the best way to refresh a large db?

Probably use some methods like the following to keep checking the last_updated or last_modified header field in your rake task.
def get_date
date = CSV.foreach('CSV_raw.csv', :headers => false).first { |r| puts r}
$last_modified = Date.parse(date.compact[1]) # if last_updated is first row of CSV or use your http req header
end
run_once = ARGV.length > 0 # to run once & test if it works; not sure if rake taks accept args.
if not run_once
puts "Daemon Mode"
end
if not File.read('last_update.txt').empty?
date_in_file = Date.parse(File.read('last_update.txt'))
else
date_in_file = Date.parse('2001-02-03')
end
if $last_modified > date_in_file
"your db updating method"
end
unless run_once
sleep UPDATE_INTERVAL # whatever value you want for the interval to be
end
end until run_once

Related

NameError Exception: undefined local variable or method `products' for Wheyscrapper:Class

I'm building a small web scraper using Ruby and now I'm trying to refactor my code. Unfortunately, I'm encountering some errors while I'm refactoring my code. This is one of the errors.
Basically, I'm calling two separate methods in the first method which is whey_scrapper. Each of these two methods are basically responsible of scraping a specific item on the webpage. When I run and debug this code with byebug, I basically try to display the products or prices I've scraped but I get an error message saying that 'products' or 'prices' is undefined. This is my current code:
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
products = Array.new
product_names = parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
products << product
end
end
def prices_scrapper
prices = Array.new
product_prices = parsed_page.css('div.price-box')
product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
prices << price
end
end
byebug
whey_scrapper
end
There's a lot going on here, but to make it more Ruby you'd consider making those lazy-initialized and giving them names that reflect that:
class Wheyscrapper
URL = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?%s"
def initialize(company:)
#company = company
# Use encode_www_form to encode query-string parameters
#url = URL % URI.encode_www_form(manufacturer: company)
end
def document
# Lazy-initialize a parsd version of the page
#document ||= Nokogiri::HTML(open(url).read)
end
def products
document.css('div.product-primary').map do |product_name|
{
name: product_name.css('h2.product-name').text
}
end
end
def prices
document.css('div.price-box').map do |product_price|
{
amount: product_price.css('span.price').text
}
end
end
end
This fixes a lot of the data propagation problems you had in your original. When you declare a variable it's a local variable, meaning it doesn't exist outside of that particular call of that particular method. If you want to persist it for longer you need to use instance variables, as in #products, or you need to define methods that return the data you need.
The above approach combines that, using a lazy-initialized instance variable to persist the parsed document, and exposes that as a method the other methods can use.
Now you can spin this up:
scraper = WheyScraper.new(company: "Body & Fit")
Where that should enable everything to be available directly:
scraper.prices
scraper.products
When you learn how to use Ruby effectively you'll often find solutions to your problems that are really minimal. Usually a lot of Ruby code is a sign that it's not being used properly.
This should be refactored in a better way but this should at least work without refactor, based on my comments above
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
#parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
#products = Array.new
product_names = #parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
#products << product
end
end
def prices_scrapper
#prices = Array.new
#product_prices = #parsed_page.css('div.price-box')
#product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
#prices << price
end
end
end
w = Wheyscrapper.new.whey_scrapper

Sidekiq mechanize overwritten instance

I am building a simple web spider using Sidekiq and Mechanize.
When I run this for one domain, it works fine. When I run it for multiple domains, it fails. I believe the reason is that web_page gets overwritten when instantiated by another Sidekiq worker, but I am not sure if that's true or how to fix it.
# my scrape_search controller's create action searches on google.
def create
#scrape = ScrapeSearch.build(keywords: params[:keywords], profession: params[:profession])
agent = Mechanize.new
scrape_search = agent.get('http://google.com/') do |page|
search_result = page.form...
search_result.css("h3.r").map do |link|
result = link.at_css('a')['href'] # Narrowing down to real search results
#domain = Domain.new(some params)
ScrapeDomainWorker.perform_async(#domain.url, #domain.id, remaining_keywords)
end
end
end
I'm creating a Sidekiq job per domain. Most of the domains I'm looking for should contain just a few pages, so there's no need for sub-jobs per page.
This is my worker:
class ScrapeDomainWorker
include Sidekiq::Worker
...
def perform(domain_url, domain_id, keywords)
#domain = Domain.find(domain_id)
#domain_link = #domain.protocol + '://' + domain_url
#keywords = keywords
# First we scrape the homepage and get the first links
#domain.to_parse = ['/'] # to_parse is an array of PATHS to parse for the domain
mechanize_path('/')
#domain.verified << '/' # verified is an Array field containing valid domain paths
get_paths(#web_page) # Now we should have to_scrape populated with homepage links
#domain.scraped = 1 # Loop counter
while #domain.scraped < 100
#domain.to_parse.each do |path|
#domain.to_parse.delete(path)
#domain.scraped += 1
mechanize_path(path) # We create a Nokogiri HTML doc with mechanize for the valid path
...
get_paths(#web_page) # Fire this to repopulate to_scrape !!!
end
end
#domain.save
end
def mechanize_path(path)
agent = Mechanize.new
begin
#web_page = agent.get(#domain_link + path)
rescue Exception => e
puts "Mechanize Exception for #{path} :: #{e.message}"
end
end
def get_paths(web_page)
paths = web_page.links.map {|link| link.href.gsub((#domain.protocol + '://' + #domain.url), "") } ## This works when I scrape a single domain, but fails with ".gsub for nil" when I scrape a few domains.
paths.uniq.each do |path|
#domain.to_parse << path
end
end
end
This works when I scrape a single domain, but fails with .gsub for nil for web_page when I scrape a few domains.
You can wrap you code in another class, and then create and object of that class within your worker:
class ScrapeDomainWrapper
def initialize(domain_url, domain_id, keywords)
# ...
end
def mechanize_path(path)
# ...
end
def get_paths(web_page)
# ...
end
end
And your worker:
class ScrapeDomainWorker
include Sidekiq::Worker
def perform(domain_url, domain_id, keywords)
ScrapeDomainWrapper.new(domain_url, domain_id, keywords)
end
end
Also, bear in mind that Mechanize::Page#links may be a nil.

Sqlite3 library won't open after 250 inserts

I'm trying to insert a large amount of information into a Sqlite3 database using a ruby script. After 250 db_prepare_location.execute's to do this, it stops working saying:
.rvm/gems/ruby-1.9.2-p290/gems/sqlite3-1.3.6/lib/sqlite3/statement.rb:67:in `step': unable to open database file (SQLite3::CantOpenException)
from /Users/ashley/.rvm/gems/ruby-1.9.2-p290/gems/sqlite3-1.3.6/lib/sqlite3/statement.rb:67:in `execute'
from programs.rb:57:in `get_program_details'
from programs.rb:22:in `block in get_link'
from /Users/ashley/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/1.9.1/csv.rb:1768:in `each'
from /Users/ashley/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/1.9.1/csv.rb:1202:in `block in foreach'
from /Users/ashley/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/1.9.1/csv.rb:1340:in `open'
from /Users/ashley/.rvm/rubies/ruby-1.9.2-p290/lib/ruby/1.9.1/csv.rb:1201:in `foreach'
from programs.rb:20:in `get_link'
from programs.rb:63:in `<module:Test>'
from programs.rb:15:in `<main>'
And here's my code:
require 'net/http'
require 'json'
require 'nokogiri'
require 'open-uri'
require 'csv'
require 'sqlite3'
require "bundler/setup"
require "capybara"
require "capybara/dsl"
Capybara.run_server = false
Capybara.default_driver = :selenium
Capybara.current_driver = :selenium
module Test
class Tree
include Capybara::DSL
def get_link
CSV.foreach("links.csv") do |row|
link = row[0]
get_details(link)
end
end
def get_details(link)
db = SQLite3::Database.open "development.sqlite3"
address = []
address_text = []
visit("#{link}")
name = find("#listing_detail_header").find("h3").text
page.find(:xpath, "//div[#id='listing_detail_header']").all(:xpath, "//span/span").each {|span| address << span }
if address.size == 4
street_address = address[0].text
address.shift
address.each {|a| address_text << a.text }
city_state_address = address_text.join(", ")
else
puts link
street_address = ""
city_state_address = ""
end
if page.has_css?('.provider-click_to_call')
find(".provider-click_to_call").click
phone_number = find("#phone_number").text.gsub(/[()]/, "").gsub(" ", "-")
else
phone_number = ""
end
if page.has_css?('.provider-website_link')
website = find(".provider-website_link")[:href]
else
website = ""
end
description = find(".listing_details_list").find("p").text
db_prepare_location = db.prepare("INSERT INTO programs(name, city_state_address, street_address, phone_number, website, description) VALUES (?, ?, ?, ?, ?, ?)")
db_prepare_location.bind_params name, city_state_address, street_address, phone_number, website, description
db_prepare_location.execute
end
end
test = Test::Tree.new
test.get_link
end
What is the problem here and what can I do to fix it? Let me know if additional info is needed.
You could be running out file descriptors. Every time you call get_details, you open the SQLite database:
db = SQLite3::Database.open "development.sqlite3"
but you never explicitly close it; instead, you're relying on the garbage collector to clean up all your dbs and close all your file descriptors. Each time you open the database, you need to allocate a file descriptor, closing the database frees the file descriptor. If you're calling get_details faster than the GC can clean things up, you will run out of file descriptors and subsequent SQLite3::Database.open calls will fail.
Try adding db.close at the end of get_details.
You'll probably have to close the prepared statement as well so you should db_prepare_location.close before db.close:
def get_details
#...
db_prepare_location.close
db.close
end
Yes, Ruby has garbage collection but that doesn't mean that you don't have to manage your resources by hand.
Another option (which DGM was hinting at) would be to open a connection to the database in your constructor:
def initialize
#db = SQLite3::Database.open "development.sqlite3"
end
and then drop your SQLite3::Database.open call in get_details and use #db instead. You wouldn't need a db.close in get_details anymore but you'd still want the db_prepare_location.close call.

Ruby scraper. How to export to CSV?

I wrote this ruby script to scrape product info from the manufacturer website. The scraping and storage of the product objects in an array works, but I can't figure out how to export the array data to a csv file. This error is being thrown:
scraper.rb:45: undefined method `send_data' for main:Object (NoMethodError)
I do not understand this piece of code. What's this doing and why isn't it working right?
send_data csv_data,
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=products.csv"
Full code:
#!/usr/bin/ruby
require 'rubygems'
require 'anemone'
require 'fastercsv'
productsArray = Array.new
class Product
attr_accessor :name, :sku, :desc
end
# Scraper Code
Anemone.crawl("http://retail.pelicanbayltd.com/") do |anemone|
anemone.on_every_page do |page|
currentPage = Product.new
#Product info parsing
currentPage.name = page.doc.css(".page_headers").text
currentPage.sku = page.doc.css("tr:nth-child(2) strong").text
currentPage.desc = page.doc.css("tr:nth-child(4) .item").text
if currentPage.sku =~ /#\d\d\d\d/
currentPage.sku = currentPage.sku[1..-1]
productsArray.push(currentPage)
end
end
end
# CSV Export Code
products = productsArray.find(:all)
csv_data = FasterCSV.generate do |csv|
# header row
csv << ["sku", "name", "desc"]
# data rows
productsArray.each do |product|
csv << [product.sku, product.name, product.desc]
end
end
send_data csv_data,
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=products.csv"
If you are new to Ruby, you should be using Ruby 1.9 or later, in which case you can use the built-in CSV output which builds in fast csv plus l18n support:
require 'csv'
CSV.open('filename.csv', 'w') do |csv|
csv << [sku, name, desc]
end
http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV.html
File.open('filename.csv', 'w') do |f|
f.write(csv_data)
end
It probably makes more sense to do:
#csv = FasterCSV.open('filename.csv', 'w')
and then write to it as you go along:
#csv << [sku, name, desc]
that way if your script crashes halfway through you've at least got half of the data.

How do I query a MS Access database table, and export the information to Excel using Ruby and win32ole?

I'm new to Ruby, and I'm trying to query an existing MS Access database for information for a report. I want this information stored in an Excel file. How would I do this?
Try one of these:
OLE:
require 'win32ole'
class AccessDbExample
#ado_db = nil
# Setup the DB connections
def initialize filename
#ado_db = WIN32OLE.new('ADODB.Connection')
#ado_db['Provider'] = "Microsoft.Jet.OLEDB.4.0"
#ado_db.Open(filename)
rescue Exception => e
puts "ADO failed to connect"
puts e
end
def table_to_csv table
sql = "SELECT * FROM #{table};"
results = WIN32OLE.new('ADODB.Recordset')
results.Open(sql, #ado_db)
File.open("#{table}.csv", 'w') do |file|
fields = []
results.Fields.each{|f| fields << f.Name}
file.puts fields.join(',')
results.GetRows.transpose.each do |row|
file.puts row.join(',')
end
end unless results.EOF
self
end
def cleanup
#ado_db.Close unless #ado_db.nil?
end
end
AccessDbExample.new('test.mdb').table_to_csv('colors').cleanup
ODBC:
require 'odbc'
include ODBC
class AccessDbExample
#obdc_db = nil
# Setup the DB connections
def initialize filename
drv = Driver.new
drv.name = 'AccessOdbcDriver'
drv.attrs['driver'] = 'Microsoft Access Driver (*.mdb)'
drv.attrs['dbq'] = filename
#odbc_db = Database.new.drvconnect(drv)
rescue
puts "ODBC failed to connect"
end
def table_to_csv table
sql = "SELECT * FROM #{table};"
result = #odbc_db.run(sql)
return nil if result == -1
File.open("#{table}.csv", 'w') do |file|
header_row = result.columns(true).map{|c| c.name}.join(',')
file.puts header_row
result.fetch_all.each do |row|
file.puts row.join(',')
end
end
self
end
def cleanup
#odbc_db.disconnect unless #odbc_db.nil?
end
end
AccessDbExample.new('test.mdb').table_to_csv('colors').cleanup
Why do you want to do this? You can simply query your db from Excel directly. Check out this tutorial.
As Johannes said, you can query the database from Excel.
If, however, you would prefer to work with Ruby...
You can find info on querying Access/Jet databases with Ruby here.
Lots of info on automating Excel with Ruby can be found here.
David

Resources