Ruby scraper. How to export to CSV? - ruby

I wrote this ruby script to scrape product info from the manufacturer website. The scraping and storage of the product objects in an array works, but I can't figure out how to export the array data to a csv file. This error is being thrown:
scraper.rb:45: undefined method `send_data' for main:Object (NoMethodError)
I do not understand this piece of code. What's this doing and why isn't it working right?
send_data csv_data,
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=products.csv"
Full code:
#!/usr/bin/ruby
require 'rubygems'
require 'anemone'
require 'fastercsv'
productsArray = Array.new
class Product
attr_accessor :name, :sku, :desc
end
# Scraper Code
Anemone.crawl("http://retail.pelicanbayltd.com/") do |anemone|
anemone.on_every_page do |page|
currentPage = Product.new
#Product info parsing
currentPage.name = page.doc.css(".page_headers").text
currentPage.sku = page.doc.css("tr:nth-child(2) strong").text
currentPage.desc = page.doc.css("tr:nth-child(4) .item").text
if currentPage.sku =~ /#\d\d\d\d/
currentPage.sku = currentPage.sku[1..-1]
productsArray.push(currentPage)
end
end
end
# CSV Export Code
products = productsArray.find(:all)
csv_data = FasterCSV.generate do |csv|
# header row
csv << ["sku", "name", "desc"]
# data rows
productsArray.each do |product|
csv << [product.sku, product.name, product.desc]
end
end
send_data csv_data,
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=products.csv"

If you are new to Ruby, you should be using Ruby 1.9 or later, in which case you can use the built-in CSV output which builds in fast csv plus l18n support:
require 'csv'
CSV.open('filename.csv', 'w') do |csv|
csv << [sku, name, desc]
end
http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV.html

File.open('filename.csv', 'w') do |f|
f.write(csv_data)
end

It probably makes more sense to do:
#csv = FasterCSV.open('filename.csv', 'w')
and then write to it as you go along:
#csv << [sku, name, desc]
that way if your script crashes halfway through you've at least got half of the data.

Related

How to refresh a large database?

I built a rake task to donwload a zip from Awin datafeed and import it to my product model via activerecord-import.
require 'zip'
require 'httparty'
require 'active_record'
require 'activerecord-import'
namespace :affiliate_datafeed do
desc "Import products data from Awin"
task import_product_awin: :environment do
url = "https://productdata.awin.com"
dir = "db/affiliate_datafeed/awin.zip"
File.open(dir, "wb") do |f|
f.write HTTParty.get(url).body
end
zip_file = Zip::File.open(dir)
entry = zip_file.glob('*.csv').first
csv_text = entry.get_input_stream.read
products = []
CSV.parse(csv_text, :headers=>true).each do |row|
products << Product.new(row.to_h)
end
Product.import(products)
end
end
How to update the product db only if the product doesn't exist or if there is a new date in the last_updated field? What is the best way to refresh a large db?
Probably use some methods like the following to keep checking the last_updated or last_modified header field in your rake task.
def get_date
date = CSV.foreach('CSV_raw.csv', :headers => false).first { |r| puts r}
$last_modified = Date.parse(date.compact[1]) # if last_updated is first row of CSV or use your http req header
end
run_once = ARGV.length > 0 # to run once & test if it works; not sure if rake taks accept args.
if not run_once
puts "Daemon Mode"
end
if not File.read('last_update.txt').empty?
date_in_file = Date.parse(File.read('last_update.txt'))
else
date_in_file = Date.parse('2001-02-03')
end
if $last_modified > date_in_file
"your db updating method"
end
unless run_once
sleep UPDATE_INTERVAL # whatever value you want for the interval to be
end
end until run_once

Getting all unique URL's using nokogiri

I've been working for a while to try to use the .uniq method to generate a unique list of URL's from a website (within the /informatics path). No matter what I try I get a method error when trying to generate the list. I'm sure it's a syntax issue, and I was hoping someone could point me in the right direction.
Once I get the list I'm going to need to store these to a database via ActiveRecord, but I need the unique list before I get start to wrap my head around that.
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0]="https://www.nku.edu/academics/informatics.html"
ARGV.each do |arg|
open(arg) do |f|
# Display connection data
puts "#"*25 + "\nConnection: '#{arg}'\n" + "#"*25
[:base_uri, :meta, :status, :charset, :content_encoding,
:content_type, :last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
# Display the href links
base_url = /^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]
puts "base_url: #{base_url}"
Nokogiri::HTML(f).css('a').each do |anchor|
href = anchor['href']
# Make Unique
if href =~ /.*informatics/
puts href
#store stuff to active record
end
end
end
end
Replace the Nokogiri::HTML part to select only those href attributes that matches with /*.informatics/ and then you can use uniq, as it's already an array:
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0] = 'https://www.nku.edu/academics/informatics.html'
ARGV.each do |arg|
open(arg) do |f|
puts "#{'#' * 25} \nConnection: '#{arg}'\n #{'#' * 25}"
%i[base_uri meta status charset content_encoding, content_type last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
puts "base_url: #{/^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]}"
anchors = Nokogiri::HTML(f).css('a').select { |anchor| anchor['href'] =~ /.*informatics/ }
puts anchors.map { |anchor| anchor['href'] }.uniq
end
end
See output.

No such file or directory # rb_sysopen

I'm using Ruby 2.1.1 When I run this code:
<CSV.foreach("public/data/original/example_data.csv",headers: true, converters: :numeric) do |info|
I get an error:
No such file or directory # rb_sysopen
It works if I place example_data.csv in the same directory as shown below, but my boss said it can't be that way he wants all *.csv files in a different directory:
<CSV.foreach("example_data.csv",headers: true, converters: :numeric) do |info|
I had to use a workaround that bypassed File Utilities. Using thoughtbot/paperclip generated a directory called csvcontroller. I placed the csv file in that directory folder.
class Uploader < ActiveRecord::Base
attr_accessible :purchase_name, :item_description, :item_price, :purchase_count,
:merchant_address, :merchant_name, :csvdata
has_attached_file :csvdata, :url => "/csvcontroller/:basename.:extension",
:path => ":rails_root/csvcontroller/:basename.:extension"
#:default_url => "/controllers/original/example_data.csv"
validates_attachment_content_type :csvdata, :content_type => ["text/csv"]
end
Then I placed my parser in that directory to avoid using FileUtils
require 'csv'
#total_cost = 0
#errors out FileUtils.move '/public/data/original/example_data.csv', '/controllers'
#errors out require File.expand_path('../app/public/data/original/', __FILE__)
# errors outCSV.foreach("any_path_name_outside_the_same_directory/example_data.csv",
#headers: true, converters: :numeric) do |info|
CSV.foreach("example_data.csv", headers: true, converters: :numeric) do |info|
a =(info["item price"]).to_f
b = (info["purchase count"]).to_i
#total_cost += a * b
#store = []
customer = []
customer << info["purchaser name"]
#store << info["item description"]
#store << (info["item price"]).to_f
#store << (info["purchase count"]).to_i
#store << info["merchant address"]
#store << info["merchant name"]
puts #customer
puts #store
puts #total_cost
end
It looks ugly but that is what it is.
I could not get the FileUtils:: class to work properly. This is a Ruby bug for 2.1.1

Access Class Variable from outside class RUBY

I can't seem to figure this out. I have the following class:
class CSV_Email
attr_accessor :client_array, :email_array
def load(file)
#file = file
#Parse csv file into ruby arrays...
#Column Headers - Email, Client
#client_array = []
#email_array = []
CSV.foreach(file, :col_sep => ",", :headers => :first_row, :return_headers => false) do |column|
client_array << column[0]
email_array << column[1]
end
end
end
Now I need to access client_id_array and email_array. I tried this:
test = CSV_Email.new
test.load("Email_Test.csv")
puts client_array
But I receive a undefined local variable client_array error. How can I access that variable?
I am using ruby 1.9.3.
You need to use the object you created:
puts test.client_array

Receiving errors when saving Tweets to a database using Sinatra

I'm using Sinatra, EventMachine, DataMapper, SQLite3 and the Twitter Stream API to capture and save tweets. When I run the application from my command line, it seems to continually fail at tweet 50. If I'm not saving the tweets, it can run seemingly forever.
Below is the app code to capture tweets with 'oscar' in them, which provided a very quick stream. Just enter your twitter username and password and run at the command line.
require 'rubygems'
require 'sinatra'
require 'em-http'
require 'json'
require 'dm-core'
require 'dm-migrations'
USERNAME = '<your twitter username>'
PASSWORD = '<your secret password>'
STREAMING_URL = 'http://stream.twitter.com/1/statuses/filter.json'
DataMapper.setup(:default, ENV['DATABASE_URL'] || "sqlite3://#{Dir.pwd}/db/development.db")
class Tweet
include DataMapper::Resource
property :id, Serial
property :tweet_id, String
property :username, String
property :avatar_url, String
property :text, Text
end
DataMapper.auto_upgrade!
get '/' do
#tweets = Tweet.all
erb :index
end
def rip_tweet(line)
#count += 1
tweet = Tweet.new :tweet_id => line['id'],
:username => line['user']['screen_name'],
:avatar_url => line['user']['profile_image_url'],
:text => line['text']
if tweet.save
puts #count
else
puts "F"
end
end
EM.schedule do
#count = 0
http = EM::HttpRequest.new(STREAMING_URL).get({
:head => {
'Authorization' => [ USERNAME, PASSWORD]
},
:query => {
'track' => 'oscars'
}
})
buffer = ""
http.stream do |chunk|
buffer += chunk
while line = buffer.slice!(/.+\r?\n/)
rip_tweet JSON.parse(line)
end
end
end
helpers do
alias_method :h, :escape_html
end
I'm not sure you can safely mix EM and Sinatra in the same process. You might want to try splitting the Sinatra viewer and the EventMachine downloader into separate programs and processes.

Resources