Create dynamic variables from th class name in tables, move td values into that row's array or hash? - ruby

I'm an amateur programmer wanting to scrape data from a site that is similar to this site: http://www.highschoolsports.net/massey/ (I have permission to scrape the site, by the way.)
The target site has 'th' classes for each 'th' in row[0] but I want to ensure that each 'TD' I pull from each table is somehow linked to that th's class name, because the tables are inconsistent, for example one table might be:
row[0] - >>th.name, th.place, th.team
row[1] - >>td[0], td[1] , td[2]
while another might be:
row[0] - >>th.place, th.team, th.name
row[1] - >>td[0], td[1] , td[2] etc..
My Question: How do I capture the 'th' class name across many hundreds of tables which are inconsistent(in 'th' class order) and create the 10-14 variables(arrays), then link the 'td' corresponding to that column in the table to that dynamic variable? Please let me know if this is confusing.. there are multiple tables on a given page..
Currently my code is something like:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'uri'
class Result
def initialize(row)
#attrs = {}
#attrs[:raw] = row.text
end
end
class Race
def initialize(page, table)
#handle = page
#table = table
#results = []
#attrs = {}
parse!
end
def parse!
#attrs[:name] = #handle.css('div.caption').text
get_rows
end
def get_rows
# get all of the rows ..
#handle.css('tr').each do |tr|
#results << RaceResult.new(tr)
end
end
end
class Event
class << self
def all(index_url)
events = []
ourl = Nokogiri::HTML(open(index_url))
ourl.css('a.event').each do |url|
abs_url = MAIN + url.attributes["href"]
events << Event.new(abs_url)
end
events
end
end
def initialize(url)
#url = url
#handle = nil
#attrs = {}
#races = []
#sub_events = []
parse!
end
def parse!
#handle = Nokogiri::HTML(open(#url))
get_page_meta
if(#handle.css('table.base.event_results').length > 0)
#handle.search('div.table_container.event_results').each do |table|
#races << Race.new(#handle, table)
end
else
#handle.css('div.centered a.obvious').each do |ol|
#sub_events << Event.new(BASE_URL + ol.attributes["href"])
end
end
end
def get_page_meta
#attrs[:name] = #handle.css('html body div.content h2 text()')[0] # event name
#attrs[:date] = #handle.xpath("html/body/div/div/text()[2]").text.strip #date
end
end
A friend has been helping me with this and I'm just starting to get a grasp on OOP but I'm only capture the tables and they're not split into td's and stored into some kind of variable/array/hash etc.. I need help understanding this process or how to do this. The critical piece would be dynamically assigning variable names according to the classes of the data and moving the 'td's' from that column (all td[2]'s for example) into that dynamic variable name. I can't tell you how amazing it would be if someone actually could help me solve this problem and understand how to make this work. Thank you in advance for any help!

It's easy once you realize that the th contents are the keys of your hash. Example:
#items = []
doc.css('table.masseyStyleTable').each do |table|
fields = table.css('th').map{|x| x.text.strip}
table.css('tr').each do |tr|
item = {}
fields.each_with_index do |field,i|
item[field] = tr.css('td')[i].text.strip rescue ''
end
#items << item
end
end

Related

Delete method in plain Ruby is not working

Please see below.
The delete method is not working and I do not know why.
I am trying to delete a customer without using rails and just plain ruby.
please can you help.
wrong number of arguments (given 0, expected 1) (ArgumentError)
from /Users/mustafaalomer/code/MustafaAlomer711/fullstack-challenges/02-OOP/05-Food-Delivery-Day-One/01-Food-Delivery/app/repositories/customer_repository.rb:28:in `delete'
from /Users/mustafaalomer/code/MustafaAlomer711/fullstack-challenges/02-OOP/05-Food-Delivery-Day-One/01-Food-Delivery/app/controllers/customers_controller.rb:33:in `destroy'
from /Users/mustafaalomer/code/MustafaAlomer711/fullstack-challenges/02-OOP/05-Food-Delivery-Day-One/01-Food-Delivery/router.rb:36:in `route_action'
from /Users/mustafaalomer/code/MustafaAlomer711/fullstack-challenges/02-OOP/05-Food-Delivery-Day-One/01-Food-Delivery/router.rb:13:in `run'
from app.rb:19:in `<main>'
require_relative "../views/customers_view"
require_relative "../models/customer"
class CustomersController
def initialize(customer_repository)
#customer_repository = customer_repository
#customers_view = CustomersView.new
end
def add
# ask user for a name
name = #customers_view.ask_user_for(:name)
# ask user for a address
address = #customers_view.ask_user_for(:address)
# make a new instance of a customer
customer = Customer.new(name: name, address: address)
# add the customer to the repository
#customer_repository.create(customer)
list
end
def list
customers = #customer_repository.all
#customers_view.display_list(customers)
end
def destroy
# ask user for the id to delete
list
id = #customers_view.ask_user_to_delete(:id)
# customer = #customer_repository.find(id)
# #customer_repository.delete(customer)
end
end
require 'csv'
require_relative '../models/customer'
class CustomerRepository
def initialize(csv_file)
#csv_file = csv_file
#customers = []
#next_id = 1
load_csv if File.exist?(csv_file)
end
def all
#customers
end
def create(customer)
customer.id = #next_id
#customers << customer
#next_id += 1
save_csv
end
def find(id)
#customers.find { |customer| customer.id == id}
end
def delete(id)
#customers.delete { |customer| customer.id == id}
end
private
def save_csv
CSV.open(#csv_file, "wb") do |csv|
csv << %w[id name address]
#customers.each do |customer|
csv << [customer.id, customer.name, customer.address]
end
end
end
def load_csv
CSV.foreach(#csv_file, headers: :first_row, header_converters: :symbol) do |row|
row[:id] = row[:id].to_i
#customers << Customer.new(row)
end
#next_id = #customers.last.id + 1 unless #customers.empty?
end
end
delete always takes an argument.
delete_if can be given a block and seems to be what you're looking for.

NameError Exception: undefined local variable or method `products' for Wheyscrapper:Class

I'm building a small web scraper using Ruby and now I'm trying to refactor my code. Unfortunately, I'm encountering some errors while I'm refactoring my code. This is one of the errors.
Basically, I'm calling two separate methods in the first method which is whey_scrapper. Each of these two methods are basically responsible of scraping a specific item on the webpage. When I run and debug this code with byebug, I basically try to display the products or prices I've scraped but I get an error message saying that 'products' or 'prices' is undefined. This is my current code:
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
products = Array.new
product_names = parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
products << product
end
end
def prices_scrapper
prices = Array.new
product_prices = parsed_page.css('div.price-box')
product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
prices << price
end
end
byebug
whey_scrapper
end
There's a lot going on here, but to make it more Ruby you'd consider making those lazy-initialized and giving them names that reflect that:
class Wheyscrapper
URL = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?%s"
def initialize(company:)
#company = company
# Use encode_www_form to encode query-string parameters
#url = URL % URI.encode_www_form(manufacturer: company)
end
def document
# Lazy-initialize a parsd version of the page
#document ||= Nokogiri::HTML(open(url).read)
end
def products
document.css('div.product-primary').map do |product_name|
{
name: product_name.css('h2.product-name').text
}
end
end
def prices
document.css('div.price-box').map do |product_price|
{
amount: product_price.css('span.price').text
}
end
end
end
This fixes a lot of the data propagation problems you had in your original. When you declare a variable it's a local variable, meaning it doesn't exist outside of that particular call of that particular method. If you want to persist it for longer you need to use instance variables, as in #products, or you need to define methods that return the data you need.
The above approach combines that, using a lazy-initialized instance variable to persist the parsed document, and exposes that as a method the other methods can use.
Now you can spin this up:
scraper = WheyScraper.new(company: "Body & Fit")
Where that should enable everything to be available directly:
scraper.prices
scraper.products
When you learn how to use Ruby effectively you'll often find solutions to your problems that are really minimal. Usually a lot of Ruby code is a sign that it's not being used properly.
This should be refactored in a better way but this should at least work without refactor, based on my comments above
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
#parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
#products = Array.new
product_names = #parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
#products << product
end
end
def prices_scrapper
#prices = Array.new
#product_prices = #parsed_page.css('div.price-box')
#product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
#prices << price
end
end
end
w = Wheyscrapper.new.whey_scrapper

Adding specific URL to BASE PATH in order to scrape webpage using Nokogiri

I am new to ruby and this site so please bear with me! I have googled endlessly to no fruition.
I am trying to pass in a college object to my class method scrape_college_info that I created in the previous class method scrape_illinois_index_page, so that I may scrape the next level of information for the specific college the user selects using Pry and Nokogiri. Unfortunately, I keep getting an argument error.
I know it isn't the prettiest, but this is my code right now:
class College
attr_accessor :name, :location, :size, :type, :url
BASE_PATH = "https://www.collegesimply.com/colleges/illinois/"
def self.college
self.scrape_colleges
end
def self.scrape_colleges
colleges = self.scrape_illinois_index_page
colleges
end
def self.scrape_illinois_index_page
doc = Nokogiri::HTML(open(BASE_PATH))
# binding.pry
colleges = []
doc.xpath("//tr").each do |doc|
college = self.new
if doc.css("td")[0] != nil
college.name = doc.css("td")[0].text.strip
end
if doc.css("td")[1] != nil
college.location = doc.css("td")[1].text.strip
end
if doc.css('table.table tbody tr td:nth-child(1) a')[0] != nil
college.link = doc.css('table.table tbody tr td:nth-child(1) a')[0]['href']
end
colleges << college
end
colleges
end
def self.scrape_college_info(college)
doc = Nokogiri::HTML(open(BASE_PATH + "#{college.link}"))
end
end
Try below code to get college.link.
if doc.css("td")[0] != nil
college.name = doc.css("td")[0].text.strip
college.link = doc.css("td")[0].css("a").map{|a| a['href']}[0]
end
Now you can pass college link like :
def self.scrape_college_info(college)
doc = Nokogiri::HTML(open("https://www.collegesimply.com" + "#{college.link}"))
end
Hope this will solve your problem. Please let me know, if it works for you.
Try using URI.join:
new_url = URI.join(BASE_PATH, college.link).to_s

How this ruby script using twitter api scrapes user ids?

Hello I have modified this older code to scrape twitter usernames, but for some reason it also scrapes user ids. I dont understand how it does that, since I dont see anywhere in the code "user_id" which you should use to get user ids according to twitter api documentation.
Here is the code
def my_usernames
"UHDTelevisions"
end
def my_userinfo(names)
#client.followers(names)
end
def my_userhash(users)
userhash = {}
users.each do |user|
userhash[user.screen_name] = user.id.to_s
end
return userhash
end
def my_users
my_userhash(my_userinfo(my_usernames))
end
def my_csv(my_users)
CSV.open('./my_users.csv','a+') do |csv|
my_users.each do |k,v|
csv << [k,v]
end
end
Here is the line that builds a hash {name ⇒ id}:
userhash[user.screen_name] = user.id.to_s
Here we already got the user object, that contains id amongst other user params. To return the list of names, one might simply:
#client.followers("UHDTelevisions").map &:screen_name
instead of all the code above.
If you wanted to keep a parallel structure you could change it to have my_userarray (since you only need values, not key-value pairs, I assume)
def my_userarray(users)
userarray = []
users.each do |user|
userarray << user.screen_name
end
return userarray
end
You would need to update the my_users method as well, of course, to reflect the new method name for my_userarray

Sidekiq mechanize overwritten instance

I am building a simple web spider using Sidekiq and Mechanize.
When I run this for one domain, it works fine. When I run it for multiple domains, it fails. I believe the reason is that web_page gets overwritten when instantiated by another Sidekiq worker, but I am not sure if that's true or how to fix it.
# my scrape_search controller's create action searches on google.
def create
#scrape = ScrapeSearch.build(keywords: params[:keywords], profession: params[:profession])
agent = Mechanize.new
scrape_search = agent.get('http://google.com/') do |page|
search_result = page.form...
search_result.css("h3.r").map do |link|
result = link.at_css('a')['href'] # Narrowing down to real search results
#domain = Domain.new(some params)
ScrapeDomainWorker.perform_async(#domain.url, #domain.id, remaining_keywords)
end
end
end
I'm creating a Sidekiq job per domain. Most of the domains I'm looking for should contain just a few pages, so there's no need for sub-jobs per page.
This is my worker:
class ScrapeDomainWorker
include Sidekiq::Worker
...
def perform(domain_url, domain_id, keywords)
#domain = Domain.find(domain_id)
#domain_link = #domain.protocol + '://' + domain_url
#keywords = keywords
# First we scrape the homepage and get the first links
#domain.to_parse = ['/'] # to_parse is an array of PATHS to parse for the domain
mechanize_path('/')
#domain.verified << '/' # verified is an Array field containing valid domain paths
get_paths(#web_page) # Now we should have to_scrape populated with homepage links
#domain.scraped = 1 # Loop counter
while #domain.scraped < 100
#domain.to_parse.each do |path|
#domain.to_parse.delete(path)
#domain.scraped += 1
mechanize_path(path) # We create a Nokogiri HTML doc with mechanize for the valid path
...
get_paths(#web_page) # Fire this to repopulate to_scrape !!!
end
end
#domain.save
end
def mechanize_path(path)
agent = Mechanize.new
begin
#web_page = agent.get(#domain_link + path)
rescue Exception => e
puts "Mechanize Exception for #{path} :: #{e.message}"
end
end
def get_paths(web_page)
paths = web_page.links.map {|link| link.href.gsub((#domain.protocol + '://' + #domain.url), "") } ## This works when I scrape a single domain, but fails with ".gsub for nil" when I scrape a few domains.
paths.uniq.each do |path|
#domain.to_parse << path
end
end
end
This works when I scrape a single domain, but fails with .gsub for nil for web_page when I scrape a few domains.
You can wrap you code in another class, and then create and object of that class within your worker:
class ScrapeDomainWrapper
def initialize(domain_url, domain_id, keywords)
# ...
end
def mechanize_path(path)
# ...
end
def get_paths(web_page)
# ...
end
end
And your worker:
class ScrapeDomainWorker
include Sidekiq::Worker
def perform(domain_url, domain_id, keywords)
ScrapeDomainWrapper.new(domain_url, domain_id, keywords)
end
end
Also, bear in mind that Mechanize::Page#links may be a nil.

Resources