Adding specific URL to BASE PATH in order to scrape webpage using Nokogiri - ruby

I am new to ruby and this site so please bear with me! I have googled endlessly to no fruition.
I am trying to pass in a college object to my class method scrape_college_info that I created in the previous class method scrape_illinois_index_page, so that I may scrape the next level of information for the specific college the user selects using Pry and Nokogiri. Unfortunately, I keep getting an argument error.
I know it isn't the prettiest, but this is my code right now:
class College
attr_accessor :name, :location, :size, :type, :url
BASE_PATH = "https://www.collegesimply.com/colleges/illinois/"
def self.college
self.scrape_colleges
end
def self.scrape_colleges
colleges = self.scrape_illinois_index_page
colleges
end
def self.scrape_illinois_index_page
doc = Nokogiri::HTML(open(BASE_PATH))
# binding.pry
colleges = []
doc.xpath("//tr").each do |doc|
college = self.new
if doc.css("td")[0] != nil
college.name = doc.css("td")[0].text.strip
end
if doc.css("td")[1] != nil
college.location = doc.css("td")[1].text.strip
end
if doc.css('table.table tbody tr td:nth-child(1) a')[0] != nil
college.link = doc.css('table.table tbody tr td:nth-child(1) a')[0]['href']
end
colleges << college
end
colleges
end
def self.scrape_college_info(college)
doc = Nokogiri::HTML(open(BASE_PATH + "#{college.link}"))
end
end

Try below code to get college.link.
if doc.css("td")[0] != nil
college.name = doc.css("td")[0].text.strip
college.link = doc.css("td")[0].css("a").map{|a| a['href']}[0]
end
Now you can pass college link like :
def self.scrape_college_info(college)
doc = Nokogiri::HTML(open("https://www.collegesimply.com" + "#{college.link}"))
end
Hope this will solve your problem. Please let me know, if it works for you.

Try using URI.join:
new_url = URI.join(BASE_PATH, college.link).to_s

Related

Boolean method not returning in different situations [RUBY]

I'm building a simple web-scraper (scraping jobs from indeed.com) for practice and I'm trying to implement the following method (low_salary?(salary)). The aim is for the method to compare a minimum (i.e. desired) salary, compare it with the offered salary contained in the job object (#salary):
class Job
attr_reader :title, :company, :location, :salary, :url
def initialize(title, company, location, salary, url)
#title = title
#company = company
#location = location
#salary = salary
#url = url
end
def low_salary?(minimum_salary)
return if !#salary
minimum_salary < #salary.split(/[^\d]/)[1..2].join.to_i
end
end
The method works fine when comapring #salary and the min_salary variable given to it, the delete_if appropriately deletes the elements that return true for low_salary? and returns correctly when #salary is nil (indeed listings don't always include the salary so my assumption is that there will be some nil values) in the following test program (Also: I am unsure as to why minimum_salary < #salary works but #salary < minimum_salary doesn't, but this does exactly what I want it to do):
require_relative('job_class.rb')
job = Job.new("designer", "company", "location", "£23,000 a year", "url")
job_results = []
job_results.push(job)
min_salary = 50000
print job.low_salary?(min_salary)
job_results.delete_if { |job| job.low_salary?(min_salary) }
print job_results
However in my scraper program, I get a no method error when calling the method: job_class.rb:16:in "low_salary?": undefined method `join' for nil:NilClass (NoMethodError)
require 'nokogiri'
require 'open-uri'
require_relative 'job_class.rb'
class JobSearchTool
def initialize(job_title, location, salary)
#document = Nokogiri::HTML(open("https://uk.indeed.com/jobs?q=#{job_title.gsub('-', '+')}&l=#{location}"))
#job_listings = #document.css('div.mosaic-provider-jobcards > a')
#salary = salary.to_i
#job_results = []
end
def scrape_jobs
#job_listings.each do |job_card|
#job_results.push(Job.new(
job_card.css('h2 > span').text, #title
job_card.css('span.companyName').text, #company
job_card.css('div.companyLocation').text, #location
job_card.css('span.salary-snippet').text, #salary
job_card['href']) #url
)
end
end
def format_jobs
#job_results.each do |job|
puts <<~JOB
#{job.title} - #{job.company} in #{job.location} :#{job.salary}
Apply at: #{job.url}
---------------------------------------------------------------------------------
JOB
end
end
def check_salary
#job_results.delete_if { |job| job.low_salary?(#salary) }
end
def run
scrape_jobs
check_salary
format_jobs
end
if __FILE__ == $0
job_search_tool = JobSearchTool.new(ARGV[0], ARGV[1], ARGV[2])
job_search_tool.run
end
Obviously something from the scraper programme is influencing the method somehow, but I can't understand what it could be. I'm using the method in the exact same way as the test program, so what difference is causing the method not to return when #salary is nil?
A quick search on the URL you're scraping shows there are job posts that don't have a salary, so, when you get the data from that HTML element and initialize a new Job object, the salary is an empty string, and knowing that "".split(/[^\d]/)[1..2] returns nil, that's the error you get.
You must add a way to handle job posts without a salary:
class Job
attr_reader :title, :company, :location, :salary, :url
def initialize(title, company, location, salary, url)
#title = title
#company = company
#location = location
#salary = salary.to_s # Explicit conversion of nil to string
#url = url
end
def low_salary?(minimum_salary)
return if parsed_salary.zero? # parsed_salary returns always an integer,
# so you can check when is zero,
# and not just when is falsy
minimum_salary < parsed_salary
end
private
def parsed_salary
salary[/(?<=£)(\d|,)*(?=\s)/]
.to_s # converts nil to "" if the regex doesn't capture anything
.tr(",", "") # removes the commas to parse the string as an integer
.to_i # parses the string to its corresponding integer representation
end
end
Notice the regex isn't meant to capture everything, but it works with the salary as rendered in the website.

NameError Exception: undefined local variable or method `products' for Wheyscrapper:Class

I'm building a small web scraper using Ruby and now I'm trying to refactor my code. Unfortunately, I'm encountering some errors while I'm refactoring my code. This is one of the errors.
Basically, I'm calling two separate methods in the first method which is whey_scrapper. Each of these two methods are basically responsible of scraping a specific item on the webpage. When I run and debug this code with byebug, I basically try to display the products or prices I've scraped but I get an error message saying that 'products' or 'prices' is undefined. This is my current code:
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
products = Array.new
product_names = parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
products << product
end
end
def prices_scrapper
prices = Array.new
product_prices = parsed_page.css('div.price-box')
product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
prices << price
end
end
byebug
whey_scrapper
end
There's a lot going on here, but to make it more Ruby you'd consider making those lazy-initialized and giving them names that reflect that:
class Wheyscrapper
URL = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?%s"
def initialize(company:)
#company = company
# Use encode_www_form to encode query-string parameters
#url = URL % URI.encode_www_form(manufacturer: company)
end
def document
# Lazy-initialize a parsd version of the page
#document ||= Nokogiri::HTML(open(url).read)
end
def products
document.css('div.product-primary').map do |product_name|
{
name: product_name.css('h2.product-name').text
}
end
end
def prices
document.css('div.price-box').map do |product_price|
{
amount: product_price.css('span.price').text
}
end
end
end
This fixes a lot of the data propagation problems you had in your original. When you declare a variable it's a local variable, meaning it doesn't exist outside of that particular call of that particular method. If you want to persist it for longer you need to use instance variables, as in #products, or you need to define methods that return the data you need.
The above approach combines that, using a lazy-initialized instance variable to persist the parsed document, and exposes that as a method the other methods can use.
Now you can spin this up:
scraper = WheyScraper.new(company: "Body & Fit")
Where that should enable everything to be available directly:
scraper.prices
scraper.products
When you learn how to use Ruby effectively you'll often find solutions to your problems that are really minimal. Usually a lot of Ruby code is a sign that it's not being used properly.
This should be refactored in a better way but this should at least work without refactor, based on my comments above
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
#parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
#products = Array.new
product_names = #parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
#products << product
end
end
def prices_scrapper
#prices = Array.new
#product_prices = #parsed_page.css('div.price-box')
#product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
#prices << price
end
end
end
w = Wheyscrapper.new.whey_scrapper

How do I implement hashids in ruby on rails

I will go ahead and apologize upfront as I am new to ruby and rails and I cannot for the life of me figure out how to implement using hashids in my project. The project is a simple image host. I have it already working using Base58 to encode the sql ID and then decode it in the controller. However I wanted to make the URLs more random hence switching to hashids.
I have placed the hashids.rb file in my lib directory from here: https://github.com/peterhellberg/hashids.rb
Now some of the confusion starts here. Do I need to initialize hashids on every page that uses hashids.encode and hashids.decode via
hashids = Hashids.new("mysalt")
I found this post (http://zogovic.com/post/75234760043/youtube-like-ids-for-your-activerecord-models) which leads me to believe I can put it into an initializer however after doing that I am still getting NameError (undefined local variable or method `hashids' for ImageManager:Class)
so in my ImageManager.rb class I have
require 'hashids'
class ImageManager
class << self
def save_image(imgpath, name)
mime = %x(/usr/bin/exiftool -MIMEType #{imgpath})[34..-1].rstrip
if mime.nil? || !VALID_MIME.include?(mime)
return { status: 'failure', message: "#{name} uses an invalid format." }
end
hash = Digest::MD5.file(imgpath).hexdigest
image = Image.find_by_imghash(hash)
if image.nil?
image = Image.new
image.mimetype = mime
image.imghash = hash
unless image.save!
return { status: 'failure', message: "Failed to save #{name}." }
end
unless File.directory?(Rails.root.join('uploads'))
Dir.mkdir(Rails.root.join('uploads'))
end
#File.open(Rails.root.join('uploads', "#{Base58.encode(image.id)}.png"), 'wb') { |f| f.write(File.open(imgpath, 'rb').read) }
File.open(Rails.root.join('uploads', "#{hashids.encode(image.id)}.png"), 'wb') { |f| f.write(File.open(imgpath, 'rb').read) }
end
link = ImageLink.new
link.image = image
link.save
#return { status: 'success', message: Base58.encode(link.id) }
return { status: 'success', message: hashids.encode(link.id) }
end
private
VALID_MIME = %w(image/png image/jpeg image/gif)
end
end
And in my controller I have:
require 'hashids'
class MainController < ApplicationController
MAX_FILE_SIZE = 10 * 1024 * 1024
MAX_CACHE_SIZE = 128 * 1024 * 1024
#links = Hash.new
#files = Hash.new
#tstamps = Hash.new
#sizes = Hash.new
#cache_size = 0
class << self
attr_accessor :links
attr_accessor :files
attr_accessor :tstamps
attr_accessor :sizes
attr_accessor :cache_size
attr_accessor :hashids
end
def index
end
def transparency
end
def image
##imglist = params[:id].split(',').map{ |id| ImageLink.find(Base58.decode(id)) }
#imglist = params[:id].split(',').map{ |id| ImageLink.find(hashids.decode(id)) }
end
def image_direct
#linkid = Base58.decode(params[:id])
linkid = hashids.decode(params[:id])
file =
if Rails.env.production?
puts "#{Base58.encode(ImageLink.find(linkid).image.id)}.png"
File.open(Rails.root.join('uploads', "#{Base58.encode(ImageLink.find(linkid).image.id)}.png"), 'rb') { |f| f.read }
else
puts "#{hashids.encode(ImageLink.find(linkid).image.id)}.png"
File.open(Rails.root.join('uploads', "#{hashids.encode(ImageLink.find(linkid).image.id)}.png"), 'rb') { |f| f.read }
end
send_data(file, type: ImageLink.find(linkid).image.mimetype, disposition: 'inline')
end
def upload
imgparam = params[:image]
if imgparam.is_a?(String)
name = File.basename(imgparam)
imgpath = save_to_tempfile(imgparam).path
else
name = imgparam.original_filename
imgpath = imgparam.tempfile.path
end
File.chmod(0666, imgpath)
%x(/usr/bin/exiftool -all= -overwrite_original #{imgpath})
logger.debug %x(which exiftool)
render json: ImageManager.save_image(imgpath, name)
end
private
def save_to_tempfile(url)
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = uri.scheme == 'https'
http.start do
resp = http.get(uri.path)
file = Tempfile.new('urlupload', Dir.tmpdir, :encoding => 'ascii-8bit')
file.write(resp.body)
file.flush
return file
end
end
end
Then in my image.html.erb view I have this:
<%
#imglist.each_with_index { |link, i|
id = hashids.encode(link.id)
ext = link.image.mimetype.split('/')[1]
if ext == 'jpeg'
ext = 'jpg'
end
puts id + '.' + ext
%>
Now if I add
hashids = Hashids.new("mysalt")
in ImageManager.rb main_controller.rb and in my image.html.erb I am getting this error:
ActionView::Template::Error (undefined method `id' for #<Array:0x000000062f69c0>)
So all in all implementing hashids.encode/decode is not as easy as implementing Base58.encode/decode and I am confused on how to get it working... Any help would be greatly appreciated.
I would suggest loading it as a gem by including it into your Gemfile and running bundle install. It will save you the hassle of requiring it in every file and allow you to manage updates using Bundler.
Yes, you do need to initialize it wherever it is going to be used with the same salt. Would suggest that you define the salt as a constant, perhaps in application.rb.
The link you provided injects hashids into ActiveRecord, which means it will not work anywhere else. I would not recommend the same approach as it will require a high level of familiarity with Rails.
You might want to spend some time understanding ActiveRecord and ActiveModel. Will save you a lot of reinventing the wheel. :)
Before everythink you should just to test if Hashlib is included in your project, you can run command rails c in your project folder and make just a small test :
>> my_id = ImageLink.last.id
>> puts Hashids.new(my_id)
If not working, add the gem in gemfile (that anyway make a lot more sence).
Then, I think you should add a getter for your hash_id in your ImageLink model.
Even you don't want to save your hash in the database, this hash have it's pllace in your model. See virtual property for more info.
Remember "Skinny Controller, Fat Model".
class ImageLink < ActiveRecord::Base
def hash_id()
# cache the hash
#hash_id ||= Hashids.new(id)
end
def extension()
# you could add the logic of extension here also.
ext = image.mimetype.split('/')[1]
if ext == 'jpeg'
'jpg'
else
ext
end
end
end
Change the return in your ImageManager#save_image
link = ImageLink.new
link.image = image
# Be sure your image have been saved (validation errors, etc.)
if link.save
{ status: 'success', message: link.hash_id }
else
{status: 'failure', message: link.errors.join(", ")}
end
In your template
<%
#imglist.each_with_index do |link, i|
puts link.hash_id + '.' + link.extension
end # <- I prefer the do..end to not forgot the ending parenthesis
%>
All this code is not tested...
I was looking for something similar where I can disguise the ids of my records. I came across act_as_hashids.
https://github.com/dtaniwaki/acts_as_hashids
This little gem integrates seamlessly. You can still find your records through the ids. Or with the hash. On nested records you can use the method with_hashids.
To get the hash you use to_param on the object itself which result in a string similar to this ePQgabdg.
Since I just implemented this I can't tell how useful this gem will be. So far I just had to adjust my code a little bit.
I also gave the records a virtual attribute hashid so I can access it easily.
attr_accessor :hashid
after_find :set_hashid
private
def set_hashid
self.hashid = self.to_param
end

Ruby: Chatterbot can't load bot data

I'm picking up ruby language and get stuck at playing with the chatterbot i have developed. Similar issue has been asked here Click here , I did what they suggested to change the rescue in order to see the full message.But it doesn't seem right, I was running basic_client.rb at rubybot directory and fred.bot is also generated at that directory . Please see the error message below: Your help very be very much appreciated.
Snailwalkers-MacBook-Pro:~ snailwalker$ cd rubybot
Snailwalkers-MacBook-Pro:rubybot snailwalker$ ruby basic_client.rb
/Users/snailwalker/rubybot/bot.rb:12:in `rescue in initialize': Can't load bot data because: No such file or directory - bot_data (RuntimeError)
from /Users/snailwalker/rubybot/bot.rb:9:in `initialize'
from basic_client.rb:3:in `new'
from basic_client.rb:3:in `<main>'
basic_client.rb
require_relative 'bot.rb'
bot = Bot.new(:name => 'Fred', :data_file => 'fred.bot')
puts bot.greeting
while input = gets and input.chomp != 'end'
puts '>> ' + bot.response_to(input)
end
puts bot.farewell
bot.rb:
require 'yaml'
require './wordplay'
class Bot
attr_reader :name
def initialize(options)
#name = options[:name] || "Unnamed Bot"
begin
#data = YAML.load(File.read('bot_data'))
rescue => e
raise "Can't load bot data because: #{e}"
end
end
def greeting
random_response :greeting
end
def farewell
random_response :farewell
end
def response_to(input)
prepared_input = preprocess(input).downcase
sentence = best_sentence(prepared_input)
reversed_sentence = WordPlay.switch_pronouns(sentence)
responses = possible_responses(sentence)
responses[rand(responses.length)]
end
private
def possible_responses(sentence)
responses = []
#data[:responses].keys.each do |pattern|
next unless pattern.is_a?(String)
if sentence.match('\b' + pattern.gsub(/\*/, '') + '\b')
if pattern.include?('*')
responses << #data[:responses][pattern].collect do |phrase|
matching_section = sentence.sub(/^.*#{pattern}\s+/, '')
phrase.sub('*', WordPlay.switch_pronouns(matching_section))
end
else
responses << #data[:responses][pattern]
end
end
end
responses << #data[:responses][:default] if responses.empty?
responses.flatten
end
def preprocess(input)
perform_substitutions input
end
def perform_substitutions(input)
#data[:presubs].each {|s| input.gsub!(s[0], s[1])}
input
end
# select best_sentence by looking at longest sentence
def best_sentence(input)
hot_words = #data[:responses].keys.select do |k|
k.class == String && k =~ /^\w+$/
end
WordPlay.best_sentence(input.sentences, hot_words)
end
def random_response(key)
random_index = rand(#data[:responses][key].length)
#data[:responses][key][random_index].gsub(/\[name\]/, #name)
end
end
I'm assuming that you are trying to load the :data_file passed into Bot.new, but right now you are statically loading a bot_data file everytime. You never mentioned about bot_data in the question. So if I'm right it should be like this :
#data = YAML.load(File.read(options[:data_file]))
Instead of :
#data = YAML.load(File.read('bot_data'))

Create dynamic variables from th class name in tables, move td values into that row's array or hash?

I'm an amateur programmer wanting to scrape data from a site that is similar to this site: http://www.highschoolsports.net/massey/ (I have permission to scrape the site, by the way.)
The target site has 'th' classes for each 'th' in row[0] but I want to ensure that each 'TD' I pull from each table is somehow linked to that th's class name, because the tables are inconsistent, for example one table might be:
row[0] - >>th.name, th.place, th.team
row[1] - >>td[0], td[1] , td[2]
while another might be:
row[0] - >>th.place, th.team, th.name
row[1] - >>td[0], td[1] , td[2] etc..
My Question: How do I capture the 'th' class name across many hundreds of tables which are inconsistent(in 'th' class order) and create the 10-14 variables(arrays), then link the 'td' corresponding to that column in the table to that dynamic variable? Please let me know if this is confusing.. there are multiple tables on a given page..
Currently my code is something like:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'uri'
class Result
def initialize(row)
#attrs = {}
#attrs[:raw] = row.text
end
end
class Race
def initialize(page, table)
#handle = page
#table = table
#results = []
#attrs = {}
parse!
end
def parse!
#attrs[:name] = #handle.css('div.caption').text
get_rows
end
def get_rows
# get all of the rows ..
#handle.css('tr').each do |tr|
#results << RaceResult.new(tr)
end
end
end
class Event
class << self
def all(index_url)
events = []
ourl = Nokogiri::HTML(open(index_url))
ourl.css('a.event').each do |url|
abs_url = MAIN + url.attributes["href"]
events << Event.new(abs_url)
end
events
end
end
def initialize(url)
#url = url
#handle = nil
#attrs = {}
#races = []
#sub_events = []
parse!
end
def parse!
#handle = Nokogiri::HTML(open(#url))
get_page_meta
if(#handle.css('table.base.event_results').length > 0)
#handle.search('div.table_container.event_results').each do |table|
#races << Race.new(#handle, table)
end
else
#handle.css('div.centered a.obvious').each do |ol|
#sub_events << Event.new(BASE_URL + ol.attributes["href"])
end
end
end
def get_page_meta
#attrs[:name] = #handle.css('html body div.content h2 text()')[0] # event name
#attrs[:date] = #handle.xpath("html/body/div/div/text()[2]").text.strip #date
end
end
A friend has been helping me with this and I'm just starting to get a grasp on OOP but I'm only capture the tables and they're not split into td's and stored into some kind of variable/array/hash etc.. I need help understanding this process or how to do this. The critical piece would be dynamically assigning variable names according to the classes of the data and moving the 'td's' from that column (all td[2]'s for example) into that dynamic variable name. I can't tell you how amazing it would be if someone actually could help me solve this problem and understand how to make this work. Thank you in advance for any help!
It's easy once you realize that the th contents are the keys of your hash. Example:
#items = []
doc.css('table.masseyStyleTable').each do |table|
fields = table.css('th').map{|x| x.text.strip}
table.css('tr').each do |tr|
item = {}
fields.each_with_index do |field,i|
item[field] = tr.css('td')[i].text.strip rescue ''
end
#items << item
end
end

Resources