undefined method each for nil class in nokigiri use - ruby

i am trying to fetch all links on the given link but it is giving me an error undefined method `each' for nil:NilClass
require 'nokogiri'
def find_links(link)
page = Nokogiri::HTML(open(link))
link_size = page.css('li')
(0..link_size.length).each do |index|
b = link_size[index]['href']
return b
end
end
find_links('http://code.tutsplus.com/tutorials/you-dont-know-anything-about-regular-expressions-a-complete-guide--net-7869').each do |url|
puts url
end

There are couple of issues in your code. Find explanation inline below:
def find_links(link)
page = Nokogiri::HTML(open(link))
link_size = page.css('li')
(0..link_size.length).each do |index|
b = link_size[index]['href'] # You are expecting to get 'href' on list item which is wrong.
return b # You shouldn't return from this block if you are expecting to get all the links. return from here will return from this method itself after first iteration.
# That's why you are getting nil error since link_size[index]['href'] doesn't have href attribute in first list item
end
end
Change your code to: (find explanations inline)
require 'nokogiri'
require 'open-uri'
def find_links(link)
page = Nokogiri::HTML(open(link))
# You want to iterate on anchor tags rather than list.
# See the use of `map`, it will return the array and since this is the last statement it will return from the method, giving all the links.
# .css('li a') will give all the anchor tags which have list item in it's parent chain.
page.css('li a').map { |x| x['href'] }
end
find_links('http://code.tutsplus.com/tutorials/you-dont-know-anything-about-regular-expressions-a-complete-guide--net-7869').each do |url|
puts url
end

require 'nokogiri'
require 'open-uri'
def find_links(link)
page = Nokogiri::HTML(open(link))
link_array = page.css('li')
(1..link_array.length).each do |f|
a=Array.new.push(page.css('li a')[f]['href'])
puts a
end
end
find_links('http://code.tutsplus.com/tutorials/you-dont-know-anything-about-regular-expressions-a-complete-guide--net-7869')

Related

NameError Exception: undefined local variable or method `products' for Wheyscrapper:Class

I'm building a small web scraper using Ruby and now I'm trying to refactor my code. Unfortunately, I'm encountering some errors while I'm refactoring my code. This is one of the errors.
Basically, I'm calling two separate methods in the first method which is whey_scrapper. Each of these two methods are basically responsible of scraping a specific item on the webpage. When I run and debug this code with byebug, I basically try to display the products or prices I've scraped but I get an error message saying that 'products' or 'prices' is undefined. This is my current code:
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
products = Array.new
product_names = parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
products << product
end
end
def prices_scrapper
prices = Array.new
product_prices = parsed_page.css('div.price-box')
product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
prices << price
end
end
byebug
whey_scrapper
end
There's a lot going on here, but to make it more Ruby you'd consider making those lazy-initialized and giving them names that reflect that:
class Wheyscrapper
URL = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?%s"
def initialize(company:)
#company = company
# Use encode_www_form to encode query-string parameters
#url = URL % URI.encode_www_form(manufacturer: company)
end
def document
# Lazy-initialize a parsd version of the page
#document ||= Nokogiri::HTML(open(url).read)
end
def products
document.css('div.product-primary').map do |product_name|
{
name: product_name.css('h2.product-name').text
}
end
end
def prices
document.css('div.price-box').map do |product_price|
{
amount: product_price.css('span.price').text
}
end
end
end
This fixes a lot of the data propagation problems you had in your original. When you declare a variable it's a local variable, meaning it doesn't exist outside of that particular call of that particular method. If you want to persist it for longer you need to use instance variables, as in #products, or you need to define methods that return the data you need.
The above approach combines that, using a lazy-initialized instance variable to persist the parsed document, and exposes that as a method the other methods can use.
Now you can spin this up:
scraper = WheyScraper.new(company: "Body & Fit")
Where that should enable everything to be available directly:
scraper.prices
scraper.products
When you learn how to use Ruby effectively you'll often find solutions to your problems that are really minimal. Usually a lot of Ruby code is a sign that it's not being used properly.
This should be refactored in a better way but this should at least work without refactor, based on my comments above
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
#parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
#products = Array.new
product_names = #parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
#products << product
end
end
def prices_scrapper
#prices = Array.new
#product_prices = #parsed_page.css('div.price-box')
#product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
#prices << price
end
end
end
w = Wheyscrapper.new.whey_scrapper

Getting all unique URL's using nokogiri

I've been working for a while to try to use the .uniq method to generate a unique list of URL's from a website (within the /informatics path). No matter what I try I get a method error when trying to generate the list. I'm sure it's a syntax issue, and I was hoping someone could point me in the right direction.
Once I get the list I'm going to need to store these to a database via ActiveRecord, but I need the unique list before I get start to wrap my head around that.
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0]="https://www.nku.edu/academics/informatics.html"
ARGV.each do |arg|
open(arg) do |f|
# Display connection data
puts "#"*25 + "\nConnection: '#{arg}'\n" + "#"*25
[:base_uri, :meta, :status, :charset, :content_encoding,
:content_type, :last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
# Display the href links
base_url = /^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]
puts "base_url: #{base_url}"
Nokogiri::HTML(f).css('a').each do |anchor|
href = anchor['href']
# Make Unique
if href =~ /.*informatics/
puts href
#store stuff to active record
end
end
end
end
Replace the Nokogiri::HTML part to select only those href attributes that matches with /*.informatics/ and then you can use uniq, as it's already an array:
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0] = 'https://www.nku.edu/academics/informatics.html'
ARGV.each do |arg|
open(arg) do |f|
puts "#{'#' * 25} \nConnection: '#{arg}'\n #{'#' * 25}"
%i[base_uri meta status charset content_encoding, content_type last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
puts "base_url: #{/^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]}"
anchors = Nokogiri::HTML(f).css('a').select { |anchor| anchor['href'] =~ /.*informatics/ }
puts anchors.map { |anchor| anchor['href'] }.uniq
end
end
See output.

Sidekiq mechanize overwritten instance

I am building a simple web spider using Sidekiq and Mechanize.
When I run this for one domain, it works fine. When I run it for multiple domains, it fails. I believe the reason is that web_page gets overwritten when instantiated by another Sidekiq worker, but I am not sure if that's true or how to fix it.
# my scrape_search controller's create action searches on google.
def create
#scrape = ScrapeSearch.build(keywords: params[:keywords], profession: params[:profession])
agent = Mechanize.new
scrape_search = agent.get('http://google.com/') do |page|
search_result = page.form...
search_result.css("h3.r").map do |link|
result = link.at_css('a')['href'] # Narrowing down to real search results
#domain = Domain.new(some params)
ScrapeDomainWorker.perform_async(#domain.url, #domain.id, remaining_keywords)
end
end
end
I'm creating a Sidekiq job per domain. Most of the domains I'm looking for should contain just a few pages, so there's no need for sub-jobs per page.
This is my worker:
class ScrapeDomainWorker
include Sidekiq::Worker
...
def perform(domain_url, domain_id, keywords)
#domain = Domain.find(domain_id)
#domain_link = #domain.protocol + '://' + domain_url
#keywords = keywords
# First we scrape the homepage and get the first links
#domain.to_parse = ['/'] # to_parse is an array of PATHS to parse for the domain
mechanize_path('/')
#domain.verified << '/' # verified is an Array field containing valid domain paths
get_paths(#web_page) # Now we should have to_scrape populated with homepage links
#domain.scraped = 1 # Loop counter
while #domain.scraped < 100
#domain.to_parse.each do |path|
#domain.to_parse.delete(path)
#domain.scraped += 1
mechanize_path(path) # We create a Nokogiri HTML doc with mechanize for the valid path
...
get_paths(#web_page) # Fire this to repopulate to_scrape !!!
end
end
#domain.save
end
def mechanize_path(path)
agent = Mechanize.new
begin
#web_page = agent.get(#domain_link + path)
rescue Exception => e
puts "Mechanize Exception for #{path} :: #{e.message}"
end
end
def get_paths(web_page)
paths = web_page.links.map {|link| link.href.gsub((#domain.protocol + '://' + #domain.url), "") } ## This works when I scrape a single domain, but fails with ".gsub for nil" when I scrape a few domains.
paths.uniq.each do |path|
#domain.to_parse << path
end
end
end
This works when I scrape a single domain, but fails with .gsub for nil for web_page when I scrape a few domains.
You can wrap you code in another class, and then create and object of that class within your worker:
class ScrapeDomainWrapper
def initialize(domain_url, domain_id, keywords)
# ...
end
def mechanize_path(path)
# ...
end
def get_paths(web_page)
# ...
end
end
And your worker:
class ScrapeDomainWorker
include Sidekiq::Worker
def perform(domain_url, domain_id, keywords)
ScrapeDomainWrapper.new(domain_url, domain_id, keywords)
end
end
Also, bear in mind that Mechanize::Page#links may be a nil.

Ruby - Getting page content even if it doesn't exist

I am trying to put together a series of custom 404 pages.
require 'uri'
def open(url)
page_content = Net::HTTP.get(URI.parse(url))
puts page_content.content
end
open('http://somesite.com/1ygjah1761')
the following code exits the program with an error. How can I get the page content from a website, regardless of it being 404 or not.
You need to rescue from the error
def open(url)
require 'net/http'
page_content = ""
begin
page_content = Net::HTTP.get(URI.parse(url))
puts page_content
rescue Net::HTTPNotFound
puts "THIS IS 404" + page_content
end
end
You can find more information on something like this here: http://tammersaleh.com/posts/rescuing-net-http-exceptions/
Net::HTTP.get returns the page content directly as a string, so there is no need to call .content on the results:
page_content = Net::HTTP.get(URI.parse(url))
puts page_content

Create dynamic variables from th class name in tables, move td values into that row's array or hash?

I'm an amateur programmer wanting to scrape data from a site that is similar to this site: http://www.highschoolsports.net/massey/ (I have permission to scrape the site, by the way.)
The target site has 'th' classes for each 'th' in row[0] but I want to ensure that each 'TD' I pull from each table is somehow linked to that th's class name, because the tables are inconsistent, for example one table might be:
row[0] - >>th.name, th.place, th.team
row[1] - >>td[0], td[1] , td[2]
while another might be:
row[0] - >>th.place, th.team, th.name
row[1] - >>td[0], td[1] , td[2] etc..
My Question: How do I capture the 'th' class name across many hundreds of tables which are inconsistent(in 'th' class order) and create the 10-14 variables(arrays), then link the 'td' corresponding to that column in the table to that dynamic variable? Please let me know if this is confusing.. there are multiple tables on a given page..
Currently my code is something like:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'uri'
class Result
def initialize(row)
#attrs = {}
#attrs[:raw] = row.text
end
end
class Race
def initialize(page, table)
#handle = page
#table = table
#results = []
#attrs = {}
parse!
end
def parse!
#attrs[:name] = #handle.css('div.caption').text
get_rows
end
def get_rows
# get all of the rows ..
#handle.css('tr').each do |tr|
#results << RaceResult.new(tr)
end
end
end
class Event
class << self
def all(index_url)
events = []
ourl = Nokogiri::HTML(open(index_url))
ourl.css('a.event').each do |url|
abs_url = MAIN + url.attributes["href"]
events << Event.new(abs_url)
end
events
end
end
def initialize(url)
#url = url
#handle = nil
#attrs = {}
#races = []
#sub_events = []
parse!
end
def parse!
#handle = Nokogiri::HTML(open(#url))
get_page_meta
if(#handle.css('table.base.event_results').length > 0)
#handle.search('div.table_container.event_results').each do |table|
#races << Race.new(#handle, table)
end
else
#handle.css('div.centered a.obvious').each do |ol|
#sub_events << Event.new(BASE_URL + ol.attributes["href"])
end
end
end
def get_page_meta
#attrs[:name] = #handle.css('html body div.content h2 text()')[0] # event name
#attrs[:date] = #handle.xpath("html/body/div/div/text()[2]").text.strip #date
end
end
A friend has been helping me with this and I'm just starting to get a grasp on OOP but I'm only capture the tables and they're not split into td's and stored into some kind of variable/array/hash etc.. I need help understanding this process or how to do this. The critical piece would be dynamically assigning variable names according to the classes of the data and moving the 'td's' from that column (all td[2]'s for example) into that dynamic variable name. I can't tell you how amazing it would be if someone actually could help me solve this problem and understand how to make this work. Thank you in advance for any help!
It's easy once you realize that the th contents are the keys of your hash. Example:
#items = []
doc.css('table.masseyStyleTable').each do |table|
fields = table.css('th').map{|x| x.text.strip}
table.css('tr').each do |tr|
item = {}
fields.each_with_index do |field,i|
item[field] = tr.css('td')[i].text.strip rescue ''
end
#items << item
end
end

Resources