I'm trying to scrape a website's content to instantiate objects out of the data, and I'm running into a problem with a dead link on the page I'm scraping. I want to figure out how I can simply not iterate over that link and avoid scraping it altogether.
I tried using this, but it didn't work:
name = li.css("strong a").text.strip unless li.nil?
url = li.css("a")[0].attr("href") unless li.nil?
Player.new(name,url)
class HomepageScraper
BASE_URL = "https://www.nba.com/history/nba-at-50/top-50-players"
def self.scrape_players
page = open(BASE_URL)
parsed_HTML = Nokogiri::HTML(page)
name_lis = parsed_HTML.css("div.field-item li")
name_lis.each do |li|
name = li.css("strong a").text.strip
url = li.css("a")[0].attr("href")
Player.new(name,url)
end
end
end
I expected example output to be:
#name = "Shaquille o neal", #url = "www.nba..."
But received:
#name = "Shaquille o neal", #url = nil
The error message is:
undefined method `attr' for nil:NilClass (NoMethodError)
If you run at least Ruby 2.3, do a
url = li.css("a")[0]&.attr("href")
This sets url to nil, if the part to the left of &. is nil, and applies attr otherwise.
You should use the compact method on Array.
It is a useful method if you need to remove nil values from an array.
For example:
[1, nil, 2, nil].compact => [1, 2]
In your case:
name_lis.compact.each do |li|
end
Related
I am working on a CLI Project and trying to open up a web page by using url variable declared in another method.
def self.open_deal_page(input)
index = input.to_i - 1
#deals = PopularDeals::NewDeals.new_deals
#deals.each do |info|
d = info[index]
#product_url = "#{d.url}"
end
#product_url.to_s
puts "They got me!"
end
def self.deal_page(product_url)
#self.open_deal_page(input)
deal = {}
html = Nokogiri::HTML(open(#product_url))
doc = Nokogiri::HTML(html)
deal[:name] = doc.css(".dealTitle h1").text.strip
deal[:discription] = doc.css(".textDescription").text.strip
deal[:purchase] = doc.css("div a.button").attribute("href")
deal
#binding.pry
end
but I am receiving this error.
`open': no implicit conversion of nil into String (TypeError)
any possible solution? Thank you so much in advance.
Try returning your #product_url within your open_deal_page method, because now you're returning puts "They got me!", and also note that your product_url is being created inside your each block, so, it won't be accessible then, try creating it before as an empty string and then you can return it.
def open_deal_page(input)
...
# Create the variable
product_url = ''
# Assign it the value
deals.each do |info|
product_url = "#{info[index].url}"
end
# And return it
product_url
end
In your deal_page method tell to Nokogiri to open the product_url that you're passing as argument.
def deal_page(product_url)
...
html = Nokogiri::HTML(open(product_url))
...
end
Sorry, but I didn't find the documentation enlightening at all. Basically, I am trying to iterate through a where some options are not valid. The ones I want have 'class="active"'. Can I do that with mechanize? Here's what I have so far:
class Scraper
def init
mech = Mechanize.new
page = mech.get('url')
#Now go through the <select> to get product numbers for the different flavors
form = page.form_with(:id => 'twister')
select = form.field_with(:name => 'dropdown_selected_flavor_name')
select.options.each do |o|
if (o.text != "")
value = o
end
productNumber = trim_pn(value.to_s[2..12])
puts productNumber
end
end
#Checks validity of product number and removes excess characters if necessary
def trim_pn(pn)
if (pn[0] == ",")
pn = pn[1..-1]
end
return pn
end
end
p = Scraper.new
p.init
All that does is grabs the product number and removes some extra info that I don't want. I thought replacing the .each do with this:
select.options_with(:class => 'active').each do |o|
if (o.text != "")
value = o
end
end
But that throws "undefined method 'dom_class' for Mechanize:Form:Option blah blah." Is there are different way I should be approaching this?
I'm trying to create a simple web-crawler, so I wrote this:
(Method get_links take a parent link from which we will seek)
require 'nokogiri'
require 'open-uri'
def get_links(link)
link = "http://#{link}"
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
array = hrefs.select {|i| i[0] == "/"}
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
end
(Method search_links, takes an array from get_links method and search at this array)
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
end
This method finds most of links from the website, but not all.
What did I do wrong? Which algorithm I should use?
Some comments about your code:
def get_links(link)
link = "http://#{link}"
# You're assuming the protocol is always http.
# This isn't the only protocol on used on the web.
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
# You can write these two lines more compact as
# hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
array = hrefs.select {|i| i[0] == "/"}
# I guess you want to handle URLs that are relative to the host.
# However, URLs relative to the protocol (starting with '//')
# will also be selected by this condition.
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
# The value assigned to links_list will implicitly be returned.
# (The assignment itself is futile, the right-hand-part alone would
# suffice.) Because this builds on `array` all absolute URLs will be
# missing from the return value.
end
Explanation for
hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
.xpath('//a/#href') uses the attribute syntax of XPath to directly get to the href attributes of a elements
.map(&:to_s) is an abbreviated notation for .map { |item| item.to_s }
.delete_if(&:empty?) uses the same abbreviated notation
And comments about the second function:
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
# How about using a Set instead of an Array and
# thus have the collection provide uniqueness of
# its items, so that you don't have to?
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
# This function isn't recursive, it just calls `get_links` on two
# 'levels'. Thus you search only two levels deep and return findings
# from the first and second level combined. (Without the "zero'th"
# level - the URL passed into `search_links`. Unless off course if it
# also occured on the first or second level.)
#
# Is this what you intended?
end
You should probably be using mechanize:
require 'mechanize'
agent = Mechanize.new
page = agent.get url
links = page.search('a[href]').map{|a| page.uri.merge(a[:href]).to_s}
# if you want to remove links with a different host (hyperlinks?)
links.reject!{|l| URI.parse(l).host != page.uri.host}
Otherwise you'll have trouble converting relative urls to absolute properly.
Here's an extract of the code that I am using:
def retrieve(user_token, quote_id, check="quotes")
end_time = Time.now + 15
match = false
until Time.now > end_time || match
#response = http_request.get(quote_get_url(quote_id, user_token))
eval("match = !JSON.parse(#response.body)#{field(check)}.nil?")
end
match.eql?(false) ? nil : #response
end
private
def field (check)
hash = {"quotes" => '["quotes"][0]',
"transaction-items" => '["quotes"][0]["links"]["transactionItems"]'
}
hash[check]
end
I was informed that using eval in this manner is not good practice. Could anyone suggest a better way of dynamically checking the existence of a JSON node (field?). I want this to do:
psudo: match = !JSON.parse(#response.body) + dynamic-path + .nil?
Store paths as arrays of path elements (['quotes', 0]). With a little helper function you'll be able to avoid eval. It is, indeed, completely inappropriate here.
Something along these lines:
class Hash
def deep_get(path)
path.reduce(self) do |memo, path_element|
return unless memo
memo[path_element]
end
end
end
path = ['quotes', 0]
hash = JSON.parse(response.body)
match = !hash.deep_get(path).nil?
I have a query string that looks as follows:
http://localhost:3000/events?appointment_practices%5B10%5D=Injury&appointment_practices%5B18%5D=Immigration&appointment_practices%5B8%5D=Bankruptcy
appointment_practices is actually a hash I inserted into the query string during a redirect:
appointment_practices = practices.reduce({}) do |acc, practice|
acc[practice.id] = practice.class.name
acc
end
redirect_to events_path(appointment_practices: appointment_practices)
Now I want to parse that query string. When I tried to parse it with decode_www_form, it returns an array with a nil element:
[nil]
This is the code that is giving me the nil element:
#http_refer = #_env['HTTP_REFERER']
begin
uri = URI.parse #http_refer
practices = Hash[URI::decode_www_form(uri.query)].values_at('appointment_practices')
puts "practices: #{practices}"
rescue StandardError
end
I am trying to extract the hash. For example, in appointment_practices%5B10%5D=Injury, the id is 10 and the practice is Injury.
What other options do I have besides regex?
You can use Rack::Utils.parse_nested_query:
require 'uri'
require 'rack'
uri = URI.parse('http://localhost:3000/events?appointment_practices%5B10%5D=Injury&appointment_practices%5B18%5D=Immigration&appointment_practices%5B8%5D=Bankruptcy')
Rack::Utils.parse_nested_query(uri.query)
#=> {"appointment_practices"=>{"10"=>"Injury", "18"=>"Immigration", "8"=>"Bankruptcy"}}