`initialize': No such file or directory # rb_sysopen when using Nokogiri to open site - ruby

I created a CLI program that uses Scraper class to scrape site. I am Using Nokogiri and Open-URI. The error on top is popping up. I looked online and did not find help.
I made sure the site doesn't have typos.
from the CLI class I create a new Scraper class using the site as arg
class KefotoScraper::CLI
attr_accessor :kefoto_scraper
def initialize
site = "https://www.kefotos.mx"
#kefoto_scraper = Scraper.new(site)
end
end
In Scraper I have the following code:
class Scraper
attr_accessor :doc, :product_names, :site, :name, :link
def initialize(site)
#site = site
#doc = doc
#product_names = product_names
#name = name
#link = link
#price_range = [].uniq
scrape_product
end
def get_html
#doc = Nokogiri::HTML(open(#site))
#product_names = doc.css(".navbar-nav li")
product_names
end
def scrape_product
get_html.each {|product|
#name = product.css("span").text
plink = product.css("a").attr("href").text
#link = "#{site}#{link}"
link_doc = Nokogiri::HTML(open(#link))
pr = link_doc.scan(/[\$£](\d{1,3}(,\d{3})*(\.\d*)?)/)
prices = pr_link.text
prices.each {|price|
if #price_range.include?(price[0]) == false
#price_range << price[0]
end
}
new_product = Products.new(#name, #price_range)
puts new_product
}
end
end
I get the following error:
scraper.rb:18:in `initialize': No such file or directory # rb_sysopen - https://www.kefotos.mx (Errno::ENOENT)

open by default operates on local files, not URLs. That error means "I can't find a file on your hard drive named https://www.kefotos.mx".
You can let it work on URIs by requiring the open-uri library:
require 'open-uri'
This will make your code work, but it is a much better practice to use a proper HTTP client to read HTTP resources, as an attacker could potentially use an overloaded open() to access files on your machine's hard drive.
For example, if you were to use just net/http:
# At the top of your scraper.rb:
require 'net/http'
# Then, in your class:
link_doc = Nokogiri::HTML(Net::HTTP.get(URI(#link)))

Related

Ruby TempFile behaviour among different classes

Our processing server works mainly with TempFiles as it makes things easier on our side: no need to take care of deleting them as they get garbage collected or handle name collisions, etc.
Lately, we are having problems with TempFiles getting GCed too early in the process. Specially with one of our services that will convert a Foo file from a url to some Bar file and upload it to our servers.
For sake of clarity I added bellow a case scenario in order to make discussion easier and have an example at hand.
This workflow does the following:
Get a url as parameter
Download the Foo file as a TempFile
Duplicate it to a new TempFile
Download the related assets to TempFiles
Link the related assets into the local dup TempFile
Convert the Foo to Bar format
Upload it to our server
At times the conversion fail and everything points to the fact that our local Foo file is pointing to related assets that have been created and GCed before the conversion.
My two questions:
Is it possible that my TempFiles get GCed too early? I read about Ruby GCed system it was very conservative to avoid those scenarios.
How can I avoid this from happening? I could try to save all related assets from download_and_replace_uri(node) and passing them as a return to keep it alive while the instance of ConvertService is still existing. But I'm not sure if this would solve it.
myfile.foo
{
"buffers": [
{ "uri": "http://example.com/any_file.jpg" },
{ "uri": "http://example.com/any_file.png" },
{ "uri": "http://example.com/any_file.jpmp3" }
]
}
main.rb
ConvertService.new('http://example.com/myfile.foo')
ConvertService
class ConvertService
def initialize(url)
#url = url
#bar_file = Tempfile.new
end
def call
import_foo
convert_foo
upload_bar
end
private
def import_foo
#foo_file = ImportService.new(#url).call.edited_file
end
def convert_foo
`create-bar "#{#foo_file.path}" "#{#bar_file.path}"`
end
def upload_bar
UploadBarService.new(#bar_file).call
end
end
ImportService
class ImportService
def initialize(url)
#url = url
#edited_file ||= Tempfile.new
end
def call
download
duplicate
replace
end
private
def download
#original = DownloadFileService.new(#url).call.file
end
def duplicate
FileUtils.cp(#original.path, #edited_file.path)
end
def replace
file = File.read(#edited_file.path)
json = JSON.parse(file, symbolize_names: true)
json[:buffers]&.each do |node|
node[:uri] = DownloadFileService.new(node[:uri]).call.file.path
end
write_to_disk(#edited_file.path, json.to_json)
end
end
DownloadFileService
module Helper
class DownloadFileService < ApplicationHelperService
def initialize(url)
#url = url
#file = Tempfile.new
end
def call
uri = URI.parse(#url)
Net::HTTP.start(
uri.host,
uri.port,
use_ssl: uri.scheme == 'https'
) do |http|
response = http.request(Net::HTTP::Get.new(uri.path))
#file.binmode
#file.write(response.body)
#file.flush
end
end
end
end
UploadBarService
module Helper
class UploadBarService < ApplicationHelperService
def initialize(file)
#file = file
end
def call
HTTParty.post('http://example.com/upload', body: { file: #file })
# NOTE: End points returns the url for the uploaded file
end
end
end
Because of the complexity of your code and missing parts which may be obfuscated to us, the simple answer to your problem is to insure that your tempfile instance objects remain in memory throughout the lifecycle in which they are needed, otherwise they will get garbage collected immediately, removing the tempfile from the file system, and will lead to the the missing tempfile state you've encountered.
The Ruby Document for Tempfile states "When a Tempfile object is garbage collected, or when the Ruby interpreter exits, its associated temporary file is automatically deleted."
As per comments, others may find this conversation helpful when running into this problem.

NameError Exception: undefined local variable or method `products' for Wheyscrapper:Class

I'm building a small web scraper using Ruby and now I'm trying to refactor my code. Unfortunately, I'm encountering some errors while I'm refactoring my code. This is one of the errors.
Basically, I'm calling two separate methods in the first method which is whey_scrapper. Each of these two methods are basically responsible of scraping a specific item on the webpage. When I run and debug this code with byebug, I basically try to display the products or prices I've scraped but I get an error message saying that 'products' or 'prices' is undefined. This is my current code:
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
products = Array.new
product_names = parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
products << product
end
end
def prices_scrapper
prices = Array.new
product_prices = parsed_page.css('div.price-box')
product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
prices << price
end
end
byebug
whey_scrapper
end
There's a lot going on here, but to make it more Ruby you'd consider making those lazy-initialized and giving them names that reflect that:
class Wheyscrapper
URL = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?%s"
def initialize(company:)
#company = company
# Use encode_www_form to encode query-string parameters
#url = URL % URI.encode_www_form(manufacturer: company)
end
def document
# Lazy-initialize a parsd version of the page
#document ||= Nokogiri::HTML(open(url).read)
end
def products
document.css('div.product-primary').map do |product_name|
{
name: product_name.css('h2.product-name').text
}
end
end
def prices
document.css('div.price-box').map do |product_price|
{
amount: product_price.css('span.price').text
}
end
end
end
This fixes a lot of the data propagation problems you had in your original. When you declare a variable it's a local variable, meaning it doesn't exist outside of that particular call of that particular method. If you want to persist it for longer you need to use instance variables, as in #products, or you need to define methods that return the data you need.
The above approach combines that, using a lazy-initialized instance variable to persist the parsed document, and exposes that as a method the other methods can use.
Now you can spin this up:
scraper = WheyScraper.new(company: "Body & Fit")
Where that should enable everything to be available directly:
scraper.prices
scraper.products
When you learn how to use Ruby effectively you'll often find solutions to your problems that are really minimal. Usually a lot of Ruby code is a sign that it's not being used properly.
This should be refactored in a better way but this should at least work without refactor, based on my comments above
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'csv'
class Wheyscrapper
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/afslanken/afslank-toppers/?manufacturer=#{company}"
unparsed_page = open(url).read
#parsed_page = Nokogiri::HTML(unparsed_page)
product_scrapper
prices_scrapper
# csv = CSV.open('wheyprotein.csv', 'wb')
end
def product_scrapper
#products = Array.new
product_names = #parsed_page.css('div.product-primary')
product_names.each do |product_name|
product = {
name: product_name.css('h2.product-name').text
}
#products << product
end
end
def prices_scrapper
#prices = Array.new
#product_prices = #parsed_page.css('div.price-box')
#product_prices.each do |product_price|
price = {
amount: product_price.css('span.price').text
}
#prices << price
end
end
end
w = Wheyscrapper.new.whey_scrapper

Trying to learn to use PageObjects with Ruby - getting error "uninitialized constant Site (NameError)"

I have some experience of Selenium in Python and Cucumber/Watir/RSpec in Ruby, and can write scripts that execute successfully, but they aren't using classes, so I am trying to learn more about classes and splitting the scripts up in to pageobejcts.
I found this example to learn from: http://watir.com/guides/page-objects/ so copied the script and made some minor edits as you'll see below.
I'm using SublimeText 3.x with Ruby 2.4.x on Win10, so you know what tools I'm using.
I put the whole script in to a single .rb file (the only differences are that I replaced the URL and the elements to enter the username and password) and tried to execute it and get the following error:
C:/selenium/ruby/lotw/lotwlogin.rb:3:in `<main>': uninitialized constant Site (NameError).
I added the top line (required 'watir') line and it made no difference to the error encountered.
So I have in lotwlogin.rb essentilly the structure and syntax of the original script with custom elements. However, the core structure is reporting an error and I don't know what to do about it.
Here is my script:
require 'watir'
site = Site.new(Watir::Browser.new :chrome) # was :firefox but that no longer works since FF63
login_page = site.login_page.open
user_page = login_page.login_as "testuser", "testpassword" # dummy user and password for now
user_page.should be_logged_in
class BrowserContainer
def initialize(browser)
#browser = browser
end
end
class Site < BrowserContainer
def login_page
#login_page = LoginPage.new(#browser)
end
def user_page
#user_page = UserPage.new(#browser)
end
def close
#browser.close
end
end
class LoginPage < BrowserContainer
URL = "https://lotw.arrl.org/lotw/login"
def open
#browser.goto URL
##browser.window.maximize
self # no idea what this is for
end
def login_as(user, pass)
user_field.set user
password_field.set pass
login_button.click
next_page = UserPage.new(#browser)
Watir::Wait.until { next_page.loaded? }
next_page
end
private
def user_field
#browser.text_field(:name => "login")
end
def password_field
#browser.text_field(:name => "password")
end
def login_button
#browser.button(:value => "Log On")
end
end # LoginPage
class UserPage < BrowserContainer
def logged_in?
logged_in_element.exists?
end
def loaded?
#browser.h3 == "Welcome to Your Logbook of the World User Account Home Page"
end
private
def logged_in_element
#browser.div(:text => "Log off")
end
end # UserPage
Any assistance how to not get the Site error would be appreciated.
Thanks
Mike
You define class Site only a few lines below. But at that point, it's not yet known.
Move this logic to after all class definitions:
site = Site.new(Watir::Browser.new :chrome) # was :firefox but that no longer works since FF63
login_page = site.login_page.open
user_page = login_page.login_as "testuser", "testpassword" # dummy user and password for now
user_page.should be_logged_in

Sidekiq mechanize overwritten instance

I am building a simple web spider using Sidekiq and Mechanize.
When I run this for one domain, it works fine. When I run it for multiple domains, it fails. I believe the reason is that web_page gets overwritten when instantiated by another Sidekiq worker, but I am not sure if that's true or how to fix it.
# my scrape_search controller's create action searches on google.
def create
#scrape = ScrapeSearch.build(keywords: params[:keywords], profession: params[:profession])
agent = Mechanize.new
scrape_search = agent.get('http://google.com/') do |page|
search_result = page.form...
search_result.css("h3.r").map do |link|
result = link.at_css('a')['href'] # Narrowing down to real search results
#domain = Domain.new(some params)
ScrapeDomainWorker.perform_async(#domain.url, #domain.id, remaining_keywords)
end
end
end
I'm creating a Sidekiq job per domain. Most of the domains I'm looking for should contain just a few pages, so there's no need for sub-jobs per page.
This is my worker:
class ScrapeDomainWorker
include Sidekiq::Worker
...
def perform(domain_url, domain_id, keywords)
#domain = Domain.find(domain_id)
#domain_link = #domain.protocol + '://' + domain_url
#keywords = keywords
# First we scrape the homepage and get the first links
#domain.to_parse = ['/'] # to_parse is an array of PATHS to parse for the domain
mechanize_path('/')
#domain.verified << '/' # verified is an Array field containing valid domain paths
get_paths(#web_page) # Now we should have to_scrape populated with homepage links
#domain.scraped = 1 # Loop counter
while #domain.scraped < 100
#domain.to_parse.each do |path|
#domain.to_parse.delete(path)
#domain.scraped += 1
mechanize_path(path) # We create a Nokogiri HTML doc with mechanize for the valid path
...
get_paths(#web_page) # Fire this to repopulate to_scrape !!!
end
end
#domain.save
end
def mechanize_path(path)
agent = Mechanize.new
begin
#web_page = agent.get(#domain_link + path)
rescue Exception => e
puts "Mechanize Exception for #{path} :: #{e.message}"
end
end
def get_paths(web_page)
paths = web_page.links.map {|link| link.href.gsub((#domain.protocol + '://' + #domain.url), "") } ## This works when I scrape a single domain, but fails with ".gsub for nil" when I scrape a few domains.
paths.uniq.each do |path|
#domain.to_parse << path
end
end
end
This works when I scrape a single domain, but fails with .gsub for nil for web_page when I scrape a few domains.
You can wrap you code in another class, and then create and object of that class within your worker:
class ScrapeDomainWrapper
def initialize(domain_url, domain_id, keywords)
# ...
end
def mechanize_path(path)
# ...
end
def get_paths(web_page)
# ...
end
end
And your worker:
class ScrapeDomainWorker
include Sidekiq::Worker
def perform(domain_url, domain_id, keywords)
ScrapeDomainWrapper.new(domain_url, domain_id, keywords)
end
end
Also, bear in mind that Mechanize::Page#links may be a nil.

How do I implement hashids in ruby on rails

I will go ahead and apologize upfront as I am new to ruby and rails and I cannot for the life of me figure out how to implement using hashids in my project. The project is a simple image host. I have it already working using Base58 to encode the sql ID and then decode it in the controller. However I wanted to make the URLs more random hence switching to hashids.
I have placed the hashids.rb file in my lib directory from here: https://github.com/peterhellberg/hashids.rb
Now some of the confusion starts here. Do I need to initialize hashids on every page that uses hashids.encode and hashids.decode via
hashids = Hashids.new("mysalt")
I found this post (http://zogovic.com/post/75234760043/youtube-like-ids-for-your-activerecord-models) which leads me to believe I can put it into an initializer however after doing that I am still getting NameError (undefined local variable or method `hashids' for ImageManager:Class)
so in my ImageManager.rb class I have
require 'hashids'
class ImageManager
class << self
def save_image(imgpath, name)
mime = %x(/usr/bin/exiftool -MIMEType #{imgpath})[34..-1].rstrip
if mime.nil? || !VALID_MIME.include?(mime)
return { status: 'failure', message: "#{name} uses an invalid format." }
end
hash = Digest::MD5.file(imgpath).hexdigest
image = Image.find_by_imghash(hash)
if image.nil?
image = Image.new
image.mimetype = mime
image.imghash = hash
unless image.save!
return { status: 'failure', message: "Failed to save #{name}." }
end
unless File.directory?(Rails.root.join('uploads'))
Dir.mkdir(Rails.root.join('uploads'))
end
#File.open(Rails.root.join('uploads', "#{Base58.encode(image.id)}.png"), 'wb') { |f| f.write(File.open(imgpath, 'rb').read) }
File.open(Rails.root.join('uploads', "#{hashids.encode(image.id)}.png"), 'wb') { |f| f.write(File.open(imgpath, 'rb').read) }
end
link = ImageLink.new
link.image = image
link.save
#return { status: 'success', message: Base58.encode(link.id) }
return { status: 'success', message: hashids.encode(link.id) }
end
private
VALID_MIME = %w(image/png image/jpeg image/gif)
end
end
And in my controller I have:
require 'hashids'
class MainController < ApplicationController
MAX_FILE_SIZE = 10 * 1024 * 1024
MAX_CACHE_SIZE = 128 * 1024 * 1024
#links = Hash.new
#files = Hash.new
#tstamps = Hash.new
#sizes = Hash.new
#cache_size = 0
class << self
attr_accessor :links
attr_accessor :files
attr_accessor :tstamps
attr_accessor :sizes
attr_accessor :cache_size
attr_accessor :hashids
end
def index
end
def transparency
end
def image
##imglist = params[:id].split(',').map{ |id| ImageLink.find(Base58.decode(id)) }
#imglist = params[:id].split(',').map{ |id| ImageLink.find(hashids.decode(id)) }
end
def image_direct
#linkid = Base58.decode(params[:id])
linkid = hashids.decode(params[:id])
file =
if Rails.env.production?
puts "#{Base58.encode(ImageLink.find(linkid).image.id)}.png"
File.open(Rails.root.join('uploads', "#{Base58.encode(ImageLink.find(linkid).image.id)}.png"), 'rb') { |f| f.read }
else
puts "#{hashids.encode(ImageLink.find(linkid).image.id)}.png"
File.open(Rails.root.join('uploads', "#{hashids.encode(ImageLink.find(linkid).image.id)}.png"), 'rb') { |f| f.read }
end
send_data(file, type: ImageLink.find(linkid).image.mimetype, disposition: 'inline')
end
def upload
imgparam = params[:image]
if imgparam.is_a?(String)
name = File.basename(imgparam)
imgpath = save_to_tempfile(imgparam).path
else
name = imgparam.original_filename
imgpath = imgparam.tempfile.path
end
File.chmod(0666, imgpath)
%x(/usr/bin/exiftool -all= -overwrite_original #{imgpath})
logger.debug %x(which exiftool)
render json: ImageManager.save_image(imgpath, name)
end
private
def save_to_tempfile(url)
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = uri.scheme == 'https'
http.start do
resp = http.get(uri.path)
file = Tempfile.new('urlupload', Dir.tmpdir, :encoding => 'ascii-8bit')
file.write(resp.body)
file.flush
return file
end
end
end
Then in my image.html.erb view I have this:
<%
#imglist.each_with_index { |link, i|
id = hashids.encode(link.id)
ext = link.image.mimetype.split('/')[1]
if ext == 'jpeg'
ext = 'jpg'
end
puts id + '.' + ext
%>
Now if I add
hashids = Hashids.new("mysalt")
in ImageManager.rb main_controller.rb and in my image.html.erb I am getting this error:
ActionView::Template::Error (undefined method `id' for #<Array:0x000000062f69c0>)
So all in all implementing hashids.encode/decode is not as easy as implementing Base58.encode/decode and I am confused on how to get it working... Any help would be greatly appreciated.
I would suggest loading it as a gem by including it into your Gemfile and running bundle install. It will save you the hassle of requiring it in every file and allow you to manage updates using Bundler.
Yes, you do need to initialize it wherever it is going to be used with the same salt. Would suggest that you define the salt as a constant, perhaps in application.rb.
The link you provided injects hashids into ActiveRecord, which means it will not work anywhere else. I would not recommend the same approach as it will require a high level of familiarity with Rails.
You might want to spend some time understanding ActiveRecord and ActiveModel. Will save you a lot of reinventing the wheel. :)
Before everythink you should just to test if Hashlib is included in your project, you can run command rails c in your project folder and make just a small test :
>> my_id = ImageLink.last.id
>> puts Hashids.new(my_id)
If not working, add the gem in gemfile (that anyway make a lot more sence).
Then, I think you should add a getter for your hash_id in your ImageLink model.
Even you don't want to save your hash in the database, this hash have it's pllace in your model. See virtual property for more info.
Remember "Skinny Controller, Fat Model".
class ImageLink < ActiveRecord::Base
def hash_id()
# cache the hash
#hash_id ||= Hashids.new(id)
end
def extension()
# you could add the logic of extension here also.
ext = image.mimetype.split('/')[1]
if ext == 'jpeg'
'jpg'
else
ext
end
end
end
Change the return in your ImageManager#save_image
link = ImageLink.new
link.image = image
# Be sure your image have been saved (validation errors, etc.)
if link.save
{ status: 'success', message: link.hash_id }
else
{status: 'failure', message: link.errors.join(", ")}
end
In your template
<%
#imglist.each_with_index do |link, i|
puts link.hash_id + '.' + link.extension
end # <- I prefer the do..end to not forgot the ending parenthesis
%>
All this code is not tested...
I was looking for something similar where I can disguise the ids of my records. I came across act_as_hashids.
https://github.com/dtaniwaki/acts_as_hashids
This little gem integrates seamlessly. You can still find your records through the ids. Or with the hash. On nested records you can use the method with_hashids.
To get the hash you use to_param on the object itself which result in a string similar to this ePQgabdg.
Since I just implemented this I can't tell how useful this gem will be. So far I just had to adjust my code a little bit.
I also gave the records a virtual attribute hashid so I can access it easily.
attr_accessor :hashid
after_find :set_hashid
private
def set_hashid
self.hashid = self.to_param
end

Resources