Find first in array - ruby

I'm trying to use the Wikipedia gem to run a rake task and match the first image that is either .jpg, .png or .gif to save to my institute instance. I'm using ruby 2.3 and Rails 5.
My current code is as follows:
namespace :import do
desc "Import images from Wikipedia"
task institutes: :environment do
require 'wikipedia'
Institute.all.each do |institute|
school = institute.name
page = Wikipedia.find(school)
next if page.content.nil?
accepted_formats = [".jpg", ".png", ".gif"]
images = page.image_urls
image = images.find {|i| i.image_type }
institute.update!(image_url: image)
end
def image_type
accepted_formats = File.extname(i)
end
end
end
This is giving the error NoMethodError: private method 'image_type' called for #<String....>
Is there a more efficient way (and one that works!) of doing this? Sorry, I'm not that experienced in Ruby! I can't work out what the best way to get this to work is; whether to include a method elsewhere or if there's some better way to do it?

I'll recommend you, at first to check is it need to update institute. Next, if you want to use accepted_formats you should define it in constant like: ACCEPTED_IMAGE_FORMATS or send it like argument.
Then you should move action that should return accepted image to method, something like first_valid_image(images, accepted_formats). On my opinion, it should looks like:
namespace :import do
desc "Import images from Wikipedia"
task institutes: :environment do
require 'wikipedia'
Institute.all.each do |institute|
school = institute.name
page = Wikipedia.find(school)
next if page.content.nil?
accepted_formats = [".jpg", ".png", ".gif"]
images = page.image_urls
image = first_valid_image(images, accepted_formats)
institute.update!(image_url: image) if image # this action would run only if image.ni? == false
end
def first_valid_image(images, accepted_formats)
images.find do |image|
File.extname(image).in? accepted_formats
end
end
end
end

Related

Looking for a cleaner way to scrape from website by avoiding repeating

Hi I am just doing a bit of refactoring on a small cli web scraping project I did in Ruby and I was simply wondering if there was cleaner way to write a particular section without repeating the code too much.
Basically with the code below, I pulled data from a website but I had to do this per page. You will notice that both methods are only different by their name and the source.
def self.scrape_first_page
html = open("https://www.texasblackpages.com/united-states/san-antonio")
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
def self.scrape_second_page
html = open('https://www.texasblackpages.com/united-states/san-antonio?page=2')
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
Is there a way for me to streamline this process all with just one method pulling from one source, but have the ability to access different pages within the same site, or this is pretty much the best and only way? They owners of the website do not have a public api from me to pull from in case anyone is wondering.
Remember that in programming you want to steer towards code that follows the Zero, One or Infinity Rule avoid the dreaded two. In other words, write methods that take no arguments, fixed arguments (one), or an array of unspecified size (infinity).
So the first step is to clean up the scraping function to make it as generic as possible:
def scrape(page)
doc = Nokogiri::HTML(open(page))
# Use map here to return an array of Business objects
doc.css('div.grid_element').map do |business|
Business.new.tap do |biz|
# Use tap to modify this object before returning it
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
end
Note that apart from the extraction code, there's nothing specific about this. Takes a URL, returns Business objects in an Array.
In order to generate pages 1..N, consider this:
def pages(base_url, start: 1)
page = start
Enumerator.new do |y|
loop do
y << base_url % page
page += 1
end
end
end
Now that's an infinite series, but you can always cap it to whatever you want with take(n) or by instead looping until you get an empty list:
# Collect all business from each of the pages...
businesses = pages('https://www.texasblackpages.com/united-states/san-antonio?page=%d').lazy.map do |page|
# ...by scraping the page...
scrape(page)
end.take_while do |results|
# ...and iterating until there's no results, as in Array#any? is false.
results.any?
end.to_a.flatten
The .lazy part means "evaluate each part of the chain sequentially" as opposed to the default behaviour of trying to evaluate each stage to completion. This is important or else it will try and download an infinite number of pages before moving to the next test.
The .to_a on the end forces that chain to run to completion. The .flatten squishes all the page-wise results into a single result set.
Of course if you want to scrape the first N pages, it's a lot easier:
pages('https://www.texasblackpages.com/.../san-antonio?page=%d').take(n).flat_map do |page|
scrape(page)
end
It's almost no code!
This was suggested by #Todd A. Jacobs
def self.scrape(url)
html = open(url)
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
The downside is with there not being a public api I had to invoke the method as many times as I need it since the url's are representing different pages within the wbesite, but this is fine because I was able to get rid of the repeating methods.
def make_listings
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=2")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=3")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=4")
end
i ever had some problem with you, i do loop though. usually if the page support pagination then the first page it have chance to use query param page also.
def self.scrape
page = 1
loop do
url = "https://www.texasblackpages.com/united-states/san-antonio?page=#{page}"
# do nokogiri parse
# do data scrapping
page += 1
end
end
you can have break on certain page condition.

How to use user input across classes in Ruby?

I’m writing an app that scrapes genius.com to show a user the top ten songs. The user can then pick a song to see the lyrics.
I’d like to know how to employ the user input collected in my cli class inside of a method in my scraper class.
Right now I have part of the scrape happening outside the scraper class, but I'd like a clean division of responsibility.
Here’s part of my code:
Class CLI
def get_user_song
chosen_song = gets.strip.to_i
if chosen_song > 10 || chosen_song < 1
puts "Only the hits! Choose a number from 1-10."
end
I’d like to be able to do something like the below.
Class Scraper
def self.scrape_lyrics
page = Nokogiri::HTML(open("https://genius.com/#top-songs"))
#url = page.css('div#top-songs a').map {|link| link['href']}
user_selection = #input_from_cli #<---this is where I'd like to use the output
# of the 'gets' method above.
#print_lyrics = #url[user_selection - 1]
scrape_2 = Nokogiri::HTML(open(#print_lyrics))
puts scrape_2.css(".lyrics").text
end
I'm basically wondering how I can pass the chosen song variable into the Scraper class. I've tried a writing class method, but was having trouble writing it in a way that didn't break the rest of my program.
Thanks for any help!
I see two possible solutions to your problem. Which one is appropriate for this depends on your design goals. I'll try to explain with each option:
From a plain reading of your code, the user inputs the number without seeing the content of the page (through your program). In this case the simple way would be to pass in the selected number as a parameter to the scrape_lyrics method:
def self.scrape_lyrics(user_selection)
page = Nokogiri::HTML(open("https://genius.com/#top-songs"))
#url = page.css('div#top-songs a').map {|link| link['href']}
#print_lyrics = #url[user_selection -1]
scrape_2 = Nokogiri::HTML(open(#print_lyrics))
puts scrape_2.css(".lyrics").text
end
All sequencing happens in the CLI class and the scraper is called with all necessary data at the get go.
When imagining your tool more interactively, I was thinking it could be useful to have the scraper download the current top 10 and present the list to the user to choose from. In this case the interaction is a little bit more back-and-forth.
If you still want a strict separation, you can split scrape_lyrics into scrape_top_ten and scrape_lyrics_by_number(song_number) and sequence that in the CLI class.
If you expect the interaction flow to be very dynamic it might be better to inject the interaction methods into the scraper and invert the dependency:
def self.scrape_lyrics(cli)
page = Nokogiri::HTML(open("https://genius.com/#top-songs"))
titles = page.css('div#top-songs h3:first-child').map {|t| t.text}
user_selection = cli.choose(titles) # presents a choice to the user, returning the selected number
#url = page.css('div#top-songs a').map {|link| link['href']}
#print_lyrics = #url[user_selection - 1]
scrape_2 = Nokogiri::HTML(open(#print_lyrics))
puts scrape_2.css(".lyrics").text
end
See the tty-prompt gem for an example implementation of the latter approach.

How to write a while loop properly

I'm trying to scrape a website however I cannot seem to get my while-loop to break out once it hits a page with no more information:
def scrape_verse_items(keyword)
pg = 1
while pg < 1000
puts "page #{pg}"
url = "https://www.bible.com/search/bible?page=#{pg}&q=#{keyword}&version_id=1"
doc = Nokogiri::HTML(open(url))
items = doc.css("ul.search-result li.reference")
error = doc.css('div#noresults')
until error.any? do
if keyword != ''
item_hash = {}
items.each do |item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
item_hash[title] = content
end
else
puts "Please enter a valid search"
end
if error.any?
break
end
end
pg += 1
end
item_hash
end
puts scrape_verse_items('joy')
I know this doesn't exactly answer your question, but perhaps you might consider using a different approach altogether.
Using while and until loops can get a bit confusing, and usually isn't the most performant way of doing things.
Maybe you would consider using recursion instead.
I've written a small script that seems to work :
class MyScrapper
def initialize;end
def call(keyword)
puts "Please enter a valid search" && return unless keyword
scrape({}, keyword, 1)
end
private
def scrape(results, keyword, page)
doc = load_page(keyword, page)
return results if doc.css('div#noresults').any?
build_new_items(doc).merge(scrape(results, keyword, page+1))
end
def load_page(keyword, page)
url = "https://www.bible.com/search/bible?page=#{page}&q=#{keyword}&version_id=1"
Nokogiri::HTML(open(url))
end
def build_new_items(doc)
items = doc.css("ul.search-result li.reference")
items.reduce({}) do |list, item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
list[title] = content
list
end
end
end
You call it by doing MyScrapper.new.call("Keyword") (It might make more sense to have this as a module you include or even have them as class methods to avoid the need to instantiate the class.
What this does is, call a method called scrape and you give it the starting results, keyword, and page. It loads the page, if there are no results it returns the existing results it has found.
Otherwise it builds a hash from the page it loaded, and then the method calls itself, and merges the results with the new hash it just build. It does this till there are no more results.
If you want to limit the page results you can just change this like:
return results if doc.css('div#noresults').any?
to this:
return results if doc.css('div#noresults').any? || page > 999
Note: You might want to double-check the results that are being returned are correct. I think they should be but I wrote this quite quickly, so there could always be a small bug hiding somewhere in there.

Save names and iterate through them ruby

Hey this is my first post and I´m an complete ruby noob.
This is my existing source-code for my Dashing/Roo Project.
require 'roo-xls'
SCHEDULER.every '10m' do
file_path = "/home/numbers.xlsx"
def fetch_spreadsheet_data(path)
s = Roo::Excelx.new(path)
#This should be edited
send_event('Department1', { value:s.cell('C',5,s.sheets[0]) })
end
#Checker if file has been modified
module Handler
def file_modified
fetch_spreadsheet_data(path)
end
end
fetch_spreadsheet_data(file_path)
end
I want to add a few Departments (for example Department1, Factory2 ....)
For Department1 it should use: 'C',1,s.sheets[0]; For Factory2 it should use: 'C',2,s.sheets[0] and so on.
I want to save the names into an array and then iterate through it.
So how could I implement this logic?
thanks a lot

Parsing large file with SaxMachine seems to be loading the whole file into memory

I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.
On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.
I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.
Is there something I am overlooking?
Many thanks
Update to add code sample
class FeedImporter
class FeedListing
include ::SAXMachine
element :id
element :title
element :description
element :url
def to_hash
{}.tap do |hash|
self.class.column_names.each do |key|
hash[key] = send(key)
end
end
end
end
class Feed
include ::SAXMachine
elements :listing, :as => :listings, :class => FeedListing
end
def perform
open('~/feeds/large_feed.xml') do |file|
# I think that SAXMachine is trying to load All of the listing elements into this one ruby object.
puts 'Parsing'
feed = Feed.parse(file)
# We are now iterating over each of the listing elements, but they have been "parsed" from the feed already.
puts 'Importing'
feed.listings.each do |listing|
Listing.import(listing.to_hash)
end
end
end
end
As you can see, I don't care about the <listings> element in the feed. I just want the attributes of each <listing> element.
The output looks like this:
Parsing
... wait forever
Importing (actually, I don't ever see this on the big file (1.6gb) because too much memory is used :(
Here's a Reader that will yield each listing's XML to a block, so you can process each Listing without loading the entire document into memory
reader = Nokogiri::XML::Reader(file)
while reader.read
if reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT and reader.name == 'listing'
listing = FeedListing.parse(reader.outer_xml)
Listing.import(listing.to_hash)
end
end
If listing elements could be nested, and you wanted to parse the outermost listings as single documents, you could do this:
require 'rubygems'
require 'nokogiri'
# Monkey-patch Nokogiri to make this easier
class Nokogiri::XML::Reader
def element?
node_type == TYPE_ELEMENT
end
def end_element?
node_type == TYPE_END_ELEMENT
end
def opens?(name)
element? && self.name == name
end
def closes?(name)
(end_element? && self.name == name) ||
(self_closing? && opens?(name))
end
def skip_until_close
raise "node must be TYPE_ELEMENT" unless element?
name_to_close = self.name
if self_closing?
# DONE!
else
level = 1
while read
level += 1 if opens?(name_to_close)
level -= 1 if closes?(name_to_close)
return if level == 0
end
end
end
def each_outer_xml(name, &block)
while read
if opens?(name)
yield(outer_xml)
skip_until_close
end
end
end
end
once you have it monkey-patched, it's easy to deal with each listing individually:
open('~/feeds/large_feed.xml') do |file|
reader = Nokogiri::XML::Reader(file)
reader.each_outer_xml('listing') do |outer_xml|
listing = FeedListing.parse(outer_xml)
Listing.import(listing.to_hash)
end
end
Unfortunately there are now three different repos for sax-machine. And worse, the gemspec version was not bumped.
Despite the comment on Greg Weber's blog, I don't think this code was integrated into pauldix's or ezkl's forks. To use the lazy, fiber-based version of the code, I think you need to specifically reference gregweb's version in your gemfile like this:
gem 'sax-machine', :git => 'https://github.com/gregwebs/sax-machine'
I forked sax-machine so that it uses constant memory: https://github.com/gregwebs/sax-machine
Good news: there is a new maintainer that is planning on merging my changes.
Myself and the new maintainer have been using my fork without issue for a year now.
You are right, SAXMachine reads the whole document eagerly. Have a look at it's handler sources: https://github.com/pauldix/sax-machine/blob/master/lib/sax-machine/sax_handler.rb
To solve your Problem, I would use http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/SAX/Parser.html directly and implement the handler yourself.

Resources