How to use user input across classes in Ruby? - ruby

I’m writing an app that scrapes genius.com to show a user the top ten songs. The user can then pick a song to see the lyrics.
I’d like to know how to employ the user input collected in my cli class inside of a method in my scraper class.
Right now I have part of the scrape happening outside the scraper class, but I'd like a clean division of responsibility.
Here’s part of my code:
Class CLI
def get_user_song
chosen_song = gets.strip.to_i
if chosen_song > 10 || chosen_song < 1
puts "Only the hits! Choose a number from 1-10."
end
I’d like to be able to do something like the below.
Class Scraper
def self.scrape_lyrics
page = Nokogiri::HTML(open("https://genius.com/#top-songs"))
#url = page.css('div#top-songs a').map {|link| link['href']}
user_selection = #input_from_cli #<---this is where I'd like to use the output
# of the 'gets' method above.
#print_lyrics = #url[user_selection - 1]
scrape_2 = Nokogiri::HTML(open(#print_lyrics))
puts scrape_2.css(".lyrics").text
end
I'm basically wondering how I can pass the chosen song variable into the Scraper class. I've tried a writing class method, but was having trouble writing it in a way that didn't break the rest of my program.
Thanks for any help!

I see two possible solutions to your problem. Which one is appropriate for this depends on your design goals. I'll try to explain with each option:
From a plain reading of your code, the user inputs the number without seeing the content of the page (through your program). In this case the simple way would be to pass in the selected number as a parameter to the scrape_lyrics method:
def self.scrape_lyrics(user_selection)
page = Nokogiri::HTML(open("https://genius.com/#top-songs"))
#url = page.css('div#top-songs a').map {|link| link['href']}
#print_lyrics = #url[user_selection -1]
scrape_2 = Nokogiri::HTML(open(#print_lyrics))
puts scrape_2.css(".lyrics").text
end
All sequencing happens in the CLI class and the scraper is called with all necessary data at the get go.
When imagining your tool more interactively, I was thinking it could be useful to have the scraper download the current top 10 and present the list to the user to choose from. In this case the interaction is a little bit more back-and-forth.
If you still want a strict separation, you can split scrape_lyrics into scrape_top_ten and scrape_lyrics_by_number(song_number) and sequence that in the CLI class.
If you expect the interaction flow to be very dynamic it might be better to inject the interaction methods into the scraper and invert the dependency:
def self.scrape_lyrics(cli)
page = Nokogiri::HTML(open("https://genius.com/#top-songs"))
titles = page.css('div#top-songs h3:first-child').map {|t| t.text}
user_selection = cli.choose(titles) # presents a choice to the user, returning the selected number
#url = page.css('div#top-songs a').map {|link| link['href']}
#print_lyrics = #url[user_selection - 1]
scrape_2 = Nokogiri::HTML(open(#print_lyrics))
puts scrape_2.css(".lyrics").text
end
See the tty-prompt gem for an example implementation of the latter approach.

Related

Looking for a cleaner way to scrape from website by avoiding repeating

Hi I am just doing a bit of refactoring on a small cli web scraping project I did in Ruby and I was simply wondering if there was cleaner way to write a particular section without repeating the code too much.
Basically with the code below, I pulled data from a website but I had to do this per page. You will notice that both methods are only different by their name and the source.
def self.scrape_first_page
html = open("https://www.texasblackpages.com/united-states/san-antonio")
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
def self.scrape_second_page
html = open('https://www.texasblackpages.com/united-states/san-antonio?page=2')
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
Is there a way for me to streamline this process all with just one method pulling from one source, but have the ability to access different pages within the same site, or this is pretty much the best and only way? They owners of the website do not have a public api from me to pull from in case anyone is wondering.
Remember that in programming you want to steer towards code that follows the Zero, One or Infinity Rule avoid the dreaded two. In other words, write methods that take no arguments, fixed arguments (one), or an array of unspecified size (infinity).
So the first step is to clean up the scraping function to make it as generic as possible:
def scrape(page)
doc = Nokogiri::HTML(open(page))
# Use map here to return an array of Business objects
doc.css('div.grid_element').map do |business|
Business.new.tap do |biz|
# Use tap to modify this object before returning it
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
end
Note that apart from the extraction code, there's nothing specific about this. Takes a URL, returns Business objects in an Array.
In order to generate pages 1..N, consider this:
def pages(base_url, start: 1)
page = start
Enumerator.new do |y|
loop do
y << base_url % page
page += 1
end
end
end
Now that's an infinite series, but you can always cap it to whatever you want with take(n) or by instead looping until you get an empty list:
# Collect all business from each of the pages...
businesses = pages('https://www.texasblackpages.com/united-states/san-antonio?page=%d').lazy.map do |page|
# ...by scraping the page...
scrape(page)
end.take_while do |results|
# ...and iterating until there's no results, as in Array#any? is false.
results.any?
end.to_a.flatten
The .lazy part means "evaluate each part of the chain sequentially" as opposed to the default behaviour of trying to evaluate each stage to completion. This is important or else it will try and download an infinite number of pages before moving to the next test.
The .to_a on the end forces that chain to run to completion. The .flatten squishes all the page-wise results into a single result set.
Of course if you want to scrape the first N pages, it's a lot easier:
pages('https://www.texasblackpages.com/.../san-antonio?page=%d').take(n).flat_map do |page|
scrape(page)
end
It's almost no code!
This was suggested by #Todd A. Jacobs
def self.scrape(url)
html = open(url)
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
The downside is with there not being a public api I had to invoke the method as many times as I need it since the url's are representing different pages within the wbesite, but this is fine because I was able to get rid of the repeating methods.
def make_listings
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=2")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=3")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=4")
end
i ever had some problem with you, i do loop though. usually if the page support pagination then the first page it have chance to use query param page also.
def self.scrape
page = 1
loop do
url = "https://www.texasblackpages.com/united-states/san-antonio?page=#{page}"
# do nokogiri parse
# do data scrapping
page += 1
end
end
you can have break on certain page condition.

How to write a while loop properly

I'm trying to scrape a website however I cannot seem to get my while-loop to break out once it hits a page with no more information:
def scrape_verse_items(keyword)
pg = 1
while pg < 1000
puts "page #{pg}"
url = "https://www.bible.com/search/bible?page=#{pg}&q=#{keyword}&version_id=1"
doc = Nokogiri::HTML(open(url))
items = doc.css("ul.search-result li.reference")
error = doc.css('div#noresults')
until error.any? do
if keyword != ''
item_hash = {}
items.each do |item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
item_hash[title] = content
end
else
puts "Please enter a valid search"
end
if error.any?
break
end
end
pg += 1
end
item_hash
end
puts scrape_verse_items('joy')
I know this doesn't exactly answer your question, but perhaps you might consider using a different approach altogether.
Using while and until loops can get a bit confusing, and usually isn't the most performant way of doing things.
Maybe you would consider using recursion instead.
I've written a small script that seems to work :
class MyScrapper
def initialize;end
def call(keyword)
puts "Please enter a valid search" && return unless keyword
scrape({}, keyword, 1)
end
private
def scrape(results, keyword, page)
doc = load_page(keyword, page)
return results if doc.css('div#noresults').any?
build_new_items(doc).merge(scrape(results, keyword, page+1))
end
def load_page(keyword, page)
url = "https://www.bible.com/search/bible?page=#{page}&q=#{keyword}&version_id=1"
Nokogiri::HTML(open(url))
end
def build_new_items(doc)
items = doc.css("ul.search-result li.reference")
items.reduce({}) do |list, item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
list[title] = content
list
end
end
end
You call it by doing MyScrapper.new.call("Keyword") (It might make more sense to have this as a module you include or even have them as class methods to avoid the need to instantiate the class.
What this does is, call a method called scrape and you give it the starting results, keyword, and page. It loads the page, if there are no results it returns the existing results it has found.
Otherwise it builds a hash from the page it loaded, and then the method calls itself, and merges the results with the new hash it just build. It does this till there are no more results.
If you want to limit the page results you can just change this like:
return results if doc.css('div#noresults').any?
to this:
return results if doc.css('div#noresults').any? || page > 999
Note: You might want to double-check the results that are being returned are correct. I think they should be but I wrote this quite quickly, so there could always be a small bug hiding somewhere in there.

How should I use recursive method in ruby

I wrote a simple web scrawler using Mechanize, now I'm stuck at how to get next page recursively, below is the code.
def self.generate_page #generate a Mechainze page object,the first page
agent = Mechanize.new
url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
page = agent.get(url)
page
end
def self.next_page(n_page) #get next page recursively by click next tag showed in each pages
puts n_page
# if I dont use puts , I get nothing , when using puts, I get
#<Mechanize::Page:0x007fd341c70fd0>
#<Mechanize::Page:0x007fd342f2ce08>
#<Mechanize::Page:0x007fd341d0cf70>
#<Mechanize::Page:0x007fd3424ff5c0>
#<Mechanize::Page:0x007fd341e1f660>
#<Mechanize::Page:0x007fd3425ec618>
#<Mechanize::Page:0x007fd3433f3e28>
#<Mechanize::Page:0x007fd3433a2410>
#<Mechanize::Page:0x007fd342446ca0>
#<Mechanize::Page:0x007fd343462490>
#<Mechanize::Page:0x007fd341c2fe18>
#<Mechanize::Page:0x007fd342d18040>
#<Mechanize::Page:0x007fd3432c76a8>
#which are the results I want
np = Mechanize.new.click(n_page.link_with(:text=>/next/)) unless n_page.link_with(:text=>/next/).nil?
result = next_page(np) unless np.nil?
result # here the value is empty, I dont know what is worng
end
def self.get_page # trying to pass the result of next_page() method
puts next_page(generate_page)
# it seems result is never passed here,
end
I followed these two links What is recursion and how does it work?
and Ruby recursive function
but still cant figure out what's wrong.. hope someone can help me out.. Thanks
There are a few issues with your code:
You shouldn't be calling Mechanize.new more than once.
From a stylistic perspective, you are doing too many nil checks.
Unless you have a preference for recursion, it'll probably be easier to do it iteratively.
To have your next_page method return an array containing every link page in the chain, you could write this:
# you should store the mechanize agent as a global variable
Agent = Mechanize.new
# a helper method to DRY up the code
def click_to_next_page(page)
Agent.click(n_page.link_with(:text=>/next/))
end
# repeatedly visits next page until none exists
# returns all seen pages as an array
def get_all_next_pages(n_page)
results = []
np = click_to_next_page(n_page)
results.push(np)
until !np
np = click_to_next_page(np)
np && results.push(np)
end
results
end
# testing it out (i'm not actually running this)
base_url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
root_page = Agent.get(base_url)
next_pages = get_all_next_pages(root_page)
puts next_pages

Creating a Ruby API

I have been tasked with creating a Ruby API that retrieves youtube URL's. However, I am not sure of the proper way to create an 'API'... I did the following code below as a Sinatra server that serves up JSON, but what exactly would be the definition of an API and would this qualify as one? If this is not an API, how can I make in an API? Thanks in advance.
require 'open-uri'
require 'json'
require 'sinatra'
# get user input
puts "Please enter a search (seperate words by commas):"
search_input = gets.chomp
puts
puts "Performing search on YOUTUBE ... go to '/videos' API endpoint to see the results and use the output"
puts
# define query parameters
api_key = 'my_key_here'
search_url = 'https://www.googleapis.com/youtube/v3/search'
params = {
part: 'snippet',
q: search_input,
type: 'video',
videoCaption: 'closedCaption',
key: api_key
}
# use search_url and query parameters to construct a url, then open and parse the result
uri = URI.parse(search_url)
uri.query = URI.encode_www_form(params)
result = JSON.parse(open(uri).read)
# class to define attributes of each video and format into eventual json
class Video
attr_accessor :title, :description, :url
def initialize
#title = nil
#description = nil
#url = nil
end
def to_hash
{
'title' => #title,
'description' => #description,
'url' => #url
}
end
def to_json
self.to_hash.to_json
end
end
# create an array with top 3 search results
results_array = []
result["items"].take(3).each do |video|
#video = Video.new
#video.title = video["snippet"]["title"]
#video.description = video["snippet"]["description"]
#video.url = video["snippet"]["thumbnails"]["default"]["url"]
results_array << #video.to_json.gsub!(/\"/, '\'')
end
# define the API endpoint
get '/videos' do
results_array.to_json
end
An "API = Application Program Interface" is, simply, something that another program can reliably use to get a job done, without having to busy its little head about exactly how the job is done.
Perhaps the simplest thing to do now, if possible, is to go back to the person who "tasked" you with this task, and to ask him/her, "well, what do you have in mind?" The best API that you can design, in this case, will be the one that is most convenient for the people (who are writing the programs which ...) will actually have to use it. "Don't guess. Ask!"
A very common strategy for an API, in a language like Ruby, is to define a class which represents "this application's connection to this service." Anyone who wants to use the API does so by calling some function which will return a new instance of this class. Thereafter, the program uses this object to issue and handle requests.
The requests, also, are objects. To issue a request, you first ask the API-connection object to give you a new request-object. You then fill-out the request with whatever particulars, then tell the request object to "go!" At some point in the future, and by some appropriate means (such as a callback ...) the request-object informs you that it succeeded or that it failed.
"A whole lot of voodoo-magic might have taken place," between the request object and the connection object which spawned it, but the client does not have to care. And that, most of all, is the objective of any API. "It Just Works.™"
I think they want you to create a third-party library. Imagine you are schizophrenic for a while.
Joe wants to build a Sinatra application to list some YouTube videos, but he is lazy and he does not want to do the dirty work, he just wants to drop something in, give it some credentials, ask for urls and use them, finito.
Joe asks Bob to implement it for him and he gives him his requirements: "Bob, I need YouTube library. I need it to do:"
# Please note that I don't know how YouTube API works, just guessing.
client = YouTube.new(api_key: 'hola')
video_urls = client.videos # => ['https://...', 'https://...', ...]
And Bob says "OK." end spends a day in his interactive console.
So first, you should figure out how you are going to use your not-yet-existing lib, if you can – sometimes you just don't know yet.
Next, build that library based on the requirements, then drop it in your Sinatra app and you're done. Does that help?

Using rspec to test a method on an object that's a property of an object

If I have a method like this:
require 'tweetstream'
# client is an instance of TweetStream::Client
# twitter_ids is an array of up to 1000 integers
def add_first_users_to_stream(client, twitter_ids)
# Add the first 100 ids to sitestream.
client.sitestream(twitter_ids.slice!(0,100))
# Add any extra IDs individually.
twitter_ids.each do |id|
client.control.add_user(id)
end
return client
end
I want to use rspec to test that:
client.sitestream is called, with the first 100 Twitter IDs.
client.control.add_user() is called with the remaining IDs.
The second point is trickiest for me -- I can't work out how to stub (or whatever) a method on an object that is itself a property of an object.
(I'm using Tweetstream here, although I expect the answer could be more general. If it helps, client.control would be an instance of TweetStream::SiteStreamClient.)
(I'm also not sure a method like my example is best practice, accepting and returning the client object like that, but I've been trying to break my methods down so that they're more testable.)
This is actually a pretty straightforward situation for RSpec. The following will work, as an example:
describe "add_first_users_to_stream" do
it "should add ids to client" do
bulk_add_limit = 100
twitter_ids = (0..bulk_add_limit+rand(50)).collect { rand(4000) }
extras = twitter_ids[bulk_add_limit..-1]
client = double('client')
expect(client).to receive(:sitestream).with(twitter_ids[0...bulk_add_limit])
client_control = double('client_control')
expect(client).to receive(:control).exactly(extras.length).times.and_return(client_control)
expect(client_control).to receive(:add_user).exactly(extras.length).times.and_return {extras.shift}
add_first_users_to_stream(client, twitter_ids)
end
end

Resources