I'm trying to scrape a website however I cannot seem to get my while-loop to break out once it hits a page with no more information:
def scrape_verse_items(keyword)
pg = 1
while pg < 1000
puts "page #{pg}"
url = "https://www.bible.com/search/bible?page=#{pg}&q=#{keyword}&version_id=1"
doc = Nokogiri::HTML(open(url))
items = doc.css("ul.search-result li.reference")
error = doc.css('div#noresults')
until error.any? do
if keyword != ''
item_hash = {}
items.each do |item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
item_hash[title] = content
end
else
puts "Please enter a valid search"
end
if error.any?
break
end
end
pg += 1
end
item_hash
end
puts scrape_verse_items('joy')
I know this doesn't exactly answer your question, but perhaps you might consider using a different approach altogether.
Using while and until loops can get a bit confusing, and usually isn't the most performant way of doing things.
Maybe you would consider using recursion instead.
I've written a small script that seems to work :
class MyScrapper
def initialize;end
def call(keyword)
puts "Please enter a valid search" && return unless keyword
scrape({}, keyword, 1)
end
private
def scrape(results, keyword, page)
doc = load_page(keyword, page)
return results if doc.css('div#noresults').any?
build_new_items(doc).merge(scrape(results, keyword, page+1))
end
def load_page(keyword, page)
url = "https://www.bible.com/search/bible?page=#{page}&q=#{keyword}&version_id=1"
Nokogiri::HTML(open(url))
end
def build_new_items(doc)
items = doc.css("ul.search-result li.reference")
items.reduce({}) do |list, item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
list[title] = content
list
end
end
end
You call it by doing MyScrapper.new.call("Keyword") (It might make more sense to have this as a module you include or even have them as class methods to avoid the need to instantiate the class.
What this does is, call a method called scrape and you give it the starting results, keyword, and page. It loads the page, if there are no results it returns the existing results it has found.
Otherwise it builds a hash from the page it loaded, and then the method calls itself, and merges the results with the new hash it just build. It does this till there are no more results.
If you want to limit the page results you can just change this like:
return results if doc.css('div#noresults').any?
to this:
return results if doc.css('div#noresults').any? || page > 999
Note: You might want to double-check the results that are being returned are correct. I think they should be but I wrote this quite quickly, so there could always be a small bug hiding somewhere in there.
I am trying to do a post and run some if statement. What I want to do is:
check all fields are filled
if all fields are filled move on to next step, or else reload page
check if already in data base
add if not already in data base
post "/movies/new" do
title = params[:title]
year = params[:year]
gross = params[:gross]
poster = params[:poster]
trailer = params[:trailer]
if title && year && gross && poster && trailer
movie = Movie.find_by(title: title, year: year, gross: gross)
if movie
redirect "/movies/#{movie.id}"
else
movie = Movie.new(title: title, year: year, gross: gross, poster: poster, trailer: trailer)
if movie.save
redirect "/movies/#{movie.id}"
else
erb :'movies/new'
end
end
else
erb :'movies/new'
end
end
I don't think my if statement is correct. It works even if all my fields are not filled
Your code is doing a lot of work in one single method. I would suggest to restructure it into smaller chunks to make it easier to manage. I mostly code for Rails, so apologies if parts of these do not apply to your framework.
post "/movies/new" do
movie = find_movie || create_movie
if movie
redirect "/movies/#{movie.id}"
else
erb :'movies/new'
end
end
def find_movie
# guard condition to ensure that the required parameters are there
required_params = [:title, :year, :gross]
return nil unless params_present?(required_params)
Movie.find_by(params_from_keys(required_params))
end
def create_movie
required_params = [:title, :year, :gross, :poster, :trailer]
return nil unless params_present?(required_params)
movie = Movie.new(params_from_keys(required_params))
movie.save ? movie : nil # only return the movie if it is successfully saved
end
# utility method to check whether all provided params are present
def params_present?(keys)
keys.each {|key| return false if params[key].blank? }
true
end
# utility method to convert params into the hash format required to create / find a record
def params_from_keys(keys)
paras = {}
keys.each { |key| paras.merge!(key: params[key]) }
paras
end
Even if you type nothing in the HTML fields, they will still be submitted as empty strings.
You can avoid having empty parameters by, for example, filtering them:
post '/movies/new' do
params.reject! { |key, value| value.empty? }
# rest of your code
end
Also I would rather post to /movies rather than to /movies/new, that's more REST-wise.
Try if condition to check fields are blank like below -
unless [title, year, gross, poster, trailer].any?(&:blank?)
This will check any of the field should not be nil or blank("").
I'm banging my head trying to understand how the Twitter gem's pagination works.
I've tried max_id and cursor and they both strangely don't work.
Basically the maximum I can get out of search results is 100, and I would like to get 500.
Current code:
max_page = 5
max_id = -1
#data = []
for i in (1..max_page)
t = twt_client.search("hello world", :count => 100, :result_type => :recent, :max_id => max_id)
t.each do | tweet |
#data << tweet
end
max_id = t.next_results[:max_id]
end
This actually tells me that next_results is a private method, anyone has a working solution?
Without knowing which gem you're referencing (please specify a URL), I'd say intiuitively that cursor and max_id wouldn't get you what you want. However count would. Since you say you're only retrieving 100 results and count is set to 100, that would make sense to me.
t = twt_client.search("hello world", :count => 500, :result_type => :recent, :max_id => max_id)
I'm assuming you're talking about the Twitter client referenced here. My first question is: What's twt_client and for that matter, what does its search method return? It's also possible that you've unwittingly updated the gem and there's been a code base change that makes your current script out of date.
Take a look at your installed gem version and another look at the README here.
Twitter::SearchResults#next_results is private, because they try to provide uniform interface for enumeration.
Look, there's included Twitter::Enumerable in search_results.rb
module Twitter
class SearchResults
include Twitter::Enumerable
...
private
def last?
!next_page?
end
...
def fetch_next_page
response = #client.send(#request_method, #path, next_page).body
self.attrs = response
end
...
end
end
And if you look at enumerable.rb, you'll see that method's Twitter::SearchResults#last? and Twitter::SearchResults#fetch_next_page are used by Twitter::SearchResults#each method
module Twitter
module Enumerable
include ::Enumerable
# #return [Enumerator]
def each(start = 0)
return to_enum(:each, start) unless block_given?
Array(#collection[start..-1]).each do |element|
yield(element)
end
unless last?
start = [#collection.size, start].max
fetch_next_page
each(start, &Proc.new)
end
self
end
...
end
end
And Twitter::SearchResults#each will iterate over pages until there's #attrs[:search_metadata][:next_results] in Twitter's responses. So you need to break iteration after you'll reach 500th element.
I think you just need to use each
#data = []
tweet_number = 1
search_results = twt_client.search("hello world", count: 100, result_type: :recent)
search_results.each do |tweet|
#data << tweet
break if tweet_number == 500
end
This post is a result of looking into gem's sources and twitter's api. I could make a mistake somewhere, since I haven't checked my thoughts in console.
Try this (I basically only updated the calculation of the max_id in the loop):
max_page = 5
max_id = -1
#data = []
for i in (1..max_page)
t = twt_client.search("hello world", :count => 100, :result_type => :recent, :max_id => max_id)
t.each do | tweet |
#data << tweet
end
max_id = t.to_a.map(&:id).max + 1 # or may be max_id = t.map(&:id).max + 1
end
I got Ruby to travel to a web site, iterate through a list of campaigns and scrape the pages for specific data. The problem I have now is getting it from the structure Nokogiri gives me, and outputting it into a readable form.
campaign_list = Array.new
campaign_list.push(1042360, 1042386, 1042365, 992307)
browser = Watir::Browser.new :chrome
browser.goto '<redacted>'
browser.text_field(:id => 'email').set '<redacted>'
browser.text_field(:id => 'password').set '<redacted>'
browser.send_keys :enter
file = File.new('hourlysales.csv', 'w')
data = {}
campaign_list.each do |campaign|
browser.goto "<redacted>"
if browser.text.include? "Application Error"
puts "Error loading page, I recommend restarting script"
# Possibly automatic restart of script
else
hourly_data = Nokogiri::HTML.parse(browser.html).text
# file.write data
puts hourly_data
end
This is the output I get:
{"views":[[17,145],[18,165],[19,99],[20,71],[21,31],[22,26],[23,10],[0,15],[1,1], [2,18],[3,19],[4,35],[5,47],[6,44],[7,67],[8,179],[9,141],[10,112],[11,95],[12,46],[13,82],[14,79],[15,70],[16,103]],"orders":[[17,10],[18,9],[19,5],[20,1],[21,1],[22,0],[23,0],[0,1],[1,0],[2,1],[3,0],[4,1],[5,2],[6,1],[7,5],[8,11],[9,6],[10,5],[11,3],[12,1],[13,2],[14,4],[15,6],[16,7]],"conversion_rates":[0.06870229007633588,0.05442176870748299,0.050505050505050504,0.014084507042253521,0.03225806451612903,0.0,0.0,0.06666666666666667,0.0,0.05555555555555555,0.0,0.02857142857142857,0.0425531914893617,0.022727272727272728,0.07462686567164178,0.06134969325153374,0.0425531914893617,0.044642857142857144,0.031578947368421054,0.021739130434782608,0.024390243902439025,0.05063291139240506,0.08571428571428572,0.06741573033707865]}
The arrays stand for { views [[hour, # of views], [hour, # of views], etc. }. Same with orders. I don't need conversion rates.
I also need to add the values up for each key, so after doing this for 5 pages, I have one key for each hour of the day, and the total number of views for that hour. I tried a couple each loops, but couldn't make any progress.
I appreciate any help you guys can give me.
It looks like the output (which from your code I assume is the content of hourly_data) is JSON. In that case, it's easy to parse and add up the numbers. Something like this:
require "json" # at the top of your script
# ...
def sum_hours_values(data, hours_values=nil)
# Start with an empty hash that automatically initializes missing keys to `0`
hours_values ||= Hash.new {|hsh,hour| hsh[hour] = 0 }
# Iterate through the [hour, value] arrays, adding `value` to the running
# count for that `hour`, and return `hours_values`
data.each_with_object(hours_values) do |(hour, value), hsh|
hsh[hour] += value
end
end
# ... Watir/Nokogiri stuff here...
# Initialize these so they persist outside the loop
hours_views, orders_views = nil
campaign_list.each do |campaign|
browser.goto "<redacted>"
if browser.text.include? "Application Error"
# ...
else
# ...
hourly_data_parsed = JSON.parse(hourly_data)
hours_views = sum_hours_values(hourly_data_parsed["views"], hours_views)
hours_orders = sum_hours_values(hourly_data_parsed["orders"], orders_views)
end
end
puts "Views by hour:"
puts hours_views.sort.map {|hour_views| "%2i\t%4i" % hour_views }
puts "Orders by hour:"
puts hours_orders.sort.map {|hour_orders| "%2i\t%4i" % hour_orders }
P.S. There's a really nice recursive version of sum_hours_values I didn't include since the iterative version is clearer to most Ruby programmers. If you're into recursion I leave it as an exercise for you. ;)