Nest output of recursive function - ruby

I have written this piece of code, which outputs a list of jobdescriptions (in Danish). It works fine, however I would like to alter the output a bit. The function is recursive because the jobs are nested, however the output does not show the nesting.
How do I configure the function to show an output like this:
Job 1
- Job 1.1
- Job 1.2
-- Job 1.2.1
And so on...
require 'nokogiri'
require 'open-uri'
def crawl(url)
basePath = 'http://www.ug.dk'
doc = Nokogiri::HTML(open(basePath + url))
doc.css('.maplist li').each do |listitem|
listitem.css('.txt').each do |txt|
puts txt.content
end
listitem.css('a[href]').each do |link|
crawl(link['href'])
end
end
end
crawl('/Job.aspx')

require 'nokogiri'
require 'open-uri'
def crawl(url, nesting_level = 0)
basePath = 'http://www.ug.dk'
doc = Nokogiri::HTML(open(basePath + url))
doc.css('.maplist li').each do |listitem|
listitem.css('.txt').each do |txt|
puts " " * nesting_level + txt.content
end
listitem.css('a[href]').each do |link|
crawl(link['href'], nesting_level + 1)
end
end
end
crawl('/Job.aspx')

I see two options:
Pass an additional argument to the recursive function to indicate the level you are currently in. Initialize the value to 0 and each time you call the function increment this value. Something like this:
def crawl(url, level)
basePath = 'http://www.ug.dk'
doc = Nokogiri::HTML(open(basePath + url))
doc.css('.maplist li').each do |listitem|
listitem.css('.txt').each do |txt|
puts txt.content
end
listitem.css('a[href]').each do |link|
crawl(link['href'], level + 1)
end
end
end
Make use of the caller array that holds the callstack. Use the size of this array to indicate the level in the recursion you are located in.
def try
puts caller.inspect
end
try
I would personally stick to the fist version as it seems easier to read, but requires you to modify the interface.

Related

Nokogiri Throwing Exception in Function but not outside of Function

I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....

Ruby: Decorator pattern slows simple program by a lot

I recently wrote a program to return a bunch of stocks from the stock market that are unhealthy. The basic algorithm is this:
Look up all the quotes of every stock in an exchange (either NYSE or NASDAQ)
Find the ones that are less than 5 dollars from step 1
Find the ones from step 2 that are down 3 days and have large volume (expensive because I have to make a request for each stock, which is like ~700 currently for nasdaq).
Scan the news for the ones returned by step 3.
I had this all in one file:
Original implementation (https://github.com/EdmundMai/minion/blob/aa14bc3234a4953e7273ec502276c6f0073b459d/lib/minion.rb):
require 'bundler/setup'
require "minion/version"
require "yahoo-finance"
require "business_time"
require 'nokogiri'
require 'open-uri'
module Minion
class << self
def query(exchange)
client = YahooFinance::Client.new
all_companies = CSV.read("#{exchange}.csv")
small_caps = []
ticker_symbols = all_companies.map { |row| row[0] }
ticker_symbols.each_slice(200) do |batch|
data = client.quotes(batch, [:symbol, :last_trade_price, :average_daily_volume])
small_caps << data.select { |stock| stock.last_trade_price.to_f < 5.0 }
end
attractive = []
small_caps.flatten!.each_with_index do |small_cap, index|
begin
data = client.historical_quotes(small_cap.symbol, { start_date: 2.business_days.ago, end_date: Time.now })
closing_prices = data.map(&:close).map(&:to_f)
volumes = data.map(&:volume).map(&:to_i)
negative_3_days_in_a_row = closing_prices == closing_prices.sort
larger_than_average_volume = volumes.reduce(:+) / volumes.count > small_cap.average_daily_volume.to_i
if negative_3_days_in_a_row && larger_than_average_volume
attractive << small_cap.symbol
puts "Qualified: #{small_cap.symbol}, finished with #{index} out of #{small_caps.count}"
else
puts "Not qualified: #{small_cap.symbol}, finished with #{index} out of #{small_caps.count}"
end
rescue => e
puts e.inspect
end
end
final_results = []
attractive.each do |symbol|
rss_feed = Nokogiri::HTML(open("http://feeds.finance.yahoo.com/rss/2.0/headline?s=#{symbol}&region=US&lang=en-US"))
html_body = rss_feed.css('body')[0].text
diluting = false
['warrant', 'cashless exercise'].each do |keyword|
diluting = true if html_body.match(/#{keyword}/i)
end
final_results << symbol if diluting
end
final_results
end
end
end
This was really fast and would finish processing like ~700 stocks in a minute or less.
Then, I tried refactoring and splitting up the algorithm into different classes and files without changing the algorithm at all. I decided on using the decorator pattern since it seems to fit. However when I run the program now, it makes each request really slowly (15+ min). I know this because my puts statements get printed out really slowly.
New and slower implementation (https://github.com/EdmundMai/minion/blob/master/lib/minion.rb)
require 'bundler/setup'
require "minion/version"
require "yahoo-finance"
require "minion/dilution_finder"
require "minion/negative_finder"
require "minion/small_cap_finder"
require "minion/market_fetcher"
module Minion
class << self
def query(exchange)
all_companies = CSV.read("#{exchange}.csv")
all_tickers = all_companies.map { |row| row[0] }
short_finder = DilutionFinder.new(NegativeFinder.new(SmallCapFinder.new(MarketFetcher.new(all_tickers))))
short_finder.results
end
end
end
The part it's lagging at according to my puts:
require "yahoo-finance"
require "business_time"
require_relative "stock_finder"
class NegativeFinder < StockFinder
def results
client = YahooFinance::Client.new
results = []
finder.results.each_with_index do |stock, index|
begin
data = client.historical_quotes(stock.symbol, { start_date: 2.business_days.ago, end_date: Time.now })
closing_prices = data.map(&:close).map(&:to_f)
volumes = data.map(&:volume).map(&:to_i)
negative_3_days_in_a_row = closing_prices == closing_prices.sort
larger_than_average_volume = volumes.reduce(:+) / volumes.count > stock.average_daily_volume.to_i
if negative_3_days_in_a_row && larger_than_average_volume
results << stock
puts "Qualified: #{stock.symbol}, finished with #{index} out of #{finder.results.count}"
else
puts "Not qualified: #{stock.symbol}, finished with #{index} out of #{finder.results.count}"
end
rescue => e
puts e.inspect
end
end
results
end
end
It's lagging on step 3 (making one request for each stock). Not sure what's going on so any advice would be appreciated. If you want to clone the program and run it, just comment in the last line in lib/minion.rb and type ruby lib/minion.rb
After debugging it I figured it out. It was because I was calling finder.results (results being the decorated method) inside of the loop as shown below:
require 'bundler/setup'
require "minion/version"
require "yahoo-finance"
require "minion/dilution_finder"
require "minion/negative_finder"
require "minion/small_cap_finder"
require "minion/market_fetcher"
module Minion
class << self
def query(exchange)
all_companies = CSV.read("#{exchange}.csv")
all_tickers = all_companies.map { |row| row[0] }
short_finder = DilutionFinder.new(NegativeFinder.new(SmallCapFinder.new(MarketFetcher.new(all_tickers))))
short_finder.results
end
end
end
The part it's lagging at according to my puts:
require "yahoo-finance"
require "business_time"
require_relative "stock_finder"
class NegativeFinder < StockFinder
def results
client = YahooFinance::Client.new
results = []
finder.results.each_with_index do |stock, index|
begin
data = client.historical_quotes(stock.symbol, { start_date: 2.business_days.ago, end_date: Time.now })
closing_prices = data.map(&:close).map(&:to_f)
volumes = data.map(&:volume).map(&:to_i)
negative_3_days_in_a_row = closing_prices == closing_prices.sort
larger_than_average_volume = volumes.reduce(:+) / volumes.count > stock.average_daily_volume.to_i
if negative_3_days_in_a_row && larger_than_average_volume
results << stock
// HERE!!!!!!!!!!!!!!!!!!!!!!!!!
puts "Qualified: #{stock.symbol}, finished with #{index} out of #{finder.results.count}" <------------------------------------
else
// AND HERE!!!!!!!!!!!!!!!!!!!!!!!!!
puts "Not qualified: #{stock.symbol}, finished with #{index} out of #{finder.results.count}" <-----------------------------------------------------------
end
rescue => e
puts e.inspect
end
end
results
end
end
This caused a cascade of requests every time I iterated through the loop in NegativeFinder. Removing that call fixed it. Lesson: When using the decorator pattern, either only call the decorated method once, especially when you're doing something expensive in each call. Either that or hold the returned variable in an instance variable so you don't have to calculate it each time.
Also as a side note, I've decided not to go with the decorator pattern because I don't think it applies well here. Something like SmallCapFinder.new(SmallCapFinder.new(MarketFetcher.new(all_tickers))) doesn't add functionality at all (the primary function of using the decorator pattern), so chaining decorators doesn't do anything. Therefore, I'm just going to make them methods instead of adding unnecessary complexity.
There are some thing missing in the code you gave us (Base class StockFinder, MarketFetcher). But I think you are now instantate more than one YahooFinance::Client. Input/Output to other systems is very often the cause for speed problems.
I suggest that you first encapsulate the finance client and access to financial data. This makes it easier when you want to switch your financial data provider, or add another one. Instead of the decorator pattern, I would just use plain old methods for finding small caps, finding negative, etc.

Ruby + recursive function + defining global variable

I am pulling bitbucket repo list using Ruby. The response from bitbucket will contain only 10 repositories and a marker for the next page where there will be another 10 repos and so on ... (they call it pagination)
So, I wrote a recursive function which calls itself if the next page marker is present. This will continue until it reaches the last page.
Here is my code:
#!/usr/local/bin/ruby
require 'net/http'
require 'json'
require 'awesome_print'
#repos = Array.new
def recursive(url)
### here goes my net/http code which connects to bitbucket and pulls back the response in a JSON as request.body
### hence, removing this code for brevity
hash = JSON.parse(response.body)
hash["values"].each do |x|
#repos << x["links"]["self"]["href"]
end
if hash["next"]
puts "next page exists"
puts "calling recusrisve with: #{hash["next"]}"
recursive(hash["next"])
else
puts "this is the last page. No more recursions"
end
end
repo_list = recursive('https://my_bitbucket_url')
#repos.each {|x| puts x}
Now, my code works fine and it lists all the repos.
Question:
I am new to Ruby, so I am not sure about the way I have used the global variable #repos = Array.new above. If I define the array in function, then each call to the function will create a new array overwriting its contents from previous call.
So, how do the Ruby programmers use a global symbol in such cases. Does my code obey Ruby ethics or is it something really amateur (yet correct because it works) way of doing it.
The consensus is to avoid global variables as much as possible.
I would either build the collection recursively like this:
def recursive(url)
### ...
result = []
hash["values"].each do |x|
result << x["links"]["self"]["href"]
end
if hash["next"]
result += recursive(hash["next"])
end
result
end
or hand over the collection to the function:
def recursive(url, result = [])
### ...
hash["values"].each do |x|
result << x["links"]["self"]["href"]
end
if hash["next"]
recursive(hash["next"], result)
end
result
end
Either way you can call the function
repo_list = recursive(url)
And I would write it like this:
def recursive(url)
# ...
result = hash["values"].map { |x| x["links"]["self"]["href"] }
result += recursive(hash["next"]) if hash["next"]
result
end

ruby cgi wont return method calls, but will return parameters

my environment: ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-linux]
The thing is, I can make ruby return the parameters sent over GET. but when i'm trying to use them as arguements to my methods in if/else, ruby wont return anything and I end up with a blank page.
ph and pm return correctly:
http://127.0.0.1/cgi-bin/test.rb?hostname=node00.abit.dk&macadd=23:14:41:51:63
returns:
node00.abit.dk 23:14:41:51:63
Connection to the database (MySQL) works fine
When I test the method newHostName it outputs correctly:
puts newHostName
returns (which is correct)
node25.abit.dk
the code:
#!/usr/bin/ruby
require 'cgi'
require 'sequel'
require 'socket'
require 'timeout'
DB = Sequel.connect(:adapter=>'mysql', :host=>'localhost', :database=>'nodes', :user=>'nodeuser', :password=>'...')
#cgi-part to work
#takes 2 parameters:
#hostname & macadd
cgi = CGI.new
puts cgi.header
p = cgi.params
ph = p['hostname']
pm = p['macadd']
def nodeLookup(hostnameargv)
hostname = DB[:basenode]
h = hostname[:hostname => hostnameargv]
h1 = h[:hostname]
h2 = h[:macadd]
ary = [h1, h2]
return ary
end
def lastHostName()
#TODO: replace with correct sequel-code and NOT raw SQL
DB.fetch("SELECT hostname FROM basenode ORDER BY id DESC LIMIT 1") do |row|
return row[:hostname]
end
end
def newHostName()
org = lastHostName
#Need this 'hack' to make ruby grep for the number
#nodename e.g 'node01.abit.dk'
var1 = org[4]
var2 = org[5]
var3 = var1 + var2
sum = var3.to_i + 1
#puts sum
sum = "node" + sum.to_s + ".abit.dk"
return sum
end
def insertNewNode(newhost, newmac)
newnode = DB[:basenode]
newnode.insert(:hostname => newhost, :macadd => newmac)
return "#{newnode.count}"
end
#puts ph
#puts pm
#puts newHostName
cgi.out() do
cgi.html do
begin
if ph == "node00.abit.dk"
puts newHostName
else
puts nodeLookup(ph)
end
end
end
end
I feel like im missing something here. Any help is very much appreciated!
//M00kaw
What about modify last lines of your code as followed? CGI HTML generation methods take a block and yield the return value of the block as their content. So you should make newHostName or nodeLookup(ph) as the return value of the block passed to cgi.html(), rather than puts sth, which prints the content to your terminal and return nil. That's why cgi.html() got an empty string (nil.to_s).
#puts newHostName
cgi.out() do
cgi.html do
if ph == "node00.abit.dk"
newHostName
else
nodeLookup(ph)
end
end
end
p.s. It's conventional to indent your ruby code with 2 spaces :-)

Why only the first link is fetched?

I'm trying to fetch news from Hacker News and write a link's title and URL to an HTML file. However, only the first link is getting written and others are not. What am I doing wrong?
require 'httparty'
def fetch(source)
response = HTTParty.get(source)
response["items"].each do |item|
return '' + item["title"] + ''
end
end
links = fetch('http://api.ihackernews.com/page')
File.open("/tmp/news.html", "w") do |f|
f.puts links
end
You shouldn't use return keyword in this case. It ends the method prematurely and returns only the first link. Use this instead:
require 'httparty'
def fetch(source)
response = HTTParty.get(source)
# convert response['items'] array to array of strings
response["items"].map do |item|
'' + item["title"] + ''
end
end
links = fetch('http://api.ihackernews.com/page')
links.length # => 30

Resources