Ruby + recursive function + defining global variable - ruby

I am pulling bitbucket repo list using Ruby. The response from bitbucket will contain only 10 repositories and a marker for the next page where there will be another 10 repos and so on ... (they call it pagination)
So, I wrote a recursive function which calls itself if the next page marker is present. This will continue until it reaches the last page.
Here is my code:
#!/usr/local/bin/ruby
require 'net/http'
require 'json'
require 'awesome_print'
#repos = Array.new
def recursive(url)
### here goes my net/http code which connects to bitbucket and pulls back the response in a JSON as request.body
### hence, removing this code for brevity
hash = JSON.parse(response.body)
hash["values"].each do |x|
#repos << x["links"]["self"]["href"]
end
if hash["next"]
puts "next page exists"
puts "calling recusrisve with: #{hash["next"]}"
recursive(hash["next"])
else
puts "this is the last page. No more recursions"
end
end
repo_list = recursive('https://my_bitbucket_url')
#repos.each {|x| puts x}
Now, my code works fine and it lists all the repos.
Question:
I am new to Ruby, so I am not sure about the way I have used the global variable #repos = Array.new above. If I define the array in function, then each call to the function will create a new array overwriting its contents from previous call.
So, how do the Ruby programmers use a global symbol in such cases. Does my code obey Ruby ethics or is it something really amateur (yet correct because it works) way of doing it.

The consensus is to avoid global variables as much as possible.
I would either build the collection recursively like this:
def recursive(url)
### ...
result = []
hash["values"].each do |x|
result << x["links"]["self"]["href"]
end
if hash["next"]
result += recursive(hash["next"])
end
result
end
or hand over the collection to the function:
def recursive(url, result = [])
### ...
hash["values"].each do |x|
result << x["links"]["self"]["href"]
end
if hash["next"]
recursive(hash["next"], result)
end
result
end
Either way you can call the function
repo_list = recursive(url)
And I would write it like this:
def recursive(url)
# ...
result = hash["values"].map { |x| x["links"]["self"]["href"] }
result += recursive(hash["next"]) if hash["next"]
result
end

Related

How to write a while loop properly

I'm trying to scrape a website however I cannot seem to get my while-loop to break out once it hits a page with no more information:
def scrape_verse_items(keyword)
pg = 1
while pg < 1000
puts "page #{pg}"
url = "https://www.bible.com/search/bible?page=#{pg}&q=#{keyword}&version_id=1"
doc = Nokogiri::HTML(open(url))
items = doc.css("ul.search-result li.reference")
error = doc.css('div#noresults')
until error.any? do
if keyword != ''
item_hash = {}
items.each do |item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
item_hash[title] = content
end
else
puts "Please enter a valid search"
end
if error.any?
break
end
end
pg += 1
end
item_hash
end
puts scrape_verse_items('joy')
I know this doesn't exactly answer your question, but perhaps you might consider using a different approach altogether.
Using while and until loops can get a bit confusing, and usually isn't the most performant way of doing things.
Maybe you would consider using recursion instead.
I've written a small script that seems to work :
class MyScrapper
def initialize;end
def call(keyword)
puts "Please enter a valid search" && return unless keyword
scrape({}, keyword, 1)
end
private
def scrape(results, keyword, page)
doc = load_page(keyword, page)
return results if doc.css('div#noresults').any?
build_new_items(doc).merge(scrape(results, keyword, page+1))
end
def load_page(keyword, page)
url = "https://www.bible.com/search/bible?page=#{page}&q=#{keyword}&version_id=1"
Nokogiri::HTML(open(url))
end
def build_new_items(doc)
items = doc.css("ul.search-result li.reference")
items.reduce({}) do |list, item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
list[title] = content
list
end
end
end
You call it by doing MyScrapper.new.call("Keyword") (It might make more sense to have this as a module you include or even have them as class methods to avoid the need to instantiate the class.
What this does is, call a method called scrape and you give it the starting results, keyword, and page. It loads the page, if there are no results it returns the existing results it has found.
Otherwise it builds a hash from the page it loaded, and then the method calls itself, and merges the results with the new hash it just build. It does this till there are no more results.
If you want to limit the page results you can just change this like:
return results if doc.css('div#noresults').any?
to this:
return results if doc.css('div#noresults').any? || page > 999
Note: You might want to double-check the results that are being returned are correct. I think they should be but I wrote this quite quickly, so there could always be a small bug hiding somewhere in there.

How should I use recursive method in ruby

I wrote a simple web scrawler using Mechanize, now I'm stuck at how to get next page recursively, below is the code.
def self.generate_page #generate a Mechainze page object,the first page
agent = Mechanize.new
url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
page = agent.get(url)
page
end
def self.next_page(n_page) #get next page recursively by click next tag showed in each pages
puts n_page
# if I dont use puts , I get nothing , when using puts, I get
#<Mechanize::Page:0x007fd341c70fd0>
#<Mechanize::Page:0x007fd342f2ce08>
#<Mechanize::Page:0x007fd341d0cf70>
#<Mechanize::Page:0x007fd3424ff5c0>
#<Mechanize::Page:0x007fd341e1f660>
#<Mechanize::Page:0x007fd3425ec618>
#<Mechanize::Page:0x007fd3433f3e28>
#<Mechanize::Page:0x007fd3433a2410>
#<Mechanize::Page:0x007fd342446ca0>
#<Mechanize::Page:0x007fd343462490>
#<Mechanize::Page:0x007fd341c2fe18>
#<Mechanize::Page:0x007fd342d18040>
#<Mechanize::Page:0x007fd3432c76a8>
#which are the results I want
np = Mechanize.new.click(n_page.link_with(:text=>/next/)) unless n_page.link_with(:text=>/next/).nil?
result = next_page(np) unless np.nil?
result # here the value is empty, I dont know what is worng
end
def self.get_page # trying to pass the result of next_page() method
puts next_page(generate_page)
# it seems result is never passed here,
end
I followed these two links What is recursion and how does it work?
and Ruby recursive function
but still cant figure out what's wrong.. hope someone can help me out.. Thanks
There are a few issues with your code:
You shouldn't be calling Mechanize.new more than once.
From a stylistic perspective, you are doing too many nil checks.
Unless you have a preference for recursion, it'll probably be easier to do it iteratively.
To have your next_page method return an array containing every link page in the chain, you could write this:
# you should store the mechanize agent as a global variable
Agent = Mechanize.new
# a helper method to DRY up the code
def click_to_next_page(page)
Agent.click(n_page.link_with(:text=>/next/))
end
# repeatedly visits next page until none exists
# returns all seen pages as an array
def get_all_next_pages(n_page)
results = []
np = click_to_next_page(n_page)
results.push(np)
until !np
np = click_to_next_page(np)
np && results.push(np)
end
results
end
# testing it out (i'm not actually running this)
base_url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
root_page = Agent.get(base_url)
next_pages = get_all_next_pages(root_page)
puts next_pages

Increasing Ruby Resolv Speed

Im trying to build a sub-domain brute forcer for use with my clients - I work in security/pen testing.
Currently, I am able to get Resolv to look up around 70 hosts in 10 seconds, give or take and wanted to know if there was a way to get it to do more. I have seen alternative scripts out there, mainly Python based that can achieve far greater speeds than this. I don't know how to increase the number of requests Resolv makes in parallel, or if i should split the list up. Please note I have put Google's DNS servers in the sample code, but will be using internal ones for live usage.
My rough code for debugging this issue is:
require 'resolv'
def subdomains
puts "Subdomain enumeration beginning at #{Time.now.strftime("%H:%M:%S")}"
subs = []
domains = File.open("domains.txt", "r") #list of domain names line by line.
Resolv.new(:nameserver => ['8.8.8.8', '8.8.4.4'])
File.open("tiny.txt", "r").each_line do |subdomain|
subdomain.chomp!
domains.each do |d|
puts "Checking #{subdomain}.#{d}"
ip = Resolv.new.getaddress "#{subdomain}.#{d}" rescue ""
if ip != nil
subs << subdomain+"."+d << ip
end
end
end
test = subs.each_slice(4).to_a
test.each do |z|
if !z[1].nil? and !z[3].nil?
puts z[0] + "\t" + z[1] + "\t\t" + z[2] + "\t" + z[3]
end
end
puts "Finished at #{Time.now.strftime("%H:%M:%S")}"
end
subdomains
domains.txt is my list of client domain names, for example google.com, bbc.co.uk, apple.com and 'tiny.txt' is a list of potential subdomain names, for example ftp, www, dev, files, upload. Resolv will then lookup files.bbc.co.uk for example and let me know if it exists.
One thing is you are creating a new Resolv instance with the Google nameservers, but never using it; you create a brand new Resolv instance to do the getaddress call, so that instance is probably using some default nameservers and not the Google ones. You could change the code to something like this:
resolv = Resolv.new(:nameserver => ['8.8.8.8', '8.8.4.4'])
# ...
ip = resolv.getaddress "#{subdomain}.#{d}" rescue ""
In addition, I suggest using the File.readlines method to simplify your code:
domains = File.readlines("domains.txt").map(&:chomp)
subdomains = File.readlines("tiny.txt").map(&:chomp)
Also, you're rescuing the bad ip and setting it to the empty string, but then in the next line you test for not nil, so all results should pass, and I don't think that's what you want.
I've refactored your code, but not tested it. Here is what I came up with, and may be clearer:
def subdomains
puts "Subdomain enumeration beginning at #{Time.now.strftime("%H:%M:%S")}"
domains = File.readlines("domains.txt").map(&:chomp)
subdomains = File.readlines("tiny.txt").map(&:chomp)
resolv = Resolv.new(:nameserver => ['8.8.8.8', '8.8.4.4'])
valid_subdomains = subdomains.each_with_object([]) do |subdomain, valid_subdomains|
domains.each do |domain|
combined_name = "#{subdomain}.#{domain}"
puts "Checking #{combined_name}"
ip = resolv.getaddress(combined_name) rescue nil
valid_subdomains << "#{combined_name}#{ip}" if ip
end
end
valid_subdomains.each_slice(4).each do |z|
if z[1] && z[3]
puts "#{z[0]}\t#{z[1]}\t\t#{z[2]}\t#{z[3]}"
end
end
puts "Finished at #{Time.now.strftime("%H:%M:%S")}"
end
Also, you might want to check out the dnsruby gem (https://github.com/alexdalitz/dnsruby). It might do what you want to do better than Resolv.
[Note: I've rewritten the code so that it fetches the IP addresses in chunks. Please see https://gist.github.com/keithrbennett/3cf0be2a1100a46314f662aea9b368ed. You can modify the RESOLVE_CHUNK_SIZE constant to balance performance with resource load.]
I've rewritten this code using the dnsruby gem (written mainly by Alex Dalitz in the UK, and contributed to by myself and others). This version uses asynchronous message processing so that all requests are being processed pretty much simultaneously. I've posted a gist at https://gist.github.com/keithrbennett/3cf0be2a1100a46314f662aea9b368ed but will also post the code here.
Note that since you are new to Ruby, there are lots of things in the code that might be instructive to you, such as method organization, use of Enumerable methods (e.g. the amazing 'partition' method), the Struct class, rescuing a specific Exception class, %w, and Benchmark.
NOTE: LOOKS LIKE STACK OVERFLOW ENFORCES A MAXIMUM MESSAGE SIZE, SO THIS CODE IS TRUNCATED. GO TO THE GIST IN THE LINK ABOVE FOR THE COMPLETE CODE.
#!/usr/bin/env ruby
# Takes a list of subdomain prefixes (e.g. %w(ftp xyz)) and a list of domains (e.g. %w(nytimes.com afp.com)),
# creates the subdomains combining them, fetches their IP addresses (or nil if not found).
require 'dnsruby'
require 'awesome_print'
RESOLVER = Dnsruby::Resolver.new(:nameserver => %w(8.8.8.8 8.8.4.4))
# Experiment with this to get fast throughput but not overload the dnsruby async mechanism:
RESOLVE_CHUNK_SIZE = 50
IpEntry = Struct.new(:name, :ip) do
def to_s
"#{name}: #{ip ? ip : '(nil)'}"
end
end
def assemble_subdomains(subdomain_prefixes, domains)
domains.each_with_object([]) do |domain, subdomains|
subdomain_prefixes.each do |prefix|
subdomains << "#{prefix}.#{domain}"
end
end
end
def create_query_message(name)
Dnsruby::Message.new(name, 'A')
end
def parse_response_for_address(response)
begin
a_answer = response.answer.detect { |a| a.type == 'A' }
a_answer ? a_answer.rdata.to_s : nil
rescue Dnsruby::NXDomain
return nil
end
end
def get_ip_entries(names)
queue = Queue.new
names.each do |name|
query_message = create_query_message(name)
RESOLVER.send_async(query_message, queue, name)
end
# Note: although map is used here, the record in the output array will not necessarily correspond
# to the record in the input array, since the order of the messages returned is not guaranteed.
# This is indicated by the lack of block variable specified (normally w/map you would use the element).
# That should not matter to us though.
names.map do
_id, result, error = queue.pop
name = _id
case error
when Dnsruby::NXDomain
IpEntry.new(name, nil)
when NilClass
ip = parse_response_for_address(result)
IpEntry.new(name, ip)
else
raise error
end
end
end
def main
# domains = File.readlines("domains.txt").map(&:chomp)
domains = %w(nytimes.com afp.com cnn.com bbc.com)
# subdomain_prefixes = File.readlines("subdomain_prefixes.txt").map(&:chomp)
subdomain_prefixes = %w(www xyz)
subdomains = assemble_subdomains(subdomain_prefixes, domains)
start_time = Time.now
ip_entries = subdomains.each_slice(RESOLVE_CHUNK_SIZE).each_with_object([]) do |ip_entries_chunk, results|
results.concat get_ip_entries(ip_entries_chunk)
end
duration = Time.now - start_time
found, not_found = ip_entries.partition { |entry| entry.ip }
puts "\nFound:\n\n"; puts found.map(&:to_s); puts "\n\n"
puts "Not Found:\n\n"; puts not_found.map(&:to_s); puts "\n\n"
stats = {
duration: duration,
domain_count: ip_entries.size,
found_count: found.size,
not_found_count: not_found.size,
}
ap stats
end
main

Structuring Nokogiri output without HTML tags

I got Ruby to travel to a web site, iterate through a list of campaigns and scrape the pages for specific data. The problem I have now is getting it from the structure Nokogiri gives me, and outputting it into a readable form.
campaign_list = Array.new
campaign_list.push(1042360, 1042386, 1042365, 992307)
browser = Watir::Browser.new :chrome
browser.goto '<redacted>'
browser.text_field(:id => 'email').set '<redacted>'
browser.text_field(:id => 'password').set '<redacted>'
browser.send_keys :enter
file = File.new('hourlysales.csv', 'w')
data = {}
campaign_list.each do |campaign|
browser.goto "<redacted>"
if browser.text.include? "Application Error"
puts "Error loading page, I recommend restarting script"
# Possibly automatic restart of script
else
hourly_data = Nokogiri::HTML.parse(browser.html).text
# file.write data
puts hourly_data
end
This is the output I get:
{"views":[[17,145],[18,165],[19,99],[20,71],[21,31],[22,26],[23,10],[0,15],[1,1], [2,18],[3,19],[4,35],[5,47],[6,44],[7,67],[8,179],[9,141],[10,112],[11,95],[12,46],[13,82],[14,79],[15,70],[16,103]],"orders":[[17,10],[18,9],[19,5],[20,1],[21,1],[22,0],[23,0],[0,1],[1,0],[2,1],[3,0],[4,1],[5,2],[6,1],[7,5],[8,11],[9,6],[10,5],[11,3],[12,1],[13,2],[14,4],[15,6],[16,7]],"conversion_rates":[0.06870229007633588,0.05442176870748299,0.050505050505050504,0.014084507042253521,0.03225806451612903,0.0,0.0,0.06666666666666667,0.0,0.05555555555555555,0.0,0.02857142857142857,0.0425531914893617,0.022727272727272728,0.07462686567164178,0.06134969325153374,0.0425531914893617,0.044642857142857144,0.031578947368421054,0.021739130434782608,0.024390243902439025,0.05063291139240506,0.08571428571428572,0.06741573033707865]}
The arrays stand for { views [[hour, # of views], [hour, # of views], etc. }. Same with orders. I don't need conversion rates.
I also need to add the values up for each key, so after doing this for 5 pages, I have one key for each hour of the day, and the total number of views for that hour. I tried a couple each loops, but couldn't make any progress.
I appreciate any help you guys can give me.
It looks like the output (which from your code I assume is the content of hourly_data) is JSON. In that case, it's easy to parse and add up the numbers. Something like this:
require "json" # at the top of your script
# ...
def sum_hours_values(data, hours_values=nil)
# Start with an empty hash that automatically initializes missing keys to `0`
hours_values ||= Hash.new {|hsh,hour| hsh[hour] = 0 }
# Iterate through the [hour, value] arrays, adding `value` to the running
# count for that `hour`, and return `hours_values`
data.each_with_object(hours_values) do |(hour, value), hsh|
hsh[hour] += value
end
end
# ... Watir/Nokogiri stuff here...
# Initialize these so they persist outside the loop
hours_views, orders_views = nil
campaign_list.each do |campaign|
browser.goto "<redacted>"
if browser.text.include? "Application Error"
# ...
else
# ...
hourly_data_parsed = JSON.parse(hourly_data)
hours_views = sum_hours_values(hourly_data_parsed["views"], hours_views)
hours_orders = sum_hours_values(hourly_data_parsed["orders"], orders_views)
end
end
puts "Views by hour:"
puts hours_views.sort.map {|hour_views| "%2i\t%4i" % hour_views }
puts "Orders by hour:"
puts hours_orders.sort.map {|hour_orders| "%2i\t%4i" % hour_orders }
P.S. There's a really nice recursive version of sum_hours_values I didn't include since the iterative version is clearer to most Ruby programmers. If you're into recursion I leave it as an exercise for you. ;)

Testing a lambda

I am creating an import feature that imports CSV files into several tables. I made a module called CsvParser which parses a CSV file and creates records. My models that receive the create actions extends theCsvParser. They make a call to CsvParser.create and pass the correct attribute order and an optional lambda called value_parser. This lambda transforms values in a hash to a preffered format.
class Mutation < ActiveRecord::Base
extend CsvParser
def self.import_csv(csv_file)
attribute_order = %w[reg_nr receipt_date reference_number book_date is_credit sum balance description]
value_parser = lambda do |h|
h["is_credit"] = ((h["is_credit"] == 'B') if h["is_credit"].present?)
h["sum"] = -1 * h["sum"].to_f unless h["is_credit"]
return [h]
end
CsvParser.create(csv_file, self, attribute_order, value_parser)
end
end
The reason that I'm using a lambda instead of checks inside the CsvParser.create method is because the lambda is like a business rule that belongs to this model.
My question is how i should test this lambda. Should i test it in the model or the CsvParser? Should i test the lambda itself or the result of an array of the self.import method? Maybe i should make another code structure?
My CsvParser looks as follows:
require "csv"
module CsvParser
def self.create(csv_file, klass, attribute_order, value_parser = nil)
parsed_csv = CSV.parse(csv_file, col_sep: "|")
records = []
ActiveRecord::Base.transaction do
parsed_csv.each do |row|
record = Hash.new {|h, k| h[k] = []}
row.each_with_index do |value, index|
record[attribute_order[index]] = value
end
if value_parser.blank?
records << klass.create(record)
else
value_parser.call(record).each do |parsed_record|
records << klass.create(parsed_record)
end
end
end
end
return records
end
end
I'm testing the module itself:
require 'spec_helper'
describe CsvParser do
it "should create relations" do
file = File.new(Rails.root.join('spec/fixtures/files/importrelaties.txt'))
Relation.should_receive(:create).at_least(:once)
Relation.import_csv(file).should be_kind_of Array
end
it "should create mutations" do
file = File.new(Rails.root.join('spec/fixtures/files/importmutaties.txt'))
Mutation.should_receive(:create).at_least(:once)
Mutation.import_csv(file).should be_kind_of Array
end
it "should create strategies" do
file = File.new(Rails.root.join('spec/fixtures/files/importplan.txt'))
Strategy.should_receive(:create).at_least(:once)
Strategy.import_csv(file).should be_kind_of Array
end
it "should create reservations" do
file = File.new(Rails.root.join('spec/fixtures/files/importreservering.txt'))
Reservation.should_receive(:create).at_least(:once)
Reservation.import_csv(file).should be_kind_of Array
end
end
Some interesting questions. A couple of notes:
You probably shouldn't have a return within the lambda. Just make the last statement [h].
If I understand the code correctly, the first and second lines of your lambda are overcomplicated. Reduce them to make them more readable and easier to refactor:
h["is_credit"] = (h['is_credit'] == 'B') # I *think* that will do the same
h['sum'] = h['sum'].to_f # Your original code would have left this a string
h['sum'] *= -1 unless h['is_credit']
It looks like your lambda doesn't depend on anything external (aside from h), so I would test it separately. You could even make it a constant:
class Mutation < ActiveRecord::Base
extend CsvParser # <== See point 5 below
PARSE_CREDIT_AND_SUM = lambda do |h|
h["is_credit"] = (h['is_credit'] == 'B')
h['sum'] = h['sum'].to_f
h['sum'] *= -1 unless h['is_credit']
[h]
end
Without knowing the rationale, it's hard to say where you should put this code. My gut instinct is that it is not the job of the CSV parser (although a good parser may detect floating point numbers and convert them from strings?) Keep your CSV parser reusable. (Note: Re-reading, I think you've answered this question yourself - it is business logic, tied to the model. Go with your gut!)
Lastly, you are defining and the method CsvParser.create. You don't need to extend CsvParser to get access to it, although if you have other facilities in CsvParser, consider making CsvParser.create a normal module method called something like create_from_csv_file

Resources