Ruby Mechanize Stops Working while in Each Do Loop - ruby

I am using a mechanize Ruby script to loop through about 1,000 records in a tab delimited file. Everything works as expected until i reach about 300 records.
Once I get to about 300 records, my script keeps calling rescue on every attempt and eventually stops working. I thought it was because I had not properly set max_history, but that doesn't seem to be making a difference.
Here is the error message that I start getting:
getaddrinfo: nodename nor servname provided, or not known
Any ideas on what I might be doing wrong here?
require 'mechanize'
result_counter = 0
used_file = File.open(ARGV[0])
total_rows = used_file.readlines.size
mechanize = Mechanize.new { |agent|
agent.open_timeout = 10
agent.read_timeout = 10
agent.max_history = 0
}
File.open(ARGV[0]).each do |line|
item = line.split("\t").map {|item| item.strip}
website = item[16]
name = item[11]
if website
begin
tries ||= 3
page = mechanize.get(website)
primary1 = page.link_with(text: 'text')
secondary1 = page.link_with(text: 'other_text')
contains_primary = true
contains_secondary = true
unless contains_primary || contains_secondary
1.times do |count|
result_counter+=1
STDERR.puts "Generate (#{result_counter}/#{total_rows}) #{name} - No"
end
end
for i in [primary1]
if i
page_to_visit = i.click
page_found = page_to_visit.uri
1.times do |count|
result_counter+=1
STDERR.puts "Generate (#{result_counter}/#{total_rows}) #{name}"
end
break
end
end
rescue Timeout::Error
STDERR.puts "Generate (#{result_counter}/#{total_rows}) #{name} - Timeout"
rescue => e
STDERR.puts e.message
STDERR.puts "Generate (#{result_counter}/#{total_rows}) #{name} - Rescue"
end
end
end

You get this error because you don't close the connection after you used it.
This should fix your problem:
mechanize = Mechanize.new { |agent|
agent.open_timeout = 10
agent.read_timeout = 10
agent.max_history = 0
agent.keep_alive = false
}

Related

Add multithreads/concurency in script

I created a script which checks healthcheck and ports status from a .json file populated with microservices.
So for every microservice from the .json file the script will output the HTTP status and healthcheck body and other small details, and I want to add multithreading here in order to return all the output at once.Please see the script below:
#!/usr/bin/env ruby
... get the environment argument part...
file = File.read('./services.json')
data_hash = JSON.parse(file)
threads = []
service = data_hash.keys
service.each do |microservice|
threads << Thread.new do
begin
puts "Microservice: #{microservice}"
port = data_hash["#{microservice}"]['port']
puts "Port: #{port}"
nodes = "knife search 'chef_environment:#{env} AND recipe:#{microservice}' -i"
node = %x[ #{nodes} ].split
node.each do |n|
puts "Node: #{n}"
uri = URI("http://#{n}:#{port}/healthcheck?count=10")
res = Net::HTTP.get_response(uri)
status = Net::HTTP.get(uri)
puts res.code
puts status
puts res.message
end
rescue Net::ReadTimeout
puts "ReadTimeout Error"
next
end
end
end
threads.each do |thread|
thread.join
end
Anyway in this way the script return first the puts "Microservice: #{microservice}" and puts "Port: #{port}" and after this it will return the nodes and only after the STATUS.
How can I return all the data for each loop together?
Instead of puts write output to a variable (hash).
If you wand to wait for all threads to finish their job before showing the output, use ThreadsWait class.
require 'thwait'
file = File.read('./services.json')
data_hash = JSON.parse(file)
h = {}
threads = []
service = data_hash.keys
service.each do |microservice|
threads << Thread.new do
thread_id = Thread.current.object_id.to_s(36)
begin
h[thread_id] = "Microservice: #{microservice}"
port = data_hash["#{microservice}"]['port']
h[thread_id] << "Port: #{port}"
nodes = "knife search 'chef_environment:#{env} AND recipe:#{microservice}' -i"
node = %x[ #{nodes} ].split
node.each do |n|
h[thread_id]<< "Node: #{n}"
uri = URI("http://#{n}:#{port}/healthcheck?count=10")
res = Net::HTTP.get_response(uri)
status = Net::HTTP.get(uri)
h[thread_id] << res.code
h[thread_id] << status
h[thread_id] << res.message
end
rescue Net::ReadTimeout
h[thread_id] << "ReadTimeout Error"
next
end
end
end
threads.each do |thread|
thread.join
end
# wait untill all threads finish their job
ThreadsWait.all_waits(*threads)
p h
[edit]
ThreadsWait.all_waits(*threads) is redundant in above code and can be omitted, since line treads.each do |thread| thread.join end does exactely the same thing.
Instead of outputting the data as you get it using puts, you can collect it all in a string and then puts it once at the end. Strings can take the << operator (implemented as a method in Ruby), so you can just initialize the string, add to it, and then output it at the end, like this:
report = ''
report << 'first thing'
report << 'second thing'
puts report
You could even save them all up together and print them all after all were finished if you want.

Put contents of array all at once

I don't understand why this won't do what the title states.
#!/usr/bin/env ruby
require 'socket'
require 'timeout'
class Scanner
def initialize(host, port)
#host = host
#port = port
end
def popen
begin
array = []
sock = Socket.new(:INET, :STREAM)
sockaddr = Socket.sockaddr_in(#port, #host)
Timeout::timeout(5) do
array.push("Port #{#port}: Open") if sock.connect(sockaddr)
end
puts array
rescue Timeout::Error
puts "Port #{#port}: Filtered"
rescue Errno::ECONNREFUSED
end
end
end # end Scanner
def main
begin
p = 1
case ARGV[0]
when '-p'
eport = ARGV[1]
host = ARGV[2]
else
eport = 65535
host = ARGV[0]
end
t1 = Time.now
puts "\n"
puts "-" * 70
puts "Scanning #{host}..."
puts "-" * 70
while p <= eport.to_i do
scan = Scanner.new(host, p)
scan.popen
p += 1
end
t2 = Time.now
time = t2 - t1
puts "\nScan completed: #{host} scanned in #{time} seconds."
rescue Errno::EHOSTUNREACH
puts "This host appears to be unreachable"
rescue Interrupt
puts "onnection terminated."
end
end
main
What I'm trying to achieve is an output similar to nmap, in the way that it scans everything, and then shows all open or closed ports at the end. Instead what happens is that it prints them out as it discovers them. I figured pushing the output into an array then printing the array would achieve such an output, yet it still prints out the ports one at a time. Why is this happening?
Also, I apologize for the formatting, the code tags are a little weird.
Your loop calls popen once per iteration. Your popen method sets array = [] each time it is called, then populates it with one item, then you print it with puts. On the next loop iteration, you reset array to [] and do it all again.
You only asked "why," but – you could solve this by setting array just once in the body of main and then passing it to popen (or any number of ways).

Ruby output is not displayed on the sinatra browser

I want to bulid a multi threaded application. If i do not use threads, everything works fine. When i try to use threads, then nothing is displayed on the browser. when i use the syntax 'puts "%s" %io.read' then it displays on the command prompt and not on the browser. Any help would be appreciated.
require 'sinatra'
require 'thread'
set :environment, :production
get '/price/:upc/:rtype' do
Webupc = "#{params[:upc]}"
Webformat = "#{params[:rtype]}"
MThread = Thread.new do
puts "inside thread"
puts "a = %s" %Webupc
puts "b = %s" %Webformat
#call the price
Maxupclen = 16
padstr = ""
padupc = ""
padlen = (Maxupclen - Webupc.length)
puts "format type: #{params[:rtype]}"
puts "UPC: #{params[:upc]}"
puts "padlen: %s" %padlen
if (Webformat == 'F')
puts "inside format"
if (padlen == 0 ) then
IO.popen("tstprcpd.exe #{Webupc}")
{ |io|
"%s" %io.read
}
elsif (padlen > 0 ) then
for i in 1 .. padlen
padstr = padstr + "0"
end
padupc = padstr + Webupc
puts "padupc %s" %padupc
IO.popen("tstprcpd.exe #{padupc}") { |io|
"%s" %io.read
}
elsif (padlen < 0 ) then
IO.popen("date /T") { |io|
"UPC length must be 16 digits or less." %io.read
}
end
end
end
end
Your code has several problems:
It is not formatted properly
You are using Uppercase names for variables; that makes them constants!
puts will not output to the browser, but to the console. The browser will recieve the return value of the block, i.e. the return value of the last statement in the block. Therefore, you need to build your output differently (see below).
You are never joining the thread
Here's a minimal sinatra app that uses a thread. However, the thread makes no sense in this case because you must wait for its termination anyway before you can output the result to the browser. In order to build the output I have used StringIO, which you can use with puts to build a multiline string conveniently. However, you could also simply initialize res with an empty string with res = "" and then append your lines to this string with res << "new line\n".
require 'sinatra'
require 'thread'
require 'stringio'
get '/' do
res = StringIO.new
th = Thread.new do
res.puts 'Hello, world!'
end
th.join
res.string
end

Using ruby to retrieve a document from a website

I have written a script in ruby that navigates through a website and gets to a form page. Once the form page is filled out the script hits the submit button and then a dialogbox opens asking you where to save it too. I am having trouble trying to get this file. I have searched the web and cant find anything. How would i go about retrieveing the file name of the document?
I would really appreciate if someone could help me
My code is below:
browser = Mechanize.new
## CONSTANTS
LOGIN_URL = 'https://business.airtricity.com/ews/welcome.jsp'
HOME_PAGE_URL = 'https://business.airtricity.com/ews/welcome.jsp'
CONSUMPTION_REPORT_URL = 'https://business.airtricity.com/ews/touConsChart.jsp?custid=209495'
LOGIN = ""
PASS = ""
MPRN_GPRN_LCIS = "10000001534"
CONSUMPTION_DATE = "20/01/2013"
END_DATE = "27/01/2013"
DOWNLOAD = "DL"
### Login page
begin
login_page = browser.get(LOGIN_URL)
rescue Mechanize::ResponseCodeError => exception
login_page = exception.page
end
puts "+++++++++"
puts login_page.links
puts "+++++++++"
login_form = login_page.forms.first
login_form['userid'] = LOGIN
login_form['password'] = PASS
login_form['_login_form_'] = "yes"
login_form['ipAddress'] = "137.43.154.176"
login_form.submit
## home page
begin
home_page = browser.get(HOME_PAGE_URL)
rescue Mechanize::ResponseCodeError => exception
home_page = exception.page
end
puts "----------"
puts home_page.links
puts "----------"
# Consumption Report
begin
Report_Page = browser.get(CONSUMPTION_REPORT_URL)
rescue Mechanize::ResponseCodeError => exception
Report_Page = exception.page
end
puts "**********"
puts Report_Page.links
pp Report_Page
puts "**********"
Report_Form = Report_Page.forms.first
Report_Form['entity1'] = MPRN_GPRN_LCIS
Report_Form['start'] = CONSUMPTION_DATE
Report_Form['end'] = END_DATE
Report_Form['charttype'] = DOWNLOAD
Report_Form.submit
## Download Report
begin
browser.pluggable_parser.csv = Mechanize::Download
Download_Page = browser.get('https://business.airtricity.com/ews/touConsChart.jsp?custid=209495/meter_read_download_2013-1-20_2013-1-27.csv').save('Hello')
rescue Mechanize::ResponseCodeError => exception
Download_Page = exception.page
end
http://mechanize.rubyforge.org/Mechanize.html#method-i-get_file
File downloading from url it's pretty straightforward with mechanize:
browser = Mechanize.new
file_url = 'https://raw.github.com/ragsagar/ragsagar.github.com/c5caa502f8dec9d5e3738feb83d86e9f7561bd5e/.html'
downloaded_file = browser.get_file file_url
File.open('new_file.txt', 'w') { |file| file.write downloaded_file }
I've seen automation fail because of the browser agent. Perhaps you could try
browser.user_agent_alias = "Windows Mozilla"

mechanize html scraping problem

so i am trying to extract the email of my website using ruby mechanize and hpricot.
what i a trying to do its loop on all the page of my administration side and parse the pages with hpricot.so far so good. Then I get:
Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*
when it parse a bunch of page , its starts with a timeout and then print the html code of the page.
cant understand why? how can i debug that?
its seems like mechanize can get more than 10 page on a row ?? is it possible??
thanks
require 'logger'
require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'open-uri'
class Harvester
def initialize(page)
#page=page
#agent = WWW::Mechanize.new{|a| a.log = Logger.new("logs.log") }
#agent.keep_alive=false
#agent.read_timeout=15
end
def login
f = #agent.get( "http://****.com/admin/index.asp") .forms.first
f.set_fields(:username => "user", :password =>"pass")
f.submit
end
def harvest(s)
pageNumber=1
##agent.read_timeout =
s.upto(#page) do |pagenb|
puts "*************************** page= #{pagenb}/#{#page}***************************************"
begin
#time=Time.now
#search=#agent.get( "http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")
extract(pagenb)
rescue => e
puts "unknown #{e.to_s}"
#puts "url:http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}"
#sleep(2)
extract(pagenb)
rescue Net::HTTPBadResponse => e
puts "net exception"+ e.to_s
rescue WWW::Mechanize::ResponseCodeError => ex
puts "mechanize error: "+ex.response_code
rescue Timeout::Error => e
puts "timeout: "+e.to_s
end
end
end
def extract(page)
#puts search.body
search=#agent.get( "http://***.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")
doc = Hpricot(search.body)
#remove titles
#~ doc.search("/html/body/div/table[2]/tr/td[2]/table[3]/tr[1]").remove
(doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|
#delete the phone number from the html
temp = tr.search("/td[2]").inner_html
index = temp.index('<')
email = temp[0..index-1]
puts email
f=File.open("./emails", 'a')
f.puts(email)
f.close
end
end
end
puts "starting extacting emails ... "
start =ARGV[0].to_i
h=Harvester.new(186)
h.login
h.harvest(start)
Mechanize puts full content of a page into history, this may cause problems when browsing through many pages. To limit the size of history, try
#mech = WWW::Mechanize.new do |agent|
agent.history.max_size = 1
end

Resources