Using ruby to retrieve a document from a website

Using ruby to retrieve a document from a website - ruby

I have written a script in ruby that navigates through a website and gets to a form page. Once the form page is filled out the script hits the submit button and then a dialogbox opens asking you where to save it too. I am having trouble trying to get this file. I have searched the web and cant find anything. How would i go about retrieveing the file name of the document?
I would really appreciate if someone could help me
My code is below:
browser = Mechanize.new
## CONSTANTS
LOGIN_URL = 'https://business.airtricity.com/ews/welcome.jsp'
HOME_PAGE_URL = 'https://business.airtricity.com/ews/welcome.jsp'
CONSUMPTION_REPORT_URL = 'https://business.airtricity.com/ews/touConsChart.jsp?custid=209495'
LOGIN = ""
PASS = ""
MPRN_GPRN_LCIS = "10000001534"
CONSUMPTION_DATE = "20/01/2013"
END_DATE = "27/01/2013"
DOWNLOAD = "DL"
### Login page
begin
login_page = browser.get(LOGIN_URL)
rescue Mechanize::ResponseCodeError => exception
login_page = exception.page
end
puts "+++++++++"
puts login_page.links
puts "+++++++++"
login_form = login_page.forms.first
login_form['userid'] = LOGIN
login_form['password'] = PASS
login_form['_login_form_'] = "yes"
login_form['ipAddress'] = "137.43.154.176"
login_form.submit
## home page
begin
home_page = browser.get(HOME_PAGE_URL)
rescue Mechanize::ResponseCodeError => exception
home_page = exception.page
end
puts "----------"
puts home_page.links
puts "----------"
# Consumption Report
begin
Report_Page = browser.get(CONSUMPTION_REPORT_URL)
rescue Mechanize::ResponseCodeError => exception
Report_Page = exception.page
end
puts "**********"
puts Report_Page.links
pp Report_Page
puts "**********"
Report_Form = Report_Page.forms.first
Report_Form['entity1'] = MPRN_GPRN_LCIS
Report_Form['start'] = CONSUMPTION_DATE
Report_Form['end'] = END_DATE
Report_Form['charttype'] = DOWNLOAD
Report_Form.submit
## Download Report
begin
browser.pluggable_parser.csv = Mechanize::Download
Download_Page = browser.get('https://business.airtricity.com/ews/touConsChart.jsp?custid=209495/meter_read_download_2013-1-20_2013-1-27.csv').save('Hello')
rescue Mechanize::ResponseCodeError => exception
Download_Page = exception.page
end

http://mechanize.rubyforge.org/Mechanize.html#method-i-get_file
File downloading from url it's pretty straightforward with mechanize:
browser = Mechanize.new
file_url = 'https://raw.github.com/ragsagar/ragsagar.github.com/c5caa502f8dec9d5e3738feb83d86e9f7561bd5e/.html'
downloaded_file = browser.get_file file_url
File.open('new_file.txt', 'w') { |file| file.write downloaded_file }

I've seen automation fail because of the browser agent. Perhaps you could try
browser.user_agent_alias = "Windows Mozilla"

Related

Ruby Google Drive API - 403 Forbidden on web_content_link

I can read and scan the files fine in a Google Drive, but I can't seem to access the web_content_link, no matter what I do. My Auth perms look fine. I'm at a complete loss.
I abstracted some of the Google Drive API logic to google_setup.rb.
And then in my rake file, all I want to do is access the download file. I seem to be able to scan through the files just fine, but even though, it looks to me like I should have perms to access the download link, I always get a 403 Forbidden error.
If you can be of any help, please let me know!
google_setup.rb
require 'google/apis/drive_v3'
require 'googleauth'
require 'googleauth/stores/file_token_store'
require 'fileutils'
require 'open-uri'
module GoogleSetup
OOB_URI = "urn:ietf:wg:oauth:2.0:oob".freeze
def self.authorize
client_id = Google::Auth::ClientId.from_file Rails.root.join('lib', 'assets', 'credentials.json').freeze
token_store = Google::Auth::Stores::FileTokenStore.new file: 'token.yaml'
authorizer = Google::Auth::UserAuthorizer.new client_id, Google::Apis::DriveV3::AUTH_DRIVE, token_store
user_id = 'default'
credentials = authorizer.get_credentials user_id
if credentials.nil?
url = authorizer.get_authorization_url base_url: 'https://seeburg.herokuapp.com/'
puts 'Open the following URL in the browser and enter the ' \
"resulting code after authorization:\n" + url
code = ENV["GOOGLE_CODE"]
credentials = authorizer.get_and_store_credentials_from_code(
user_id: user_id, code: code, base_url: OOB_URI
)
end
drive_service = Google::Apis::DriveV3::DriveService.new
drive_service.client_options.application_name = 'Seeburg Google Drive Integration'
drive_service.authorization = credentials
drive_service
end
def self.get_files(query)
drive_service = GoogleSetup.authorize
files = []
#page_token = ''
while #page_token
begin
response = drive_service.list_files(page_size: 100, q: query, page_token: #page_token, fields: 'nextPageToken, files')
#page_token = response.next_page_token || false
if response.files.empty?
puts 'No files found'
else
files = response.files
end
sleep 0.5
rescue StandardError => e
puts e
end
end
files
end
end
test_google.rake
require 'google_setup'
require 'fileutils'
require 'open-uri'
desc 'Test Google Drive API'
task test_google: :environment do
folder_id = "1j1Ly_NveiCtfrolzSxmrbHS1DenPZagV";
query = "name contains 'MP3' and '#{folder_id}' in parents";
GoogleSetup.get_files(query).each do |file|
##If I can get this section to work, everything else I need will work
begin
puts download = URI.open(file.web_content_link)
rescue StandardError => e
puts e
end
end
end

It ended up being a timeout issue. Google only lets you take certain actions within a timeframe of authentication.

Web scraping with Kimurai gem

I am doing some web scraping with the Kimurai Ruby gem. I have this script that works great:
require 'kimurai'
class SimpleSpider < Kimurai::Base
#name = "simple_spider"
#engine = :selenium_chrome
#start_urls = ["https://apply.workable.com/taxjar/"]
def parse(response, url:, data: {})
# Update response to current response after interaction with a browser
count = 0
# browser.click_button "Show more"
doc = browser.current_response
returned_jobs = doc.css('.careers-jobs-list-styles__jobsList--3_v12')
returned_jobs.css('li').each do |char_element|
# puts char_element
title = char_element.css('a')[0]['aria-label']
link = "https://apply.workable.com" + char_element.css('a')[0]['href']
#click on job link and get description
browser.visit(link)
job_page = browser.current_response
description = job_page.xpath('/html/body/div[1]/div/div[1]/div[2]/div[2]/div[2]').text
puts '*******'
puts title
puts link
puts description
puts count += 1
end
puts "There are #{count} jobs total"
end
end
SimpleSpider.crawl!
However, I'm wanting this all to return an array of objects...or jobs in this case. I'd like to create a jobs array in the parse method and do something like jobs << [title, link, description, company] inside the returned_jobs loop and have that get returned when I call SimpleSpider.crawl! but that doesn't work.
Any help appreciated.

You can slightly modify your code like this:
class SimpleSpider < Kimurai::Base
#name = "simple_spider"
#engine = :selenium_chrome
#start_urls = ["https://apply.workable.com/taxjar/"]
def parse(response, url:, data: {})
# Update response to current response after interaction with a browser
count = 0
# browser.click_button "Show more"
doc = browser.current_response
returned_jobs = doc.css('.careers-jobs-list-styles__jobsList--3_v12')
jobs = []
returned_jobs.css('li').each do |char_element|
# puts char_element
title = char_element.css('a')[0]['aria-label']
link = "https://apply.workable.com" + char_element.css('a')[0]['href']
#click on job link and get description
browser.visit(link)
job_page = browser.current_response
description = job_page.xpath('/html/body/div[1]/div/div[1]/div[2]/div[2]/div[2]').text
jobs << [title, link, description]
end
puts "There are #{jobs.count} jobs total"
puts jobs
end
end
I am not sure about the company as I don't see that variable in your code. However, you can see the idea to call an array above and work on that.
Here is part of output running in terminal:
I also have a blog post here about how to use Kimurai framework from Ruby on Rails application.

Turns out there is a parse method that allows a value to be returned. Here is working example:
require 'open-uri'
require 'nokogiri'
require 'kimurai'
class TaxJar < Kimurai::Base
#name = "tax_jar"
#engine = :selenium_chrome
#start_urls = ["https://apply.workable.com/taxjar/"]
def parse(response, url:, data: {})
jobs = Array.new
doc = browser.current_response
returned_jobs = doc.css('.careers-jobs-list-styles__jobsList--3_v12')
returned_jobs.css('li').each do |char_element|
title = char_element.css('a')[0]['aria-label']
link = "https://apply.workable.com" + char_element.css('a')[0]['href']
#click on job link and get description
browser.visit(link)
job_page = browser.current_response
description = job_page.xpath('/html/body/div[1]/div/div[1]/div[2]/div[2]/div[2]').text
company = 'TaxJar'
puts "title is: #{title}, link is: #{link}, \n description is: #{description}"
jobs << [title, link, description, company]
end
return jobs
end
end
jobs = TaxJar.parse!(:parse, url: "https://apply.workable.com/taxjar/")
puts jobs.inspect
If you are scraping JS websites, this gem seems pretty robust compared with others (waitr/selenium) I have tried.

Ruby Mechanize Stops Working while in Each Do Loop

I am using a mechanize Ruby script to loop through about 1,000 records in a tab delimited file. Everything works as expected until i reach about 300 records.
Once I get to about 300 records, my script keeps calling rescue on every attempt and eventually stops working. I thought it was because I had not properly set max_history, but that doesn't seem to be making a difference.
Here is the error message that I start getting:
getaddrinfo: nodename nor servname provided, or not known
Any ideas on what I might be doing wrong here?
require 'mechanize'
result_counter = 0
used_file = File.open(ARGV[0])
total_rows = used_file.readlines.size
mechanize = Mechanize.new { |agent|
agent.open_timeout = 10
agent.read_timeout = 10
agent.max_history = 0
}
File.open(ARGV[0]).each do |line|
item = line.split("\t").map {|item| item.strip}
website = item[16]
name = item[11]
if website
begin
tries ||= 3
page = mechanize.get(website)
primary1 = page.link_with(text: 'text')
secondary1 = page.link_with(text: 'other_text')
contains_primary = true
contains_secondary = true
unless contains_primary || contains_secondary
1.times do |count|
result_counter+=1
STDERR.puts "Generate (#{result_counter}/#{total_rows}) #{name} - No"
end
end
for i in [primary1]
if i
page_to_visit = i.click
page_found = page_to_visit.uri
1.times do |count|
result_counter+=1
STDERR.puts "Generate (#{result_counter}/#{total_rows}) #{name}"
end
break
end
end
rescue Timeout::Error
STDERR.puts "Generate (#{result_counter}/#{total_rows}) #{name} - Timeout"
rescue => e
STDERR.puts e.message
STDERR.puts "Generate (#{result_counter}/#{total_rows}) #{name} - Rescue"
end
end
end

You get this error because you don't close the connection after you used it.
This should fix your problem:
mechanize = Mechanize.new { |agent|
agent.open_timeout = 10
agent.read_timeout = 10
agent.max_history = 0
agent.keep_alive = false
}

Retrieve Google Checkout CSV (no API)

I'm trying to retrieve the Google Checkout report (Download data to spreadsheet (.csv)). Unfortunatly I can't use the API (it's reserved to only UK and US accounts...!)
I have a script made with Mechanize and Ruby but I have an error : "Net::HTTPBadRequest 1.1 400 Bad Request".
Here is my code :
require 'rubygems'
require 'mechanize'
require 'logger'
agent = Mechanize.new { |a| a.log = Logger.new(STDERR) }
agent.user_agent_alias = 'Mac Safari'
page = agent.get 'https://checkout.google.com/sell/orders'
form = page.forms.first
form.Email = 'email#gmail.com'
form.Passwd = 'password'
page = agent.submit(form, form.buttons.first)
form = page.forms.last
p form
form['start-date'] = "2012-11-16"
form['end-date'] = "2012-11-17"
form['column-style'] = "EXPANDED"
#form['_type'] = "order-list-request"
#form['date-time-zone'] = "America/Los_Angeles"
#form['financial-state'] = ""
#form['query-type'] = ""
p form
begin
page = agent.submit(form, form.buttons.first)
rescue Mechanize::ResponseCodeError => ex
puts ex.page.body
end

Thanks to pguardiario and Charles proxy, I found my error... There was a superfluous field!

mechanize html scraping problem

so i am trying to extract the email of my website using ruby mechanize and hpricot.
what i a trying to do its loop on all the page of my administration side and parse the pages with hpricot.so far so good. Then I get:
Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*
when it parse a bunch of page , its starts with a timeout and then print the html code of the page.
cant understand why? how can i debug that?
its seems like mechanize can get more than 10 page on a row ?? is it possible??
thanks
require 'logger'
require 'rubygems'
require 'mechanize'
require 'hpricot'
require 'open-uri'
class Harvester
def initialize(page)
#page=page
#agent = WWW::Mechanize.new{|a| a.log = Logger.new("logs.log") }
#agent.keep_alive=false
#agent.read_timeout=15
end
def login
f = #agent.get( "http://****.com/admin/index.asp") .forms.first
f.set_fields(:username => "user", :password =>"pass")
f.submit
end
def harvest(s)
pageNumber=1
##agent.read_timeout =
s.upto(#page) do |pagenb|
puts "*************************** page= #{pagenb}/#{#page}***************************************"
begin
#time=Time.now
#search=#agent.get( "http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")
extract(pagenb)
rescue => e
puts "unknown #{e.to_s}"
#puts "url:http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}"
#sleep(2)
extract(pagenb)
rescue Net::HTTPBadResponse => e
puts "net exception"+ e.to_s
rescue WWW::Mechanize::ResponseCodeError => ex
puts "mechanize error: "+ex.response_code
rescue Timeout::Error => e
puts "timeout: "+e.to_s
end
end
end
def extract(page)
#puts search.body
search=#agent.get( "http://***.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")
doc = Hpricot(search.body)
#remove titles
#~ doc.search("/html/body/div/table[2]/tr/td[2]/table[3]/tr[1]").remove
(doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|
#delete the phone number from the html
temp = tr.search("/td[2]").inner_html
index = temp.index('<')
email = temp[0..index-1]
puts email
f=File.open("./emails", 'a')
f.puts(email)
f.close
end
end
end
puts "starting extacting emails ... "
start =ARGV[0].to_i
h=Harvester.new(186)
h.login
h.harvest(start)

Mechanize puts full content of a page into history, this may cause problems when browsing through many pages. To limit the size of history, try
#mech = WWW::Mechanize.new do |agent|
agent.history.max_size = 1
end

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Using ruby to retrieve a document from a website - ruby

I've seen automation fail because of the browser agent. Perhaps you could try browser.user_agent_alias = "Windows Mozilla"

Related

Ruby Google Drive API - 403 Forbidden on web_content_link

Web scraping with Kimurai gem

Ruby Mechanize Stops Working while in Each Do Loop

Retrieve Google Checkout CSV (no API)

mechanize html scraping problem

Categories

Resources