Ruby: How to set feedjira configuration options? - ruby

In the Feedjira 2.0 announcement blog post, it says that if you want to set the user agent, that should be a configuration option, but it is not clear how to do this. Ideally, I would like to mimic the options previously provided in Feedjira 1.0, including user_agent, if_modified_since, timeout, and ssl_verify_peer.
http://feedjira.com/blog/2014/04/14/thoughts-on-version-two-point-oh.html
With Feedjira 1.0, you could set those options by making the following call (as described here):
feed_parsed = Feedjira::Feed.fetch_and_parse("http://sports.espn.go.com/espn/rss/news", {:if_modified_since => Time.now, :ssl_verify_peer => false, :timeout => 5, :user_agent => "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"})
The only example I have seen where configuration options are set was from a comment in a github pull request, which is as follows:
Feedjira::Feed.configure do |faraday|
faraday.request :user_agent, app: "MySite", version: APP_VERSION
end
But when I tried something similar, I received the following error:
undefined method `configure' for Feedjira::Feed:Class

It looks like a patch was added to allow a timeout option to be passed to the fetch_and_parse function:
https://github.com/feedjira/feedjira/pull/318/commits/fbdb85b622f72067683508b1d7cab66af6303297#diff-a29beef397e3d8624e10af065da09a14
However, until that is pushed live, a timeout and an open_timeout option can be passed by bypassing Feedjira for the fetching and instead using Faraday (or any library that can fetch HTTP requests, like Net::HTTP). You can also set ssl verify to false, and set the user agent, such as this:
require 'feedjira'
require 'pp'
url = "http://www.espn.com/espnw/rss/?sectionKey=athletes-life"
conn = Faraday.new :ssl => {:verify => false}
response = conn.get do |request|
request.url url
request.options.timeout = 5
request.options.open_timeout = 5
request.headers = {'User-Agent' => "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"}
end
feed_parsed = Feedjira::Feed.parse response.body
pp feed_parsed.entries.first
I haven't seen a way to check for "if_modified_since", but I will update answer if I do.

Related

Mechanize Rails - Web Scraping - Server responds with JSON - How to Parse URL from to Download CSV

I am new to Mechanize and trying to overcome this probably very obvious answer.
I put together a short script to auth on an external site, then click a link that generates a CSV file dynamically.
I have finally got it to click on the export button, however, it returns an AWS URL.
I'm trying to get the script to download said CSV from this JSON Response (seen below).
Myscript.rb
require 'mechanize'
require 'logger'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'zlib'
USERNAME = "myemail"
PASSWORD = "mysecret"
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
mechanize = Mechanize.new do |a|
a.user_agent = USER_AGENT
end
form_page = mechanize.get('https://XXXX.XXXXX.com/signin')
form = form_page.form_with(:id =>'login')
form.field_with(:id => 'user_email').value=USERNAME
form.field_with(:id => 'user_password').value=PASSWORD
page = form.click_button
donations = mechanize.get('https://XXXXX.XXXXXX.com/pages/ACCOUNT/statistics')
puts donations.body
donations = mechanize.get('https://xxx.siteimscraping.com/pages/myaccount/statistics')
bs_csv_download = page.link_with(:text => 'Download CSV')
JSON response from website containing link to CSV I need to parse and download via Mechanize and/or nokogiri.
{"message":"Find your report at https://s3.amazonaws.com/reports.XXXXXXX.com/XXXXXXX.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256\u0026X-Amz-Credential=AKIAIKW4BJKQUNOJ6D2A%2F20190228%2Fus-east-1%2Fs3%2Faws4_request\u0026X-Amz-Date=20190228T025844Z\u0026X-Amz-Expires=86400\u0026X-Amz-SignedHeaders=host\u0026X-Amz-Signature=b19b6f1d5120398c850fc03c474889570820d33f5ede5ff3446b7b8ecbaf706e"}
I very much appreciate any help.
You could parse it as JSON and then retrieve a substring from the response (assuming it always responds in the same format):
require 'json'
...
bs_csv_download = page.link_with(:text => 'Download CSV')
json_response = JSON.parse(bs_csv_download)
direct_link = json_response["message"][20..-1]
mechanize.get(direct_link).save('file.csv')
We're getting the 20th character in the "message" value with [20..-1] (-1 means till the end of the string).

Multiple matches in hash table

Learning Ruby (v. 2.5) in coursera.
Aim is to write on ruby simple parser, which will count what IP host is responsible for the most queries in the apache logs.
Apache logs:
87.99.82.183 - - [01/Feb/2018:18:50:06 +0000] "GET /favicon.ico HTTP/1.1" 404 504 "http://35.225.14.147/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36"
87.99.82.183 - - [01/Feb/2018:18:50:52 +0000] "GET /secret.html HTTP/1.1" 404 505 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36"
Ruby code:
class ApacheLogAnalyzer
def initialize
#total_hits_by_ip = {}
end
def analyze(file_name)
ip_regex = /^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}/
file = File.open(file_name , "r")
file.each_line do |line|
count_hits(ip_regex.match(line))
end
end
def count_hits(ip)
if ip
if #total_hits_by_ip[ip]
#total_hits_by_ip[ip] += 1
else
#total_hits_by_ip[ip] = 1
end
end
end
Result is following:
{#<MatchData "87.99.82.183">=>1, #<MatchData "87.99.82.183">=>1}
The result contains duplicates (it shoud contain one key "87.99.82.183" with value 2). Where could be the issue?
The result contains duplicates in your case because hash keys are different objects but with the same values. Look at this examples:
a = "hello world foo".match(/he/) # => #<MatchData "he">
b = "hello world bar".match(/he/) # => #<MatchData "he">
a == b # => false
You can replace the hash keys with just string for example to definitely avoid this:
class ApacheLogAnalyzer
def analyze(file_name)
File.open(file_name).each_line.inject(Hash.new(0)) do |result, line|
ip = line.split
hash[ip] += 1
result
end
end
end
Thank you for your comment. I found that using method to_s resolves the issue.
So improved code looks like this:
count_hits(ip_regex.match(line).to_s)

Using a splat to catch errors is not working

I have a lot of errors that I need to catch, so I put them all into two arrays and made a constant to hold them, however, when I run the program I receive the exception:
C:/Users/thomas_j_perkins/bin/ruby/tool/sql_tool/whitewidow/lib/imports/constants_and_requires.rb:62:in `<top (required)>': uninitialized constant RestClient::MaxRedirectsReached (NameError)
from whitewidow.rb:6:in `require_relative'
from whitewidow.rb:6:in `<main>'
Here's how the constants look:
LOADING_ERRORS = [RestClient::ResourceNotFound, RestClient::InternalServerError, RestClient::RequestTimeout,
RestClient::Gone, RestClient::SSLCertificateNotVerified, RestClient::Forbidden,
OpenSSL::SSL::SSLError, Errno::ECONNREFUSED, URI::InvalidURIError, Errno::ECONNRESET,
Timeout::Error, OpenSSL::SSL::SSLError, Zlib::GzipFile::Error, RestClient::MultipleChoices,
RestClient::Unauthorized, SocketError, RestClient::BadRequest, RestClient::ServerBrokeConnection,
RestClient::MaxRedirectsReached]
FATAL_ERRORS = [Mechanize::ResponseCodeError, RestClient::ServiceUnavailable, OpenSSL::SSL::SSLError,
RestClient::BadGateway]
Here's how I'm using them:
begin
# Do some cool stuff
rescue *FATAL_ERRORS => e
puts e
end
--
begin
# Do some more cool stuff
rescue *LOADING_ERRORS => e
puts e
end
Am I doing something wrong to where I will receive a top required error? Just in case you need it here's the entire requiring file that the error is specifying:
# Built in libraries
require 'rubygems'
require 'bundler/setup'
require 'mechanize'
require 'nokogiri'
require 'rest-client'
require 'timeout'
require 'uri'
require 'fileutils'
require 'yaml'
require 'date'
require 'optparse'
require 'tempfile'
require 'socket'
require 'net/http'
# Created libraries
require_relative '../../lib/modules/format'
require_relative '../../lib/misc/credits'
require_relative '../../lib/misc/legal'
require_relative '../../lib/misc/spider'
require_relative '../../lib/modules/copy'
require_relative '../../lib/modules/site_info'
require_relative '../../lib/modules/expansion/string_expan'
# Modules that need to be included
include Format
include Credits
include Legal
include Whitewidow
include Copy
include SiteInfo
# Constants used throughout the program
=begin
USER_AGENTS = { # Temporary fix for user agents until I can refactor the YAML file
1 => 'Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620',
2 => 'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
3 => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.3pre) Gecko/20100403 Lorentz/3.6.3plugin2pre (.NET CLR 4.0.20506)',
4 => 'Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)',
5 => 'igdeSpyder (compatible; igde.ru; +http://igde.ru/doc/tech.html)',
6 => 'larbin_2.6.3 (ltaa_web_crawler#groupes.epfl.ch)',
7 => 'Mozilla/5.0 (Linux; Android 5.0.2; SAMSUNG SM-T550 Build/LRX22G) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/3.3 Chrome/38.0.2125.102 Safari/537.36',
8 => 'Dalvik/2.1.0 (Linux; U; Android 6.0.1; Nexus Player Build/MMB29T)',
9 => 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
10 => 'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)',
}
=end
FORMAT = Format::StringFormat.new
PATH = Dir.pwd
VERSION = Whitewidow.version
SEARCH = File.readlines("#{PATH}/lib/lists/search_query.txt").sample
USER_AGENTS = YAML.load_file("#{PATH}/lib/lists/rand-age.yml")
OPTIONS = {}
USER_AGENT = USER_AGENTS[rand(1..10)]
SKIP = %w(/webcache.googleusercontent.com stackoverflow.com github.com)
LOADING_ERRORS = [RestClient::ResourceNotFound, RestClient::InternalServerError, RestClient::RequestTimeout,
RestClient::Gone, RestClient::SSLCertificateNotVerified, RestClient::Forbidden,
OpenSSL::SSL::SSLError, Errno::ECONNREFUSED, URI::InvalidURIError, Errno::ECONNRESET,
Timeout::Error, OpenSSL::SSL::SSLError, Zlib::GzipFile::Error, RestClient::MultipleChoices,
RestClient::Unauthorized, SocketError, RestClient::BadRequest, RestClient::ServerBrokeConnection,
RestClient::MaxRedirectsReached]
FATAL_ERRORS = [Mechanize::ResponseCodeError, RestClient::ServiceUnavailable, OpenSSL::SSL::SSLError,
RestClient::BadGateway]
I installed mechanize and rest-client
gem install mechanize
gem install rest-client
then I opened an IRB session
require mechanize
require rest-client
then tested you FATAL_ERROR array and was able to raise the error and handle it with your code.
So there is no problem with the way you are using the * splat operator.
The problem is in your LOADING_ERRORS array.
When I tried doing the same thing with your LOADING_ERRORS array, I got the same error message as you.
I cloned the rest-client git repository and searched in the lib/restclient/exceptions.rb file and it seems like there is no RestClient::MaxRedirectsReached defined.
If you remove that exception from your array, the code works.
After further research in the repository, there is a history.md file and it states:
Changes to redirection behavior: (#381, #484)
Remove RestClient::MaxRedirectsReached in favor of the normal
ExceptionWithResponse subclasses. This makes the response accessible on
the exception object as .response, making it possible for callers to tell
what has actually happened when the redirect limit is reached.
When following HTTP redirection, store a list of each previous response on
the response object as .history. This makes it possible to access the
original response headers and body before the redirection was followed.
Follow redirection consistently, regardless of whether the HTTP method was
passed as a symbol or string. Under the hood rest-client now normalizes the
HTTP request method to a lowercase string.
So it seems like that exception has been removed from the rest-client library.
You may want to replace it with RestClient::ExceptionWithResponse

Trouble scraping Google trends using Capybara and Poltergeist

I want to get the top trending queries in a particular category on Google Trends. I could download the CSV for that category but that is not a viable solution because I want to branch into each query and find the trending sub-queries for each.
I am unable to capture the contents of the following table, which contains the top 10 trending queries for a topic. Also for some weird reason taking a screenshot using capybara returns a darkened image.
<div id="TOP_QUERIES_0_0table" class="trends-table">
Please run the code on the Ruby console to see it working. Capturing elements/screenshot works fine for facebook.com or google.com but doesn't work for trends.
I am guessing this has to do with the table getting generated dynamically on page load but I'm not sure if that should block capybara from capturing the elements already loaded on the page. Any hints would be very valuable.
require 'capybara/poltergeist'
require 'capybara/dsl'
require 'csv'
class PoltergeistCrawler
include Capybara::DSL
def initialize
Capybara.register_driver :poltergeist_crawler do |app|
Capybara::Poltergeist::Driver.new(app, {
:js_errors => false,
:inspector => false,
phantomjs_logger: open('/dev/null')
})
end
Capybara.default_wait_time = 3
Capybara.run_server = false
Capybara.default_driver = :poltergeist_crawler
page.driver.headers = {
"DNT" => 1,
"User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0"
}
end
# handy to peek into what the browser is doing right now
def screenshot(name="screenshot")
page.driver.render("public/#{name}.jpg",full: true)
end
# find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields
def doc
Nokogiri.parse(page.body)
end
end
crawler = PoltergeistCrawler.new
url = "http://www.google.com/trends/explore#cat=0-45&geo=US&date=today%2012-m&cmpt=q"
crawler.visit url
crawler.screenshot
crawler.find(:xpath, "//div[#id='TOP_QUERIES_0_0table']")
Capybara::ElementNotFound: Unable to find xpath "//div[#id='TOP_QUERIES_0_0table']"
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:41:in block in find'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/base.rb:84:insynchronize'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:30:in find'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/session.rb:676:inblock (2 levels) in '
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/dsl.rb:51:in block (2 levels) in <module:DSL>'
from (irb):45
from /Users/karan/.rbenv/versions/1.9.3-p484/bin/irb:12:in'
The javascript error was due to the incorrect USER-Agent. Once I changed the User Agent to that of my chrome browser it worked !
"User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"

How to capture HTTP full headers with watir

I need to capture HTTP response coming after the request made.
I have tried with "net/http" gem but it is not giving me full response header.
the code I have tried is
uri = URI("http:/example.com")
res = Net::HTTP.get_response(uri)
res.to_hash
I am getting some response headers but not full headers, I have checked the same request in firebug and it is giving some extra headers what I am getting by my code
Can any one help me out for this to get full HTTP response headers, or any trick to do that by invoking browser.
Maybe this would help: WebDriver: Y U NO HAVE HTTP Status Codes?!
Try this:
uri = URI("http:/example.com")
res = Net::HTTP.get_response(uri)
res.header.to_hash
If you want to get header information, watir is probably the wrong library for you. What problem are you trying to solve?
Not a full answer but more in your direction :)
Impossible using Webdriver (see http://jimevansmusic.blogspot.nl/2012/07/webdriver-y-u-no-have-http-status-codes.html).
Possible solutions:
Use Selenium::Client
Use a proxy
Solution 1 (Selenium::Client):
You can do it using Selenium (also used by Watir Webdriver).
Check here: http://blog.testingbot.com/2011/12/21/capture-network-traffic-with-selenium
require "rubygems"
gem "selenium-client"
require "selenium/client"
gem 'test-unit'
require 'test/unit'
# since this code comes from their site (should not be needed)
gem "testingbot"
require "testingbot"
class ExampleTest < TestingBot::TestCase
attr_reader :browser
def setup
#browser = Selenium::Client::Driver.new \
:host => "hub.testingbot.com",
:port => 4444,
:browser => "firefox",
:version => "8",
:platform => "WINDOWS",
:url => "http://www.google.com",
:timeout_in_second => 60
browser.start_new_browser_session(:captureNetworkTraffic => true)
end
def teardown
browser.close_current_browser_session
end
def test_command
browser.open "/"
p browser.browser_network_traffic
end
end
According to the article this will open Google in Firefox 8 and return the network traffic. An example of a response would be:
"403 GET http://localhost:5555/favicon.ico1333 bytes 94ms
(2011-12-21T15:53:06.352+0100 - 2011-12-21T15:53:06.446+0100
Request Headers - Host => localhost:5555 -
User-Agent => Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0.1) Gecko/20100101
Firefox/8.0.1 - Accept => image/png,image/*;q=0.8,*/*;q=0.5 -
Accept-Language => en-us,en;q=0.5 - Accept-Encoding => gzip, deflate - Accept-Charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7 -
Proxy-Connection => keep-aliveResponse Headers - Date => Wed, 21 Dec 2011 14:53:06 GMT -
Server => Jetty/5.1.x (Windows 7/6.1 x86 java/1.6.0_26 - Content-Type => text/html -
Content-Length => 1333 - Via => 1.1 (jetty)
Solution 2 (Proxy):
Check http://bmp.lightbody.net/ together with https://github.com/jarib/browsermob-proxy-rb.

Resources