I'm working on a Ruby script to automatically collect tweets from a list of users or terms, and it works except when Twitter is over capacity. Then, Twitter returns an HTML page with no error code for me to capture. The HTML throws off the parser and the script fails. How can I check for that HTML response and handle it gracefully? I tried using an "until success" approach (commented out in my code below), but that ended up in the rescue every time.
I'm using the logger, twitter and twitter4r gems and authenticating via OAuth. The code below works unless Twitter is over capacity. I only get to this section if no RestErrors are returned.
Here's my code:
# until success
# begin
if search_type == "users"
# begin
tweets_array = client.timeline_for(:user, :id => row.chomp, :since_id => since_status_id, :count => 200)
success = true
# rescue
# log.info("error getting tweets. waiting to try again.")
# sleep 180
# end
elsif search_type == "terms"
# begin
tweets_array = client.search(:q => row.chomp, :since_id => since_status_id, :count => 200)
success = true
# rescue
# log.info("error getting tweets. waiting to try again.")
# sleep 180
# end
elsif
log.fatal("unsupported search type. exiting.")
break
end # search type
# end
Code I'm trying based on answers:
begin
tweets_array = client.timeline_for(:user, :id => row.chomp, :since_id => since_status_id, :count => 200)
success = true
rescue Twitter::RESTError => re
log.info(row.chomp + ": " + re.code.to_s + " " + re.message.to_s + " " + re.uri.to_s)
if re.code = 503
sleep 180
end
end
You can check the HTTP status code of the result, it should be 503.
http://twitter.com/503
Related
The following code when run on jenkins throws the error as 'The set password url is
invalid argument(Session info: chrome=100.0.4896.88) (Selenium::WebDriver::Error::InvalidArgumentError)
Backtrace: Ordinal0 [0x00A67413+2389011]"
STEP FILE:
When(/^the user clicks on activate online account link$/) do
on(CheckoutPage) do |page|
#sleep for 30 seconds for the email to be received
sleep 30
p #set_password_link = page.get_password_token
puts "The set password url is #{#set_password_link}"
page.navigate_to(#set_password_link)
end
end
Code FILE:
def get_password_token
begin
retries ||= 0
Gmail.new("xxxxxxx#gmail.com", "xxxxxxxx") do |gmail|
email = gmail.inbox.emails(:from => 'orders#cottonon.com', :subject => 'Activate your online account').last
html = email.html_part.body.to_s
urls = URI.extract(html, %w(https))
return urls[1]
end
rescue
retry if (retries += 1) < $code_retry
end
end
it could be number of things, maybe you just need URI.parse(urls[1]) or fetched url is invalid
also it seems like your gmail code always fetches last mail, which can return wrong one if email is still not received
Here is a gmail_check method that should be more resistant to mail content and time received
def gmail_check(url_part, receiver, timeout = 30)
time = (Time.now-5.minutes).to_i
Gmail.connect("xxxxxxx#gmail.com", "xxxxxxxx") do |gmail|
puts("Reading emails to: #{receiver}")
while (timeout > 0)
gmail.inbox.find(:gm => "\"after:#{time}\"").each do |mail|
if mail.message.to.first == receiver
content = mail.multipart? ? mail.html_part.decoded : mail.message.decoded
Nokogiri::HTML(content).css("a").each do |a|
href = a.attributes["href"].to_s
return href if href.include?(url_part)
end
end
end
puts("Waiting 5 seconds before reading mail again.")
timeout = timeout - 5
sleep 5
end
end
end
but you should be able to easily debug the problem by ssh-ing into jenkins machine,
type irb
type require gmail
paste your code there
check the url
good luck :)
I have the BrowserMob Proxy set up correctly with Watir and it is capturing traffic and saving the HAR file; however, what it's not doing is that it's not capturing the traffic continuously. So following is what I'm trying to achieve:
Go to homepage
Click on a link to go to another page where I need to wait for some events to happen
Once on the second page, start capturing traffic after the event happens and wait for a specific call to occur and capture its contents.
What I'm noticing however, is that it's following all of the above steps, but on step 3 the proxy stops capturing traffic before that call is even made on that page. The HAR that is returned doesn't have that call in it hence the test fails before it even does its job. Following is how the code looks like.
class BMP
attr_accessor :server, :proxy, :net_har, :sel_proxy
def initialize
bm_path = File.path(Support::Paths.cucumber_root + "/browsermob-
proxy-2.1.4/bin/browsermob-proxy")
#server = BrowserMob::Proxy::Server.new(bm_path, {:port => 9999,
:log => false, :use_little_proxy => true, :timeout => 100})
#server.start
#proxy = #server.create_proxy
#sel_proxy = #proxy.selenium_proxy
#proxy.timeouts(:read => 50000, :request => 50000, :dns_cache =>
50000)
#net_har = #proxy.new_har("new_har", :capture_binary_content =>
true, :capture_headers => true, :capture_content => true)
end
def fetch_har_entries(target_url)
har_logs = File.join(Support::Paths.har_logs, "har_file # .
{Time.now.strftime("%m%d%y_%H%M%S")} .har")
#net_har.save_to har_logs
index = 0
while (#net_har.entries.count > index) do
if #net_har.entries[index].request.url.include?(target_url) &&
entry.request.method.eql?("GET")
logs = JSON.parse(entry.response.content.text) if not
entry.response.content.text.nil?
har_logs = File.join(Support::Paths.har_logs, "json_file_# .
{Time.now.strftime("%m%d%y_%H%M%S")}.json")
File.open(har_logs, "w") do |json|
json.write(logs)
end
break
end
index += 1
end
end
end
In my test file I have following
Then("I navigate to the homepage") do
visit(HomePage) do |page|
page.element.click
end
end
And("I should wait for event to capture traffic") do
visit(SecondPage) do |page|
page.wait_until{page.element2.present?)
BMP.fetch_har_entries("target/url")
end
end
What am I missing that is causing the proxy to not capture traffic in its entirety?
In case anyone gets here from a google search, I figured out how to resolve this on my own (thanks stackoverflow community for nothing, lol). So to resolve the issue, i used a custom retriable loop called eventually method.
logs = nil
eventually(timeout: 110, interval: 1) do
#net_har = #proxy.new_har("har", capture_binary_content: true, capture_headers: true, capture_content: true)
#net_har.entries.each do |entry|
begin
break if #net_har.entries.index entry == #net_har.entries.count
next unless entry.request.url.include?(target_url) &&
entry.request.post_data.text.include?(target_body_text)
logs = entry.request.post_data.text
break
rescue TypeError
fail("Response body for the network call came back empty")
end
end
raise EOFError if logs_hash.nil?
end
logs
end
Basically I'm assuming what was happening was the BMP would only cache or capture 30 seconds worth of har logs, and if my network event didn't occur during those 30 secs, i was SOL. So the what above code is doing is that's it's waiting for the logs variable to be not nil, if it is, it raises an EOFError and goes back to the loop initializes the har again and looks for the network call again. It keeps on doing that until it find the call or 110 seconds are up. Following is the eventually method I'm using
def eventually(options = {})
timeout = options[:timeout] || 30
interval = options[:interval] || 0.1
time_limit = Time.now + timeout
loop do
begin
yield
rescue EOFError => error
end
return if error.nil?
raise error if Time.now >= time_limit
sleep interval
end
end
I am using the Ruby mechanize web crawler to pull data from popular real estate websites. I'm using the home address as keywords to scrape the public data on Zillow, Redfin, etc.
I'm basically trying to bypass any HTTP and network errors. The following rescue function doesn't seem to do the job.
def scrape_single(key_word)
#setup agent
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
agent.request_headers = { "Accept-Encoding" => ""}
agent.follow_meta_refresh = true
agent.keep_alive = false
#page setup
begin
agent.get(##search_engine) do |page|
##search_result = page.form('f') do |search|
search.q = key_word
end.submit
end
rescue Timeout::Error
puts "Timeout"
retry
rescue Net::HTTPGatewayTimeOut => e
if e.response_code == '504' || '502'
e.skip
sleep 5
end
rescue Net::HTTPBadGateway => e
if e.response_code == '504' || '502'
e.skip
sleep 5
end
rescue Net::HTTPNotFound => e
if e.response_code == '404'
e.skip
sleep 5
end
rescue Net::HTTPFatalError => e
if e.response_code == '503'
e.skip
end
rescue Mechanize::ResponseCodeError => e
if e.response_code == '404'
e.skip
sleep 5
elsif e.response_code == '502'
e.skip
sleep 5
else
retry
end
rescue Errno::ETIMEDOUT
retry
end
return ##search_result # returns Mechanize::Page
end
The following is an example of error message I get for a keyword with an address in MA.
/home/ec2-user/.gem/ruby/2.1/gems/mechanize-2.7.5/lib/mechanize/http/agent.rb:323:in `fetch': 404 => Net::HTTPNotFound for https://www.redfin.com/MA/WASHINGTON/306-WERDEN-RD-Unknown/home/134059623 -- unhandled response (Mechanize::ResponseCodeError)
The actual message you see when you input the above URL is:
Cannot GET /MA/WASHINGTON/306-WERDEN-RD-Unknown/home/134059623
My goal is to simply ignore and skip sporadic errors and move onto next keyword. I couldn't really find a working solution online and any feedback would be greatly appreciated.
If I understand the error raised is Mechanize::ResponseCodeError and this is clearly a 404 response_code. But in your script you don't raise 404 response_code from Mechanize::ResponseCodeError
all_response_code = ['403', '404', '502']
rescue Mechanize::ResponseCodeError => e
if all_response_code.include? response_code
e.skip
sleep 5
else
retry
end
Maybe if you add a condition for the 404 response_code, it will do the trick
EDIT
I changed the code a little bit in order to have less lines
I am having trouble printing out a list of people I am following on twitter. This code worked at 250, but fails now that I am following 320 people.
Failure Description: The code request exceeds twitter's rate limit. The code sleeps for the time required for the limit to reset, and then tries again.
I think the way it's written, it just keeps retrying the same entire rejectable request, rather than picking up where it left off.
MAX_ATTEMPTS = 3
num_attempts = 0
begin
num_attempts += 1
#client.friends.each do |user|
puts "#{user.screen_name}"
end
rescue Twitter::Error::TooManyRequests => error
if num_attempts <= MAX_ATTEMPTS
sleep error.rate_limit.reset_in
retry
else
raise
end
end
Thanks!
The following code will return an array of usernames. The vast majority of the code was written by the author of: http://workstuff.tumblr.com/post/4556238101/a-short-ruby-script-to-pull-your-twitter-followers-who
First create the following definition.
def get_cursor_results(action, items, *args)
result = []
next_cursor = -1
until next_cursor == 0
begin
t = #client.send(action, args[0], args[1], {:cursor => next_cursor})
result = result + t.send(items)
next_cursor = t.next_cursor
rescue Twitter::Error::TooManyRequests => error
puts "Rate limit error, sleeping for #{error.rate_limit.reset_in} seconds...".color(:yellow)
sleep error.rate_limit.reset_in
retry
end
end
return result
end
Second gather your twitter friends using the following two lines
friends = get_cursor_results('friends', 'users', 'twitterusernamehere')
screen_names = friends.collect{|x| x.screen_name}
try using a cursor: http://rdoc.info/gems/twitter/Twitter/API/FriendsAndFollowers#friends-instance_method (for example, https://gist.github.com/kent/451413)
I'm using curb to test some URLs in Ruby:
require 'curb'
def test_url()
c = Curl::Easy.new("http://www.wikipedia.org/wiki/URL_redirection") do |curl|
curl.follow_location= true
curl.head = true
end
c.perform
puts "status => " + c.status
puts "body => " + c.body_str
puts "final url => " + c.last_effective_url
end
test_url
This outputs:
status => 301 Moved Permanently
body =>
final url => http://en.wikipedia.org/wiki/URL_redirection
In this case, www.wikipedia.org/wiki/URL_redirection redirects to en.wikipedia.org/wiki/URL_redirection.
As you can see, I am getting a 301 status. How can I get the status of the final response code?
In this case, it is 200 because the document is found. I checked the libcurl documentation and found a flag CURLINFO_RESPONSE_CODE.
What is the equivalent in the curb library?
Found it.
I cloned the curb source and grepped for :
last_effective_url
In the function below it was the equivalent for the response code, in curb_easy.c, line 2435.
Note to self, "Use the source Luke"!
UPDATE:
The answer is response_code
In my case the code looks like so:
c = Curl::Easy.new(HOST_NAME) do |curl|
curl.follow_location = true
curl.head = true
end
c.perform
puts url + " => " + c.response_code.to_s