I am using Ruby curb to call multiple urls at once, e.g.
require 'rubygems'
require 'curb'
easy_options = {:follow_location => true}
multi_options = {:pipeline => true}
Curl::Multi.get(['http://www.example.com','http://www.trello.com','http://www.facebook.com','http://www.yahoo.com','http://www.msn.com'], easy_options, multi_options) do|easy|
# do something interesting with the easy response
puts easy.last_effective_url
end
The problem I have is I want to break the subsequent async calls when any url timeout occurred, is it possible?
As far as I know the current API doesn't expose the Curl::Multi instance, since otherwise you could do:
stop_everything = proc { multi.cancel! }
multi = Curl::Multi.get(array_of_urls, on_failure: stop_everything)
The easiest way might be to patch the Curl::Multi.http to return the m variable.
See https://github.com/taf2/curb/blob/master/lib/curl/multi.rb#L85
I think this will do exactly what you ask for:
require 'rubygems'
require 'curb'
responses = {}
requests = ['http://www.example.com','http://www.trello.com','http://www.facebook.com','http://www.yahoo.com','http://www.msn.com']
m = Curl::Multi.new
requests.each do |url|
responses[url] = ""
c = Curl::Easy.new(url) do|curl|
curl.follow_location = true
curl.on_body{|data| responses[url] << data; data.size }
curl.on_success {|easy| puts easy.last_effective_url }
curl.on_failure {|easy| puts "ERROR:#{easy.last_effective_url}"; #should_stop = true}
end
m.add(c)
end
m.perform { m.cancel! if #should_stop }
Related
I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....
Hello I'm new to Chef and Ruby.
I'm trying to make a recipe in Chef
To create a cron job on a server
based on a value of a variable that I get inside my ruby code.
Gem.clear_paths
node.default["value"] = "nil"
require 'net/http'
ruby_block "do-http-request-with-cutom-header" do
block do
Net::HTTP.get('example.com', '/index.html') # => String
uri = URI('http://example.com/index.html')
params = { :limit => 10, :page => 3 }
uri.query = URI.encode_www_form(params)
res = Net::HTTP.get_response(uri)
puts res.body if res.is_a?(Net::HTTPSuccess)
value= res.code
node["value"] = value
end
end
if node["value"] == "nil" then
cron "cassandra repair job" do
action :delete
end
else
cron "cassandra repair job" do
hour "0"
minute "55"
weekday node["value"]
mailto "root#localhost"
user "root"
command "/opt/cassandra/bin/nodetool repair -par -inc -pr"
end
end
I know that chef has Lazy Evaluation variable method and ruby code is executing on a converge phase, but I can not figure out the way to modify my code.
How can I use lazy evaluation in my code ?
node['value'] = value will create an attribute at the normal level, caveat, it is saved on the node object and stay there forever.
As you're using a volatile attribute coming from external source, you should use node.run_state['value'] which purpose is to keep transient value during the run.
Now, as you said it, you need to use lazy evaluation in your later resources for the value and for the action as you wish a different action depending of the external service return.
Example with update use of run_state here, untested code:
node.run_state['value'] = nil
node.run_state['action'] = :delete
require 'net/http'
ruby_block "do-http-request-with-cutom-header" do
block do
Net::HTTP.get('example.com', '/index.html') # => String
uri = URI('http://example.com/index.html')
params = { :limit => 10, :page => 3 }
uri.query = URI.encode_www_form(params)
res = Net::HTTP.get_response(uri)
puts res.body if res.is_a?(Net::HTTPSuccess)
value= res.code
node.run_state['value'] = value
node.run_state['action'] = :create
end
end
cron 'cassandra repair job' do
hour '0'
minute '55'
weekday lazy { node.run_state['value'] }
mailto 'root#localhost'
user 'root'
command '/opt/cassandra/bin/nodetool repair -par -inc -pr'
action lazy { node.run_state['action'] }
end
Using lazy on the action parameter is possible since chef 12.4, if you're under this you'll have to craft the resource and run it within the ruby block.
Example crafted from answer here (still untested):
ruby_block "do-http-request-with-cutom-header" do
block do
Net::HTTP.get('example.com', '/index.html') # => String
uri = URI('http://example.com/index.html')
params = { :limit => 10, :page => 3 }
uri.query = URI.encode_www_form(params)
res = Net::HTTP.get_response(uri)
puts res.body if res.is_a?(Net::HTTPSuccess)
if res.code.nil?
r = Chef::Resource::Cron.new "cassandra repair job"
r.run_action :delete
else
r = Chef::Resource::Cron.new "cassandra repair job"
r.hour "0"
r.minute "55"
r.weekday res.code
r.mailto "root#localhost"
r.user "root"
r.command "/opt/cassandra/bin/nodetool repair -par -inc -pr"
r.run_action :create
end
end
end
But at this point, it seems you would better get this information from a custom ohai plugin and use your original code without lazy.
I know that current timeouts are currently not supported with Http.rb and Celluloid[1], but is there an interim workaround?
Here's the code I'd like to run:
def fetch(url, options = {} )
puts "Request -> #{url}"
begin
options = options.merge({ socket_class: Celluloid::IO::TCPSocket,
timeout_class: HTTP::Timeout::Global,
timeout_options: {
connect_timeout: 1,
read_timeout: 1,
write_timeout: 1
}
})
HTTP.get(url, options)
rescue HTTP::TimeoutError => e
[do more stuff]
end
end
Its goal is to test a server as being live and healthy. I'd be open to alternatives (e.g. %x(ping <server>)) but these seem less efficient and actually able to get at what I'm looking for.
[1] https://github.com/httprb/http.rb#celluloidio-support
You can set timeout on future calls when you fetch for the request
Here is how to use timeout with Http.rb and Celluloid-io
require 'celluloid/io'
require 'http'
TIMEOUT = 10 # in sec
class HttpFetcher
include Celluloid::IO
def fetch(url)
HTTP.get(url, socket_class: Celluloid::IO::TCPSocket)
rescue Exception => e
# error
end
end
fetcher = HttpFetcher.new
urls = %w(http://www.ruby-lang.org/ http://www.rubygems.org/ http://celluloid.io/)
# Kick off a bunch of future calls to HttpFetcher to grab the URLs in parallel
futures = urls.map { |u| [u, fetcher.future.fetch(u)] }
# Consume the results as they come in
futures.each do |url, future|
# Wait for HttpFetcher#fetch to complete for this request
response = future.value(TIMEOUT)
puts "*** Got #{url}: #{response.inspect}\n\n"
end
I'm playing with EventMachine for some days now which has a steep learn curve IMHO ;-) I try to return a hash by triggering HttpHeaderCrawler.query() which I need within the callback. But what I get in this case is not the hash {'http_status' => xxx, 'http_version' => xxx} but an EventMachine::HttpClient Object itself.
I wanna keep the EM.run block clean and wanna do all logic within own classes / modules so how to return such a value into the main loop to access it by the callback? Many thanks in advance ;-)
#!/usr/bin/env ruby
require 'eventmachine'
require 'em-http-request'
class HttpHeaderCrawler
include EM::Deferrable
def query(uri)
http = EM::HttpRequest.new(uri).get
http.callback do
http_header = {
"http_status" => http.response_header.http_status,
"http_version" => http.response_header.http_version
}
puts "Returns to EM main loop: #{http_header}"
succeed(http_header)
end
end
end
EM.run do
domains = ['http://www.google.com', 'http://www.facebook.com', 'http://www.twitter.com']
domains.each do |domain|
hdr = HttpHeaderCrawler.new.query(domain)
hdr.callback do |header|
puts "Received from HttpHeaderCrawler: #{header}"
end
end
end
This snippet produces the following output:
Returns to EM main loop: {"http_status"=>302, "http_version"=>"1.1"}
Received from HttpHeaderCrawler: #<EventMachine::HttpClient:0x00000100d57388>
Returns to EM main loop: {"http_status"=>301, "http_version"=>"1.1"}
Received from HttpHeaderCrawler: #<EventMachine::HttpClient:0x00000100d551a0>
Returns to EM main loop: {"http_status"=>200, "http_version"=>"1.1"}
Received from HttpHeaderCrawler: #<EventMachine::HttpClient:0x00000100d56280>
I think the problem is #query returns http.callback, which returns the http object itself, whereas it should return self, i.e. the HttpHeaderCrawler. See if this works.
def query(uri)
http = EM::HttpRequest.new(uri).get
http.callback do
http_header = {
"http_status" => http.response_header.http_status,
"http_version" => http.response_header.http_version
}
puts "Returns to EM main loop: #{http_header}"
succeed(http_header)
end
self
end
How can I send HTTP GET request with parameters via ruby?
I have tried a lot of examples but all of those failed.
I know this post is old but for the sake of those brought here by google, there is an easier way to encode your parameters in a URL safe manner. I'm not sure why I haven't seen this elsewhere as the method is documented on the Net::HTTP page. I have seen the method described by Arsen7 as the accepted answer on several other questions also.
Mentioned in the Net::HTTP documentation is URI.encode_www_form(params):
# Lets say we have a path and params that look like this:
path = "/search"
params = {q: => "answer"}
# Example 1: Replacing the #path_with_params method from Arsen7
def path_with_params(path, params)
encoded_params = URI.encode_www_form(params)
[path, encoded_params].join("?")
end
# Example 2: A shortcut for the entire example by Arsen7
uri = URI.parse("http://localhost.com" + path)
uri.query = URI.encode_www_form(params)
response = Net::HTTP.get_response(uri)
Which example you choose is very much dependent on your use case. In my current project I am using a method similar to the one recommended by Arsen7 along with the simpler #path_with_params method and without the block format.
# Simplified example implementation without response
# decoding or error handling.
require "net/http"
require "uri"
class Connection
VERB_MAP = {
:get => Net::HTTP::Get,
:post => Net::HTTP::Post,
:put => Net::HTTP::Put,
:delete => Net::HTTP::Delete
}
API_ENDPOINT = "http://dev.random.com"
attr_reader :http
def initialize(endpoint = API_ENDPOINT)
uri = URI.parse(endpoint)
#http = Net::HTTP.new(uri.host, uri.port)
end
def request(method, path, params)
case method
when :get
full_path = path_with_params(path, params)
request = VERB_MAP[method].new(full_path)
else
request = VERB_MAP[method].new(path)
request.set_form_data(params)
end
http.request(request)
end
private
def path_with_params(path, params)
encoded_params = URI.encode_www_form(params)
[path, encoded_params].join("?")
end
end
con = Connection.new
con.request(:post, "/account", {:email => "test#test.com"})
=> #<Net::HTTPCreated 201 Created readbody=true>
I assume that you understand the examples on the Net::HTTP documentation page but you do not know how to pass parameters to the GET request.
You just append the parameters to the requested address, in exactly the same way you type such address in the browser:
require 'net/http'
res = Net::HTTP.start('localhost', 3000) do |http|
http.get('/users?id=1')
end
puts res.body
If you need some generic way to build the parameters string from a hash, you may create a helper like this:
require 'cgi'
def path_with_params(page, params)
return page if params.empty?
page + "?" + params.map {|k,v| CGI.escape(k.to_s)+'='+CGI.escape(v.to_s) }.join("&")
end
path_with_params("/users", :id => 1, :name => "John&Sons")
# => "/users?name=John%26Sons&id=1"