For instance, the following example code returns 0:
require 'feedjira'
feed_parsed = Feedjira::Feed.fetch_and_parse("https://news.yahoo.com/rss/topstories")
puts feed_parsed
Set ssl_verify_peer to false and that successfully accesses the file. For instance:
require 'feedjira'
feed_parsed = Feedjira::Feed.fetch_and_parse("https://news.yahoo.com/rss/topstories", {:ssl_verify_peer => false})
puts feed_parsed
Related
I have a working program that searches Google using Mechanize, however when the program searches Google it also pulls sites that look something like http://webcache.googleusercontent.com/.
I would like to reject that site from being stored in the file. All the sites' URLs are structured differently.
Source code:
require 'mechanize'
PATH = Dir.pwd
SEARCH = "test"
def info(input)
puts "[INFO]#{input}"
end
def get_urls
info("Searching for sites.")
agent = Mechanize.new
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls_to_log = str_list[1]
success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
end
end
info("Sites dumped into #{PATH}/temp/sites.txt")
end
get_urls
Text file:
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
http://www.speedtest.net/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/results.php
http://www.speedtest.net/mobile/
http://www.speedtest.net/about.php
https://support.speedtest.net/
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ
https://en.wikipedia.org/wiki/Test%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J
https://www.test.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.speakeasy.net/speedtest/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:sCEGhiP0qxEJ:https://www.speakeasy.net/speedtest/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.google.com/webmasters/tools/mobile-friendly/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:WBvZnqZfQukJ:https://www.google.com/webmasters/tools/mobile-friendly/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.humanmetrics.com/cgi-win/jtypes2.asp
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:w_lAt3mgXcoJ:http://www.humanmetrics.com/cgi-win/jtypes2.asp%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://speedtest.xfinity.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:snNGJxOQROIJ:http://speedtest.xfinity.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydo
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.16personalities.com/free-personality-test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:SQzntHUEffkJ
https://www.16personalities.com/free-personality-test%252Btest%26gbv%3D%26%26ct%3Dclnk
https://www.xamarin.com/test-cloud
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:ypEu7XAFM8QJ:
https://www.xamarin.com/test-cloud%252Btest%26gbv%3D1%26%26ct%3Dclnk
It works now. I had issue with success('log'), i dont know why but commented it.
str_list = str.split(%r{=|&})
next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
# success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
There are well-tested wheels used to tear apart URLs into the component parts so use them. Ruby comes with URI, which allows us to easily extract the host, path or query:
require 'uri'
URL = 'http://foo.com/a/b/c?d=1'
URI.parse(URL).host
# => "foo.com"
URI.parse(URL).path
# => "/a/b/c"
URI.parse(URL).query
# => "d=1"
Ruby's Enumerable module includes reject and select which make it easy to loop over an array or enumerable object and reject or select elements from it:
(1..3).select{ |i| i.even? } # => [2]
(1..3).reject{ |i| i.even? } # => [1, 3]
Using all that you could check the host of a URL for sub-strings and reject any you don't want:
require 'uri'
%w[
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
].reject{ |url| URI.parse(url).host[/googleusercontent\.com$/] }
# => ["http://www.speedtest.net/"]
Using these methods and techniques you can reject or select from an input file, or just peek into single URLs and choose to ignore or honor them.
This is my hyperresource client:
require 'rubygems'
require 'hyperresource'
require 'json'
api = HyperResource.new(root: 'http://127.0.0.1:9393/todos',
headers: {'Accept' => 'application/vnd.127.0.0.1:9393/todos.v1+hal+json'})
string = '{"todo":{"title":"test"}}'
hash = JSON.parse(string)
api.post(hash)
puts hash
The hash output is: {"todo"=>{"title"=>"test"}}
At my Sinatra with Roar API I have this post function:
post "/todos" do
params.to_json
puts params
#todo = Todo.new(params[:todo])
if #todo.save
#todo.extend(TodoRepresenter)
#todo.to_json
else
puts 'FAIL'
end
end
My puts 'params' over here gets: {"{\"todo\":{\"title\":\"test\"}}"=>nil}
I found out, these are 'escaped strings' but I don't know where it goes wrong.
EDIT:
I checked my api with curl and postman google extension, both work fine. It's just hyperresource I guess
You are posting JSON, ergo you either need to register a Sinatra middleware that will automatically parse incoming JSON requests, or you need to do it yourself.
require 'rubygems'
require 'hyperresource'
require 'json'
api = HyperResource.new(root: 'http://127.0.0.1:9393/todos',
headers: {'Accept' => 'application/vnd.127.0.0.1:9393/todos.v1+hal+json'})
string = '{"todo":{"title":"test"}}'
hash = JSON.parse(string)
api.post({:data => hash})
puts hash
---
post "/todos" do
p = JSON.parse(params[:data])
puts p.inspect
#todo = Todo.new(p[:todo])
if #todo.save
#todo.extend(TodoRepresenter)
#todo.to_json
else
puts 'FAIL'
end
end
Should do what you need.
I am using Ruby curb to call multiple urls at once, e.g.
require 'rubygems'
require 'curb'
easy_options = {:follow_location => true}
multi_options = {:pipeline => true}
Curl::Multi.get(['http://www.example.com','http://www.trello.com','http://www.facebook.com','http://www.yahoo.com','http://www.msn.com'], easy_options, multi_options) do|easy|
# do something interesting with the easy response
puts easy.last_effective_url
end
The problem I have is I want to break the subsequent async calls when any url timeout occurred, is it possible?
As far as I know the current API doesn't expose the Curl::Multi instance, since otherwise you could do:
stop_everything = proc { multi.cancel! }
multi = Curl::Multi.get(array_of_urls, on_failure: stop_everything)
The easiest way might be to patch the Curl::Multi.http to return the m variable.
See https://github.com/taf2/curb/blob/master/lib/curl/multi.rb#L85
I think this will do exactly what you ask for:
require 'rubygems'
require 'curb'
responses = {}
requests = ['http://www.example.com','http://www.trello.com','http://www.facebook.com','http://www.yahoo.com','http://www.msn.com']
m = Curl::Multi.new
requests.each do |url|
responses[url] = ""
c = Curl::Easy.new(url) do|curl|
curl.follow_location = true
curl.on_body{|data| responses[url] << data; data.size }
curl.on_success {|easy| puts easy.last_effective_url }
curl.on_failure {|easy| puts "ERROR:#{easy.last_effective_url}"; #should_stop = true}
end
m.add(c)
end
m.perform { m.cancel! if #should_stop }
There seem to be several options for establishing Redis connections for use within EventMachine, and I'm having a hard time understanding the core differences between them.
My goal is to implement Redis within Goliath
The way I establish my connection now is through em-synchrony:
require 'em-synchrony'
require 'em-synchrony/em-redis'
config['redis'] = EventMachine::Synchrony::ConnectionPool.new(:size => 20) do
EventMachine::Protocols::Redis.connect(:host => 'localhost', :port => 6379)
end
What is the difference between the above, and using something like em-hiredis?
If I'm using Redis for sets and basic key:value storage, is em-redis the best solution for my scenario?
We use em-hiredis very successfully inside Goliath. Here's a sample of how we coded publishing:
config/example_api.rb
# These give us direct access to the redis connection from within the API
config['redisUri'] = 'redis://localhost:6379/0'
config['redisPub'] ||= EM::Hiredis.connect('')
example_api.rb
class ExampleApi < Goliath::API
use Goliath::Rack::Params # parse & merge query and body parameters
use Goliath::Rack::Formatters::JSON # JSON output formatter
use Goliath::Rack::Render # auto-negotiate response format
def response(env)
env.logger.debug "\n\n\nENV: #{env['PATH_INFO']}"
env.logger.debug "REQUEST: Received"
env.logger.debug "POST Action received: #{env.params} "
#processing of requests from browser goes here
resp =
case env.params["action"]
when 'SOME_ACTION' then process_action(env)
when 'ANOTHER_ACTION' then process_another_action(env)
else
# skip
end
env.logger.debug "REQUEST: About to respond with: #{resp}"
[200, {'Content-Type' => 'application/json', 'Access-Control-Allow-Origin' => "*"}, resp]
end
# process an action
def process_action(env)
# extract message data
data = Hash.new
data["user_id"], data["object_id"] = env.params['user_id'], env.params['object_id']
publishData = { "action" => 'SOME_ACTION_RECEIVED',
"data" => data }
redisPub.publish("Channel_1", Yajl::Encoder.encode(publishData))
end
end
return data
end
# process anothr action
def process_another_action(env)
# extract message data
data = Hash.new
data["user_id"], data["widget_id"] = env.params['user_id'], env.params['widget_id']
publishData = { "action" => 'SOME_OTHER_ACTION_RECEIVED',
"data" => data }
redisPub.publish("Channel_1", Yajl::Encoder.encode(publishData))
end
end
return data
end
end
Handling subscriptions are left as an exercise for the reader.
what em-synchrony does is patch the em-redis gem to allow using it with fibers which effectively allows it to run in goliath.
Here is a project using Goliath + Redis which can guide you on how to make all this works: https://github.com/igrigorik/mneme
Example with em-hiredis, what goliath do is wrap your request in a fiber so a way to test it is:
require 'rubygems'
require 'bundler/setup'
require 'em-hiredis'
require 'em-synchrony'
EM::run do
Fiber.new do
## this is what you can use in goliath
redis = EM::Hiredis.connect
p EM::Synchrony.sync redis.keys('*')
## end of goliath block
end.resume
end
and the Gemfile I used:
source :rubygems
gem 'em-hiredis'
gem 'em-synchrony'
If you run this example you will get the list of defined keys in your redis database printed on screen.
Without the EM::Synchrony.sync call you would get a deferrable but here the fiber is suspended until the calls return and you get the result.
I'm running a simple thin server, that publish some messages to different queues, the code looks like :
require "rubygems"
require "thin"
require "amqp"
require 'msgpack'
app = Proc.new do |env|
params = Rack::Request.new(env).params
command = params['command'].strip rescue "no command"
number = params['number'].strip rescue "no number"
p command
p number
AMQP.start do
if command =~ /\A(create|c|r|register)\z/i
MQ.queue("create").publish(number)
elsif m = (/\A(Answer|a)\s?(\d+|\d+-\d+)\z/i.match(command))
MQ.queue("answers").publish({:number => number,:answer => "answer" }.to_msgpack )
end
end
[200, {'Content-Type' => "text/plain"} , command ]
end
Rack::Handler::Thin.run(app, :Port => 4001)
Now when I run the server, and do something like http://0.0.0.0:4001/command=r&number=123123123
I'm always getting duplicate outputs, something like :
"no command"
"no number"
"no command"
"no number"
The first thing is why I'm getting like duplicate requests ? is it something has to do with the browser ? since when I use curl I'm not having the same behavior , and the second thing why I can't get the params ?
Any tips about the best implementation for such a server would be highly appreciated
Thanks in advance .
The second request comes from the browser looking for the favicon.ico. You can inspect the requests by adding the following code in your handler:
params = Rack::Request.new(env).params
p env # add this line to see the request in your console window
Alternatively you could use Sinatra:
require "rubygems"
require "amqp"
require "msgpack"
require "sinatra"
get '/:command/:number' do
command = params['command'].strip rescue "no command"
number = params['number'].strip rescue "no number"
p command
p number
AMQP.start do
if command =~ /\A(create|c|r|register)\z/i
MQ.queue("create").publish(number)
elsif m = (/\A(Answer|a)\s?(\d+|\d+-\d+)\z/i.match(command))
MQ.queue("answers").publish({:number => number,:answer => "answer" }.to_msgpack )
nd
end
return command
end
and then run ruby the_server.rb at the command line to start the http server.