When trying to get this page:
resp = RestClient.get("http://www.radios.com.br/aovivo/XXXX/24924")
I get this error:
URI::InvalidURIError: bad URI(is not URI?): http://www.radios.com.br/aovivo/Radio-Gospel-Ajduk?s/24924
from /Users/danicuki/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/uri/common.rb:176:in `split'
from /Users/danicuki/.rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/uri/common.rb:211:in `parse'
I think this is happening because the response redirect url has encoding problem. How to fix it?
Non-ASCII characters in URIs must be urlencoded:
url = "http://www.radios.com.br/aovivo/XXXX/24924"
resp = RestClient.get(URI::encode(str))
You need to apply patch for RestClient: (in version 2.1.0 it is not fixed yet)
RestClient::AbstractResponse.module_eval do
alias _origin_follow_redirection _follow_redirection
def _follow_redirection(new_args, &block)
# cannot follow redirection if there is no location header
raise exception_with_response unless headers[:location]
# Fix URI::InvalidURIError: URI must be ascii only
headers[:location] = URI::encode headers[:location]
_origin_follow_redirection new_args, &block
end
end
Related
I'm using httparty to unshorten short URIs and I happened upon:
HTTParty.get('http://bit.ly/19NoFfn', limit: 50 )
which when expanded yields:
https://sublime.wbond.net/packages/PhpSpec Snippets
which obviously throws a: URI::InvalidURIError.
Would it be possible to pass some parameter to httparty so that it would automatically try to encode URIs before trying to follow them?
I sort of solved my issue:
def unshorten(uri)
begin
response = HTTParty.get(uri, limit: 50)
rescue URI::InvalidURIError => error
bad_uri = error.message.match(/^bad\sURI\(is\snot\sURI\?\)\:\s(.*)$/)[1]
good_uri = URI.encode bad_uri
response = self.unshorten good_uri
end
response
end
I don't feel particularly comfortable fetching the URI from the error message string but it seems there's no other way. Or is there? :)
This question already has answers here:
URI::InvalidURIError (bad URI(is not URI?): ):
(4 answers)
Closed 6 years ago.
I'm using the ruby version 1.9.3, I like to get host name from the video url below,
I tried with code
require 'uri'
url = "https://ferrari-view.4me.it/view-share/playerp/?plContext=http://ferrari-%201363948628-stream.4mecloud.it/live/ferrari/ngrp:livegenita/manifest.f4m&cartellaConfig=http://ferrari-4me.weebo.it/static/player/config/&cartellaLingua=http://ferrari-4me.weebo.it/static/player/config/&poster=http://pusher.newvision.it:8080/resources/img1.jpg&urlSkin=http://ferrari-4me.weebo.it/static/player/swf/skin.swf?a=1363014732171&method=GET&target_url=http://ferrari-4me.weebo.it/static/player/swf/player.swf&userLanguage=IT&styleTextColor=#000000&autoPlay=true&bufferTime=2&isLive=true&highlightColor=#eb2323&gaTrackerList=UA-23603234-4"
puts URI.parse(url).host
it throws an exception URI::InvalidURIError: bad URI(is not URI?):
I tried with encode the URL then parse like below
puts URI.parse(URI.parse(url)).host
it throws an exception same URI::InvalidURIError: bad URI(is not URI?)
But above code works for the below URL.
url = http://www.youtube.com/v/GpQDa3PUAbU?version=3&autohide=1&autoplay=1
How to fix this? any suggestion please.
Thanks
This url is not valid, but it works in browser because browser itself is less strict about special characters like :, /, etc.
You should encode your URI first
encoded_url = URI.encode(url)
And then parse it
URI.parse(encoded_url)
Addressable::URI is a better, more rfc-compliant replacement for URI:
require "addressable/uri"
Addressable::URI.parse(url).host
#=> "ferrari-view.4me.it"
gem install addressable first.
try this:
safeurl = URI.encode(url.strip)
response = RestClient.get(safeurl)
Your URI query is not valid. There are several characters that you should encode with URI::encode(). For instance, #, , or & are not valid in a query.
Below a working version of your code
require 'uri'
plContext = URI::encode("http://ferrari-%201363948628-stream.4mecloud.it/live/ferrari/ngrp:livegenita/manifest.f4m")
cartellaConfig = URI::encode("http://ferrari-4me.weebo.it/static/player/config/")
cartellaLingua = URI::encode("http://ferrari-4me.weebo.it/static/player/config/")
poster = URI::encode("http://pusher.newvision.it:8080/resources/img1.jpg")
urlSkin = URI::encode("http://ferrari-4me.weebo.it/static/player/swf/skin.swf?a=1363014732171")
target_url = URI::encode("http://ferrari-4me.weebo.it/static/player/swf/player.swf")
url = "https://ferrari-view.4me.it/view-share/playerp/?"
url << "plContext=#{plContext}"
url << "&cartellaConfig=#{cartellaConfig}"
url << "&cartellaLingua=#{cartellaLingua}"
url << "&poster=#{poster}"
url << "&urlSkin=#{urlSkin}"
url << "&method=GET"
url << "&target_url=#{target_url}"
url << "&userLanguage=IT"
url << "&styleTextColor=#{URI::encode("#000000")}"
url << "&autoPlay=true&bufferTime=2&isLive=true&gaTrackerList=UA-23603234-4"
url << "&highlightColor=#{URI::encode("#eb2323")}"
puts url
puts URI.parse(url).host
URI.parse is right: that URI is illegal. Just because it accidentally happens to work in your browser doesn't make it legal. You cannot parse that URI, because it isn't a URI.
uri = URI.parse(URI.encode(url.strip))
I found some solutions using post_connect_hook and pre_connect_hook, but it seems like they don't work. I'm using the latest Mechanize version (2.1). There are no [:response] fields in the new version, and I don't know where to get them in the new version.
https://gist.github.com/search?q=pre_connect_hooks
https://gist.github.com/search?q=post_connect_hooks
Is it possible to make Mechanize return a UTF8 encoded version, instead of having to convert it manually using iconv?
Since Mechanize 2.0, arguments of pre_connect_hooks() and post_connect_hooks() were changed.
See the Mechanize documentation:
pre_connect_hooks()
A list of hooks to call before retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.
post_connect_hooks()
A list of hooks to call after retrieving a response. Hooks are called with the agent, the URI, the response, and the response body.
Now you can't change the internal response-body value because an argument is not array. So, the next best way is to replace an internal parser with your own:
class MyParser
def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block)
# insert your conversion code here. For example:
# thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists.
Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block)
end
end
agent = Mechanize.new
agent.html_parser = MyParser
page = agent.get('http://somewhere.com/')
...
I found a solution that works pretty well:
class HtmlParser
def self.parse(body, url, encoding)
body.encode!('UTF-8', encoding, invalid: :replace, undef: :replace, replace: '')
Nokogiri::HTML::Document.parse(body, url, 'UTF-8')
end
end
Mechanize.new.tap do |web|
web.html_parser = HtmlParser
end
No issues were found yet.
In your script, just enter: page.encoding = 'utf-8'
However, depending on your scenario, you may alternatively need to enter the reverse (the encoding of the website Mechanize is working with) instead. For that, open Firefox, open the website you want Mechanize to work with, select Tools in the menubar, and then open Page Info. Determine what the page is encoded in from there.
Using that info, you would instead enter what the page is encoded in (such as page.encoding = 'windows-1252').
How about something like this:
class Mechanize
alias_method :original_get, :get
def get *args
doc = original_get *args
doc.encoding = 'utf-8'
doc
end
end
I'm using the following method to translate a simple word from English to Russian by calling:
translate("hello")
This is my method:
def translate(text)
begin
uri = "http://api.microsofttranslator.com/V2/Ajax.svc/GetTranslations?appId=#{#appid}&text=#{text.strip}&from=en&to=ru&maxTranslations=1"
page = HTTParty.get(uri).body
show_info = JSON.parse(page) # this line throws the error
rescue
puts $!
end
end
The JSON output:
{"From":"en","Translations":[{"Count":0,"MatchDegree":100,"MatchedOriginalText":"","Rating":5,"TranslatedText":"Привет"}]}
The error:
unexpected token at '{"From":"en","Translations":[{"Count":0,"MatchDegree":100,"MatchedOriginalText":"","Rating":5,"TranslatedText":"Привет"}]}'
Not sure what it means by unexpected token. It's the only error I'm receiving. Unfortunately I can't modify the JSON output as it's returned by the API itself.
UPDATE:
Looks like the API is returning some illegal characters (bad Microsoft):
'´╗┐{"From":"en","Translations":[{"Count":0,"MatchDegree":0,"Matched OriginalText":"","Rating":5,"TranslatedText":"Hello"}]}'
Full error:
C:/Ruby193/lib/ruby/1.9.1/json/common.rb:148:in `parse': 743: unexpected token at '´╗┐{"From":"en","Translations":[{"Count":0,"MatchDegree":0,"Matched
OriginalText":"","Rating":5,"TranslatedText":"Hello"}]}' (JSON::ParserError)
from C:/Ruby193/lib/ruby/1.9.1/json/common.rb:148:in `parse'
from trans.rb:13:in `translate'
from trans.rb:17:in `<main>'
Try ensuring UTF-8 encoding and stripping any leading BOM indicators in the string:
# encoding: UTF-8
# ^-- Make sure this is on the first line!
def translate(text)
begin
uri = "http://api.microsofttranslator.com/V2/Ajax.svc/GetTranslations?appId=#{#appid}&text=#{text.strip}&from=en&to=ru&maxTranslations=1"
page = HTTParty.get(uri).body
page.force_encoding("UTF-8").gsub!("\xEF\xBB\xBF", '')
show_info = JSON.parse(page) # this line throws the error
rescue
puts $!
end
end
Sources:
Ruby 1.9's String
Wikipedia: Byte order mark
Using awk to remove the Byte-order mark
My users submit urls (to mixes on mixcloud.com) and my app uses them to perform web requests.
A good url returns a 200 status code:
uri = URI.parse("http://www.mixcloud.com/ErolAlkan/hard-summer-mix/")
request = Net::HTTP.get_response(uri)(
#<Net::HTTPOK 200 OK readbody=true>
But if you forget the trailing slash then our otherwise good url returns a 301:
uri = "http://www.mixcloud.com/ErolAlkan/hard-summer-mix"
#<Net::HTTPMovedPermanently 301 MOVED PERMANENTLY readbody=true>
The same thing happens with 404's:
# bad path returns a 404
"http://www.mixcloud.com/bad/path/"
# bad path minus trailing slash returns a 301
"http://www.mixcloud.com/bad/path"
How can I 'drill down' into the 301 to see if it takes us on to a valid resource or an error page?
Is there a tool that provides a comprehensive overview of the rules that a particular domain might apply to their urls?
301 redirects are fairly common if you do not type the URL exactly as the web server expects it. They happen much more frequently than you'd think, you just don't normally ever notice them while browsing because the browser does all that automatically for you.
Two alternatives come to mind:
1: Use open-uri
open-uri handles redirects automatically. So all you'd need to do is:
require 'open-uri'
...
response = open('http://xyz...').read
If you have trouble redirecting between HTTP and HTTPS, then have a look here for a solution:
Ruby open-uri redirect forbidden
2: Handle redirects with Net::HTTP
def get_response_with_redirect(uri)
r = Net::HTTP.get_response(uri)
if r.code == "301"
r = Net::HTTP.get_response(URI.parse(r['location']))
end
r
end
If you want to be even smarter you could try to add or remove missing backslashes to the URL when you get a 404 response. You could do that by creating a method like get_response_smart which handles this URL fiddling in addition to the redirects.
I can't figure out how to comment on the accepted answer (this question might be closed), but I should note that r.header is now obsolete, so r.header['location'] should be replaced by r['location'] (per https://stackoverflow.com/a/6934503/1084675 )
rest-client follows the redirections for GET and HEAD requests without any additional configuration. It works very nice.
for result codes between 200 and 207, a RestClient::Response will be returned
for result codes 301, 302 or 307, the redirection will be followed if the request is a GET or a HEAD
for result code 303, the redirection will be followed and the request transformed into a GET
example of usage:
require 'rest-client'
RestClient.get 'http://example.com/resource'
The rest-client README also gives an example of following redirects with POST requests:
begin
RestClient.post('http://example.com/redirect', 'body')
rescue RestClient::MovedPermanently,
RestClient::Found,
RestClient::TemporaryRedirect => err
err.response.follow_redirection
end
Here is the code I came up with (derived from different examples) which will bail out if there are too many redirects (note that ensure_success is optional):
require "net/http"
require "uri"
class Net::HTTPResponse
def ensure_success
unless kind_of? Net::HTTPSuccess
warn "Request failed with HTTP #{#code}"
each_header do |h,v|
warn "#{h} => #{v}"
end
abort
end
end
end
def do_request(uri_string)
response = nil
tries = 0
loop do
uri = URI.parse(uri_string)
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
uri_string = response['location'] if response['location']
unless response.kind_of? Net::HTTPRedirection
response.ensure_success
break
end
if tries == 10
puts "Timing out after 10 tries"
break
end
tries += 1
end
response
end
Not sure if anyone is looking for this exact solution, but if you are trying to download an image http/https and store it to a variable
require 'open_uri_redirections'
require 'net/https'
web_contents = open('file_url_goes_here', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE, :allow_redirections => :all) {|f| f.read }
puts web_contents