Ruby open-uri, returns error when opening a png URL - ruby

I am making a crawler parsing images on the Gantz manga at http://manga.bleachexile.com/gantz-chapter-1.html and on.
I had success until my crawler tried to open a image (on chapt 273):
bad URI(is not URI?): http://static.bleachexile.com/manga/gantz/273/Gantz[0273]_p001[Whatever-Illuminati].png
BUT this url is valid I guess, because I can open from Firefox.. Any thoughts?
Partial code:
img_link = nav.page.image_urls.find {|x| x.include?("manga/gantz")}
img_name = RAILS_ROOT+"/public/#{nome}/#{cap}/"+nome+((template).sub('::cap::', cap.to_s).sub('::pag::', i.to_s))
img = File.new( img_name, 'w' )
img.write( open(img_link) {|f| f.read} )
img.close

It is not a valid uri. Only certain characters are allowed for uri's. By the way firefox like all browsers try to do as much as possible for the user instead of complaining when it does not look standard compliant.
It is valid in the following form:
open("http://static.bleachexile.com/manga/gantz/273/Gantz%5B0273%5D_p001%5BWhatever-Illuminati%5D.png") # => #<File:/tmp/open-uri20100226-3342-clj08a-0>
You could try to escape it like this:
uri.gsub(/\/.*/) do |t|
t.gsub(/[^.\/a-zA-Z0-9\-_ ]/) do |c|
"%#{ c[0]<16 ? "0" : "" }#{ c[0].to_s(16).upcase }"
end.gsub(" ", "+")
end
But be carefull, if the website uses correct escaped uri's and you escape them a second time. The uri's wont point to the same location anymore.

Related

Get answer from Google Dictionary API using ruby win Portuguese and accented characters

I'm trying to get results from the Google Dictionary API with ruby. It works well with non accented characters but does not work with accented characters (i.e. if you type directly the URL into the address bar of the browser).
If you use the chrome browser you get good answers either with accents or no accents.
I already jumped over the problem of the URI parser not linking URLs with accents using the following code
require "addressable"
require "net/http"
begin
uri = Addressable::URI.convert_path('https://api.dictionaryapi.dev/api/v2/entries/pt-BR/há')
p uri
rescue => error
p error
end
response = Net::HTTP.get(uri)
p response
I get an empty response, while using the browser I get the correct response.
Can somebody suggest some workaround? What am I doing wrong?
I didn't dig deep inside addressable gem.
But here is a working example with URI and JSON:
require "net/http"
require "json"
begin
uri = URI.parse(URI.encode('https://api.dictionaryapi.dev/api/v2/entries/pt-BR/há'))
p uri
rescue => error
p error
end
response = Net::HTTP.get(uri)
p JSON.parse(response)
=>
[
{
"word"=>"ha",
"phonetics"=>[{}],
"meanings"=>[
{
"partOfSpeech"=>"undefined",
"definitions"=>[{"definition"=>"símb. de HECTARE.", "synonyms"=>[], "antonyms"=>[]}]}
]
},
{
"word"=>"hã",
"phonetics"=>[{}],
"origin"=>"⊙ ETIM voc.onom.",
"meanings"=>[
{
"partOfSpeech"=>"interjeição",
"definitions"=>[
{
"definition"=>"expressa reflexão, esclarecimento, admiração.", "synonyms"=>[], "antonyms"=>[]
}
]
}
]
}
]
Thx for all your answers so far but it seems that the problem is on the API (that is no longer maintained by Google).
What is happening in your last example with that the word 'hà' is transformed in 'ha' i.e the accent is removed and the semantics are lost.
I will try another way.

why is content_length in Net::HTTP.get_response sometimes nil even on good results?

I have the following ruby code (was trying to write a simple http-ping)
require 'net/http'
res1 = Net::HTTP.get_response 'www.google.com' , '/'
res2 = Net::HTTP.get_response 'www.google.com' , '/search?q=abc'
res1.code #200
res2.code #200
res1.content_length #5213
res2.content_length #nil **<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< WHY**
res2.body[0..60]
=> "<!doctype html><html itemscope=\"\" itemtype=\"http://schema.org"
Why does res2 content_length does not show through? Is it in some other attribute of res2 (how does one see those?)
I am a newcomer at ruby. Using irb 0.9.6 on AWS Linux
Thanks a lot.
It appears that the value returned is not necessarily the length of the body, but the fixed length of the content, when that fixed length is known in advance and stored in the content-length header.
See the source for the implementation of HTTPHeader#content_length (taken from http://ruby-doc.org/stdlib-2.3.1/libdoc/net/http/rdoc/Net/HTTPHeader.html):
# File net/http/header.rb, line 262
def content_length
return nil unless key?('Content-Length')
len = self['Content-Length'].slice(/\d+/) or
raise Net::HTTPHeaderSyntaxError, 'wrong Content-Length format'
len.to_i
end
What this probably means in this case is that the response was a multi-part MIME response, and the content-length header is not used in this case.
What you most likely want in this case is body.length, since that's the only real way to tell the actual length of the response body for a multi-part response.
Note that may be performance implications by always using content.body to find the content length; you may choose to try the content_length approach first and if it's nil, fall back to body.length.
Here's an example modification to your code:
require 'net/http'
res1 = Net::HTTP.get_response 'www.google.com' , '/'
res2 = Net::HTTP.get_response 'www.google.com' , '/search?q=abc'
res1.code #200
res2.code #200
res1.content_length #5213
res2.content_length.nil? ? res2.body.length : res2.content_length #57315 **<<<<<<<<<<<<<<< Works now **
res2.body[0..60]
=> "<!doctype html><html itemscope=\"\" itemtype=\"http://schema.org"
or, better yet, capture the content_length and use the captured value for comparison:
res2_content_length = res2.content_length
if res2_content_length.nil?
res2_content_length = res2.body.length
end
Personally, I'd just stick with always checking body.length and deal with any potential performance issue if and when it arises.
This should reliably retrieve the actual length of the content for you, regardless of whether you received a simple response of a multi-part response.

Good request from browser but bad request from ruby?

I'm using the google custom search api and I'm trying to access it through some ruby code:
Here is a snippet of the code
req = Typhoeus::Request.new("https://www.googleapis.com/customsearch/v1?key={my_key}&cx=017576662512468239146:omuauf_lfve&q=" + keyword, followlocation: true)
res = req.run
It appears that the body of the answer is this one:
<p>Your client has issued a malformed or illegal request. <ins>That’s all we know.</ins>
'
from /usr/local/lib/ruby/2.1.0/json/common.rb:155:in `parse'
from main.rb:20:in `initialize'
from main.rb:41:in `new'
from main.rb:41:in `<main>'
When I try to do the same thing from the browser it works like a charm. Even more confusing is that this same code worked 12 hours ago. I only changed the keyword that it should look for, however it started returning the error.
Any suggestions? I'm sure that I have enough credits for more requests
You probably have problems with special characters in your get parameter keyword. If you enter the URL in your browser, the brower adjusts these. However, for ruby you need to escape these characters, in such a way that a string like "sky line" becomes "sky+line" and so on. There is a utility function CGI::escape, which is used like this:
require 'cgi'
CGI::escape("sky line")
=> "sky+line"
Your fixed code would look something like this:
req = Typhoeus::Request.new("https://www.googleapis.com/customsearch/v1?key={my_key}&cx=017576662512468239146:omuauf_lfve&q=" + CGI::escape(keyword), followlocation: true)
res = req.run
However, since you're using Typhoeus anyway, you should be able to use its params parameter and let Typhoeus handle the escaping:
req = Typhoeus::Request.new(
"https://www.googleapis.com/customsearch/v1?&cx=017576662512468239146:omuauf_lfve",
followlocation: true,
params: {q: keyword, key: my_key}
)
res = req.run
There's more examples on Typhoeus' GitHub page.

Making a URL in a string usable by Ruby's Net::HTTP

Ruby's Net:HTTP needs to be given a full URL in order for it to connect to the server and get the file properly. By "full URL" I mean a URL including the http:// part and the trailing slash if it needs it. For instance, Net:HTTP won't connect to a URL looking like this: example.com, but will connect just fine to http://example.com/. Is there any way to make sure a URL is a full URL, and add the required parts if it isn't?
EDIT: Here is the code I am using:
parsed_url = URI.parse(url)
req = Net::HTTP::Get.new(parsed_url.path)
res = Net::HTTP.start(parsed_url.host, parsed_url.port) {|http|
http.request(req)
}
If this is only doing what the sample code shows, Open-URI would be an easier approach.
require 'open-uri'
res = open(url).read
This would do a simple check for http/https:
if !(url =~ /^https?:/i)
url = "http://" + url
end
This could be a more general one to handle multiple protocols (ftp, etc.)
if !(url =~ /^\w:/i)
url = "http://" + url
end
In order to make sure parsed_url.path gives you a proper value (it should be / when no specific path was provided), you could do something like this:
req = Net::HTTP::Get.new(parsed_url.path.empty? ? '/' : parsed_url.path)

Is there a way to remove the BOM from a UTF-8 encoded file?

Is there a way to remove the BOM from a UTF-8 encoded file?
I know that all of my JSON files are encoded in UTF-8, but the data entry person who edited the JSON files saved it as UTF-8 with the BOM.
When I run my Ruby scripts to parse the JSON, it is failing with an error.
I don't want to manually open 58+ JSON files and convert to UTF-8 without the BOM.
With ruby >= 1.9.2 you can use the mode r:bom|utf-8
This should work (I haven't test it in combination with json):
json = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
json = JSON.parse(file.read)
}
It doesn't matter, if the BOM is available in the file or not.
Andrew remarked, that File#rewind can't be used with BOM.
If you need a rewind-function you must remember the position and replace rewind with pos=:
#Prepare test file
File.open('file.txt', "w:utf-8"){|f|
f << "\xEF\xBB\xBF" #add BOM
f << 'some content'
}
#Read file and skip BOM if available
File.open('file.txt', "r:bom|utf-8"){|f|
pos =f.pos
p content = f.read #read and write file content
f.pos = pos #f.rewind goes to pos 0
p content = f.read #(re)read and write file content
}
So, the solution was to do a search and replace on the BOM via gsub!
I forced the encoding of the string to UTF-8 and also forced the regex pattern to be encoded in UTF-8.
I was able to derive a solution by looking at http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv and http://blog.grayproductions.net/articles/ruby_19s_string
def read_json_file(file_name, index)
content = ''
file = File.open("#{file_name}\\game.json", "r")
content = file.read.force_encoding("UTF-8")
content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')
json = JSON.parse(content)
print json
end
You can also specify encoding with the File.read and CSV.read methods, but you don't specify the read mode.
File.read(path, :encoding => 'bom|utf-8')
CSV.read(path, :encoding => 'bom|utf-8')
the "bom|UTF-8" encoding works well if you only read the file once, but fails if you ever call File#rewind, as I was doing in my code. To address this, I did the following:
def ignore_bom
#file.ungetc if #file.pos==0 && #file.getc != "\xEF\xBB\xBF".force_encoding("UTF-8")
end
which seems to work well. Not sure if there are other similar type characters to look out for, but they could easily be built into this method that can be called any time you rewind or open.
Server side cleanup of utf-8 bom bytes that worked for me:
csv_text.gsub!("\xEF\xBB\xBF".force_encoding(Encoding::BINARY), '')

Resources