Ruby 1.9.3 add unsafe characters to URI.escape - ruby

I am using Sinatra and get parameters from the url using the get '/foo/:bar' {} method. Unfortunately, the value in :bar can contain nasty things like / which leads to an 404, since no route matches /foo/:bar/baz/. I use URI.escape to escape the URL paramter, but it considers / valid a valid character. As it is mentioned here this is because the default Regexp to check against does not differentiate between unsafe and reserved characters. I would like to change this and did this:
URI.escape("foo_<_>_&_3_#_/_+_%_bar", Regexp.union(URI::REGEXP::UNSAFE, '/'))
just to test it.
URI::REGEXP::UNSAFE is the default regexp to match against according to the Ruby 1.9.3 Documentaton:
escape(*arg)
Synopsis
URI.escape(str [, unsafe])
Args
str
String to replaces in.
unsafe
Regexp that matches all symbols that must be replaced with
codes. By default uses REGEXP::UNSAFE. When this argument is
a String, it represents a character set.
Description
Escapes the string, replacing all unsafe characters with codes.
Unfortunatelly I get this error:
uninitialized constant URI::REGEXP::UNSAFE
And as this GitHub Issue suggests, this Regexp was removed from Ruby with 1.9.3. Unfortunately, the URI modules documentation is generally kind of bad, but I really cannot figure this out. Any hints?
Thanks in advance!

URI#escape is not what you are looking for. You want CGI#escape:
require 'cgi'
CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
# => "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"
This will properly encode it to allow Sinatra to retrieve it.

Perhaps you would have better luck with CGI.escape?
>> require 'uri'; URI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_&_3_%23_/_+_%25_bar"
>> require 'cgi'; CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"

Related

Simple alternative to URI.escape

While using URI.parse on a URL, I was confronted with an error message:
URI::InvalidURIError: URI must be ascii only
I found a StackOverflow question that recommended using URI.escape, which works. Using the URL in that question as an example:
URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> URI::InvalidURIError: URI must be ascii only "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/\u041E\u0443\u044D\u043D-\u041C\u044D\u0442\u044C\u044E\u0441.jpg"
URI.encode('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
However, URI.escape is obsolete, as Rubocop warns:
URI.escape method is obsolete and should not be used. Instead, use CGI.escape, URI.encode_www_form or URI.encode_www_form_component depending on your specific use case.
But while URI.escape gives us a usable result, the alternatives don’t:
CGI.escape('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http%3A%2F%2Fdxczjjuegupb.cloudfront.net%2Fwp-content%2Fuploads%2F2017%2F10%2F%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
This is a bother because in my case I’m constructing a URL from data I get via Nokogiri:
my_url = page.at('.someclass').at('img').attr('src')
Since I only need to escape the last part of the resulting URL, but CGI.escape and similar transform the whole string (including necessary characters, such as : and /), getting the escaped result now becomes a multiple-lines-of-code ordeal, having to split the path and using several variables to achieve what could be previously done with a single function (URI.escape).
Is there a simple alternative I’m not seeing? It needs to be done without external gems.
I tend to use Addressable for parsing URLs since the standard URI has flaws:
require 'addressable'
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
#<Addressable::URI:0x3fc37ecc1c40 URI:http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg>
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg').path
# "/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg"
It isn't part of the Ruby core or the standard library but it should be and it always ends up in my Gemfiles.

Remove   from Ruby String

i am try to parse some data and meet trouble with clean a   symbol. I knew that this is just a "space" but i realy got trouble to clean it from string
my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('my_page.hmtl')
price = page.search('#product_buy .price').text.to_s.gsub(/\s+/, "").gsub(" ","").gsub(" ", "")
puts price
And as result i always got "4 162" - with dat spaces. Don't know what to do.
Help please who meet this issue previously. Thank you
HTML escape codes don't mean anything to Ruby's regex engine. Looking for " " will look for those literal characters, not a thin space. Instead, versions of Ruby >= 1.8 support Unicode in strings, meaning that you can use the Unicode code point corresponding to a thin space to make your substitution. The Unicode code point for a thin space is 0x2009, meaning that you can reference it in a Ruby string as \u2009.
Additionally, instead of calling some_string.gsub('some_string', ''), you can just call some_string.delete('some_string').
Note that this isn't appropriate for all situations, because delete removes all instances of all characters appearing in the intersection of its arguments, while gsub will remove only segments matching the pattern provided. For example, 'hellohi'.gsub('hello', '') == "hi", while 'hellohi'.delete('hello') == 'i').
In your specific case, I'd use something like:
price = page.search('#product_buy .price').text.delete('\u2009\s')

Convert Ruby string to *nix filename-compatible string

In Ruby I have an arbitrary string, and I'd like to convert it to something that is a valid Unix/Linux filename. It doesn't matter what it looks like in its final form, as long as it is visually recognizable as the string it started as. Some possible examples:
"Here's my string!" => "Heres_my_string"
"* is an asterisk, you see" => "is_an_asterisk_you_see"
Is there anything built-in (maybe in the file libraries) that will accomplish this (or close to this)?
By your specifications, you could accomplish this with a regex replacement. This regex will match all characters other than basic letters and digits:
s/[^\w\s_-]+//g
This will remove any extra whitespace in between words, as shown in your examples:
s/(^|\b\s)\s+($|\s?\b)/\\1\\2/g
And lastly, replace the remaining spaces with underscores:
s/\s+/_/g
Here it is in Ruby:
def friendly_filename(filename)
filename.gsub(/[^\w\s_-]+/, '')
.gsub(/(^|\b\s)\s+($|\s?\b)/, '\\1\\2')
.gsub(/\s+/, '_')
end
First, I see that it was asked purely in ruby, and second that it's not the same purpose (*nix filename compatible), but if you are using Rails, there is a method called parameterize that should help.
In rails console:
"Here's my string!".parameterize => "here-s-my-string"
"* is an asterisk, you see".parameterize => "is-an-asterisk-you-see"
I think that parameterize, as being compliant with URL specifications, may work as well with filenames :)
You can see more about here:
http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-parameterize
There's also a whole lot of another helpful methods.

Ruby 1.9 regex encoding

I am parsing this feed http://www.sixapart.com/labs/update/developers/ with nokogiri and then running some regex on the contents of some tags. The content is UTF-8 mostly, but is occasionally corrupt. However, for my case I don't really care and just need to pass the right parts of the content through, so I'm happy to treat the data as binary/ASCII-8BIT. The problem is that no matter what I do, regexes in my script are treated as either UTF-8 or ASCII. No matter what I set the encoding comment to, or what I do to create the regex.
Is there a solution to this? Can I force the regex to binary? Can I do a gsub without a regex easily? (I am just replacing & with &)
You need to encode the initial string and use the FIXEDENCODING option.
1.9.3-head :018 > r = Regexp.new("chars".force_encoding("binary"), Regexp::FIXEDENCODING)
=> /chars/
1.9.3-head :019 > r.encoding
=> #<Encoding:ASCII-8BIT>
Strings have a property of encoding. Try to use method String#force_encoding before applying regex.
UPD: To make your regexp be ascii, look on accepted answer here: Ruby 1.9: Regular Expressions with unknown input encoding
def get_regex(pattern, encoding='ASCII', options=0)
Regexp.new(pattern.encode(encoding),options)
end

Strange \n in base64 encoded string in Ruby

The inbuilt Base64 library in Ruby is adding some '\n's. I'm unable to find out the reason. For this special example:
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'base64'
=> true
irb(main):003:0> str = "1110--ad6ca0b06e1fbeb7e6518a0418a73a6e04a67054"
=> "1110--ad6ca0b06e1fbeb7e6518a0418a73a6e04a67054"
irb(main):004:0> Base64.encode64(str)
=> "MTExMC0tYWQ2Y2EwYjA2ZTFmYmViN2U2NTE4YTA0MThhNzNhNmUwNGE2NzA1\nNA==\n"
The \n's are at the last and 6th position from end. The decoder (Base64.decode64) returns back the old string perfectly. Strange thing is, these \n's don't add any value to the encoded string. When I remove the newlines from the output string, the decoder decodes it again perfectly.
irb(main):005:0> Base64.decode64(Base64.encode64(str).gsub("\n", '')) == str
=> true
More of this, I used an another JS library to produce the base64 encoded output of the same input string, the output comes without the \n's.
Is this a bug or anything else? Has anybody faced this issue before?
FYI,
$ ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]
Edit: Since I wrote this answer Base64.strict_encode64() was added, which does not add newlines.
The docs are somewhat confusing, the b64encode method is supposed to add a newline for every 60th character, and the example for the encode64 method is actually using the b64encode method.
It seems the pack("m") method for the Array class used by encode64 also adds the newlines. I would consider it a design bug that this is not optional.
You could either remove the newlines yourself, or if you're using rails, there's ActiveSupport::CoreExtensions::Base64::Encoding with the encode64s method.
In ruby-1.9.2 you have Base64.strict_encode64 which doesn't add that \n (newline) at the end.
Use strict_encode64 method. encode64 adds \n every 60 symbols
Yeah, this is quite normal. The doc gives an example demonstrating the line-splitting. base64 does the same thing in other languages too (eg. Python).
The reason content-free newlines are added at the encode stage is because base64 was originally devised as an encoding mechanism for sending binary content in e-mail, where the line length is limited. Feel free to replace them away if you don't need them.
Seems they've got to be stripped/ignored, like:
Base64.encode64(str).gsub(/\n/, '')
The \n added when using Base64#encode64 is correct, check this post out: https://glaucocustodio.github.io/2014/09/27/a-reminder-about-base64encode64-in-ruby/

Resources