Strange \n in base64 encoded string in Ruby - ruby

The inbuilt Base64 library in Ruby is adding some '\n's. I'm unable to find out the reason. For this special example:
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'base64'
=> true
irb(main):003:0> str = "1110--ad6ca0b06e1fbeb7e6518a0418a73a6e04a67054"
=> "1110--ad6ca0b06e1fbeb7e6518a0418a73a6e04a67054"
irb(main):004:0> Base64.encode64(str)
=> "MTExMC0tYWQ2Y2EwYjA2ZTFmYmViN2U2NTE4YTA0MThhNzNhNmUwNGE2NzA1\nNA==\n"
The \n's are at the last and 6th position from end. The decoder (Base64.decode64) returns back the old string perfectly. Strange thing is, these \n's don't add any value to the encoded string. When I remove the newlines from the output string, the decoder decodes it again perfectly.
irb(main):005:0> Base64.decode64(Base64.encode64(str).gsub("\n", '')) == str
=> true
More of this, I used an another JS library to produce the base64 encoded output of the same input string, the output comes without the \n's.
Is this a bug or anything else? Has anybody faced this issue before?
FYI,
$ ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]

Edit: Since I wrote this answer Base64.strict_encode64() was added, which does not add newlines.
The docs are somewhat confusing, the b64encode method is supposed to add a newline for every 60th character, and the example for the encode64 method is actually using the b64encode method.
It seems the pack("m") method for the Array class used by encode64 also adds the newlines. I would consider it a design bug that this is not optional.
You could either remove the newlines yourself, or if you're using rails, there's ActiveSupport::CoreExtensions::Base64::Encoding with the encode64s method.

In ruby-1.9.2 you have Base64.strict_encode64 which doesn't add that \n (newline) at the end.

Use strict_encode64 method. encode64 adds \n every 60 symbols

Yeah, this is quite normal. The doc gives an example demonstrating the line-splitting. base64 does the same thing in other languages too (eg. Python).
The reason content-free newlines are added at the encode stage is because base64 was originally devised as an encoding mechanism for sending binary content in e-mail, where the line length is limited. Feel free to replace them away if you don't need them.

Seems they've got to be stripped/ignored, like:
Base64.encode64(str).gsub(/\n/, '')

The \n added when using Base64#encode64 is correct, check this post out: https://glaucocustodio.github.io/2014/09/27/a-reminder-about-base64encode64-in-ruby/

Related

why does psych yaml interpreter add line breaks around 80 characters?

Psych is the default yaml engine since ruby 1.9.3
Why, oh why does psych add a line break in its output? Check the example below.
ruby -v # => ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-linux]
require 'yaml'
"this absolutely normal sentence is more than eighty characters long because it IS".to_yaml
# => "--- this absolutely normal sentence is more than eighty characters long because it\n IS\n...\n"
YAML::ENGINE.yamler = 'syck'
"this absolutely normal sentence is more than eighty characters long because it IS".to_yaml
# => "--- this absolutely normal sentence is more than eighty characters long because it IS\n"
You'll have to configure psych’s #to_yaml options. You'll most likely find it here:
ruby-1.9.3-p125/ext/psych/emitter.c
And then you can do something like this:
yaml.to_yaml(options = {:line_width => -1})
yaml.to_yaml(options = {:line_width => -1})
is ok to solve the problem.
but RuboCop say
Useless assignment to variable - options.
so
yaml.to_yaml(line_width: -1)
is better.
Why does it matter whether YAML wraps the line or not when it serializes the data?
The question is, after wrapping it, can YAML reconstruct the correct line later when it reloads the file? And, the answer is, yes, it can:
require 'yaml'
puts '"' + YAML.load("this absolutely normal sentence is more than eighty characters long because it IS".to_yaml) + '"'
Which outputs:
"this absolutely normal sentence is more than eighty characters long because it IS"
Data that has been serialized, is in a format that YAML understands. That's an important concept, as the data is YAML's at that point. We can mess with it in an editor, and add/subtract/edit, but the data is still YAML's, because it has to reload and reparse the data in order for our applications to use it. So, after the data makes a round-trip through YAML-land, if the data returns in the same form as it left, then everything is OK.
We'd have a problem if it was serialized and then corrupted during the parsing stage, but that doesn't happen.
You can modify some of YAML's Psych driver's behavior when it's serializing data. See the answers for "Documentation for Psych to_yaml options?" for more information.

Ruby 1.9.3 add unsafe characters to URI.escape

I am using Sinatra and get parameters from the url using the get '/foo/:bar' {} method. Unfortunately, the value in :bar can contain nasty things like / which leads to an 404, since no route matches /foo/:bar/baz/. I use URI.escape to escape the URL paramter, but it considers / valid a valid character. As it is mentioned here this is because the default Regexp to check against does not differentiate between unsafe and reserved characters. I would like to change this and did this:
URI.escape("foo_<_>_&_3_#_/_+_%_bar", Regexp.union(URI::REGEXP::UNSAFE, '/'))
just to test it.
URI::REGEXP::UNSAFE is the default regexp to match against according to the Ruby 1.9.3 Documentaton:
escape(*arg)
Synopsis
URI.escape(str [, unsafe])
Args
str
String to replaces in.
unsafe
Regexp that matches all symbols that must be replaced with
codes. By default uses REGEXP::UNSAFE. When this argument is
a String, it represents a character set.
Description
Escapes the string, replacing all unsafe characters with codes.
Unfortunatelly I get this error:
uninitialized constant URI::REGEXP::UNSAFE
And as this GitHub Issue suggests, this Regexp was removed from Ruby with 1.9.3. Unfortunately, the URI modules documentation is generally kind of bad, but I really cannot figure this out. Any hints?
Thanks in advance!
URI#escape is not what you are looking for. You want CGI#escape:
require 'cgi'
CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
# => "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"
This will properly encode it to allow Sinatra to retrieve it.
Perhaps you would have better luck with CGI.escape?
>> require 'uri'; URI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_&_3_%23_/_+_%25_bar"
>> require 'cgi'; CGI.escape("foo_<_>_&_3_#_/_+_%_bar")
=> "foo_%3C_%3E_%26_3_%23_%2F_%2B_%25_bar"

Convert Ruby string to *nix filename-compatible string

In Ruby I have an arbitrary string, and I'd like to convert it to something that is a valid Unix/Linux filename. It doesn't matter what it looks like in its final form, as long as it is visually recognizable as the string it started as. Some possible examples:
"Here's my string!" => "Heres_my_string"
"* is an asterisk, you see" => "is_an_asterisk_you_see"
Is there anything built-in (maybe in the file libraries) that will accomplish this (or close to this)?
By your specifications, you could accomplish this with a regex replacement. This regex will match all characters other than basic letters and digits:
s/[^\w\s_-]+//g
This will remove any extra whitespace in between words, as shown in your examples:
s/(^|\b\s)\s+($|\s?\b)/\\1\\2/g
And lastly, replace the remaining spaces with underscores:
s/\s+/_/g
Here it is in Ruby:
def friendly_filename(filename)
filename.gsub(/[^\w\s_-]+/, '')
.gsub(/(^|\b\s)\s+($|\s?\b)/, '\\1\\2')
.gsub(/\s+/, '_')
end
First, I see that it was asked purely in ruby, and second that it's not the same purpose (*nix filename compatible), but if you are using Rails, there is a method called parameterize that should help.
In rails console:
"Here's my string!".parameterize => "here-s-my-string"
"* is an asterisk, you see".parameterize => "is-an-asterisk-you-see"
I think that parameterize, as being compliant with URL specifications, may work as well with filenames :)
You can see more about here:
http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-parameterize
There's also a whole lot of another helpful methods.

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

Ruby 1.9 regex encoding

I am parsing this feed http://www.sixapart.com/labs/update/developers/ with nokogiri and then running some regex on the contents of some tags. The content is UTF-8 mostly, but is occasionally corrupt. However, for my case I don't really care and just need to pass the right parts of the content through, so I'm happy to treat the data as binary/ASCII-8BIT. The problem is that no matter what I do, regexes in my script are treated as either UTF-8 or ASCII. No matter what I set the encoding comment to, or what I do to create the regex.
Is there a solution to this? Can I force the regex to binary? Can I do a gsub without a regex easily? (I am just replacing & with &)
You need to encode the initial string and use the FIXEDENCODING option.
1.9.3-head :018 > r = Regexp.new("chars".force_encoding("binary"), Regexp::FIXEDENCODING)
=> /chars/
1.9.3-head :019 > r.encoding
=> #<Encoding:ASCII-8BIT>
Strings have a property of encoding. Try to use method String#force_encoding before applying regex.
UPD: To make your regexp be ascii, look on accepted answer here: Ruby 1.9: Regular Expressions with unknown input encoding
def get_regex(pattern, encoding='ASCII', options=0)
Regexp.new(pattern.encode(encoding),options)
end

Resources