I am stripping the padding from the end of of a base64 encoded string. I have found that I can successfully decode the string without adding the padding back on first (in php at least). Is this a result of the specification of base64 or implementation dependent?
And a related question, can a = sign exist in a base64 encoded string other than as padding, i.e. If it is not implementation dependent, is it safe just to strip the trailing equals signs or is it possible that the last "real" character could have been one?
While padding isn't necessary to decode a base64 string, there are some implementations that will throw an error if the padding has been stripped away.
The RFC states:
Implementations MUST include appropriate pad characters at the end of
encoded data unless the specification referring to this document
explicitly states otherwise.
The = character will only ever be used for padding.
Related
I have an external system written in Ruby which sending a data over the wire encoded with ASCII_8BIT. How should I decode and encode them in the Scala?
I couldn't find a library for decoding and encoding ASCII_8BIT string in scala.
As I understand, correctly, the ASCII_8BIT is something similar to Base64. However, there is more than one Base64 encoding. Which type of encoding should I use to be sure that cover all corner cases?
What is ASCII-8BIT?
ASCII-8BIT is Ruby's binary encoding (the name "BINARY" is accepted as an alias for "ASCII-8BIT" when specifying the name of an encoding). It is used both for binary data and for text whose real encoding you don't know.
Any sequence of bytes is a valid string in the ASCII-8BIT encoding, but unlike other 8bit-encodings, only the bytes in the ASCII range are considered printable characters (and of course only those that are printable in ASCII). The bytes in the 128-255 range are considered special characters that don't have a representation in other encodings. So trying to convert an ASCII-8BIT string to any other encoding will fail (or replace the non-ASCII characters with question marks depending on the options you give to encode) unless it only contains ASCII characters.
What's its equivalent in the Scala/JVM world?
There is no strict equivalent. If you're dealing with binary data, you should be using binary streams that don't have an encoding and aren't treated as containing text.
If you're dealing with text, you'll either need to know (or somehow figure out) its encoding or just arbitrarily pick an 8-bit ASCII-superset encoding. That way non-ASCII characters may come out as the wrong character (if the text was actually encoded with a different encoding), but you won't get any errors because any byte is a valid character. You can then replace the non-ASCII characters with question marks if you want.
What does this have to do with Base64?
Nothing. Base64 is a way to represent binary data as ASCII text. It is not itself a character encoding. Knowing that a string has the character encoding ASCII or ASCII-8BIT or any other encoding, doesn't tell you whether it contains Base64 data or not.
But do note that a Base64 string will consist entirely of ASCII characters (and not just any ASCII characters, but only letters, numbers, +, / and =). So if your string contains any non-ASCII character or any character except the aforementioned, it's not Base64.
Therefore any Base64 string can be represented as ASCII. So if you have an ASCII-8BIT string containing Base64 data in Ruby, you should be able to convert it to ASCII without any problems. If you can't, it's not Base64.
My apology if I am asking a beginners question.
May I ask what is the difference between RFC2047 4.2 The "Q" encoding and RFC 2045 6.7 Quoted-Printable Content-Transfer-Encoding?
From RFC2047:
The "Q" encoding is similar to the "Quoted-Printable" content-transfer-encoding defined in RFC 2045.
I am trying to implement decode logic using Ruby. I have read the answer below and am trying to understand why first.gsub('_',' ') is required for Q encoding.
Is there a way to decode q-encoded strings in Ruby?
By reading RFC2047 again, I have realized that the approach below does not decode underscore properly in cases where underscore is encoded as =5F.
decoded = m[3].unpack('M').first.gsub('_',' ')
Instead, as described in the last sentence of RFC2047 4.2 (2):
Note that the "_" always represents hexadecimal 20, even if the SPACE character occupies a different code position in the character set in use.
I have substituted literal underscore back to =20 first, then unpack it.
I have coded as below:
decoded = m[3].gsub('_', '=20').unpack('M').first()
I have some binary data that I want to convert to something more easily readable and copy/pastable.
The binary data shows up like this
?Q?O?,???W%ʐ):?g????????
Which is pretty ugly. I can convert it to hex with:
value.unpack("H*").first
But since hexadecimal only has 16 characters, it isn't very compressed. I end up with a string that is hundreds of chars long.
I'd prefer a format that uses letters (capitalized and lowercase), numbers, and basic symbols, to make best use of the possible values. What can I use?
I'd also prefer something that comes built-in to Ruby, that doesn't require another library. Unfortunately I can't require another library unless it's really well known and popular, or ideally built-in to Ruby.
I tried the stuff from http://apidock.com/ruby/String/unpack and couldn't find anything.
A simple method uses Base64 encoding to encode the value. It's very similar to Hex encoding (which is Base16), but uses a longer dictionary.
Base64 strings, when properly prepared, contain only printable characters. This is a benefit for copy/paste and for sharing.
The secondary benefit is that it has a 3:4 encoding ratio, which means that it's reasonably efficient. A 3:4 encoding ration means that for each 3 bytes in the input, 4 bytes are used to encode (75% efficient); Hex encoding is a less efficient 1:2 encoding ratio, or for each 1 byte of input, 2 bytes are used to encode (50% efficient).
You can use the Ruby standard library Base64 implementation to encode and decode, like so:
require "base64"
encoded = Base64.encode64("Taste the thunder!") # <== "VGFzdGUgdGhlIHRodW5kZXIh\n"
decoded = Base64.decode64(encoded) # <== "Taste the thunder!"
Note that there is a (mostly) URL-safe version, as well, so that you can include an encoded value anywhere in a URL without requiring any additional URL encoding. This would allow you to pass information in a URL in an obscured way, and especially information that normally wouldn't be easily passed in that manner.
Try this to encode your data:
encoded_url_param = Base64.urlsafe_encode64("cake+pie=yummy!") # <== "Y2FrZStwaWU9eXVtbXkh"
decoded_url_param = Base64.urlsafe_decode64(encoded_url_param) # <== "cake+pie=yummy!"
Using Base64 in a URL, while actually not "security", will help keep prying eyes from your data and intent. The only potential downside to using Base64 values in a URL is that the URL must remain case-sensitive, and some applications don't honor that requirement. See the Should URL be case sensitive SO question for more information.
Sounds to me like you want base 64. It is part of the standard library:
require 'base64'
Base64.encode64(some_data)
Or using pack,
[some_data].pack("m")
The resulting data is about 4/3 the size of the input.
Base36 string encoding is a reasonable alternative to both Base64 and Hex encoding, as well. In this encoding method, only 36 characters are used, typically the ASCII lowercase letters and the ASCII numbers.
There's not a Ruby API that specifically does this, however this SO answer Base36 Encode a String shows how to do this efficiently in Ruby:
Encoding to Base36:
encoded = data.unpack('H*')[0].to_i(16).to_s(36)
Decoding from Base36:
decoded = [encoded.to_i(36).to_s(16)].pack 'H*'
Base36 encoding will work well when used in URLs, similarly to Base64, however it is not sensitive to the case sensitivity issues that Base64 is.
Note that Base36 string encoding should not be confused with base 36 radix integer encoding, which simply converts an integer value to the corresponding base 36 encoding. The integer technique uses String#to_i(36) and Fixnum#to_s(36) to accomplish its goals.
I am decoding a base64 string, modifying it, and re-encoding it with Ruby. The problem when I re-encode it is that the ruby encode library is adding a linebreak after 60 or so characters. How can I tell it to not have max characters per line limit?
val = "QmFzZTY0IGlzIGEgZ2VuZXJpYyB0ZXJtIGZvciBhIG51bWJlciBvZiBzaW1pbGFyIGVuY29kaW5nIHNjaGVtZXMgdGhhdCBlbmNvZGUgYmluYXJ5IGRhdGEgYnkgdHJlYXRpbmcgaXQgbnVtZXJpY2FsbHkgYW5kIHRyYW5zbGF0aW5nIGl0IGludG8gYSBiYXNlIDY0IHJlcHJlc2VudGF0aW9uLiBUaGUgQmFzZTY0IHRlcm0gb3JpZ2luYXRlcyBmcm9tIGEgc3BlY2lmaWMgTUlNRSBjb250ZW50IHRyYW5zZmVyIGVuY29kaW5nLg0KDQpCYXNlNjQgZW5jb2Rpbmcgc2NoZW1lcyBhcmUgY29tbW9ubHkgdXNlZCB3aGVuIHRoZXJlIGlzIGEgbmVlZCB0byBlbmNvZGUgYmluYXJ5IGRhdGEgdGhhdCBuZWVkcyBiZSBzdG9yZWQgYW5kIHRyYW5zZmVycmVkIG92ZXIgbWVkaWEgdGhhdCBhcmUgZGVzaWduZWQgdG8gZGVhbCB3aXRoIHRleHR1YWwgZGF0YS4gVGhpcyBpcyB0byBlbnN1cmUgdGhhdCB0aGUgZGF0YSByZW1haW5zIGludGFjdCB3aXRob3V0IG1vZGlmaWNhdGlvbiBkdXJpbmcgdHJhbnNwb3J0LiBCYXNlNjQgaXMgdXNlZCBjb21tb25seSBpbiBhIG51bWJlciBvZiBhcHBsaWNhdGlvbnMgaW5jbHVkaW5nIGVtYWlsIHZpYSBNSU1FLCBhbmQgc3RvcmluZyBjb21wbGV4IGRhdGEgaW4gWE1MLg=="
decoded_val = Base64.decode64(val)
encoded_val = Base64.encode64(val)
#=> QmFzZTY0IGlzIGEgZ2VuZXJpYyB0ZXJtIGZvciBhIG51bWJlciBvZiBzaW1p
# bGFyIGVuY29kaW5nIHNjaGVtZXMgdGhhdCBlbmNvZGUgYmluYXJ5IGRhdGEg
# YnkgdHJlYXRpbmcgaXQgbnVtZXJpY2FsbHkgYW5kIHRyYW5zbGF0aW5nIGl0
# IGludG8gYSBiYXNlIDY0IHJlcHJlc2VudGF0aW9uLiBUaGUgQmFzZTY0IHRl
# cm0gb3JpZ2luYXRlcyBmcm9tIGEgc3BlY2lmaWMgTUlNRSBjb250ZW50IHRy
# YW5zZmVyIGVuY29kaW5nLg0KDQpCYXNlNjQgZW5jb2Rpbmcgc2NoZW1lcyBh
# cmUgY29tbW9ubHkgdXNlZCB3aGVuIHRoZXJlIGlzIGEgbmVlZCB0byBlbmNv
# ZGUgYmluYXJ5IGRhdGEgdGhhdCBuZWVkcyBiZSBzdG9yZWQgYW5kIHRyYW5z
# ZmVycmVkIG92ZXIgbWVkaWEgdGhhdCBhcmUgZGVzaWduZWQgdG8gZGVhbCB3
# aXRoIHRleHR1YWwgZGF0YS4gVGhpcyBpcyB0byBlbnN1cmUgdGhhdCB0aGUg
# ZGF0YSByZW1haW5zIGludGFjdCB3aXRob3V0IG1vZGlmaWNhdGlvbiBkdXJp
# bmcgdHJhbnNwb3J0LiBCYXNlNjQgaXMgdXNlZCBjb21tb25seSBpbiBhIG51
# bWJlciBvZiBhcHBsaWNhdGlvbnMgaW5jbHVkaW5nIGVtYWlsIHZpYSBNSU1F
# LCBhbmQgc3RvcmluZyBjb21wbGV4IGRhdGEgaW4gWE1MLg==
RFC 4648: The Base16, Base32, and Base64 Data Encodings has this to say:
3.3. Interpretation of Non-Alphabet Characters in Encoded Data
[...]
Implementations MUST reject the encoded data if it contains
characters outside the base alphabet when interpreting base-encoded
data, unless the specification referring to this document explicitly
states otherwise. Such specifications may instead state, as MIME
does, that characters outside the base encoding alphabet should
simply be ignored when interpreting data ("be liberal in what you
accept"). Note that this means that any adjacent carriage return/
line feed (CRLF) characters constitute "non-alphabet characters" and
are ignored.
So the newlines are fine and pretty much everything will ignore them even if they're not strictly compliant with RFC 4648.
Also, the fine manual has this to say:
encode64(bin)
Returns the Base64-encoded version of bin. This method complies with RFC 2045. Line feeds are added to every 60 encoded charactors [sic].
So the 60 character line length is intentional and specified. If you want strict RFC 4648 Base64 (i.e. no newlines), then there is strict_encode64:
strict_encode64(bin)
Returns the Base64-encoded version of bin. This method complies with RFC 4648. No line feeds are added.
So you can say Base64.strict_encode64(val) to get the output you're looking for.
And for reference, here's the relevant section of RFC 2045:
6.8. Base64 Content-Transfer-Encoding
[...]
The encoded output stream must be represented in lines of no more
than 76 characters each. All line breaks or other characters not
found in Table 1 must be ignored by decoding software.
So the 60 character line length is somewhat arbitrary but compliant with RFC 2045 since 60 < 76.
In my web application one model uses identifier that was generated by some UUID tool. As I want that identifier to be part of the URL I am investigating methods to shorten that UUID string. As it is currently is in hexadecimal format I thought about converting it to ASCII somehow. As it should afterwards only contain normal characters and number ([\d\w]+) the normal hex to ASCII conversion doesn't seem to work (ugly characters).
Do you know of some nice algorithm or tool (Ruby) to do that?
A UUID is a 128-bit binary number, in the end. If you represent it as 16 unencoded bytes, there's no way to avoid "ugly characters". What you probably want to do is decode it from hex and then encode it using base64. Note that base64 encoding uses the characters + / = as well as A-Za-z0-9, you'll want to do a little postprocessing (I suggest s/+/-/g; s/\//_/g; s/==$// -- a base64ed UUID will always end with two equals signs)