ruby base64 encode 128 bit number by starting with a 2 bit character to prevent padding at the end - ruby

This question is a follow up to my previous question here: How can I convert a UUID to a string using a custom character set in Ruby?
But I will try to formulate it as a separate and specific question.
I do have a Ruby 128 bit UUID as hex value:
SecureRandom.uuid #=> "2d931510-d99f-494a-8c67-87feb05e1594"
If I get the IFC specification correctly (http://www.buildingsmart-tech.org/ifc/IFC2x3/TC1/html/ifcutilityresource/lexical/ifcgloballyuniqueid.htm), I want to Base64 encode this, but instead of getting padding at the end, I want the output to begin with a 2bit character(4 options), instead of 6 bits(needed for 64 options).
This way I think I can end up with a string of 22 characters (1 of 2bits, and 21 of 6bits for a total of 128 bits).
Is it possible to tweak the Ruby base64 in this way?

Short answer: no. Technically speaking, that is not standard Base64 so Ruby's standard lib will not deal with it.
Ruby's Base64 lib takes its input as bytes, so you need to get your input data to be divisible by 8. But you want 4 zero bits in front of your UUID, so that's 4+128=132 so the next closest multiple of 8 is 136 i.e. 17 bytes. You can discard the extra randomness at the end:
x = SecureRandom.gen_random 17 # get a little extra randomness
x[0] = (x[0].ord & 0x0f).chr # 0 out the first four bits
Base64.strict_encode64(x)[0...22] # discard extra randomness
The one downside of this approach is that your 128 bit UUID is weirdly aligned inside x and hard to see on its own. If you want to get the 128 bits out you can do that with some pack/unpack:
[x.unpack("B*")[0][4...132]].pack("B*")

Related

Why Is Huffman Encoding Optional in HTTP/2 HPACK?

I want to make sure I understand this correctly:
This is from section 5.2 of RFC7451
Header field names and header field values can be represented as
string literals. A string literal is encoded as a sequence of
octets, either by directly encoding the string literal's octets or by
using a Huffman code (see [HUFFMAN]).
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| H | String Length (7+) |
+---+---------------------------+
| String Data (Length octets) |
+-------------------------------+
This means I can either send Header string literals with H being 1 with Huffman Encoded string; or with H being 0 and the original string octets; and the existing HTTP/2 server/implementation should parse them correctly, right?
HTTP Headers are basically made up of ASCII codes. ASCII uses fixed length codes where each character is 8 bits in length (well actually only 7 bits since HTTP Headers only uses the first 127 codes in the original ASCII character set but the 8th bit is set to 0).
Huffman encoding uses variable length encoding. More frequently used characters have shorter codes less than 8 bits, and less frequently used characters have more than 8 bits. The theory being most text is made up of the more frequently used codes so the length should be shorter than ASCII in most cases. This is especially true since ASCII “wastes” a bit when only using that basic character needing only 7 bits, but save it in 8 bits of space.
So there will be some pieces of text which are actually longer than ASCII if using Huffman encoding.
The Huffman coding table used in HPACK is shown here and as an example you can see < is encoded as 111111111111100 which is 15 bits. Therefore to Huffman encode the string <<<< would take 4 octets in ASCII but 60 bits or 8 octets in Huffman encoding.
Therefore HPACK allows you to use ASCII in this case as that is more efficient.
Maybe this is a little over complicated and we should just accept the slightly less efficient encoding in these rare edge cases - some say the IETF is obsessed with saving bits - but that’s why it’s there.
Note that receivers can’t control what the other side uses, so every HTTP/2 implementation needs to understand Huffman encoding. So it’s not optional in the sense that you can make an HTTP/2 implementation without it, but the use of it for individual header names or values is optional.
Btw if interested in understanding HPACK in more detail than the spec gives, then I cover it (including the answer to this question!) in chapter 8 of my book: https://www.manning.com/books/http2-in-action.

Why does MQTT use such a strange encoding scheme for remaining length?

I've recently started writing an MQTT library for a microcontroller. I've been following the specification document. Section 2.2.3 explains how the remaining length field (part of the fixed header) encodes the number of bytes to follow in the rest of the packet.
It uses a slightly odd encoding scheme:
Byte 0 = a mod 128, a /= 128, if a > 0, set top bit and add byte 1
Byte 1 = a mod 128, a /= 128, if a > 0, set top bit... etc
This variable length encoding seems odd in this application. You could easily transmit the same number using fewer bytes, especially once you get into numbers that take 2-4 bytes using this scheme. MQTT was designed to be simple to use and implement. So why did they choose this scheme?
For example, decimal 15026222 would be encoded as 0xae 0x90 0x95 0x7, however in hexadecimal it's 0xE5482E -- 3 bytes instead of four. The overhead in calculating the encoding scheme and decoding it at the other end seems to contradict the idea that MQTT is supposed to be fast and simple to implement on an 8-bit microcontroller.
What are the benefits to this encoding scheme? Why is it used? The only blog post I could find that even mentions any motivation is this one, which says:
The encoding of the remaining length field requires a bit of additional bit and byte handling, but the benefit is that only a single byte is needed for most messages while preserving the capability to send larger message up to 268’435’455 bytes.
But that doesn't make sense to me. You could have even more messages be only a single byte if you used the entire first byte to represent 0-255 instead of 0-127. And if you used straight hexadecimal, you could represent a number as large as 4 294 967 295 instead of only 268 435 455.
Does anyone have any idea why this was used?
As the comment you cited explains, under the assumption that "only a single byte is needed for most messages", or in other words, under the assumption that most of the time a <= 127 only a single byte is needed to represent the value.
The alternatives are:
Use a value to explicitly indicate how many bytes (or bits) are needed for a. This would require dedicating at least 2 bits to support at most "4 byte" sized a for all messages.
Dedicate a fixed size for a, probably 4 bytes, for all messages. This is inferior if many (read: most) messages don't utilize this size and can't support larger values if that becomes a requirement.

Compress many numbers into a string

I was wondering if there's a way to compress 20 or so large numbers (~10^8) into a string of a reasonable length. For instance, if the numbers were stored as hex and concatenated, it'd be at least 160 characters long. I wonder if there's a smart way to compress the numbers in and get them back out. I was thinking about having a sequence 0-9 as reference and let one part of the input string be a number <1024. That number is to be converted to binary, which serves as a mask, i.e. indicating which digits exist in the number. It's still not clear where to go on from here.
Are there any better alternatives?
Thanks
If these large numbers are of the same size in bytes, and if you always know the count of those numbers, there is an easy way to do it. You simply Have an array of your bytes, and instead of reading them out as integers, you read them out as characters. Are you trying to obfuscate your values or just pack them to be easily transferred?
When I'm compacting a lot of values into one, reversible String, I usually go with base 64 conversion. This can really cut off quite a lot of the length from a String, but note that it may take up just as much memory in representing it.
Example
This number in decimal:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
is the following in Base 64:
Yki8xQRRVqd403ldXJUT8Ungkh/A3Th2TMtNlpwLPYVgct2eE8MAn0bs4o/fv1bmo4oUNQa/9WtZ8gRE7IG+UHX+LniaQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why you can't do this too an extreme level
Think about it for a second. Let's say you've got a number of length 10. And you want to represent that number with 5 characters, so a 50% rate compression scheme. First, we work out how many possible numbers you can represent with 10 digits.. which is..
2^10 = 1024
Okay, that's fine. How many numbers can we express with 5 digits:
2^5 = 32
So, you can only display 32 different numbers with 5 bits, whereas you can display 1024 numbers with 10 bits. For compression to work, there needs to be some mapping between the compressed value and the extracted value. Let's try and make that mapping happen..
Normal - Compressed
0 0
1 1
2 2
.. ...
31 31
32 ??
33 ??
34 ??
... ...
1023 ??
There is no mapping for most of the numbers that can be represented by the expanded value.
This is known as the Pigeonhole Principle and in this example our value for n is greater than our value for m, hence we need to map values from our compressed values to more than one normal value, which makes things incredibly complex. (thankyou Oli for reminding me).
You need to be much more descriptive about what you mean by "string" and "~10^8". Can your "string" contain any sequence of bytes? Or is it restricted to a subset of possible bytes? If so, how exactly is it restricted? What are the limits on your "large numbers"? What do they represent?
Numbers up to 108 can be represented in 27 bits. 20 of them would be 540 bits, which could be stored in a string of 68 bytes, if any sequence of bytes is permitted. If the contents of a string are limited, it will take more bits. If your range of numbers is larger, it will take more bits.
store all numbers as strings to a marisa trie: https://code.google.com/p/marisa-trie/
Base64 the resulting trie dictionary
It depends of course a lot on your input. But it is a possibility to build a (very) compact representation this way.

Rubyist way to decode this encoded string assuming invariant ASCII encoding

My program is a decoder for a binary protocol. One of the fields in that binary protocol is an encoded String. Each character in the String is printable, and represents an integral value. According to the spec of the protocol I'm decoding, the integral value it represents is taken from the following table, where all possible characters are listed:
Character Value
========= =====
0 0
1 1
2 2
3 3
[...]
: 10
; 11
< 12
= 13
[...]
B 18
So for example, the character = represents an integral 13.
My code was originally using ord to get the ASCII code for the character, and then subtracting 48 from that, like this:
def Decode(val)
val[0].ord - 48
end
...which works perfectly, assuming that val consists only of characters listed in that table (this is verified elsewhere).
However, in another question, I was told that:
You are asking for a Ruby way to use ord, where using it is against
the Ruby way.
It seems to me that ord is exactly what I need here, so I don't understand why using ord here is not a Rubyist way to do what I'm trying to do.
So my questions are:
First and foremost, what is the Rubyist way to write my function above?
Secondary, why is using ord here a non-Rubyist practice?
A note on encoding: This protocol which I'm decoding specifies precisely that these strings are ASCII encoded. No other encoding is possible here. Protocols like this are extremely common in my industry (stock & commodity markets).
I guess the Rubyistic way, and faster, to decode the string into an array of integers is the unpack method:
"=01:".unpack("C*").map {|v| v - 48}
>> [13, 0, 1, 10]
The unpack method, with "C*" param, converts each character to an 8-bit unsigned integer.
Probably ord is entirely safe and appropriate in your case, as the source data should always be encoded the same way. Especially if when reading the data you set the encoding to 'US-ASCII' (although the format used looks safe for 'ASCII-8BIT', 'UTF-8' and 'ISO-8859', which may be the point of it - it seems resilient to many conversions, and does not use all possible byte values). However, ord is intended to be used with character semantics, and technically you want byte semantics. With basic ASCII and variants there is no practical difference, all byte values below 128 are the same character code.
I would suggest using String#unpack as a general method for converting binary input to Ruby data types, but there is not an unpack code for "use this byte with an offset", so that becomes a two-part process.

Minimum length of source string for cipher

Is there a minimum length for the source string for cipher? I am trying to create unique session ids, and am doing the following:
cipherer = OpenSSL::Cipher::Cipher.new("aes-256-cbc").encrypt
cipherer.update(Time.now.to_f.*(1000).to_i.to_s(36))
but it returns an empty string. I changed the multiplication factor 1000 to larger ones like 100000000000000, and then it sometimes returns an empty string and sometimes a long string. I suppose this is because strings shorter than some length are mapped to empty strings by the cipher. Is this understanding correct? If so, what is the minimum length, and how can I make it so that it never returns an empty string? I don't need to decipher it, so there is no worry about whether the rounding error or hardware dependency will make it impossible to decipher.
AES Block Size is 128 Bit = 16 Bytes which is 16 characters. This means 16 Bytes go in and 16 Bytes go out.
You are using the streaming mode by calling OpenSSL::Cipher#update. This means that the input to the update method is appended to the input for the cipher until a minimum of 16 Bytes is reached. Only then, an encryption can be performed and only then the method will return an actual value.
This means, that a call to update will only produce an output every second time if you feed it with an 8 Bytes value which is what you do:
cipherer = OpenSSL::Cipher::Cipher.new("aes-256-cbc").encrypt
#=> #<OpenSSL::Cipher::Cipher:0x007fb67426e558>
cipherer.update('X'*8)
#=> ""
cipherer.update('X'*8)
#=> "\xA0+\xD8Y\xA5\xBC68\x972\x86!\xC7\xE5\xA2\xDE"
cipherer.update('X'*8)
#=> ""
cipherer.update('X'*8)
#=> "\xFB\xB1I\xE2\x01\xB8Z\x10\xDC\x96m?\xC3\x00\x19+"
On the other hand, two blocks will be encrypted at once if you provide a 32 Byte value, e.g.
cipherer = OpenSSL::Cipher::Cipher.new("aes-256-cbc").encrypt
=> #<OpenSSL::Cipher::Cipher:0x007fb6741579f8>
cipherer.update('X'*32)
#=> "\xA0+\xD8Y\xA5\xBC68\x972\x86!\xC7\xE5\xA2\xDE\xFB\xB1I\xE2\x01\xB8Z\x10\xDC\x96m?\xC3\x00\x19+"
cipherer.update('X'*32)
#=> "\xB2!\xBA>M\x13t\xEBv^\xCE\xAE\x18\x9A\xE3S\xD96\x95\x89\xC1\xB4\xAA\xDD\xD3\xDCp\e<\x90\xA79"
cipherer.update('X'*32)
#=> "\x95\xA3-\xB9\x93D\x1D\xC0\xB3mh\xE3\xB5N\x9C\x8C\xEA\xF3\x80\xD3\xBDU\xCB'\xC0E\xDA\x02\xA8L\a\xB3"
The direct solution to your problem is to use a string as input that is guaranteed to be 16 Bytes long.
I strongly encourage you to use the UUID implementation that comes with ruby, though. There are so many things to think about when implementing crypto stuff. You're always better off sticking to the standards and using well-known, widely-used and battle-tested implementations. That said, here's how to get a UUID:
require 'securerandom'
SecureRandom.uuid
#=> "c5059a24-25fc-4617-aaf7-280df52cd8d3"
Simple as that.

Resources