Minimum length of source string for cipher - ruby

Is there a minimum length for the source string for cipher? I am trying to create unique session ids, and am doing the following:
cipherer = OpenSSL::Cipher::Cipher.new("aes-256-cbc").encrypt
cipherer.update(Time.now.to_f.*(1000).to_i.to_s(36))
but it returns an empty string. I changed the multiplication factor 1000 to larger ones like 100000000000000, and then it sometimes returns an empty string and sometimes a long string. I suppose this is because strings shorter than some length are mapped to empty strings by the cipher. Is this understanding correct? If so, what is the minimum length, and how can I make it so that it never returns an empty string? I don't need to decipher it, so there is no worry about whether the rounding error or hardware dependency will make it impossible to decipher.

AES Block Size is 128 Bit = 16 Bytes which is 16 characters. This means 16 Bytes go in and 16 Bytes go out.
You are using the streaming mode by calling OpenSSL::Cipher#update. This means that the input to the update method is appended to the input for the cipher until a minimum of 16 Bytes is reached. Only then, an encryption can be performed and only then the method will return an actual value.
This means, that a call to update will only produce an output every second time if you feed it with an 8 Bytes value which is what you do:
cipherer = OpenSSL::Cipher::Cipher.new("aes-256-cbc").encrypt
#=> #<OpenSSL::Cipher::Cipher:0x007fb67426e558>
cipherer.update('X'*8)
#=> ""
cipherer.update('X'*8)
#=> "\xA0+\xD8Y\xA5\xBC68\x972\x86!\xC7\xE5\xA2\xDE"
cipherer.update('X'*8)
#=> ""
cipherer.update('X'*8)
#=> "\xFB\xB1I\xE2\x01\xB8Z\x10\xDC\x96m?\xC3\x00\x19+"
On the other hand, two blocks will be encrypted at once if you provide a 32 Byte value, e.g.
cipherer = OpenSSL::Cipher::Cipher.new("aes-256-cbc").encrypt
=> #<OpenSSL::Cipher::Cipher:0x007fb6741579f8>
cipherer.update('X'*32)
#=> "\xA0+\xD8Y\xA5\xBC68\x972\x86!\xC7\xE5\xA2\xDE\xFB\xB1I\xE2\x01\xB8Z\x10\xDC\x96m?\xC3\x00\x19+"
cipherer.update('X'*32)
#=> "\xB2!\xBA>M\x13t\xEBv^\xCE\xAE\x18\x9A\xE3S\xD96\x95\x89\xC1\xB4\xAA\xDD\xD3\xDCp\e<\x90\xA79"
cipherer.update('X'*32)
#=> "\x95\xA3-\xB9\x93D\x1D\xC0\xB3mh\xE3\xB5N\x9C\x8C\xEA\xF3\x80\xD3\xBDU\xCB'\xC0E\xDA\x02\xA8L\a\xB3"
The direct solution to your problem is to use a string as input that is guaranteed to be 16 Bytes long.
I strongly encourage you to use the UUID implementation that comes with ruby, though. There are so many things to think about when implementing crypto stuff. You're always better off sticking to the standards and using well-known, widely-used and battle-tested implementations. That said, here's how to get a UUID:
require 'securerandom'
SecureRandom.uuid
#=> "c5059a24-25fc-4617-aaf7-280df52cd8d3"
Simple as that.

Related

Tiny URL system design

I have read and watched many youtube videos & links which all provide same solution which is:
Use a distributed counter like zookeeper
Counter max limit can be 3.5 trillion
Convert the Counter value to Base62
which is all fine when the counter value is small.
e.g.
generated counter value: 120001 => base62 value FMJQmhBR
but when the counter provides large counter value like below the base62 value length also increases.
generated counter value: 120003658=> base62 value HRGZF8RiHC6y
So how can this be a solution for exact tiny url with exact 8 length.
https://www.linqz.io/2018/10/how-to-build-a-tiny-url-service-that-scales-to-billions.html
https://www.youtube.com/watch?v=eCLqmPBIEYs
https://www.youtube.com/watch?v=JQDHz72OA3c&t=1862s
First: there absolutely is a compression limit. If your chosen representation has a maximum length, that imposes a hard limit on your key space.
Let's unpack that a little. Let's say you've got 80 guests at a party, and you want to give each guest a unique label (for their drink cup or something). If you've decided that each label will be a single letter from the English alphabet, then you only have enough unique labels for 26 guests.
Second: FMJQmhBR is not the most efficient way to represent the number 120001. It takes 17 bits in binary: 11101010011000001 (not sure which endianness that is). 16 bits is just two ASCII characters, and three ASCII characters can accommodate nearly 17 million unique values. And that's without any kind of special, ZIP-like compression.
--
I think most URL shorteners work essentially by assigning a counting number to each URL that someone shortens. So, the very first URL that gets submitted will be given ID=1: they save the whole URL in the database and associate it with that number. The second URL gets ID=2, etc.
That's pretty crude, though. For a variety of reasons, they don't want to hand those IDs out in order. But if they know how long they want the identifiers to be, it's not hard hand those IDs out in random order:
When someone submits a URL, the system picks a random number between 0 and the highest-possible ID. If the URL identifiers are all supposed to be 8 ASCII characters, that means they pick a random number between 0 and 2^(8*8) = 1.844674407e19.
Then they check their DB to see if they've handed out that ID. If they have, they pick a different random number. They repeat this until they pick an ID that hasn't been handed out. (I think there are more efficient algorithms for this, but the effect is the same and this is easiest to understand.)
Given that you are not hashing every url, but a vaguely-predictable number, you could hash the result and take the first N bits
However, there are many solutions for what to do for collisions
ignore them - they will be rare (ideally)
choose the next value
hash the result again (with your input)
increment the size of the returned string
...
Here's a great video on cuckoo hashing (which is a structure of hashes relevant here):
https://www.youtube.com/watch?v=HRzg0SzFLQQ
Here's an example in Python which finds an 8-character string from the hash which should be fairly unique (this could then be collected into a sorted data structure mapping it into a URL)
This works by first hashing the value with an avalanching hash (SHA-265) and then loops to find the minimum amount of it (slice from the front of the hex string) to form an 8-char base62 string
This could be made much more efficient (even, for example by bisecting), but may be clearer as-is and depends hugely on unspecified algorithm requirements
import hashlib
BASE62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
m = hashlib.sha256()
m.update("https://stackoverflow.com/questions/65714033/tiny-url-system-design".encode())
digest = m.digest() # hash as bytes b',\xdb3\x8c\x98g\xd6\x8b\x99\xb6\x98#.\\\xd1\x07\xa0\x8f\x1e\xb4\xab\x1eg\xdd\xda\xd6\xa3\x1d\xb0\xb2`9'
hex_str = digest.hex() # string of hex chars 2cdb338c9867d68b99b698232e5cd107a08f1eb4ab1e67dddad6a31db0b26039
for hashlen in range(100, 1, -1):
number = int(hex_str[:hashlen], 16) # first_n_chars(str(hex)) -> decimal
val = ""
while number != 0:
val = "{}{}".format(BASE62[number % 62], val) # append new chars to front
number = number // 62 # integer division
if len(val) <= 8:
break
print(val) # E0IxW0zn
base62 logic from How to fix the code for base62 encoding with Python3?

ruby base64 encode 128 bit number by starting with a 2 bit character to prevent padding at the end

This question is a follow up to my previous question here: How can I convert a UUID to a string using a custom character set in Ruby?
But I will try to formulate it as a separate and specific question.
I do have a Ruby 128 bit UUID as hex value:
SecureRandom.uuid #=> "2d931510-d99f-494a-8c67-87feb05e1594"
If I get the IFC specification correctly (http://www.buildingsmart-tech.org/ifc/IFC2x3/TC1/html/ifcutilityresource/lexical/ifcgloballyuniqueid.htm), I want to Base64 encode this, but instead of getting padding at the end, I want the output to begin with a 2bit character(4 options), instead of 6 bits(needed for 64 options).
This way I think I can end up with a string of 22 characters (1 of 2bits, and 21 of 6bits for a total of 128 bits).
Is it possible to tweak the Ruby base64 in this way?
Short answer: no. Technically speaking, that is not standard Base64 so Ruby's standard lib will not deal with it.
Ruby's Base64 lib takes its input as bytes, so you need to get your input data to be divisible by 8. But you want 4 zero bits in front of your UUID, so that's 4+128=132 so the next closest multiple of 8 is 136 i.e. 17 bytes. You can discard the extra randomness at the end:
x = SecureRandom.gen_random 17 # get a little extra randomness
x[0] = (x[0].ord & 0x0f).chr # 0 out the first four bits
Base64.strict_encode64(x)[0...22] # discard extra randomness
The one downside of this approach is that your 128 bit UUID is weirdly aligned inside x and hard to see on its own. If you want to get the 128 bits out you can do that with some pack/unpack:
[x.unpack("B*")[0][4...132]].pack("B*")

Pack/Unpack and base64 in Ruby

I have a string a = "hello". I can convert it to base 2 or base 16 using unpack:
a.unpack('B*')
# => ["0110100001100101011011000110110001101111"]
a.unpack('H*')
# => ["68656c6c6f"]
To convert to base 64, I tried pack:
[a].pack('m0')
# => "aGVsbG8="
but the result is not what I expected. I thought that if I have some binary representation or a string, to represent it in divided parts, I should use unpack. But it turned out not. Please help me understand it.
Per OP's clarified question, "Why do we use #pack to get base64 and #unpack to get other representations of raw data?"
The surface level reason is because Array#pack is a method that returns a String, while String#unpack is a method that returns an Array.
There are stronger conceptual reasons underlying this. The key principle is that base64 is not an array of raw bytes. Rather, it's a 7-bit-ASCII-safe string that can represent arbitrary bytes if properly (de)coded.
Each base64 character maps to a sequence of six bits. At the byte level, that's a 4:3 ratio of characters to raw bytes. Since integer powers of 2 don't divide by 3, we end up with padding more often than not, and you can't slice base64 in arbitrary places to get ranges of bytes out of it (you'd have to figure out which bytes you want in groups of three and go get the associated base64 characters in groups of four).
Arbitrary sequences of data are, fundamentally, arrays of bytes. Base64-encoded sequences are, fundamentally, strings: data sequences constrained to the range of bytes safely transmissible and displayable as text.
Base64 is the encapsulation (or "packing") of a data array into a string.
The encoded text is correct, to validate use below online tool:
https://www.base64encode.org/
text:
hello
Encoded Base64:
aGVsbG8=
Useful resource:
https://idiosyncratic-ruby.com/4-what-the-pack.html

Rubyist way to decode this encoded string assuming invariant ASCII encoding

My program is a decoder for a binary protocol. One of the fields in that binary protocol is an encoded String. Each character in the String is printable, and represents an integral value. According to the spec of the protocol I'm decoding, the integral value it represents is taken from the following table, where all possible characters are listed:
Character Value
========= =====
0 0
1 1
2 2
3 3
[...]
: 10
; 11
< 12
= 13
[...]
B 18
So for example, the character = represents an integral 13.
My code was originally using ord to get the ASCII code for the character, and then subtracting 48 from that, like this:
def Decode(val)
val[0].ord - 48
end
...which works perfectly, assuming that val consists only of characters listed in that table (this is verified elsewhere).
However, in another question, I was told that:
You are asking for a Ruby way to use ord, where using it is against
the Ruby way.
It seems to me that ord is exactly what I need here, so I don't understand why using ord here is not a Rubyist way to do what I'm trying to do.
So my questions are:
First and foremost, what is the Rubyist way to write my function above?
Secondary, why is using ord here a non-Rubyist practice?
A note on encoding: This protocol which I'm decoding specifies precisely that these strings are ASCII encoded. No other encoding is possible here. Protocols like this are extremely common in my industry (stock & commodity markets).
I guess the Rubyistic way, and faster, to decode the string into an array of integers is the unpack method:
"=01:".unpack("C*").map {|v| v - 48}
>> [13, 0, 1, 10]
The unpack method, with "C*" param, converts each character to an 8-bit unsigned integer.
Probably ord is entirely safe and appropriate in your case, as the source data should always be encoded the same way. Especially if when reading the data you set the encoding to 'US-ASCII' (although the format used looks safe for 'ASCII-8BIT', 'UTF-8' and 'ISO-8859', which may be the point of it - it seems resilient to many conversions, and does not use all possible byte values). However, ord is intended to be used with character semantics, and technically you want byte semantics. With basic ASCII and variants there is no practical difference, all byte values below 128 are the same character code.
I would suggest using String#unpack as a general method for converting binary input to Ruby data types, but there is not an unpack code for "use this byte with an offset", so that becomes a two-part process.

Converting a unique seed string into a random, yet deterministic, float value in Ruby

I'm having a hard time with this, conceptually.
Basically, I need to accept some arbitrary unique string, and be able to convert that to a normalized float value. What the output float value is doesn't really matter, so long as the same string input always results in the same normalized float output.
So this is a hashing algorithm right? I'm familiar with SHA1 or MD5, and this seems similar to password hashing where the result is the same for the correct password. But those methods output strings of characters, I believe. And what I'm not getting is how I would turn the result of a SHA1 or MD5 into a consistent float value.
# Goal
def string_to_float(seed_string)
# ...
end
string_to_float('abc-123') #=> 0.15789
string_to_float('abc-123') #=> 0.15789
string_to_float('def-456') #=> 0.57654
string_to_float('def-456') #=> 0.57654
So what kind of approach in Ruby can I take that would turn an arbitrary string into a random but consistent float value?
The key part that you want is a way of converting a SHA1 or MD5 hash output into a float that is both deterministic and 1-1. Here's a simple solution based on md5. This could be used as integers too.
require 'digest/md5'
class String
def float_hash
(Digest::MD5.hexdigest(self).to_i(16)).to_f
end
end
puts "example_string".float_hash # returns 1.3084281619666243e+38
This generates a hexadecimal hash, then converts it to an integer, then converts that to a float. Each step is deterministic.
Note: as pointed out by #emboss, this reduces collision resistance because a double is 8 bytes and the hash is 16 bytes. It shouldn't be a big deal though by the sounds of your application.
If security is no issue, what you are describing is in my opinion not a hash function. A hash function is a one-way function, meaning computing the hash is easy but reverting it is "hard" or, ideally, impossible.
Your requirements instead describe an injective function Given any x1, x2 in your domain X the following holds:
For all x1, x2 element of X, x1 != x2 => f(x1) != f(x2)
f(x) = x is such a function, f(x) = x² is not. In plain English: you want to have different results if your inputs are different, same results only if the inputs are the same. It is true that this also is true for secure hashes, but they additionally provide the one-way characteristics such as the property of not being able (easily) to find x if you are only given f(x), among others. As far as I understood, you don't need these security properties.
Trivially, such an injective mapping from String to Float would be given by simply interpreting the "String bytes" as "Float bytes" from now on, i.e. you interpret the bytes differently (think C:
unsigned char *bytes = "...";
double d = (double)bytes;
). But, there is as downside to this - the real trouble is that Float has a maximum precision, so you will run into an overflow situation if your strings are too long (Floats are internally represented as double values, that's 8 bytes on a 32 bit machine). So not enough space for almost any use case. Even MD5-ing your strings first doesn't solve the problem - MD5 output is already 16 bytes long.
So this could be a real problem, depending on your exact requirements. Although MD5 (or any other hash) will mess sufficiently with the input to make it as random as possible, you still cut the range of possible values from 16 bytes to effectively 8 bytes. (Note: Truncating random 16 byte output at 8 bytes is generally considered "secure" in terms of preserving the randomness. Elliptic Curve Cryptography does something similar. But as far as I know, nobody can really prove it, but neither could someone prove the contrary so far). So a collision is much more likely with your restricted Float range. By the birthday paradox, finding a collision takes sqrt(number of values in a finite range) tries. For MD5 this is 2^64, but for your scheme it's only 2^32. That's still very, very unlikely to produce a collision. It's probably something in the order of winning the lottery while at the same time being hit by a lightning. If you could live with this minimal possibility, go for it:
def string_to_float(str)
Digest::MD5.new.digest(str).unpack('D')
end
If uniqueness is of absolute priority I would recommend to move from floats to integers. Ruby has built-in support for large integers that are not restricted by the internal constraints of a long value (that's what a Fixnum boils down to). So any arbitrary hash output could be represented as a large integer number.
Yes, you are describing a hashing algorithm. You could use a MD5 or SHA1 digest (since they just produce random bits) to generate a floating point number simply by using the String#unpack method with an argument of "G" (double-precision float, network byte order) from a digest:
require 'digest/sha1'
def string_to_float(str)
Digest::SHA1.digest(str).unpack("G")[0]
end
string_to_float("abc-123") # => -2.86011943713676e-154
string_to_float("def-456") # => -1.13232994606094e+214
string_to_float("abc-123") # => -2.86011943713676e-154 OK!
string_to_float("def-456") # => -1.13232994606094e+214 OK!
Note that if you want the resulting floats to be in a particular range then you'll need to do some massaging.
Also note that the unpacked number doesn't use all of the bits from the digest so you might want to combine into the number of bytes for a double floating point number (although you'll have to be careful not to decrease the entropy of the hash function, if you care about that sort of thing), e.g.:
def str2float(s)
d = Digest::SHA1.digest(s)
x, y = d[0..9], d[10..19]
# XOR the 1st (x) and 2nd (y) halves to use all bits.
(0..9).map {|i| x[i] ^ y[i]}.pack("c*").unpack("G")[0]
end

Resources