I wanted to encrypt a string from an input using Encryption Library in codeigniter, I wanted it to generate a 32 char regardless on how long the input is but the number of character generated from encrypt() deters on how many characters the input...
If you could encrypt any string down to 32 characters, 50 gigabyte games and 8K three hour movies could be compressed down to 32 characters. Obviously, that's not possible.
Consider a MD5 or SHA1 hash of the string. It won't be decryptable, and it won't be guaranteed to be unique, but it'll be a fixed, predictable length.
This question is a follow up to my previous question here: How can I convert a UUID to a string using a custom character set in Ruby?
But I will try to formulate it as a separate and specific question.
I do have a Ruby 128 bit UUID as hex value:
SecureRandom.uuid #=> "2d931510-d99f-494a-8c67-87feb05e1594"
If I get the IFC specification correctly (http://www.buildingsmart-tech.org/ifc/IFC2x3/TC1/html/ifcutilityresource/lexical/ifcgloballyuniqueid.htm), I want to Base64 encode this, but instead of getting padding at the end, I want the output to begin with a 2bit character(4 options), instead of 6 bits(needed for 64 options).
This way I think I can end up with a string of 22 characters (1 of 2bits, and 21 of 6bits for a total of 128 bits).
Is it possible to tweak the Ruby base64 in this way?
Short answer: no. Technically speaking, that is not standard Base64 so Ruby's standard lib will not deal with it.
Ruby's Base64 lib takes its input as bytes, so you need to get your input data to be divisible by 8. But you want 4 zero bits in front of your UUID, so that's 4+128=132 so the next closest multiple of 8 is 136 i.e. 17 bytes. You can discard the extra randomness at the end:
x = SecureRandom.gen_random 17 # get a little extra randomness
x[0] = (x[0].ord & 0x0f).chr # 0 out the first four bits
Base64.strict_encode64(x)[0...22] # discard extra randomness
The one downside of this approach is that your 128 bit UUID is weirdly aligned inside x and hard to see on its own. If you want to get the 128 bits out you can do that with some pack/unpack:
[x.unpack("B*")[0][4...132]].pack("B*")
Is there a way to compress/encode string to specified length(8/10 character).
I have a combination of secret key and a numeric value of 16 digit, and I want to create a unique id with combination of these both. which length should be between 8-12, and it should not change if combination is same.
Please suggest a way.
If it's 16 decimal digits and your string can contain any characters, then sure. If you want ten characters out, then you'd need 40 different characters. 4010 > 1016. Or for nine characters out, you need 60 different characters. 609 > 1016. E.g. some subset of the upper case letters, lower case letters, and digits (62 to choose 40 or 60 from). Then it is simply a matter of base conversion either way. Convert from base 10 to base 40 or 60, and then back.
Many languages already have Base-64 coding routines, which will get you to nine characters.
Eight is a problem, since you would need 100 characters (1008 == 1016), and there are only 95 printable ASCII characters.
You could use a secure hash function, like sha512, and truncate the resulting hex string to the desired length.
If you want slightly more entropy, you can base64 encode it before truncating.
If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters?
The first 128 code points are the same as in ASCII. But it says UTF-8 can support up to million of characters?
How does this work?
UTF-8 does not use one byte all the time, it's 1 to 4 bytes.
The first 128 characters (US-ASCII) need one byte.
The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.
Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.
Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
source: Wikipedia
UTF-8 uses 1-4 bytes per character: one byte for ascii characters (the first 128 unicode values are the same as ascii). But that only requires 7 bits. If the highest ("sign") bit is set, this indicates the start of a multi-byte sequence; the number of consecutive high bits set indicates the number of bytes, then a 0, and the remaining bits contribute to the value. For the other bytes, the highest two bits will be 1 and 0 and the remaining 6 bits are for the value.
So a four byte sequence would begin with 11110... (and ... = three bits for the value) then three bytes with 6 bits each for the value, yielding a 21 bit value. 2^21 exceeds the number of unicode characters, so all of unicode can be expressed in UTF8.
Unicode vs UTF-8
Unicode resolves code points to characters. UTF-8 is a storage mechanism for Unicode. Unicode has a spec. UTF-8 has a spec. They both have different limits. UTF-8 has a different upwards-bound.
Unicode
Unicode is designated with "planes." Each plane carries 216 code points. There are 17 Planes in Unicode. For a total of 17 * 2^16 code points. The first plane, plane 0 or the BMP, is special in the weight of what it carries.
Rather than explain all the nuances, let me just quote the above article on planes.
The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.
UTF-8
Now let's go back to the article linked above,
The encoding scheme used by UTF-8 was designed with a much larger limit of 231 code points (32,768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes.[3] Since Unicode limits the code points to the 17 planes that can be encoded by UTF-16, code points above 0x10FFFF are invalid in UTF-8 and UTF-32.
So you can see that you can put stuff into UTF-8 that isn't valid Unicode. Why? Because UTF-8 accommodates code points that Unicode doesn't even support.
UTF-8, even with a four byte limitation, supports 221 code points, which is far more than 17 * 2^16
According to this table* UTF-8 should support:
231 = 2,147,483,648 characters
However, RFC 3629 restricted the possible values, so now we're capped at 4 bytes, which gives us
221 = 2,097,152 characters
Note that a good chunk of those characters are "reserved" for custom use, which is actually pretty handy for icon-fonts.
* Wikipedia used show a table with 6 bytes -- they've since updated the article.
2017-07-11: Corrected for double-counting the same code point encoded with multiple bytes
2,164,864 “characters” can be potentially coded by UTF-8.
This number is 27 + 211 + 216 + 221, which comes from the way the encoding works:
1-byte chars have 7 bits for encoding
0xxxxxxx (0x00-0x7F)
2-byte chars have 11 bits for encoding
110xxxxx 10xxxxxx (0xC0-0xDF for the first byte; 0x80-0xBF for the second)
3-byte chars have 16 bits for encoding
1110xxxx 10xxxxxx 10xxxxxx (0xE0-0xEF for the first byte; 0x80-0xBF for continuation bytes)
4-byte chars have 21 bits for encoding
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (0xF0-0xF7 for the first byte; 0x80-0xBF for continuation bytes)
As you can see this is significantly larger than current Unicode (1,112,064 characters).
UPDATE
My initial calculation is wrong because it doesn't consider additional rules. See comments to this answer for more details.
UTF-8 is a variable length encoding with a minimum of 8 bits per character.
Characters with higher code points will take up to 32 bits.
Quote from Wikipedia: "UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed "octets" in the Unicode Standard)."
Some links:
http://www.utf-8.com/
http://www.joelonsoftware.com/articles/Unicode.html
http://www.icu-project.org/docs/papers/forms_of_unicode/
http://en.wikipedia.org/wiki/UTF-8
Check out the Unicode Standard and related information, such as their FAQ entry, UTF-8 UTF-16, UTF-32 & BOM. It’s not that smooth sailing, but it’s authoritative information, and much of what you might read about UTF-8 elsewhere is questionable.
The “8” in “UTF-8” relates to the length of code units in bits. Code units are entities use to encode characters, not necessarily as a simple one-to-one mapping. UTF-8 uses a variable number of code units to encode a character.
The collection of characters that can be encoded in UTF-8 is exactly the same as for UTF-16 or UTF-32, namely all Unicode characters. They all encode the entire Unicode coding space, which even includes noncharacters and unassigned code points.
While I agree with mpen on the current maximum UTF-8 codes (2,164,864) (listed below, I couldn't comment on his), he is off by 2 levels if you remove the 2 major restrictions of UTF-8: only 4 bytes limit and codes 254 and 255 can not be used (he only removed the 4 byte limit).
Starting code 254 follows the basic arrangement of starting bits (multi-bit flag set to 1, a count of 6 1's, and terminal 0, no spare bits) giving you 6 additional bytes to work with (6 10xxxxxx groups, an additional 2^36 codes).
Starting code 255 doesn't exactly follow the basic setup, no terminal 0 but all bits are used, giving you 7 additional bytes (multi-bit flag set to 1, a count of 7 1's, and no terminal 0 because all bits are used; 7 10xxxxxx groups, an additional 2^42 codes).
Adding these in gives a final maximum presentable character set of 4,468,982,745,216. This is more than all characters in current use, old or dead languages, and any believed lost languages. Angelic or Celestial script anyone?
Also there are single byte codes that are overlooked/ignored in the UTF-8 standard in addition to 254 and 255: 128-191, and a few others. Some are used locally by the keyboard, example code 128 is usually a deleting backspace. The other starting codes (and associated ranges) are invalid for one or more reasons (https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences).
Unicode is firmly married to UTF-8. Unicode specifically supports 2^21 code points (2,097,152 characters) which is exactly the same number of code points supported by UTF-8. Both systems reserve the same 'dead' space and restricted zones for code points etc. ...as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters
From the unicode standard. Unicode FAQ
The Unicode Standard encodes characters in the range U+0000..U+10FFFF,
which amounts to a 21-bit code space.
From the UTF-8 Wikipedia page. UTF-8 Description
Since the restriction of the Unicode code-space to 21-bit values in
2003, UTF-8 is defined to encode code points in one to four bytes, ...
I need to create a database column which will store a string encrypted using Triple DES. How do I determine the length of the encrypted string column?
(Answers for algorithms other than Triple DES are also welcome.)
Block-ciphers such as DES, 3DES and AES can only operate on blocks of bytes. DES and 3DES operate on block of 8 bytes and AES on blocks of 16 bytes.
To handle this, you usually apply a reversible padding to your plaintext before encrypting it. It will mostly always be "PKCS"-padding (also called PKCS5- or PKCS7-padding).
PKCS-padding adds at least one byte such that the padded text has a length divisible with the block-length (8 bytes for 3DES). The value of the padding-bytes is the number of bytes added. Fx. ABCDEF is padded to ABCDEF0505050505 and 0011223344556677 is padded to 0011223344566770808080808080808. Note that this is easy to remove: you just look at the final byte of the padded bytes, verify that it is between 1 and the block-length, and remove that number of bytes from the end (verifying that each removed byte has the correct value).
Finally, to answer your question: Assuming you are using 3DES with CBC encryption and PKCS-padding - which you probably are - the encryption of a string of length n will have length:
n + 8 - (n % 8)
Using Triple DES does not change the string's length but it will be rounded to the next 64 bit boundary. If you intend to "display" it, you'll have to encoded it (like in Base64 though.
As for other algorithms, it is difficult ot answer as there are plenty. Block ciphers will always pad input to match their block size whereas many stream ciphers will not.
Triple DES uses three 56-bit DES keys, giving 168 bit keys. It's block size is 64-bit.