What is the SHA-1 of 255? - utf-8

Assume you have a sha-1 algorithm that accepts text as input instead of a byte array.
(For instance there are some Javascript libraries like that).
When you would like to apply sha-1 to a text (let's say the text is a password), then you first do a utf-8 encoding because the text can contain multi-byte characters. In other words, the corresponding integer value of a character in the text can be larger than an 8-bit byte can hold. Since the sha-1 algorithm works on 8-bit units, it helps to encode the text in utf-8 first.
My question is this:
When you have a non-textual binary data where the value of each byte is between 0 and 255, are you still expected to do utf-8 encoding on the binary data before you pass it to the sha-1 algorithm? I know that when the values are between 0 and 127, utf-8 does not modify the data at all.
However, if the values are between 128 and 255, the UTF-8 modifies such data.
In summary, here is my question: What is SHA-1 of the byte containing the value 255 (all 1's)?
With UTF-8 encoding: 730cf30d408ecf51aad876f5c491f837f7ddea4c
Without UTF-8 encoding: 85e53271e14006f0265921d02d4d736cdc580b0b
Which one is the right one?

No, don't UTF-8 encode binary data, it makes no sense. If you want a hash of a piece of binary data, you should SHA-1 exactly that, not some random transformation of that data.
You shouldn't UTF-8 encode strings either unless what you want is the SHA-1 of the UTF-8 representation of that string.

Related

Converting an UTF-16 index to a UTF-8 compatible one

I am currently working with the Telegram API, and in one of its methods it returns the following information:
a piece of text
offset in UTF-16 Code Units
length in UTF-16 Code Units
In my programming language, Rust, all strings are valid UTF-8. This means that UTF-16 offsets are not immediately useful as they can be off by a variable amount (due to 1 or 3 byte code points). A one-byte code-point in UTF-8 corresponds to a two-byte one in UTF-16, so I cannot simply index the UTF-8 string as I may lay outside of code-point boundaries.
I am wondering now: Is there a way to convert it to valid UTF-8 without iterating through the UTF-8 string or if the information is useless once in UTF-8?

What are surrogate characters in UTF-8?

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.
What are surrogate characters in UTF-8?
This is almost like a trick question.
Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).
Approximate answer #2: Invalid (if not paired).
Approximate answer #3: It's not UTF-8; It's Modified UTF-8.
Synopsis: The term doesn't apply to UTF-8.
Unicode codepoints have a range that needs 21 bits of data.
UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.
UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.
#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.
#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).
#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.
Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.

Pack/Unpack and base64 in Ruby

I have a string a = "hello". I can convert it to base 2 or base 16 using unpack:
a.unpack('B*')
# => ["0110100001100101011011000110110001101111"]
a.unpack('H*')
# => ["68656c6c6f"]
To convert to base 64, I tried pack:
[a].pack('m0')
# => "aGVsbG8="
but the result is not what I expected. I thought that if I have some binary representation or a string, to represent it in divided parts, I should use unpack. But it turned out not. Please help me understand it.
Per OP's clarified question, "Why do we use #pack to get base64 and #unpack to get other representations of raw data?"
The surface level reason is because Array#pack is a method that returns a String, while String#unpack is a method that returns an Array.
There are stronger conceptual reasons underlying this. The key principle is that base64 is not an array of raw bytes. Rather, it's a 7-bit-ASCII-safe string that can represent arbitrary bytes if properly (de)coded.
Each base64 character maps to a sequence of six bits. At the byte level, that's a 4:3 ratio of characters to raw bytes. Since integer powers of 2 don't divide by 3, we end up with padding more often than not, and you can't slice base64 in arbitrary places to get ranges of bytes out of it (you'd have to figure out which bytes you want in groups of three and go get the associated base64 characters in groups of four).
Arbitrary sequences of data are, fundamentally, arrays of bytes. Base64-encoded sequences are, fundamentally, strings: data sequences constrained to the range of bytes safely transmissible and displayable as text.
Base64 is the encapsulation (or "packing") of a data array into a string.
The encoded text is correct, to validate use below online tool:
https://www.base64encode.org/
text:
hello
Encoded Base64:
aGVsbG8=
Useful resource:
https://idiosyncratic-ruby.com/4-what-the-pack.html

ASCII characters set

I am reading a (.txt) file, the contents of first line are just the four alphabet letters: "abcd".
when I display the ASCII code of these letters, I expect I to find 97,98,99 and 100 respectively for a,b,c and b. But I found tow special characters which their ASCII code are 255 and 254 for ÿ and þ.
Therefore the length of read line is 6 not 4 because of "ÿþabcd". Are these special character must-to-insert at the start of any sequential text file or is there any way to avoid both of them?
ASCII is only used in niche or archaic systems. Your data proved to you that your file is not ASCII. You must find out which character set and encoding the file was stored in.
Character Sets
All text is an encoding of elements of a character set. Elements of a character set are called codepoints. A character set consists of a list of codepoints and their descriptions. The description states how the codepoint is used semantically in text, such as LATIN CAPITAL LETTER A (A) or N-ARY PRODUCT (∏). (The style in which the codepoint is rendered is the purview of typefaces/fonts.)
Encodings
Codepoints are numbered with non-negative integers. The number is encoded into bytes. Most character sets have only one encoding, which is the number as an unsigned integer in the smallest size that can represent all of the codepoints. For example, Windows-1252 has 251 codepoints, with numbers between 0 and 255. A byte is big enough to represent any of them. The Unicode character set has about 1 million codepoints, numbered from 0 to 1,114,112. A 32-bit integer is big enough to represent all of them. That's the UTF-32 encoding.
Byte order
Computer memory is usually byte-addressable and file are byte sequences so for a large integer the question becomes, in which order the bytes are stored: The most significant byte first (big endian) or least significant byte first (little endian). Software adapts to or assumes one way or the other. So, UTF-32 actually identifies one of two encodings: UTF32BE or UTF-32LE. UTF-32 is shorthand for the endianness that the software assumes. Typically, the OS assumes the endianness of the hardware it is running on and programs follow suit.
Unicode Encodings
UTF-32 takes a lot of space. The most commonly used codepoints are numbered below 65,536. So there can be savings if codepoints are represented variable number of smaller integers. The size of the integer is called a code unit. The value of a code unit contains some of the bits of a code unit and indicates if there are more code units to follow that have more of the bits. So, there are UTF-16LE and UTF-16BE and UTF-8 (and more) encodings for Unicode. UTF-16 uses one or two 16-bit code units for a codepoint and UTF-8 uses one to four 8-bit code units for a codepoint.
Files are data outside of programs. So for a program to read text, it has to know the character set and encoding. Often this metadata is not stored with the file (within or beside). That's how you made the mistake of believing your file is ASCII. If you don't know the encoding of a file, you've lost data. You might be able to recover it through guessing. It is notable that the CP437 character set has 256 codepoints, numbered 0 to 255 and encoded in one byte. So every file can be read as CP437; The question is, is that right? Even if it looks right, it's probably not right unless it's from a Western culture circa 1990.
Unicode Byte-order Mark
A strong clue about which character set and encoding to guess is called the byte-order mark (BOM). Recall encodings with code units larger than one byte have an endianness. Endianness is a hardware concern. So, although a file can be passed between systems with agreement on which character set and encoding scheme is used, the endianness attribute of encoding is critical to each system. It has become standard to indicate byte order within the file, as the first bytes. Unicode specifies a codepoint to use for this purpose, as long as it is at the beginning of the file. (That means that programs reading Unicode from a file must separate this metadata from the data.) Many file writing libraries write the BOM codepoint regardless of the code unit size. So, you'll see it at the beginning of UTF-8 files. Since the Unicode BOM looks different in each of the Unicode encodings, it completely identifies which Unicode encoding is being used.
Guessing
Your file begins with the UTF-16LE BOM. Read it as UTF-16LE (and discard the BOM codepoint if your library doesn't already.)
Given the specificness of the Unicode BOM, its presence is a strong indicator that the file is encoded in Unicode and the actual bytes tell which Unicode encoding. However, as noted above, it's possible that this guess is wrong.
As #Lưu Vĩnh Phúc points out, it is unclear how you are reading "ÿþabcd" from what you say is a 6-byte file. Open the file in a hex editor. UTF-16LE should be FF FE 61 00 62 00 63 00 64 00.

Encrypt printable text so result is still printable (can be typed)

I want to encrypt some info for a licensing system and I want the result to be able to be typed in by the user.
Update: This operation must be reversible (decrypt-able)
E.g.,
Encrypt ( ComputerID+ProductID) -> (any standard ASCII character that can be typed. Ideally maybe even just A-Z).
So far what I did was to convert the encrypted text to HEX (so it's any character from 0-F) but that doubles the number of characters.
I'm using VB6.
I'm thinking I'd do some operation on each pair of (Input$(x) and Key$(x)) and then do a MOD to keep it within a range of ascii values (maybe 0-9-A-Z)
Any suggestions of a good algorithm?
Look into Base64 "encryption."
Base 64 will convert a number into 64 different ASCII characters, verses hex which is only 16 different ASCII characters... Making Base64 more compact and what you are looking for.
EDIT:
Code to do this in VB6 is available here: http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
Per Fuzzy Lollipop, below, Base32 looks like an even better option. Bonus points if you can find an example of that.
EDIT: I found an example of Base32 for VB6 although I've not tried it yet. -Clay
encode the encrypted bytes in HEX, or Base32 or Base64
Do you want this to be reversible -- to recover the IDs from the encrypted text? If so then it matters how you combine the key and input strings.
Usually you'd XOR each byte pair (work with byte arrays to avoid Unicode issues), circulating on the key string if it's shorter than the input. You can then use Base N encoding (32, 64 etc) to generate the license string.
Both operations are reversible: you can recover the XORed strings from the Base N string, then XOR with the key again to get the original IDs.
If you don't care about reversing the operations, then any convolution of key and ID will do. XOR is just the simplest.

Resources