String representation of binary data? - image

What is the correct term for the characters seen here?
data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBhQSERUUExQVFRUWGRwaGBgYGB0dGxkcHxccGxocGxoYHCYgGxojHBsdHy8gJCgpLSwsGB4xNTAqNSYrLCkBCQoKDgwOGg8PGiklHx8pLCwpKSkpKSwpLCkpKSwpLCksKSwsLCkpKSkpLCkpKSwpLCwsKSwsLCksKSkpKSksKf/AABEIAJ0BQQMBIgACEQEDEQH/xAAcAAACAwEBAQEAAAAAAAAAAAAEBQIDBgEABwj/xABFEAACAQIEBAQDBgMGAwcFAAABAhEDIQAEEjEFQVFhBhMicTKBkRRCobHB8CNS0QcVM1Ni4RaS8TRygpOiwtIkQ1RVY//EABoBAAIDA..
Is this a string representation of binary data? String representation of an image?

That is called a Data URI. It is a technique used to inline data directly into a HTML document rather than requiring a separate request to fetch a file.
The characters are a Base64 encoding of binary data - in this case it is data for an image.

Is this a string representation of binary data? String representation of an image?
Yes. This is a data URI, composed of two parts - a header telling the type of data and the data itself (a base64 encoded binary).

The image in encoded in base64.
Base64 encoding takes three bytes, each consisting of eight bits, and represents them as four printable characters in the ASCII standard. It does that in essentially two steps.
The first step is to convert three bytes to four numbers of six bits. Each character in the ASCII standard consists of seven bits. Base64 only uses 6 bits (corresponding to 2^6 = 64 characters) to ensure encoded data is printable and humanly readable. None of the special characters available in ASCII are used. The 64 characters (hence the name Base64) are 10 digits, 26 lowercase characters, 26 uppercase characters as well as '+' and '/'.
If, for example, the three bytes are 155, 162 and 233, the corresponding (and frightening) bit stream is 100110111010001011101001, which in turn corresponds to the 6-bit values 38, 58, 11 and 41.
From Base64

Related

What are surrogate characters in UTF-8?

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.
What are surrogate characters in UTF-8?
This is almost like a trick question.
Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).
Approximate answer #2: Invalid (if not paired).
Approximate answer #3: It's not UTF-8; It's Modified UTF-8.
Synopsis: The term doesn't apply to UTF-8.
Unicode codepoints have a range that needs 21 bits of data.
UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.
UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.
#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.
#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).
#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.
Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.

Pack/Unpack and base64 in Ruby

I have a string a = "hello". I can convert it to base 2 or base 16 using unpack:
a.unpack('B*')
# => ["0110100001100101011011000110110001101111"]
a.unpack('H*')
# => ["68656c6c6f"]
To convert to base 64, I tried pack:
[a].pack('m0')
# => "aGVsbG8="
but the result is not what I expected. I thought that if I have some binary representation or a string, to represent it in divided parts, I should use unpack. But it turned out not. Please help me understand it.
Per OP's clarified question, "Why do we use #pack to get base64 and #unpack to get other representations of raw data?"
The surface level reason is because Array#pack is a method that returns a String, while String#unpack is a method that returns an Array.
There are stronger conceptual reasons underlying this. The key principle is that base64 is not an array of raw bytes. Rather, it's a 7-bit-ASCII-safe string that can represent arbitrary bytes if properly (de)coded.
Each base64 character maps to a sequence of six bits. At the byte level, that's a 4:3 ratio of characters to raw bytes. Since integer powers of 2 don't divide by 3, we end up with padding more often than not, and you can't slice base64 in arbitrary places to get ranges of bytes out of it (you'd have to figure out which bytes you want in groups of three and go get the associated base64 characters in groups of four).
Arbitrary sequences of data are, fundamentally, arrays of bytes. Base64-encoded sequences are, fundamentally, strings: data sequences constrained to the range of bytes safely transmissible and displayable as text.
Base64 is the encapsulation (or "packing") of a data array into a string.
The encoded text is correct, to validate use below online tool:
https://www.base64encode.org/
text:
hello
Encoded Base64:
aGVsbG8=
Useful resource:
https://idiosyncratic-ruby.com/4-what-the-pack.html

How to check if all letters in UTF-8 string are ASCII?

How to check that all string characters are ASCII? It is sad in documentation:
Unicode characters U+0000 to U+007F (ASCII) are encoded simply as
bytes 00h to 7Fh (ASCII compatibility). This means that files and
strings which contain only 7-bit ASCII characters have the same
encoding under both ASCII and UTF-8. All characters >U+007F are
encoded as a sequence of several bytes, each of which has the two most
significant bits set.
So I wonder how to check string to be ASCII?
A string is ASCII if all the characters it contains are in the range 0-127.
guava
CharMatcher.ASCII.matchesAllOf(string);
An easy way is to check whether the length of the string in bytes equals the number of Unicode characters (code points). If these values are cached, this might even be the fastest way.

How many characters can UTF-8 encode?

If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters?
The first 128 code points are the same as in ASCII. But it says UTF-8 can support up to million of characters?
How does this work?
UTF-8 does not use one byte all the time, it's 1 to 4 bytes.
The first 128 characters (US-ASCII) need one byte.
The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.
Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.
Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
source: Wikipedia
UTF-8 uses 1-4 bytes per character: one byte for ascii characters (the first 128 unicode values are the same as ascii). But that only requires 7 bits. If the highest ("sign") bit is set, this indicates the start of a multi-byte sequence; the number of consecutive high bits set indicates the number of bytes, then a 0, and the remaining bits contribute to the value. For the other bytes, the highest two bits will be 1 and 0 and the remaining 6 bits are for the value.
So a four byte sequence would begin with 11110... (and ... = three bits for the value) then three bytes with 6 bits each for the value, yielding a 21 bit value. 2^21 exceeds the number of unicode characters, so all of unicode can be expressed in UTF8.
Unicode vs UTF-8
Unicode resolves code points to characters. UTF-8 is a storage mechanism for Unicode. Unicode has a spec. UTF-8 has a spec. They both have different limits. UTF-8 has a different upwards-bound.
Unicode
Unicode is designated with "planes." Each plane carries 216 code points. There are 17 Planes in Unicode. For a total of 17 * 2^16 code points. The first plane, plane 0 or the BMP, is special in the weight of what it carries.
Rather than explain all the nuances, let me just quote the above article on planes.
The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.
UTF-8
Now let's go back to the article linked above,
The encoding scheme used by UTF-8 was designed with a much larger limit of 231 code points (32,768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes.[3] Since Unicode limits the code points to the 17 planes that can be encoded by UTF-16, code points above 0x10FFFF are invalid in UTF-8 and UTF-32.
So you can see that you can put stuff into UTF-8 that isn't valid Unicode. Why? Because UTF-8 accommodates code points that Unicode doesn't even support.
UTF-8, even with a four byte limitation, supports 221 code points, which is far more than 17 * 2^16
According to this table* UTF-8 should support:
231 = 2,147,483,648 characters
However, RFC 3629 restricted the possible values, so now we're capped at 4 bytes, which gives us
221 = 2,097,152 characters
Note that a good chunk of those characters are "reserved" for custom use, which is actually pretty handy for icon-fonts.
* Wikipedia used show a table with 6 bytes -- they've since updated the article.
2017-07-11: Corrected for double-counting the same code point encoded with multiple bytes
2,164,864 “characters” can be potentially coded by UTF-8.
This number is 27 + 211 + 216 + 221, which comes from the way the encoding works:
1-byte chars have 7 bits for encoding
0xxxxxxx (0x00-0x7F)
2-byte chars have 11 bits for encoding
110xxxxx 10xxxxxx (0xC0-0xDF for the first byte; 0x80-0xBF for the second)
3-byte chars have 16 bits for encoding
1110xxxx 10xxxxxx 10xxxxxx (0xE0-0xEF for the first byte; 0x80-0xBF for continuation bytes)
4-byte chars have 21 bits for encoding
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (0xF0-0xF7 for the first byte; 0x80-0xBF for continuation bytes)
As you can see this is significantly larger than current Unicode (1,112,064 characters).
UPDATE
My initial calculation is wrong because it doesn't consider additional rules. See comments to this answer for more details.
UTF-8 is a variable length encoding with a minimum of 8 bits per character.
Characters with higher code points will take up to 32 bits.
Quote from Wikipedia: "UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed "octets" in the Unicode Standard)."
Some links:
http://www.utf-8.com/
http://www.joelonsoftware.com/articles/Unicode.html
http://www.icu-project.org/docs/papers/forms_of_unicode/
http://en.wikipedia.org/wiki/UTF-8
Check out the Unicode Standard and related information, such as their FAQ entry, UTF-8 UTF-16, UTF-32 & BOM. It’s not that smooth sailing, but it’s authoritative information, and much of what you might read about UTF-8 elsewhere is questionable.
The “8” in “UTF-8” relates to the length of code units in bits. Code units are entities use to encode characters, not necessarily as a simple one-to-one mapping. UTF-8 uses a variable number of code units to encode a character.
The collection of characters that can be encoded in UTF-8 is exactly the same as for UTF-16 or UTF-32, namely all Unicode characters. They all encode the entire Unicode coding space, which even includes noncharacters and unassigned code points.
While I agree with mpen on the current maximum UTF-8 codes (2,164,864) (listed below, I couldn't comment on his), he is off by 2 levels if you remove the 2 major restrictions of UTF-8: only 4 bytes limit and codes 254 and 255 can not be used (he only removed the 4 byte limit).
Starting code 254 follows the basic arrangement of starting bits (multi-bit flag set to 1, a count of 6 1's, and terminal 0, no spare bits) giving you 6 additional bytes to work with (6 10xxxxxx groups, an additional 2^36 codes).
Starting code 255 doesn't exactly follow the basic setup, no terminal 0 but all bits are used, giving you 7 additional bytes (multi-bit flag set to 1, a count of 7 1's, and no terminal 0 because all bits are used; 7 10xxxxxx groups, an additional 2^42 codes).
Adding these in gives a final maximum presentable character set of 4,468,982,745,216. This is more than all characters in current use, old or dead languages, and any believed lost languages. Angelic or Celestial script anyone?
Also there are single byte codes that are overlooked/ignored in the UTF-8 standard in addition to 254 and 255: 128-191, and a few others. Some are used locally by the keyboard, example code 128 is usually a deleting backspace. The other starting codes (and associated ranges) are invalid for one or more reasons (https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences).
Unicode is firmly married to UTF-8. Unicode specifically supports 2^21 code points (2,097,152 characters) which is exactly the same number of code points supported by UTF-8. Both systems reserve the same 'dead' space and restricted zones for code points etc. ...as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters
From the unicode standard. Unicode FAQ
The Unicode Standard encodes characters in the range U+0000..U+10FFFF,
which amounts to a 21-bit code space.
From the UTF-8 Wikipedia page. UTF-8 Description
Since the restriction of the Unicode code-space to 21-bit values in
2003, UTF-8 is defined to encode code points in one to four bytes, ...

how to represent a n-byte array in less than 2*n characters

given that a n-byte array can be represented as a 2*n character string using hex, is there a way to represent the n-byte array in less than 2*n characters?
for example, typically, an integer(int32) can be considered as a 4-byte array of data
The advantage of hex is that splitting an 8-bit byte into two equal halves is about the simplest thing you can do to map a byte to printable ASCII characters. More efficient methods consider multiple bytes as a block:
Base-64 uses 64 ASCII characters to represent 6 bits at a time. Every 3 bytes (i.e. 24 bits) are split into 4 6-bit base-64 digits, where the "digits" are:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
(and if the input is not a multiple of 3 bytes long, a 65th character, "=", is used for padding at the end). Note that there are some variant forms of base-64 use different characters for the last two "digits".
Ascii85 is another representation, which is somewhat less well-known, but commonly used: it's often the way that binary data is encoded within PostScript and PDF files. This considers every 4 bytes (big-endian) as an unsigned integer, which is represented as a 5-digit number in base 85, with each base-85 digit encoded as ASCII code 33+n (i.e. "!" for 0, up to "u" for 84) - plus a special case where the single character "z" may be used (instead of "!!!!!") to represent 4 zero bytes.
(Why 85? Because 845 < 232 < 855.)
yes, using binary (in which case it takes n bytes, not surprisingly), or using any base higher than 16, a common one is base 64.
It might depend on the exact numbers you want to represent. For instance, the number 9223372036854775808, which requres 8 bytes to represent in binary, takes only 4 bytes in ascii, if you use the product of primes representation (which is "2^63").
How about base-64?
It all depends on what characters you're willing to use in your encoding (i.e. representation).
Base64 fits 6 bits in each character, which means that 3 bytes will fit in 4 characters.
Using 65536 of about 90000 defined Unicode characters you may represent binary string in N/2 characters.
Yes. Use more characters than just 0-9 and a-f. A single character (assuming 8-bit) can have 256 values, so you can represent an n-byte number in n characters.
If it needs to be printable, you can just choose some set of characters to represent various values. A good option is base-64 in that case.

Resources