Why is huffman encoded text bigger than actual text? - algorithm

I am trying to understand how Huffman coding works and it is supposed to compress data to take less memory than actual text but when I encode for example
"Text to be encoded"
which has 18 characters the result I get is
"100100110100101110101011111000001110011011110010101100011"
Am I supposed to divide those result bits by 8 since character has 8 bits?

You should compare the same units (bits as in the after the compession or characters as in the text before), e.g.
before: "Text to be encoded" == 18 * 8 bits = 144 bits
== 18 * 7 bits = 126 bits (in case of 7-bit characters)
after: 100100110100101110101011111000001110011011110010101100011 = 57 bits
so you have 144 (or 126) bits before and 57 bits after the compression. Or
before: "Text to be encoded" == 18 characters
after: 10010011
01001011
10101011
11100000
11100110
11110010
10110001
00000001 /* the last chunk is padded */ == 8 characters
so you have 18 ascii characters before and only 8 one byte characters after the compression. If characters are supposed to be 7-bit (0..127 range Ascii table) we have 9 characters after the compression:
after: 1001001 'I'
1010010 'R'
1110101 'u'
0111110 '>'
0000111 '\0x07'
0011011 '\0x1B'
1100101 'e'
0110001 'l'
0000001 '\0x01'

Related

character encoding - how utf-8 handles charactrers

So, I know that when we type characters, each character maps to a number in a character set and then, that number is transformed into a binary format so a computer can understand. They way that number is transformed into a binary format(how many bits gets allocated) depends on character encoding.
So, if I type L, It represents 76. Then 76 gets tranformed into 1 byte binary format because of let's say UTF-8.
Now, I've read the following somewhere:
The Devanagari character क, with code point 2325 (which is 915 in
hexadecimal notation), will be represented by two bytes when using the
UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four
bytes with UTF-32 (00 00 09 15).
So, as you can see it says three bytes with UTF-8 (E0 A4 95). how are E0 A4 95 bytes ? I am asking because i have no idea where E0 A4 95 came from... Why do we need this ? if we know that code point is 2325, all we have to do is in order to use UTF-8, we know that utf-8 will need 3 bytes to transform 2325 into binary... Why do we need E0 A4 95 and what is it ?
E0 A4 95 is the 3-byte UTF-8 encoding of U+0915. In binary:
E 0 A 4 9 5 (hex)
11100000 10100100 10010101 (binary)
1110xxxx 10xxxxxx 10xxxxxx (3-byte UTF-8 encoding pattern)
0000 100100 010101 (data bits)
00001001 00010101 (regrouped data bits to 8-bit bytes)
0 9 1 5 (hex)
U+0915 (Unicode code point)
The first byte's binary pattern 1110xxxx is a lead byte indicating a 3-byte encoding and 4 bits of data. Follow on bytes start with 10xxxxxx and provide 6 more bits of data. There will be two following bytes after a 3-byte leading byte indicator.
For more information read the Wikipedia article on UTF-8 and the standard RFC-3629.

utf8 second byte lower bound in golang

I was recently going through the go source code of utf8 decoding.
Apparently when decoding utf8 bytes, when the first byte has the value 224
(0xE0) it maps to an accept range of [0xA0; 0xBF].
https://github.com/golang/go/blob/master/src/unicode/utf8/utf8.go#L81
https://github.com/golang/go/blob/master/src/unicode/utf8/utf8.go#L94
If I understand the utf8 spec (https://www.rfc-editor.org/rfc/rfc3629) correctly every continuation byte has the minimum value of 0x80 or 1000 0000. Why is the minimum value for opening byte with 0xE0 higher, i.e. 0xA0 instead of 0x80?
The reason is to prevent so-called overlong sequences. Quoting the RFC:
Implementations of the decoding algorithm above MUST protect against
decoding invalid sequences. For instance, a naive implementation may
decode the overlong UTF-8 sequence C0 80 into the character U+0000,
or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding
invalid sequences may have security consequences or cause other
problems.
[...]
A particularly subtle form of this attack can be carried out against
a parser which performs security-critical validity checks against the
UTF-8 encoded form of its input, but interprets certain illegal octet
sequences as characters. For example, a parser might prohibit the
NUL character when encoded as the single-octet sequence 00, but
erroneously allow the illegal two-octet sequence C0 80 and interpret
it as a NUL character. Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F. This last exploit has
actually been used in a widespread virus attacking Web servers in
2001; thus, the security threat is very real.
Also note the syntax rules in section 4 which explicitly only allow characters A0-BF after E0:
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
If the first byte of an UTF-8 sequence is 0xe0, that means it's a 3-byte sequence representing / encoding a Unicode codepoint (because 0xe0 = 1110 0000b).
Wikipedia: UTF-8:
Number Bits for First Last Byte 1 Byte 2 Byte 3
of bytes code point code point code point
---------------------------------------------------------------------------
3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
The first Unicode codepoint that is encoded using 3-byte UTF-8 sequence is U+0800, so the codepoint is 0x0800, in binary: 0000 1000 0000 0000
If you try to insert these bits into the ones marked with x:
1110xxxx 10xxxxxx 10xxxxxx
11110000 10100000 10000000
As you can see, the second byte is 1010 0000 which is 0xa0. So a valid UTF-8 sequence that starts with 0xe0: its 2nd byte cannot be lower than 0xa0 (because even the lowest Unicode codepoint whose UTF-8 encoded sequence starts with 0xe0 has a 2nd byte of 0xa0).

How to use urandom to generate all possible base64 characters?

In a base64 digits you could save up to 6 bits (2**6 == 64).
Which means that you could fit 3 bytes in 4 base64 digits.
64**4 == 2**24
That's why:
0x000000 == 'AAAA'
0xFFFFFF == '////'
This means that a random string of 3 bytes is equivalent to a base64 string of 4 characters.
However if I am converting a number of bytes which is not a multiple of 3 in a base64 string, I will not be able to generate all the combination of the base64 string.
Let's take an example:
If I want a random 7 characters base64 string, I would need to generate 42 random bits (64**7 == 2**42).
If I am using urandom to get 5 random bytes I will get only 40 bits (5*8) and if ask for 6 I will get 48 bits (6*8).
Can I ask for 6 bytes and use a mask to short it down to 5 or will it break my random repartition?
One solution:
hex(0x123456789012 & 0xFFFFFFFFFF)
'0x3456789012'
Another one:
hex(0x123456789012 >> 8)
'0x1234567890'
What do you think?
base64 strings with 7 characters of length
is an encode of a file with 5 bytes ( 40 bits: no less, no more )
40%6 = 4
base64 needs to add 2 more bits, and then, with 42 bits, 42%6=0,
the encode is possible; but, beware:
" If I want a random 7 characters base64 string, I would need to
generate 42 random bits (64**7 == 2**42). "
the 2 additional bits are not random, and are constants; are zero, indeed.
The cardinal number of your key-space doesn't change: is 2**40 = 1099511627776, not (64**7 == 2**42).
(64**7 == 2**42) is the cardinal number of the key-space that contains all possible combinations of 64 chars with length of 7; but, with the last two bits fixed (to zero, in this case, but doesn't matter the value) you don't have all possible combinations.
6 random bytes (48 bits), or 42 random bits, increase your original key-space; you should use 5 random bytes (40 bits), and send it to base64

Understanding Hadoop Text byteoffset

I ran the below program.
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
System.out.println(t.getLength());
System.out.println(t.find("\u0041"));
System.out.println(t.find("\u00DF"));
System.out.println(t.find("\u6771"));
System.out.println(t.find("\uD801"));
System.out.println(t.find("\uD801\uDC00"));
Output
10
0
1
3
-1
6
From my understanding find returns the byteoffset in Text.
0041 -> 01000001 , 00DF - > 11011111, 6771 -> 0110011101110001
I am not able to understand the output.
Also why
t.find("\uD801")
is -1 ?
This example has been explained in HADOOP The Definitive Guide book.
Text class stores data using UTF8 encoding. Since it uses UTF8 encoding, the indexing inside a Text is based on byte offset of UTF8 encoded characters (unlike in Java String, where the byte offset is at each character).
You can see this answer, to understand difference between Text and String in Hadoop:
Difference between Text and String in Hadoop
The text: "\u0041\u00DF\u6771\uD801\uDC00", is interpreted as follows:
\u0041 ==> Its Latin letter "A". Its UTF-8 code units: 41 (1 byte)
\u00DF ==> Its Latin letter "Sharp S". Its UTF-8 code units: c3 9f (2 bytes)
\u6771 ==> A unified Han ideograph (Chinese). Its UTF-8 code units: e6 9d b1 (3 bytes)
\uD801\uDC00 ==> Deseret letter (https://en.wikipedia.org/wiki/Deseret_alphabet). Its UTF-8 code units: f0 90 90 80 (4 bytes)
Following are the byte offsets, when it is stored in Text (which is UTF-8 encoded):
Offset of "\u0041" ==> 0
Offset of "\u00DF" ==> 1 (Since previous UTF-8 character occupied 1 byte: character 41)
Offset of "\u6771" ==> 3 (Since previous UTF-8 character occupied 2 bytes: characters c3 9f)
Offset of "\uD801\uDC00" ==> 6 (Since previous UTF-8 character occupied 3 bytes: characters e6 9d b1)
Finally, the last UTF-8 character (DESERET CAPITAL LETTER LONG I) occupies 4 bytes (f0 90 90 80).
So total length is: 1 + 2 + 3 + 4 = 10.
When you do t.find("\uD801"), you get -1. Because, no such character exists in the string, as per UTF-8 encoding.
"\uD801\uDC00" is considered as a single character (DESERET CAPITAL LETTER LONG I). Hence when you query for offset of "\uD801\uDC00", you get a proper answer of 6.

Invalid UTF-8 bytes

According to Wikipedia:
Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be
prepared for:
1. the red invalid bytes in the above table
2. an unexpected continuation byte
3. a start byte not followed by enough continuation bytes
4. an Overlong Encoding as described above
5. A 4-byte sequence (starting with 0xF4) that decodes to a value greater than U+10FFFF
As per the Codepage layout, 0xC0 and 0xC1 are invalid and must never appear in a valid UTF-8 sequence. Here is what I have for CodePoints 0xC0 and 0xC1:
Byte 2 Byte 1 Num Char
11000011 10000000 192 À
11000011 10000001 193 Á
There are characters corresponding to these byte sequences, while there should not be. Am I doing it wrong?
You're just confusing terms:
The codepoint U+00C0 is the character "À", U+00C1 is "Á".
Encoded in UTF-8 they're the byte sequence C3 80 and C3 81 respectively.
The bytes C0 and C1 should never appear in the UTF-8 encoding.
Codepoints denote characters independently of bytes. Bytes are bytes.

Resources