Character code representation of decimal digit - ascii

Differentiate between the character code representation of a decimal digit and its pure binary representation
I study computer science and this is a concept I need to know for the exams but I am not sure I fully understand it
Character code representation of 261 (for example)
Would this just be the ASCII code equivalent?
Meaning:
2 has ASCII code 50
6 has ASCII code 54
1 has ASCII code 49
So the character code representation is 50, 54, 49
Pure Binary code representation
Is this just the binary conversion of 261?
So 100000101?

ASCII defines character digits 0 to 9 with the decimal number codes 48 to 57.
So there is a representation in binary for the character but also for the the decimal digit.
The binary representation of the character 46 is: 00110100 00110110.
The character 4 is code 52 in ASCII; hence, you get 00110100. While character 6 is 54, for which you get 00110110.
Meanwhile, the decimal number 46 is stored in a 16-bit word with the following representation: 00000000 00101110.
For the character 261, you would need to get the ASCII code for 2, 6 and 1.
2: 50
6: 54
1: 49
So you get for 50, 54, 49 : 00110010 00110110 00110001

Related

Representing decimal numbers in Big and Little Endian?

So if I have an unsigned int which consists of 4 bytes which are stored in address 10000 to 10011. If the representation of the storage is in Big Endian, what decimal value is stored in the varuable?
ADRESS INSTRUCTION
10000: 0b01000010
10001: 137
10010: 0x13
10011: 0b11000011
So the decimal numbers are: 66, 137, 19, 195.
I thought that the Big Endian representation just is 6 613 719 195. But apparently that is wrong. So what am I missing here? If it was Little Endian it should be 1 951 913 766. But again this is wrong. So what am I missing here? Yes this is a quiz question that I got wrong and I just don't get it completely. The question is literally:
"In a high-level language a variable is declared as an unsigned int and consists of 4 bytes which are stored in the address 10000-10011. If the representation of the storage is in Big.Endian, which decimal value is stored in the variable?
ADRESS INSTRUCTION
10000: 0b01000010
10001: 137
10010: 0x13
10011: 0b11000011
"
You calculate bad, you need to calculate like this:
6631246 = 195 * 2^24 + 19 * 2^16 + 137 * 2^8 + 66
The representation of the value in Big Endian is to store the most significant byte (MSB) first, so the value of each byte in memory is stored in the order 10011, 10010, 10001, 10000, which is 195, 19, 137, 66 in decimal.

Convert a two-letter String to a 3-digit number

I am working on a software problem and I found myself needing to convert a 2-letter string to a 3-digit number. We're talking about English alphabet only (26 letters).
So essentially I need to convert something like AA, AR, ZF, ZZ etc. to a number in the range 0-999.
We have 676 combinations of letters and 1000 numbers, so the range is covered.
Now, I could just write up a map manually, saying that AA = 1, AB = 2 etc., but I was wondering if maybe there is a better, more "mathematical" or "logical" solution to this.
The order of numbers is of course not relevant, as long as the conversion from letters to numbers is unique and always yields the same results.
The conversion should work both ways (from letters to numbers and from numbers to letters).
Does anyone have an idea?
Thanks a lot
Treat A-Z as 1-26 in base 27, with 0 reserved for blanks.
E.g. 'CD' -> 3 * 27 + 4 = 85
85 -> 85 / 27, 85 % 27 = 3, 4 = C, D
If you don’t have to use consecutive numbers, you can view a two-letter string as a 36-based number. So, you can just use the int function to convert it into an Integer.
int('AA', 36) # 370
int('AB', 36) # 371
#...
int('ZY', 36) # 1294
int('ZZ', 36) # 1295
As for how to convert the number back to a string, you can refer to the method on How to convert an integer to a string in any base?
#furry12 because the diff between the first number and the last one is 1295-370=925<999. It is quite lucky, so you can minus every number for like 300, the results will be in the range of 0-999
def str2num(s):
return int(s, 36) - 300
print(str2num('AA')) # 70
print(str2num('ZZ')) # 995

character encoding - how utf-8 handles charactrers

So, I know that when we type characters, each character maps to a number in a character set and then, that number is transformed into a binary format so a computer can understand. They way that number is transformed into a binary format(how many bits gets allocated) depends on character encoding.
So, if I type L, It represents 76. Then 76 gets tranformed into 1 byte binary format because of let's say UTF-8.
Now, I've read the following somewhere:
The Devanagari character क, with code point 2325 (which is 915 in
hexadecimal notation), will be represented by two bytes when using the
UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four
bytes with UTF-32 (00 00 09 15).
So, as you can see it says three bytes with UTF-8 (E0 A4 95). how are E0 A4 95 bytes ? I am asking because i have no idea where E0 A4 95 came from... Why do we need this ? if we know that code point is 2325, all we have to do is in order to use UTF-8, we know that utf-8 will need 3 bytes to transform 2325 into binary... Why do we need E0 A4 95 and what is it ?
E0 A4 95 is the 3-byte UTF-8 encoding of U+0915. In binary:
E 0 A 4 9 5 (hex)
11100000 10100100 10010101 (binary)
1110xxxx 10xxxxxx 10xxxxxx (3-byte UTF-8 encoding pattern)
0000 100100 010101 (data bits)
00001001 00010101 (regrouped data bits to 8-bit bytes)
0 9 1 5 (hex)
U+0915 (Unicode code point)
The first byte's binary pattern 1110xxxx is a lead byte indicating a 3-byte encoding and 4 bits of data. Follow on bytes start with 10xxxxxx and provide 6 more bits of data. There will be two following bytes after a 3-byte leading byte indicator.
For more information read the Wikipedia article on UTF-8 and the standard RFC-3629.

Why is huffman encoded text bigger than actual text?

I am trying to understand how Huffman coding works and it is supposed to compress data to take less memory than actual text but when I encode for example
"Text to be encoded"
which has 18 characters the result I get is
"100100110100101110101011111000001110011011110010101100011"
Am I supposed to divide those result bits by 8 since character has 8 bits?
You should compare the same units (bits as in the after the compession or characters as in the text before), e.g.
before: "Text to be encoded" == 18 * 8 bits = 144 bits
== 18 * 7 bits = 126 bits (in case of 7-bit characters)
after: 100100110100101110101011111000001110011011110010101100011 = 57 bits
so you have 144 (or 126) bits before and 57 bits after the compression. Or
before: "Text to be encoded" == 18 characters
after: 10010011
01001011
10101011
11100000
11100110
11110010
10110001
00000001 /* the last chunk is padded */ == 8 characters
so you have 18 ascii characters before and only 8 one byte characters after the compression. If characters are supposed to be 7-bit (0..127 range Ascii table) we have 9 characters after the compression:
after: 1001001 'I'
1010010 'R'
1110101 'u'
0111110 '>'
0000111 '\0x07'
0011011 '\0x1B'
1100101 'e'
0110001 'l'
0000001 '\0x01'

Understanding Hadoop Text byteoffset

I ran the below program.
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
System.out.println(t.getLength());
System.out.println(t.find("\u0041"));
System.out.println(t.find("\u00DF"));
System.out.println(t.find("\u6771"));
System.out.println(t.find("\uD801"));
System.out.println(t.find("\uD801\uDC00"));
Output
10
0
1
3
-1
6
From my understanding find returns the byteoffset in Text.
0041 -> 01000001 , 00DF - > 11011111, 6771 -> 0110011101110001
I am not able to understand the output.
Also why
t.find("\uD801")
is -1 ?
This example has been explained in HADOOP The Definitive Guide book.
Text class stores data using UTF8 encoding. Since it uses UTF8 encoding, the indexing inside a Text is based on byte offset of UTF8 encoded characters (unlike in Java String, where the byte offset is at each character).
You can see this answer, to understand difference between Text and String in Hadoop:
Difference between Text and String in Hadoop
The text: "\u0041\u00DF\u6771\uD801\uDC00", is interpreted as follows:
\u0041 ==> Its Latin letter "A". Its UTF-8 code units: 41 (1 byte)
\u00DF ==> Its Latin letter "Sharp S". Its UTF-8 code units: c3 9f (2 bytes)
\u6771 ==> A unified Han ideograph (Chinese). Its UTF-8 code units: e6 9d b1 (3 bytes)
\uD801\uDC00 ==> Deseret letter (https://en.wikipedia.org/wiki/Deseret_alphabet). Its UTF-8 code units: f0 90 90 80 (4 bytes)
Following are the byte offsets, when it is stored in Text (which is UTF-8 encoded):
Offset of "\u0041" ==> 0
Offset of "\u00DF" ==> 1 (Since previous UTF-8 character occupied 1 byte: character 41)
Offset of "\u6771" ==> 3 (Since previous UTF-8 character occupied 2 bytes: characters c3 9f)
Offset of "\uD801\uDC00" ==> 6 (Since previous UTF-8 character occupied 3 bytes: characters e6 9d b1)
Finally, the last UTF-8 character (DESERET CAPITAL LETTER LONG I) occupies 4 bytes (f0 90 90 80).
So total length is: 1 + 2 + 3 + 4 = 10.
When you do t.find("\uD801"), you get -1. Because, no such character exists in the string, as per UTF-8 encoding.
"\uD801\uDC00" is considered as a single character (DESERET CAPITAL LETTER LONG I). Hence when you query for offset of "\uD801\uDC00", you get a proper answer of 6.

Resources