Byte and bits in computer science in C language - byte

A language has 28 different letters in total. Each word in the language is
composed of maximum 7 letters. You want to create a data-type to store a word of this language. You decide to store the word as an array of letters. How many bits will you assign to the data-type to be able to store all kinds of words of the language.
Op 1: 7
Op 2: 35
Op 3: 28
Op 4: 196

With n bits, you can represent one of a set of 2**n characters, so 5 bits will be sufficient to represent one of the 28 characters in your set (2**5=32).
For each character in the string, you will need five bits, so a seven-character string will require 35 bits.
Note that you could map all 28^7 words onto 34 bits (ceiling(log(28^7)/log(2)) = 34), but I wouldn't recommend it.

Related

Sequence of binary input file for NIST

I've developed an RNG program, for now, I want to test whether my numbers are random. Thus, I decided to use NIST Test Suite.
I'm still confused about the input file format, they say: "The user may want to construct as many files of arbitrary length as desired. Files should contain binary sequences stored as either ASCII characters consisting of zeroes and ones, or
as binary data where each byte contains eight bits worth of 0’s and 1’s"
My python RNG program will return a sequence of numbers line by line as:
69
11
68
55
33
20
75
96
How can I convert them to the proper input file for NIST?
Your first random number is 69, which is 1000101 in binary. You can either put that in your test file as the ASCII string "1000101" or as seven bits in a binary file 1000101... The ASCII option is probably easier, but the file will be eight times the size. With either case you might have to be careful with leading zeros in binary, I am not sure what NIST wants without reading a lot more of SP 800-22 than I currently have time for.
The appropriate input for NIST would be a set of binary sequences. You can simply convert your integers into binary numbers, write them to a file and use the file as input for the NIST program. Converting to binary and feeding them to NIST does not necessarily mean your RNG will pass all the tests there. Let's try to answer the following question.
How many bits do you produce for each integer?
For example, let's say your RNG generates integers between 0 to 5 (uniform distribution, all values are equiprobable). As representing 5 would require at least 3 bits, we will use 3 bits for each integer.
0: 000
1: 001
2: 010
3: 011
4: 100
5: 101
Look at the first (most significant) bit for each of the numbers. Four of them are 0, and the remaining two are 1. So, whenever you pick a random integer from 0 to 5, the probability of the first bit being 0 is higher than it being 1. Remember that, for an RNG, we need p(0)=p(1)=0.5 for each of the bits.
Now, if an RNG produced values from 0 to 7 (uniformly), we could convert each of them to 3 bits and maintain p(0)=p(1)=0.5 at all the indices. Why is that? Because we have all 23 different values (i.e., 0 to 23-1), it does not face any bias at any index (equal number of zeros and ones).
The above discussion leads us to the conclusion that, if you have integer values coming out of an RNG and they range from 0 to 2n-1, and each of them are equiprobable, you can convert them to n bits and concatenate them for NIST evaluation. If those conditions do not hold (e.g., the number of outcomes is not a power of 2), one way is to settle for the maximum power of 2 that can be accommodated in the output range of your RNG and discard the rest of the values.

How does UTF-8 represent characters?

I'm reading UTF-8 Encoding, and I don't understand the following sentence.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes. The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
If I'm not mistaken, this means UTF-8 requires two bytes to represent 2048 characters. In other words, we need to choose 2048 candidates from 2 to the power of 16 to represent each character.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes.
What's the big deal about choosing 2048 out of 65,536? However, UTF-8 explicitly sets boundary to each byte.
With following statements, The number of combinations is 30 (0xDF - 0xC2 + 0x01) for first byte, and 64 (0xBF - 0x80 + 0x01) for second byte.
The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
How does 1920 numbers (64 times 30) accommodate 2048 combinations?
As you already know, 2047 (0x07FF) contains the raw bits
00000111 11111111
If you look at the bit distribution chart for UTF-8:
You will see that 0x07FF falls in the second line, so it is encoded as 2 bytes using this bit pattern:
110xxxxx 10xxxxxx
Substitute the raw bits into the xs and you get this result:
11011111 10111111 (0xDF 0xBF)
Which is exactly as the description you quoted says:
The first byte will have the two high bits set and the third bit clear (11011111). The second byte will have the top bit set and the second bit clear (10111111).
Think of it as a container, where the encoding reserves a few bits for its own synchronization, and you get to use the remaining bits.
So for the range in question, the encoding "template" is
110 abcde 10 fghijk
(where I have left a single space to mark the boundary between the template and the value from the code point we want to encode, and two spaces between the actual bytes)
and you get to use the 11 bits abcdefghijk for the value you actually want to transmit.
So for the code point U+07EB you get
0x07 00000111
0xEB 11101011
where the top five zero bits are masked out (remember, we only get 11 -- because the maximum value that the encoding can accommodate in two bytes is 0x07FF. If you have a larger value, the encoding will use a different template, which is three bytes) and so
0x07 = _____ 111 (template: _____ abc)
0xEB = 11 101011 (template: de fghijk)
abc de = 111 11 (where the first three come from 0x07, and the next two from 0xEB)
fghijk = 101011 (the remaining bits from 0xEB)
yielding the value
110 11111 10 101011
aka 0xDF 0xAB.
Wikipedia's article on UTF-8 contains more examples with nicely colored numbers to see what comes from where.
The range 0x00-0x7F, which can be represented in a single byte, contains 128 code points; the two-byte range thus needs to accommodate 1920 = 2048-128 code points.
The raw encoding would allow values in the range 0xC0-0xBF in the first byte, but the values 0xC0 and 0xC1 are not ever needed because those would represent code points which can be represented in a single byte, and thus are invalid as per the encoding spec. In other words, the 0x02 in 0xC2 comes from the fact that at least one bit in the high four bits out of the 11 that this segment of the encoding can represent (one of abcd) needs to be a one bit in order for the value to require two bytes.

Lossless compression of an ordered series of 29 digits (each 0 to 5 Likert scale)

I have a survey with 29 questions, each with a 5-point Likert scale (0=None of the time; 4=Most of the time). I'd like to compress the total set of responses to a small number of alpha or alphanumeric characters, adding a check digit to the end.
So, the set of responses 00101244231023110242231421211 would get turned into something like A2CR7HW4. This output would be part of a printout that a non-techie user would enter on a website as a shortcut to entering the entire string. I'd want to avoid ambiguous characters, such as 0,O,D,I,l,5,S, leaving me with 21 or 22 characters to use (uppercase only). Alternatively, I could just stick with capital alpha only and use all 26 characters.
I'm thinking to convert each pair of digits to a letter (5^2=25, so the whole alphabet is adequate). That would reduce the sequence to 15 characters, which is still longish to type without errors.
Any other suggestions on how to minimize the length of the output?
EDIT: BTW, for context, the survey asks 29 questions about mental health symptoms, generating a predictive risk for 4 psychiatric conditions. Need a code representing all responses.
If the five answers are all equally likely, then the best you can do is ceiling(29 * log(5) / log(n)) symbols, where n is the number of symbols in your alphabet. (The base of the logarithm doesn't matter, so long as they're both the same.)
So for your 22 symbols, the best you can do is 16. For 26 symbols, the best is 15, as you described for 25. If you use 49 characters (e.g. some subset of the upper and lower case characters and the digits), you can get down to 12. The best you'll be able to do with printable ASCII characters would be 11, using 70 of the 94 characters.
The only way to make it smaller would be if the responses are not all equally likely and are heavily skewed. Though if that's the case, then there's probably something wrong with the survey.
First, choose a set of permissible characters, i.e.
characters = "ABC..."
Then, prefix the input-digits with a 1 and interpret it as a quinary number:
100101244231023110242231421211
Now, convert this quinary number to a number in base-"strlen(characters)", i.e. base26 if 26 characters are to be used:
02 23 18 12 10 24 04 19 00 15 14 20 00 03 17
Then, use these numbers as index in "characters", and you have your encoding:
CVSMKWETAPOUADR
For decoding, just reverse the steps.
Are you doing this in a specific language?
If you want to be really thrifty about it you might want to consider encoding the data at bit level.
Since there are only 5 possible answers per question you could do this with only 3 bits:
000
001
010
011
100
Your end result would be a string of bits, at 3-bits per answer so a total of 87 bits or 10 and a bit bytes.
EDIT - misread the question slightly, there are 5 possible answers not 4, my mistake.
The only problem now is that for 4 of your 5 answers you're wasting a bit...you ain't gonna benefit much from going to this much trouble I wouldn't say but it's worth considering.
EDIT:
I've been playing about with it and it's difficult to work out a mechanism that allows you to use both 2 and 3 bit values.
Since your output would be a 97 bit binary value you'd need ot be able make the distinction between 2 and 3 bits values when converting back to the original values.
If you're working with a larger number of values there are some methods you could use, like having a reserved bit for each values that can be used to sort of type a value and give it some meaning. But working with so few bits as it is, it's hard to shave anything off.
Your output at 97 bits could be padded out to 128 bits, which would give you 4 32-bit values if you wanted to simplify it. this 128 bit value would be like a unique fingerprint representing a specific set of answers. There are many ways you can represnt 128 bits.
But in the end borking at bit-level is about as good as it gets when it comes to actual compression and encoding of data...if you can express 5 unique values in less than 3 bits I'd be suitably impressed.

Can I pack 2 digits(32bit) in the 12 strings [0-9a-z]{13}

I wanna pack 2 32bit digit in to 12strings
example
123 456 -> abcdef0000000
The choice [0-9a-z] gives 36 symbols. With 12 such symbols, you get 36^12 = 4.7e18 different strings.
3 32-bit digits is the same as one 96-bit digit, which has 2^96 = 7.9e28 different values.
There are more digits than you can represent with that alphabet in 12 positions, so no, you cannot achieve the packing you want.
If you use [0-9a-zA-Z] you have 62 symbols. 62^12 = 3.2e21 - still short.
Given a number of symbols S and string length L, you can figure out how many bits of info you can store by solving S^L >= 2^B. For your case, B = 96. Let's use the Base64 alphabet, so S = 64 (this also makes the math easier). Then:
64^L >= 2^96
take log2 of both sides
L * log2(64) >= 96 * log2(2)
L * 6 >= 96
L >= 16
So with a Base64 alphabet, you can store 96 bits in 16 symbols.
Assuming you mean 3 numbers:
Whether you can get them in the string depends on the language. You can't have ANY limitations on the character set, though. A 32 bit number is 4 bytes, three of them make 12 bytes. Zeros are a possibility--any language that won't let you store a zero byte in the string won't allow this. However, there are languages that don't care--Delphi comes to mind. It stores strings with length codes rather than terminators and thus you can put anything in a string.
Assuming you mean 2 numbers:
[0-9][A-Z] is 36 symbols. 36^6 < 2^32, it won't work.
[0-9][A-Z][a-z] is 62 symbols. 62^6 > 2^32, you can do it.
On the other hand, if you're talking strings that hold wide characters of some type even the three numbers can easily be packed in, although you'll have to use some non-ASCII characters to do the job. Pick any 256 symbols and you're good to go. You can even pick 2048 symbols (you'll find more than that if you look at the east Asian languages) and pack each number into three characters.
You are practically wanting to compact a 3 4-byte length numbers into one 12-byte length string. Yes, you can do this, but you will have to use all 256 ANSI symbols (some of which are unprintable, like #13, #7, #10), and it won't even be a "pack", just first 4 bytes will hold one number, second 4 bytes will hold second number, third 4 bytes will hold third number.
If you want to pack TWO 4-byte numbers into a READABLE (printable) string of length 12, you need to be able to stuff 4 bytes into 6 "digits" of a string, thankfully 12 is divisible by 2, and you don't have to play with uneven pack length. So, solving 256^4 <= X^6, we receive X as 41, meaning yes, if you use a Base64 alphabet (as in comments to previous andwer) you are capable of having such a string.

Decoding letters ('a' .. 'z') from a bit sequence without waste

I seek an algorithm that will let me represent an incoming sequence of bits as letters ('a' .. 'z' ), in a minimal matter such that the stream of bits can be regenerated from the letters, without ever holding the entire sequence in memory.
That is, given an external bit source (each read returns a practically random bit), and user input of a number of bits, I would like to print out the minimal number of characters that can represent those bits.
Ideally there should be a parameterization - how much memory versus maximum bits before some waste is necessary.
Efficiency Goal - The same number of characters as the base-26 representation of the bits.
Non-solutions:
If sufficient storage was present, store the entire sequence and use a big-integer MOD 26 operation.
Convert every 9 bits to 2 characters - This seems suboptimal, wasting 25% of information capacity of the letters output.
If you assign a different number of bits per letter, you should be able to exactly encode the bits in the twenty-six letters allowed without wasting any bits. (This is a lot like a Huffman code, only with a pre-built balanced tree.)
To encode bits into letters: Accumulate bits until you match exactly one of the bit codes in the lookup table. Output that letter, clear the bit buffer, and keep going.
To decode letters into bits: For each letter, output the bit sequence in the table.
Implementing in code is left as an exercise to the reader. (Or to me, if I get bored later.)
a 0000
b 0001
c 0010
d 0011
e 0100
f 0101
g 01100
h 01101
i 01110
j 01111
k 10000
l 10001
m 10010
n 10011
o 10100
p 10101
q 10110
r 10111
s 11000
t 11001
u 11010
v 11011
w 11100
x 11101
y 11110
z 11111
Convert each block of 47 bits to a base 26 number of 10 digits. This gives you more than 99.99% efficiency.
This method, as well as others like Huffman, needs a padding mechanism to support variable-length input. This introduces some inefficiency which is less significant with longer inputs.
At the end of the bit stream, append an extra 1 bit. This must be done in all cases, even when the length of the bit stream is a multiple of 47. Any high-order letters of "zero" value can be skipped in the last block of encoded output.
When decoding the letters, a truncated final block can be filled out with "zero" letters and converted to a 47-bit base 2 representation. The final 1 bit is not data, but marks the end of the bit stream.
Could Huffman coding be what you're looking for? It's a compression algorithm, which pretty much represents any information with a minimum of wasted bits.
Zero waste would be log_2(26) bits per letter. As pointed out earlier, you can get to 4.7 by reading 47 bits and converting them to 10 letters. However, you can get to 4.67 by converting every 14 bits into 3 characters. This has the advantage that it fits into an integer. If you have storage space and run time is important, you can create a lookup table with 17,576 entries mapping the possible 14 bits into 3 letters. Otherwise, you can do mod and div operations to compute the 3 letters.
number of letters number of bits bits/letter
1 4 4
2 9 4.5
3 14 4.67
4 18 4.5
5 23 4.6
6 28 4.67
7 32 4.57
8 37 4.63
9 42 4.67
10 47 4.7
Any solution you use is going to be space-inefficient because 26 is not a power of 2. As far as an algorithm goes, I'd rather use a lookup table than an on-the-fly calculation for each series of 9 bits. Your lookup table would 512 entries long.
If you want the binary footprint of each letter to have the same size, the optimal solution would be given by Arithmetic Encoding. However, it will not reach your goal of a mean representation of 4.5 bits/char. Given 26 different characters (not including space etc) 4.7 would be the best you can reach without using variable-length encoding (Huffman, for instance. See Jaegers's answer) or other compression algoritms.
A suboptimal, although simpler, solution could be to find a feasible number of characters to fit into a big integer. For instance, if you form a 32-bit integer out of every 6 charachter chunk (which is possible as 26^6 < 2^32), you use 5.33 bits/char. You can actually even fit 13 letters into a 64 bit integer (4.92 bits/char). This is quite close to the optimal solution, and still rather easy to implement. Using bigger ints than 64 bits can be tricky due to missing native support in many progamming languages.
If you want even better compression rates for text, you should definitely also look into dictionary-based compression algorithms, such as LZW or Deflate.

Resources