Can I pack 2 digits(32bit) in the 12 strings [0-9a-z]{13} - algorithm

I wanna pack 2 32bit digit in to 12strings
example
123 456 -> abcdef0000000

The choice [0-9a-z] gives 36 symbols. With 12 such symbols, you get 36^12 = 4.7e18 different strings.
3 32-bit digits is the same as one 96-bit digit, which has 2^96 = 7.9e28 different values.
There are more digits than you can represent with that alphabet in 12 positions, so no, you cannot achieve the packing you want.
If you use [0-9a-zA-Z] you have 62 symbols. 62^12 = 3.2e21 - still short.
Given a number of symbols S and string length L, you can figure out how many bits of info you can store by solving S^L >= 2^B. For your case, B = 96. Let's use the Base64 alphabet, so S = 64 (this also makes the math easier). Then:
64^L >= 2^96
take log2 of both sides
L * log2(64) >= 96 * log2(2)
L * 6 >= 96
L >= 16
So with a Base64 alphabet, you can store 96 bits in 16 symbols.

Assuming you mean 3 numbers:
Whether you can get them in the string depends on the language. You can't have ANY limitations on the character set, though. A 32 bit number is 4 bytes, three of them make 12 bytes. Zeros are a possibility--any language that won't let you store a zero byte in the string won't allow this. However, there are languages that don't care--Delphi comes to mind. It stores strings with length codes rather than terminators and thus you can put anything in a string.
Assuming you mean 2 numbers:
[0-9][A-Z] is 36 symbols. 36^6 < 2^32, it won't work.
[0-9][A-Z][a-z] is 62 symbols. 62^6 > 2^32, you can do it.
On the other hand, if you're talking strings that hold wide characters of some type even the three numbers can easily be packed in, although you'll have to use some non-ASCII characters to do the job. Pick any 256 symbols and you're good to go. You can even pick 2048 symbols (you'll find more than that if you look at the east Asian languages) and pack each number into three characters.

You are practically wanting to compact a 3 4-byte length numbers into one 12-byte length string. Yes, you can do this, but you will have to use all 256 ANSI symbols (some of which are unprintable, like #13, #7, #10), and it won't even be a "pack", just first 4 bytes will hold one number, second 4 bytes will hold second number, third 4 bytes will hold third number.
If you want to pack TWO 4-byte numbers into a READABLE (printable) string of length 12, you need to be able to stuff 4 bytes into 6 "digits" of a string, thankfully 12 is divisible by 2, and you don't have to play with uneven pack length. So, solving 256^4 <= X^6, we receive X as 41, meaning yes, if you use a Base64 alphabet (as in comments to previous andwer) you are capable of having such a string.

Related

Sequence of binary input file for NIST

I've developed an RNG program, for now, I want to test whether my numbers are random. Thus, I decided to use NIST Test Suite.
I'm still confused about the input file format, they say: "The user may want to construct as many files of arbitrary length as desired. Files should contain binary sequences stored as either ASCII characters consisting of zeroes and ones, or
as binary data where each byte contains eight bits worth of 0’s and 1’s"
My python RNG program will return a sequence of numbers line by line as:
69
11
68
55
33
20
75
96
How can I convert them to the proper input file for NIST?
Your first random number is 69, which is 1000101 in binary. You can either put that in your test file as the ASCII string "1000101" or as seven bits in a binary file 1000101... The ASCII option is probably easier, but the file will be eight times the size. With either case you might have to be careful with leading zeros in binary, I am not sure what NIST wants without reading a lot more of SP 800-22 than I currently have time for.
The appropriate input for NIST would be a set of binary sequences. You can simply convert your integers into binary numbers, write them to a file and use the file as input for the NIST program. Converting to binary and feeding them to NIST does not necessarily mean your RNG will pass all the tests there. Let's try to answer the following question.
How many bits do you produce for each integer?
For example, let's say your RNG generates integers between 0 to 5 (uniform distribution, all values are equiprobable). As representing 5 would require at least 3 bits, we will use 3 bits for each integer.
0: 000
1: 001
2: 010
3: 011
4: 100
5: 101
Look at the first (most significant) bit for each of the numbers. Four of them are 0, and the remaining two are 1. So, whenever you pick a random integer from 0 to 5, the probability of the first bit being 0 is higher than it being 1. Remember that, for an RNG, we need p(0)=p(1)=0.5 for each of the bits.
Now, if an RNG produced values from 0 to 7 (uniformly), we could convert each of them to 3 bits and maintain p(0)=p(1)=0.5 at all the indices. Why is that? Because we have all 23 different values (i.e., 0 to 23-1), it does not face any bias at any index (equal number of zeros and ones).
The above discussion leads us to the conclusion that, if you have integer values coming out of an RNG and they range from 0 to 2n-1, and each of them are equiprobable, you can convert them to n bits and concatenate them for NIST evaluation. If those conditions do not hold (e.g., the number of outcomes is not a power of 2), one way is to settle for the maximum power of 2 that can be accommodated in the output range of your RNG and discard the rest of the values.

Lossless compression of an ordered series of 29 digits (each 0 to 5 Likert scale)

I have a survey with 29 questions, each with a 5-point Likert scale (0=None of the time; 4=Most of the time). I'd like to compress the total set of responses to a small number of alpha or alphanumeric characters, adding a check digit to the end.
So, the set of responses 00101244231023110242231421211 would get turned into something like A2CR7HW4. This output would be part of a printout that a non-techie user would enter on a website as a shortcut to entering the entire string. I'd want to avoid ambiguous characters, such as 0,O,D,I,l,5,S, leaving me with 21 or 22 characters to use (uppercase only). Alternatively, I could just stick with capital alpha only and use all 26 characters.
I'm thinking to convert each pair of digits to a letter (5^2=25, so the whole alphabet is adequate). That would reduce the sequence to 15 characters, which is still longish to type without errors.
Any other suggestions on how to minimize the length of the output?
EDIT: BTW, for context, the survey asks 29 questions about mental health symptoms, generating a predictive risk for 4 psychiatric conditions. Need a code representing all responses.
If the five answers are all equally likely, then the best you can do is ceiling(29 * log(5) / log(n)) symbols, where n is the number of symbols in your alphabet. (The base of the logarithm doesn't matter, so long as they're both the same.)
So for your 22 symbols, the best you can do is 16. For 26 symbols, the best is 15, as you described for 25. If you use 49 characters (e.g. some subset of the upper and lower case characters and the digits), you can get down to 12. The best you'll be able to do with printable ASCII characters would be 11, using 70 of the 94 characters.
The only way to make it smaller would be if the responses are not all equally likely and are heavily skewed. Though if that's the case, then there's probably something wrong with the survey.
First, choose a set of permissible characters, i.e.
characters = "ABC..."
Then, prefix the input-digits with a 1 and interpret it as a quinary number:
100101244231023110242231421211
Now, convert this quinary number to a number in base-"strlen(characters)", i.e. base26 if 26 characters are to be used:
02 23 18 12 10 24 04 19 00 15 14 20 00 03 17
Then, use these numbers as index in "characters", and you have your encoding:
CVSMKWETAPOUADR
For decoding, just reverse the steps.
Are you doing this in a specific language?
If you want to be really thrifty about it you might want to consider encoding the data at bit level.
Since there are only 5 possible answers per question you could do this with only 3 bits:
000
001
010
011
100
Your end result would be a string of bits, at 3-bits per answer so a total of 87 bits or 10 and a bit bytes.
EDIT - misread the question slightly, there are 5 possible answers not 4, my mistake.
The only problem now is that for 4 of your 5 answers you're wasting a bit...you ain't gonna benefit much from going to this much trouble I wouldn't say but it's worth considering.
EDIT:
I've been playing about with it and it's difficult to work out a mechanism that allows you to use both 2 and 3 bit values.
Since your output would be a 97 bit binary value you'd need ot be able make the distinction between 2 and 3 bits values when converting back to the original values.
If you're working with a larger number of values there are some methods you could use, like having a reserved bit for each values that can be used to sort of type a value and give it some meaning. But working with so few bits as it is, it's hard to shave anything off.
Your output at 97 bits could be padded out to 128 bits, which would give you 4 32-bit values if you wanted to simplify it. this 128 bit value would be like a unique fingerprint representing a specific set of answers. There are many ways you can represnt 128 bits.
But in the end borking at bit-level is about as good as it gets when it comes to actual compression and encoding of data...if you can express 5 unique values in less than 3 bits I'd be suitably impressed.

Compress many numbers into a string

I was wondering if there's a way to compress 20 or so large numbers (~10^8) into a string of a reasonable length. For instance, if the numbers were stored as hex and concatenated, it'd be at least 160 characters long. I wonder if there's a smart way to compress the numbers in and get them back out. I was thinking about having a sequence 0-9 as reference and let one part of the input string be a number <1024. That number is to be converted to binary, which serves as a mask, i.e. indicating which digits exist in the number. It's still not clear where to go on from here.
Are there any better alternatives?
Thanks
If these large numbers are of the same size in bytes, and if you always know the count of those numbers, there is an easy way to do it. You simply Have an array of your bytes, and instead of reading them out as integers, you read them out as characters. Are you trying to obfuscate your values or just pack them to be easily transferred?
When I'm compacting a lot of values into one, reversible String, I usually go with base 64 conversion. This can really cut off quite a lot of the length from a String, but note that it may take up just as much memory in representing it.
Example
This number in decimal:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
is the following in Base 64:
Yki8xQRRVqd403ldXJUT8Ungkh/A3Th2TMtNlpwLPYVgct2eE8MAn0bs4o/fv1bmo4oUNQa/9WtZ8gRE7IG+UHX+LniaQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why you can't do this too an extreme level
Think about it for a second. Let's say you've got a number of length 10. And you want to represent that number with 5 characters, so a 50% rate compression scheme. First, we work out how many possible numbers you can represent with 10 digits.. which is..
2^10 = 1024
Okay, that's fine. How many numbers can we express with 5 digits:
2^5 = 32
So, you can only display 32 different numbers with 5 bits, whereas you can display 1024 numbers with 10 bits. For compression to work, there needs to be some mapping between the compressed value and the extracted value. Let's try and make that mapping happen..
Normal - Compressed
0 0
1 1
2 2
.. ...
31 31
32 ??
33 ??
34 ??
... ...
1023 ??
There is no mapping for most of the numbers that can be represented by the expanded value.
This is known as the Pigeonhole Principle and in this example our value for n is greater than our value for m, hence we need to map values from our compressed values to more than one normal value, which makes things incredibly complex. (thankyou Oli for reminding me).
You need to be much more descriptive about what you mean by "string" and "~10^8". Can your "string" contain any sequence of bytes? Or is it restricted to a subset of possible bytes? If so, how exactly is it restricted? What are the limits on your "large numbers"? What do they represent?
Numbers up to 108 can be represented in 27 bits. 20 of them would be 540 bits, which could be stored in a string of 68 bytes, if any sequence of bytes is permitted. If the contents of a string are limited, it will take more bits. If your range of numbers is larger, it will take more bits.
store all numbers as strings to a marisa trie: https://code.google.com/p/marisa-trie/
Base64 the resulting trie dictionary
It depends of course a lot on your input. But it is a possibility to build a (very) compact representation this way.

how to represent a n-byte array in less than 2*n characters

given that a n-byte array can be represented as a 2*n character string using hex, is there a way to represent the n-byte array in less than 2*n characters?
for example, typically, an integer(int32) can be considered as a 4-byte array of data
The advantage of hex is that splitting an 8-bit byte into two equal halves is about the simplest thing you can do to map a byte to printable ASCII characters. More efficient methods consider multiple bytes as a block:
Base-64 uses 64 ASCII characters to represent 6 bits at a time. Every 3 bytes (i.e. 24 bits) are split into 4 6-bit base-64 digits, where the "digits" are:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
(and if the input is not a multiple of 3 bytes long, a 65th character, "=", is used for padding at the end). Note that there are some variant forms of base-64 use different characters for the last two "digits".
Ascii85 is another representation, which is somewhat less well-known, but commonly used: it's often the way that binary data is encoded within PostScript and PDF files. This considers every 4 bytes (big-endian) as an unsigned integer, which is represented as a 5-digit number in base 85, with each base-85 digit encoded as ASCII code 33+n (i.e. "!" for 0, up to "u" for 84) - plus a special case where the single character "z" may be used (instead of "!!!!!") to represent 4 zero bytes.
(Why 85? Because 845 < 232 < 855.)
yes, using binary (in which case it takes n bytes, not surprisingly), or using any base higher than 16, a common one is base 64.
It might depend on the exact numbers you want to represent. For instance, the number 9223372036854775808, which requres 8 bytes to represent in binary, takes only 4 bytes in ascii, if you use the product of primes representation (which is "2^63").
How about base-64?
It all depends on what characters you're willing to use in your encoding (i.e. representation).
Base64 fits 6 bits in each character, which means that 3 bytes will fit in 4 characters.
Using 65536 of about 90000 defined Unicode characters you may represent binary string in N/2 characters.
Yes. Use more characters than just 0-9 and a-f. A single character (assuming 8-bit) can have 256 values, so you can represent an n-byte number in n characters.
If it needs to be printable, you can just choose some set of characters to represent various values. A good option is base-64 in that case.

Decoding letters ('a' .. 'z') from a bit sequence without waste

I seek an algorithm that will let me represent an incoming sequence of bits as letters ('a' .. 'z' ), in a minimal matter such that the stream of bits can be regenerated from the letters, without ever holding the entire sequence in memory.
That is, given an external bit source (each read returns a practically random bit), and user input of a number of bits, I would like to print out the minimal number of characters that can represent those bits.
Ideally there should be a parameterization - how much memory versus maximum bits before some waste is necessary.
Efficiency Goal - The same number of characters as the base-26 representation of the bits.
Non-solutions:
If sufficient storage was present, store the entire sequence and use a big-integer MOD 26 operation.
Convert every 9 bits to 2 characters - This seems suboptimal, wasting 25% of information capacity of the letters output.
If you assign a different number of bits per letter, you should be able to exactly encode the bits in the twenty-six letters allowed without wasting any bits. (This is a lot like a Huffman code, only with a pre-built balanced tree.)
To encode bits into letters: Accumulate bits until you match exactly one of the bit codes in the lookup table. Output that letter, clear the bit buffer, and keep going.
To decode letters into bits: For each letter, output the bit sequence in the table.
Implementing in code is left as an exercise to the reader. (Or to me, if I get bored later.)
a 0000
b 0001
c 0010
d 0011
e 0100
f 0101
g 01100
h 01101
i 01110
j 01111
k 10000
l 10001
m 10010
n 10011
o 10100
p 10101
q 10110
r 10111
s 11000
t 11001
u 11010
v 11011
w 11100
x 11101
y 11110
z 11111
Convert each block of 47 bits to a base 26 number of 10 digits. This gives you more than 99.99% efficiency.
This method, as well as others like Huffman, needs a padding mechanism to support variable-length input. This introduces some inefficiency which is less significant with longer inputs.
At the end of the bit stream, append an extra 1 bit. This must be done in all cases, even when the length of the bit stream is a multiple of 47. Any high-order letters of "zero" value can be skipped in the last block of encoded output.
When decoding the letters, a truncated final block can be filled out with "zero" letters and converted to a 47-bit base 2 representation. The final 1 bit is not data, but marks the end of the bit stream.
Could Huffman coding be what you're looking for? It's a compression algorithm, which pretty much represents any information with a minimum of wasted bits.
Zero waste would be log_2(26) bits per letter. As pointed out earlier, you can get to 4.7 by reading 47 bits and converting them to 10 letters. However, you can get to 4.67 by converting every 14 bits into 3 characters. This has the advantage that it fits into an integer. If you have storage space and run time is important, you can create a lookup table with 17,576 entries mapping the possible 14 bits into 3 letters. Otherwise, you can do mod and div operations to compute the 3 letters.
number of letters number of bits bits/letter
1 4 4
2 9 4.5
3 14 4.67
4 18 4.5
5 23 4.6
6 28 4.67
7 32 4.57
8 37 4.63
9 42 4.67
10 47 4.7
Any solution you use is going to be space-inefficient because 26 is not a power of 2. As far as an algorithm goes, I'd rather use a lookup table than an on-the-fly calculation for each series of 9 bits. Your lookup table would 512 entries long.
If you want the binary footprint of each letter to have the same size, the optimal solution would be given by Arithmetic Encoding. However, it will not reach your goal of a mean representation of 4.5 bits/char. Given 26 different characters (not including space etc) 4.7 would be the best you can reach without using variable-length encoding (Huffman, for instance. See Jaegers's answer) or other compression algoritms.
A suboptimal, although simpler, solution could be to find a feasible number of characters to fit into a big integer. For instance, if you form a 32-bit integer out of every 6 charachter chunk (which is possible as 26^6 < 2^32), you use 5.33 bits/char. You can actually even fit 13 letters into a 64 bit integer (4.92 bits/char). This is quite close to the optimal solution, and still rather easy to implement. Using bigger ints than 64 bits can be tricky due to missing native support in many progamming languages.
If you want even better compression rates for text, you should definitely also look into dictionary-based compression algorithms, such as LZW or Deflate.

Resources