Sequence of binary input file for NIST - random

I've developed an RNG program, for now, I want to test whether my numbers are random. Thus, I decided to use NIST Test Suite.
I'm still confused about the input file format, they say: "The user may want to construct as many files of arbitrary length as desired. Files should contain binary sequences stored as either ASCII characters consisting of zeroes and ones, or
as binary data where each byte contains eight bits worth of 0’s and 1’s"
My python RNG program will return a sequence of numbers line by line as:
69
11
68
55
33
20
75
96
How can I convert them to the proper input file for NIST?

Your first random number is 69, which is 1000101 in binary. You can either put that in your test file as the ASCII string "1000101" or as seven bits in a binary file 1000101... The ASCII option is probably easier, but the file will be eight times the size. With either case you might have to be careful with leading zeros in binary, I am not sure what NIST wants without reading a lot more of SP 800-22 than I currently have time for.

The appropriate input for NIST would be a set of binary sequences. You can simply convert your integers into binary numbers, write them to a file and use the file as input for the NIST program. Converting to binary and feeding them to NIST does not necessarily mean your RNG will pass all the tests there. Let's try to answer the following question.
How many bits do you produce for each integer?
For example, let's say your RNG generates integers between 0 to 5 (uniform distribution, all values are equiprobable). As representing 5 would require at least 3 bits, we will use 3 bits for each integer.
0: 000
1: 001
2: 010
3: 011
4: 100
5: 101
Look at the first (most significant) bit for each of the numbers. Four of them are 0, and the remaining two are 1. So, whenever you pick a random integer from 0 to 5, the probability of the first bit being 0 is higher than it being 1. Remember that, for an RNG, we need p(0)=p(1)=0.5 for each of the bits.
Now, if an RNG produced values from 0 to 7 (uniformly), we could convert each of them to 3 bits and maintain p(0)=p(1)=0.5 at all the indices. Why is that? Because we have all 23 different values (i.e., 0 to 23-1), it does not face any bias at any index (equal number of zeros and ones).
The above discussion leads us to the conclusion that, if you have integer values coming out of an RNG and they range from 0 to 2n-1, and each of them are equiprobable, you can convert them to n bits and concatenate them for NIST evaluation. If those conditions do not hold (e.g., the number of outcomes is not a power of 2), one way is to settle for the maximum power of 2 that can be accommodated in the output range of your RNG and discard the rest of the values.

Related

Lossless compression of an ordered series of 29 digits (each 0 to 5 Likert scale)

I have a survey with 29 questions, each with a 5-point Likert scale (0=None of the time; 4=Most of the time). I'd like to compress the total set of responses to a small number of alpha or alphanumeric characters, adding a check digit to the end.
So, the set of responses 00101244231023110242231421211 would get turned into something like A2CR7HW4. This output would be part of a printout that a non-techie user would enter on a website as a shortcut to entering the entire string. I'd want to avoid ambiguous characters, such as 0,O,D,I,l,5,S, leaving me with 21 or 22 characters to use (uppercase only). Alternatively, I could just stick with capital alpha only and use all 26 characters.
I'm thinking to convert each pair of digits to a letter (5^2=25, so the whole alphabet is adequate). That would reduce the sequence to 15 characters, which is still longish to type without errors.
Any other suggestions on how to minimize the length of the output?
EDIT: BTW, for context, the survey asks 29 questions about mental health symptoms, generating a predictive risk for 4 psychiatric conditions. Need a code representing all responses.
If the five answers are all equally likely, then the best you can do is ceiling(29 * log(5) / log(n)) symbols, where n is the number of symbols in your alphabet. (The base of the logarithm doesn't matter, so long as they're both the same.)
So for your 22 symbols, the best you can do is 16. For 26 symbols, the best is 15, as you described for 25. If you use 49 characters (e.g. some subset of the upper and lower case characters and the digits), you can get down to 12. The best you'll be able to do with printable ASCII characters would be 11, using 70 of the 94 characters.
The only way to make it smaller would be if the responses are not all equally likely and are heavily skewed. Though if that's the case, then there's probably something wrong with the survey.
First, choose a set of permissible characters, i.e.
characters = "ABC..."
Then, prefix the input-digits with a 1 and interpret it as a quinary number:
100101244231023110242231421211
Now, convert this quinary number to a number in base-"strlen(characters)", i.e. base26 if 26 characters are to be used:
02 23 18 12 10 24 04 19 00 15 14 20 00 03 17
Then, use these numbers as index in "characters", and you have your encoding:
CVSMKWETAPOUADR
For decoding, just reverse the steps.
Are you doing this in a specific language?
If you want to be really thrifty about it you might want to consider encoding the data at bit level.
Since there are only 5 possible answers per question you could do this with only 3 bits:
000
001
010
011
100
Your end result would be a string of bits, at 3-bits per answer so a total of 87 bits or 10 and a bit bytes.
EDIT - misread the question slightly, there are 5 possible answers not 4, my mistake.
The only problem now is that for 4 of your 5 answers you're wasting a bit...you ain't gonna benefit much from going to this much trouble I wouldn't say but it's worth considering.
EDIT:
I've been playing about with it and it's difficult to work out a mechanism that allows you to use both 2 and 3 bit values.
Since your output would be a 97 bit binary value you'd need ot be able make the distinction between 2 and 3 bits values when converting back to the original values.
If you're working with a larger number of values there are some methods you could use, like having a reserved bit for each values that can be used to sort of type a value and give it some meaning. But working with so few bits as it is, it's hard to shave anything off.
Your output at 97 bits could be padded out to 128 bits, which would give you 4 32-bit values if you wanted to simplify it. this 128 bit value would be like a unique fingerprint representing a specific set of answers. There are many ways you can represnt 128 bits.
But in the end borking at bit-level is about as good as it gets when it comes to actual compression and encoding of data...if you can express 5 unique values in less than 3 bits I'd be suitably impressed.

Compress many numbers into a string

I was wondering if there's a way to compress 20 or so large numbers (~10^8) into a string of a reasonable length. For instance, if the numbers were stored as hex and concatenated, it'd be at least 160 characters long. I wonder if there's a smart way to compress the numbers in and get them back out. I was thinking about having a sequence 0-9 as reference and let one part of the input string be a number <1024. That number is to be converted to binary, which serves as a mask, i.e. indicating which digits exist in the number. It's still not clear where to go on from here.
Are there any better alternatives?
Thanks
If these large numbers are of the same size in bytes, and if you always know the count of those numbers, there is an easy way to do it. You simply Have an array of your bytes, and instead of reading them out as integers, you read them out as characters. Are you trying to obfuscate your values or just pack them to be easily transferred?
When I'm compacting a lot of values into one, reversible String, I usually go with base 64 conversion. This can really cut off quite a lot of the length from a String, but note that it may take up just as much memory in representing it.
Example
This number in decimal:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
is the following in Base 64:
Yki8xQRRVqd403ldXJUT8Ungkh/A3Th2TMtNlpwLPYVgct2eE8MAn0bs4o/fv1bmo4oUNQa/9WtZ8gRE7IG+UHX+LniaQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why you can't do this too an extreme level
Think about it for a second. Let's say you've got a number of length 10. And you want to represent that number with 5 characters, so a 50% rate compression scheme. First, we work out how many possible numbers you can represent with 10 digits.. which is..
2^10 = 1024
Okay, that's fine. How many numbers can we express with 5 digits:
2^5 = 32
So, you can only display 32 different numbers with 5 bits, whereas you can display 1024 numbers with 10 bits. For compression to work, there needs to be some mapping between the compressed value and the extracted value. Let's try and make that mapping happen..
Normal - Compressed
0 0
1 1
2 2
.. ...
31 31
32 ??
33 ??
34 ??
... ...
1023 ??
There is no mapping for most of the numbers that can be represented by the expanded value.
This is known as the Pigeonhole Principle and in this example our value for n is greater than our value for m, hence we need to map values from our compressed values to more than one normal value, which makes things incredibly complex. (thankyou Oli for reminding me).
You need to be much more descriptive about what you mean by "string" and "~10^8". Can your "string" contain any sequence of bytes? Or is it restricted to a subset of possible bytes? If so, how exactly is it restricted? What are the limits on your "large numbers"? What do they represent?
Numbers up to 108 can be represented in 27 bits. 20 of them would be 540 bits, which could be stored in a string of 68 bytes, if any sequence of bytes is permitted. If the contents of a string are limited, it will take more bits. If your range of numbers is larger, it will take more bits.
store all numbers as strings to a marisa trie: https://code.google.com/p/marisa-trie/
Base64 the resulting trie dictionary
It depends of course a lot on your input. But it is a possibility to build a (very) compact representation this way.

Can I pack 2 digits(32bit) in the 12 strings [0-9a-z]{13}

I wanna pack 2 32bit digit in to 12strings
example
123 456 -> abcdef0000000
The choice [0-9a-z] gives 36 symbols. With 12 such symbols, you get 36^12 = 4.7e18 different strings.
3 32-bit digits is the same as one 96-bit digit, which has 2^96 = 7.9e28 different values.
There are more digits than you can represent with that alphabet in 12 positions, so no, you cannot achieve the packing you want.
If you use [0-9a-zA-Z] you have 62 symbols. 62^12 = 3.2e21 - still short.
Given a number of symbols S and string length L, you can figure out how many bits of info you can store by solving S^L >= 2^B. For your case, B = 96. Let's use the Base64 alphabet, so S = 64 (this also makes the math easier). Then:
64^L >= 2^96
take log2 of both sides
L * log2(64) >= 96 * log2(2)
L * 6 >= 96
L >= 16
So with a Base64 alphabet, you can store 96 bits in 16 symbols.
Assuming you mean 3 numbers:
Whether you can get them in the string depends on the language. You can't have ANY limitations on the character set, though. A 32 bit number is 4 bytes, three of them make 12 bytes. Zeros are a possibility--any language that won't let you store a zero byte in the string won't allow this. However, there are languages that don't care--Delphi comes to mind. It stores strings with length codes rather than terminators and thus you can put anything in a string.
Assuming you mean 2 numbers:
[0-9][A-Z] is 36 symbols. 36^6 < 2^32, it won't work.
[0-9][A-Z][a-z] is 62 symbols. 62^6 > 2^32, you can do it.
On the other hand, if you're talking strings that hold wide characters of some type even the three numbers can easily be packed in, although you'll have to use some non-ASCII characters to do the job. Pick any 256 symbols and you're good to go. You can even pick 2048 symbols (you'll find more than that if you look at the east Asian languages) and pack each number into three characters.
You are practically wanting to compact a 3 4-byte length numbers into one 12-byte length string. Yes, you can do this, but you will have to use all 256 ANSI symbols (some of which are unprintable, like #13, #7, #10), and it won't even be a "pack", just first 4 bytes will hold one number, second 4 bytes will hold second number, third 4 bytes will hold third number.
If you want to pack TWO 4-byte numbers into a READABLE (printable) string of length 12, you need to be able to stuff 4 bytes into 6 "digits" of a string, thankfully 12 is divisible by 2, and you don't have to play with uneven pack length. So, solving 256^4 <= X^6, we receive X as 41, meaning yes, if you use a Base64 alphabet (as in comments to previous andwer) you are capable of having such a string.

Algorithms: random unique string

I need to generate string that meets the following requirements:
it should be a unique string;
string length should be 8 characters;
it should contain 2 digits;
all symbols (non-digital characters) should be upper case.
I will store them in a data base after generation (they will be assigned to other entities).
My intention is to do something like this:
Generate 2 random values from 0 to 9—they will be used for digits in the string;
generate 6 random values from 0 to 25 and add them to 64—they will be used as 6 symbols;
concatenate everything into one string;
check if the string already exists in the data base; if not—repeat.
My concern with regard to that algorithm is that it doesn't guarantee a result in finite time (if there are already A LOT of values in the data base).
Question: could you please give advice on how to improve this algorithm to be more deterministic?
Thanks.
it should be unique string;
string length should be 8 characters;
it should contains 2 digits;
all symbols (non-digital characters) - should be upper case.
Assuming:
requirements #2 and #3 are exact (exactly 8 chars, exactly 2 digits) and not a minimum
the "symbols" in requirement #4 are the 26 capital letters A through Z
you would like an evenly-distributed random string
Then your proposed method has two issues. One is that the letters A - Z are ASCII 65 - 90, not 64 - 89. The other is that it doesn't distribute the numbers evenly within the possible string space. That can be remedied by doing the following:
Generate two different integers between 0 and 7, and sort them.
Generate 2 random numbers from 0 to 9.
Generate 6 random letters from A to Z.
Use the two different integers in step #1 as positions, and put the 2 numbers in those positions.
Put the 6 random letters in the remaining positions.
There are 28 possibilities for the two different integers ((8*8 - 8 duplicates) / 2 orderings), 266 possibilities for the letters, and 100 possibilities for the numbers, the total # of valid combinations being Ncomb = 864964172800 = 8.64 x 1011.
edit: If you want to avoid the database for storage, but still guarantee both uniqueness of strings and have them be cryptographically secure, your best bet is a cryptographically random bijection from a counter between 0 and Nmax <= Ncomb to a subset of the space of possible output strings. (Bijection meaning there is a one-to-one correspondence between the output string and the input counter.)
This is possible with Feistel networks, which are commonly used in hash functions and symmetric cryptography (including AES). You'd probably want to choose Nmax = 239 which is the largest power of 2 <= Ncomb, and use a 39-bit Feistel network, using a constant key you keep secret. You then plug in your counter to the Feistel network, and out comes another 39-bit number X, which you then transform into the corresponding string as follows:
Repeat the following step 6 times:
Take X mod 26, generate a capital letter, and set X = X / 26.
Take X mod 100 to generate your two digits, and set X = X / 100.
X will now be between 0 and 17 inclusive (239 / 266 / 100 = 17.796...). Map this number to two unique digit positions (probably easiest using a lookup table, since we're only talking 28 possibilities. If you had more, use Floyd's algorithm for generating a unique permutation, and use the variable-base technique of mod + integer divide instead of generating a random number).
Follow the random approach above, but use the numbers generated by this algorithm instead.
Alternatively, use 40-bit numbers, and if the output of your Feistel network is > Ncomb, then increment the counter and try again. This covers the entire string space at the cost of rejecting invalid numbers and having to re-execute the algorithm. (But you don't need a database to do this.)
But this isn't something to get into unless you know what you're doing.
Are these user passwords? If so, there are a couple of things you need to take into account:
You must avoid 0/O and I/1, which can easily be mistaken for each other.
You must avoid too many consecutive letters, which might spell out a rude word.
As far as 2 is concerned, you can avoid the problem by using LLNLLNLL as your pattern (L = letter, N = number).
If you need 1 million passwords out of a pool of 2.5 billion, you will certainly get clashes in your database, so you have to deal with them gracefully. But a simple retry is enough, if your random number generator is robust.
I don't see anything in your requirements that states that the string needs to be random. You could just do something like the following pseudocode:
for letters in ( 'AAAAAA' .. 'ZZZZZZ' ) {
for numbers in ( 00 .. 99 ) {
string = letters + numbers
}
}
This will create unique strings eight characters long, with two digits and six upper-case letters.
If you need randomly-generated strings, then you need to keep some kind of record of which strings have been previously generated, so you're going to have to hit a DB (or keep them all in memory, or write them to a textfile) and check against that list.
I think you're safe well into your tens of thousands of such ID's, and even after that you're most likely alright.
Now if you want some determinism, you can always force a password after a certain number of failures. Say after 50 failures, you select a password at random and increment a part of it by 1 until you get a free one.
I'm willing to bet money though that you'll never see the extra functionality kick in during your life time :)
Do it the other way around: generate one big random number that you will split up to obtain the individual characters:
long bigrandom = ...;
int firstDigit = bigRandom % 10;
int secondDigit = ( bigrandom / 10 ) % 10;
and so on.
Then you only store the random number in your database and not the string. Since there's a one-to-one relationship between the string and the number, this doesn't really make a difference.
However, when you try to insert a new value, and it's already in the databse, you can easily find the smallest unallocated number graeter than the originally generated number, and use that instead of the one you generated.
What you gain from this method is that you're guaranteed to find an available code relatively quickly, even when most codes are already allocated.
For one thing, your list of requirements doesn't state that string has to be necessary random, so you might consider something like database index.
If 'random' is a requirement, you can do a few improvements.
Store string as a number in database. Not sure how much this improves perfromance.
Do not store used strings at all. You can employ 'index' approach above, but convert integer number to a string in a seemingly random fashion (e.g., employing bit shift). Without much research, nobody will notice pattern.
E.g., if we have sequence 1, 2, 3, 4, ... and use cyclic binary shift right by 1 bit, it'll be turned into 4, 1, 5, 2, ... (assuming we have 3 bits only)
It doesn't have to be a shift too, it can be a permutation or any other 'randomization'.
The problem with your approach is clearly that while you have few records, you are very unlikely to get collisions but as your number of records grows the chance will increase until it becomes more likely than not that you'll get a collision. Eventually you will be hitting multiple collisions before you get a 'valid' result. Every time will require a table scan to determine if the code is valid, and the whole thing turns into a mess.
The simplest solution is to precalculate your codes.
Start with the first code 00AAAA, and increment to generate 00AAAB, 00AAAC ... 99ZZZZ. Insert them into a table in random order. When you need a new code, retrieve to top record unused record from the table (then mark it as used). It's not a huge table, as pointed out above - only a few million records.
You don't need to calculate any random numbers and generate strings for each user (already done)
You don't need to check whether anything has already been used, just get the next available
No chance of getting multiple collisions before finding something usable.
If you ever need more 'codes', just generate some more 'random' strings and append them to the table.

Decoding letters ('a' .. 'z') from a bit sequence without waste

I seek an algorithm that will let me represent an incoming sequence of bits as letters ('a' .. 'z' ), in a minimal matter such that the stream of bits can be regenerated from the letters, without ever holding the entire sequence in memory.
That is, given an external bit source (each read returns a practically random bit), and user input of a number of bits, I would like to print out the minimal number of characters that can represent those bits.
Ideally there should be a parameterization - how much memory versus maximum bits before some waste is necessary.
Efficiency Goal - The same number of characters as the base-26 representation of the bits.
Non-solutions:
If sufficient storage was present, store the entire sequence and use a big-integer MOD 26 operation.
Convert every 9 bits to 2 characters - This seems suboptimal, wasting 25% of information capacity of the letters output.
If you assign a different number of bits per letter, you should be able to exactly encode the bits in the twenty-six letters allowed without wasting any bits. (This is a lot like a Huffman code, only with a pre-built balanced tree.)
To encode bits into letters: Accumulate bits until you match exactly one of the bit codes in the lookup table. Output that letter, clear the bit buffer, and keep going.
To decode letters into bits: For each letter, output the bit sequence in the table.
Implementing in code is left as an exercise to the reader. (Or to me, if I get bored later.)
a 0000
b 0001
c 0010
d 0011
e 0100
f 0101
g 01100
h 01101
i 01110
j 01111
k 10000
l 10001
m 10010
n 10011
o 10100
p 10101
q 10110
r 10111
s 11000
t 11001
u 11010
v 11011
w 11100
x 11101
y 11110
z 11111
Convert each block of 47 bits to a base 26 number of 10 digits. This gives you more than 99.99% efficiency.
This method, as well as others like Huffman, needs a padding mechanism to support variable-length input. This introduces some inefficiency which is less significant with longer inputs.
At the end of the bit stream, append an extra 1 bit. This must be done in all cases, even when the length of the bit stream is a multiple of 47. Any high-order letters of "zero" value can be skipped in the last block of encoded output.
When decoding the letters, a truncated final block can be filled out with "zero" letters and converted to a 47-bit base 2 representation. The final 1 bit is not data, but marks the end of the bit stream.
Could Huffman coding be what you're looking for? It's a compression algorithm, which pretty much represents any information with a minimum of wasted bits.
Zero waste would be log_2(26) bits per letter. As pointed out earlier, you can get to 4.7 by reading 47 bits and converting them to 10 letters. However, you can get to 4.67 by converting every 14 bits into 3 characters. This has the advantage that it fits into an integer. If you have storage space and run time is important, you can create a lookup table with 17,576 entries mapping the possible 14 bits into 3 letters. Otherwise, you can do mod and div operations to compute the 3 letters.
number of letters number of bits bits/letter
1 4 4
2 9 4.5
3 14 4.67
4 18 4.5
5 23 4.6
6 28 4.67
7 32 4.57
8 37 4.63
9 42 4.67
10 47 4.7
Any solution you use is going to be space-inefficient because 26 is not a power of 2. As far as an algorithm goes, I'd rather use a lookup table than an on-the-fly calculation for each series of 9 bits. Your lookup table would 512 entries long.
If you want the binary footprint of each letter to have the same size, the optimal solution would be given by Arithmetic Encoding. However, it will not reach your goal of a mean representation of 4.5 bits/char. Given 26 different characters (not including space etc) 4.7 would be the best you can reach without using variable-length encoding (Huffman, for instance. See Jaegers's answer) or other compression algoritms.
A suboptimal, although simpler, solution could be to find a feasible number of characters to fit into a big integer. For instance, if you form a 32-bit integer out of every 6 charachter chunk (which is possible as 26^6 < 2^32), you use 5.33 bits/char. You can actually even fit 13 letters into a 64 bit integer (4.92 bits/char). This is quite close to the optimal solution, and still rather easy to implement. Using bigger ints than 64 bits can be tricky due to missing native support in many progamming languages.
If you want even better compression rates for text, you should definitely also look into dictionary-based compression algorithms, such as LZW or Deflate.

Resources