Sorting 100 unique numbers by using 40bytes of memory - algorithm

I've been asked a good programming problem:
In the input I've got 100 unique numbers from 0-255(1 byte). I can only read one number at a time and only once. I've got 40 bytes of memory which I can use. The goal is to sort all numbers and print them in the output. I know for sure that the uniqueness of the numbers is very important.
Any ideas?

32 bytes give you 256 bits, just enough to maintain a bit map of which of the 256 possible byte values are seen in the input. One additional byte is used to store the input value. Read each value, mark it in the bitmap, then discard. Once you've read all 100 input values, simply write out the value associated with the bits you set in the bit map.
Then ask what you are supposed to do with the other 7 bytes :)

Since your numbers are unique and they are only 1-byte long, they have to be within 0 to 255. Treat your 40 bytes of storage as a long bit vector. As you read each number, set the appropriate bit in this 320-bit bit-vector. When you're done reading the input, turn around and scan through this bit-vector, printing the number corresponding to each set bit.
In response to #JavaNewb, here is some more detail. First, since a byte contains 8 bits, it can assume only one of 256 possible values, namely, 0 through 255. Armed with this little factoid, you look at the 40-byte storage array you have. This array turns out to have 40 bytes * 8 bits/byte = 320 bits. Since the problem states that each of the 100 1-byte numbers are unique, you know that you will see a particular number (which can range from 0 through 255) at most once. Each time you see a number, you set the corresponding bit in the 40-byte array. For instance, if you encounter the number 50, you set bit number 2 in byte number 6. A number N corresponds to bit N%8 in byte N/8. You are guaranteed to never encounter a set bit in this array since that would imply the existence of duplicates in the 100 numbers. After you've read in all the numbers, you look at the 40-byte array. Each bit that is set in this array corresponds to one of the 100 numbers you read in. By traversing this 40-byte array from the 0th bit in the 0th byte all the way to the 7th bit in the 31st byte, you will by:
Capturing all the numbers that were read in
Observing them in a sorted order
All you have to do now is print the numbers corresponding to the set bits as you traverse the 40-byte array.

Related

How to detect that bit input ended?

I write LZW/Huffman encoder/decoder. LZW coder sends the number of bits according to its table: if it has less then 2^n elements, it sends n bits. Huffman coder recieve these bits byte by byte and encode them into specific number of bits according to its tree.
So the problem is the last byte can contain less then 8 bits. If i use EOF to detect end of input when decoding i can suddenly get EOF value before actual end of input. And if i send/recieve 4 bytes considering the first one is for sign, i lose 1 bit every 4 bytes.
Should i lose these first bits or there is another better solution i don't know?

how to interpret a two-byte value?

I am trying to read a 2-byte value that is stored lower order byte first. So if the 2 bytes are 10 and 85 in that order (decimal) what is the number they represent? 8510? 851? Or something different?
These values represent the length of an encoded sequence and I need to know the length to properly handle the information it contains. Most only use the first byte and as a decimal number it accurately represents the total number of characters (or bytes) in the sequence... but some use both bytes and I don't understand how to interpret them.
IF anyone can help be with this I would appreciate it.
Thanks
What you're referring to is the "endianness" of the two-byte value (also called a WORD). It's generally not considered beneficial to discuss them in terms of decimal values because no, if the bytes are 10 and 85 in decimal, the values would not be any combination of 10/85. Rather, showing them in the hexadecimal values is far more prudent.
+----+----+
|0x0a|0x55|
+----+----+
To interpret this as Little Endian, the value would be 0x550a (21770 in decimal). And in Big Endian (highest order byte first), the value is 0x0a55 (2645 decimal).
It's very important, for this reason, to know the endianness of your system and be able to properly handle the two. It is also noteworthy that "Network Byte Order" is Big Endian.

Base91, how is it calculated?

I've been looking online to find out how basE91 is calculated. I have found resources such as this one which specifies the characters used for a specific value but nowhere have I found how I get that value.
I have tried changing the input values into binary and taking chunks of both 6 and 7 bits but these do not work and I get the incorrect output. I do not want code that will do this for me as I which to write that myself, I only want to know the process needed to encode a string into basE91.
First, you need to see the input as a bit stream.
Then, read 13 bits from the stream, and form an integer value from it. If the value of this integer is lower than or equal to 88, then read one additional bit, and put it into the 14th bit (lowest bit being 1st) of the integer. This integer's (let's call it v) maximum value is: 8192+88 = 8280.
Then split v into two indices: i0 = v%91, i1 = v/91. Then use a 91-element character table, and output two characters: table[i0], table[i1].
(now you can see the reason of 88: for the maximal value (8280), both i0 and i1 become 90)
So this process is more complicated than base64, but more space efficient. Furthermore, unlike base64, the size of the output is a little bit dependent of the input bytes. A N-length sequence of 0x00 will be shorter than a N-length sequence of 0xff (where N is a sufficiently large number).

How to uniquely represent 99,999 bits as a byte, word, or double word

I have 99,999 bit flags that I need to represent uniquely with 32 bits or less. Any of the bits can be set and I need to know if the set bits differ from a comparable set of bits. I am considering using CRC to store a unique value hash but I am not sure if collisions will be a problem. Ideally, less than 500 of these bits will be set at any given time, but they will not be know ahead of time.
Is there suitable hash or other algorithm to uniquely represent these bits?
NO!
Without some other information about those bit flags to identify that certain combinations are impossible, this cannot be done. If all combinations are possible, then you will need to use 99,999 bits to store your 99,999 bit flags.
Edit:
Based on the background information that this is to reduce network usage and the expectation is that only about 500 of the bits are set, there are techniques that can be used, but none are a simple hash, and none are efficient enough to store in 32 bits. I would start by looking at Arithmetic Coding. This uses a probability distribution of the characters that you want to send (0.5% 1, 99.5% 0) to compress data. By my computations, you can "expect" a compression of about 22 times. But, for signals that are considered rare, you will pay the price by needing to transmit a signal larger than your starting 99,999 bits.

Compress many numbers into a string

I was wondering if there's a way to compress 20 or so large numbers (~10^8) into a string of a reasonable length. For instance, if the numbers were stored as hex and concatenated, it'd be at least 160 characters long. I wonder if there's a smart way to compress the numbers in and get them back out. I was thinking about having a sequence 0-9 as reference and let one part of the input string be a number <1024. That number is to be converted to binary, which serves as a mask, i.e. indicating which digits exist in the number. It's still not clear where to go on from here.
Are there any better alternatives?
Thanks
If these large numbers are of the same size in bytes, and if you always know the count of those numbers, there is an easy way to do it. You simply Have an array of your bytes, and instead of reading them out as integers, you read them out as characters. Are you trying to obfuscate your values or just pack them to be easily transferred?
When I'm compacting a lot of values into one, reversible String, I usually go with base 64 conversion. This can really cut off quite a lot of the length from a String, but note that it may take up just as much memory in representing it.
Example
This number in decimal:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
is the following in Base 64:
Yki8xQRRVqd403ldXJUT8Ungkh/A3Th2TMtNlpwLPYVgct2eE8MAn0bs4o/fv1bmo4oUNQa/9WtZ8gRE7IG+UHX+LniaQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why you can't do this too an extreme level
Think about it for a second. Let's say you've got a number of length 10. And you want to represent that number with 5 characters, so a 50% rate compression scheme. First, we work out how many possible numbers you can represent with 10 digits.. which is..
2^10 = 1024
Okay, that's fine. How many numbers can we express with 5 digits:
2^5 = 32
So, you can only display 32 different numbers with 5 bits, whereas you can display 1024 numbers with 10 bits. For compression to work, there needs to be some mapping between the compressed value and the extracted value. Let's try and make that mapping happen..
Normal - Compressed
0 0
1 1
2 2
.. ...
31 31
32 ??
33 ??
34 ??
... ...
1023 ??
There is no mapping for most of the numbers that can be represented by the expanded value.
This is known as the Pigeonhole Principle and in this example our value for n is greater than our value for m, hence we need to map values from our compressed values to more than one normal value, which makes things incredibly complex. (thankyou Oli for reminding me).
You need to be much more descriptive about what you mean by "string" and "~10^8". Can your "string" contain any sequence of bytes? Or is it restricted to a subset of possible bytes? If so, how exactly is it restricted? What are the limits on your "large numbers"? What do they represent?
Numbers up to 108 can be represented in 27 bits. 20 of them would be 540 bits, which could be stored in a string of 68 bytes, if any sequence of bytes is permitted. If the contents of a string are limited, it will take more bits. If your range of numbers is larger, it will take more bits.
store all numbers as strings to a marisa trie: https://code.google.com/p/marisa-trie/
Base64 the resulting trie dictionary
It depends of course a lot on your input. But it is a possibility to build a (very) compact representation this way.

Resources