If I have a 256 bit array (selector),Ho do I select 5 elements from an array of 54 element using the 256 bit array ?. It's possible to take only first K bits from selector array to accomplish it and not use all the 256 bit.
The requirements are:
Same selector will lead to same 5 elements being picked .
Need it to be statistically fair so if I run all every possibility of bits in
the selector array it will bring an even spread of times occurrences
in the the 5 elements array.
I know that there is 2,598,960 combinations of 5 element can be selected from array of 54, without caring about the order of selecting them.
You need to pick one of 54, one of 53, ... one of 50. Take the random bits 6 at a time as numbers from 1..64. Simply discard any that are too big (more than 54, 53, or whatever). On the average, you'll need six or seven tries to get your 5 random numbers. You have 42 available, so there's no chance you'll run out, and your distribution will be perfectly uniform.
Well, if 2^K is larger than 2.598,960* you can use K bits to select 5 elements. You won't get an entirely uniform distribution because no power of 2 is divisible by 2,598,960.
*I did not check your math, I just assume 2,598,960 is correct
Related
I am trying to understand the danger in using an unstable sorting algorithm(like Quick sort) in Radix sort.
Also, is stable algorithm must in both cases(i.e.,MSD Radix sort and LSD Radix sort)?
Thanks in advance.
MSD radix sort is usually not practical, as the virtual bins can not be concatenated after each pass. If sorting by 8 bit bytes, after the first pass you have 256 separate bins, after two passes, 65536 bins, after three passes, 16777216 bins, ... .
Update - one exception to this is doing just one MSD pass to split up a large array into 256 (or 512 or 1024 or ...) bins, with the goal that each bin will fit in cache. This assumes somewhat uniform distribution so that the bins are similar in size. After the initial pass, then each bin is sorted using LSD passes, which could be done with multiple threads (if 4 cores, then LSD sort 4 bins at at time using 4 threads), since there would be no collision issues between the bins.
LSD radix sort needs to be stable, since the virtual bins are concatenated in order and the following passes on the more significant "digits" need to retain the order established by the prior passes. Note that LSD radix sort is how the old card sorters dating back to the early 1900's operated.
http://en.wikipedia.org/wiki/IBM_card_sorter#Earlier_sorters
It would be a good start to give two minutes to the history.
Radix sort is the algorithm used by the card-sorting machines you now find only in
computer museums. The cards have 80 columns, and in each column, a machine can
punch a hole in one of 12 places. The sorter can be mechanically “programmed”
to examine a given column of each card in a deck and distribute the card into one
of 12 bins depending on which place has been punched. An operator can then
gather the cards bin by bin, so that cards with the first place punched are on top of
cards with the second place punched, and so on.
For decimal digits, each column uses only 10 places. (The other two places
are reserved for encoding numeric characters.) A d-digit number would then
occupy a field of d columns. Since the card sorter can look at only one column
at a time, the problem of sorting n cards on a d-digit number requires a sorting
algorithm.
Intuitively, you might sort numbers on their most significant digit, sort each of
the resulting bins recursively, and then combine the decks in order. Unfortunately,
since the cards in 9 of the 10 bins must be put aside to sort each of the bins, this
procedure generates many intermediate piles of cards that you would have to keep
track of. (See Exercise 8.3-5.)
Radix sort solves the problem of card sorting—counterintuitively—by sorting on
the least significant digit first. The algorithm then combines the cards into a single
deck, with the cards in the 0 bin preceding the cards in the 1 bin preceding the
cards in the 2 bin, and so on. Then it sorts the entire deck again on the second-least
significant digit and recombines the deck in a like manner. The process continues
until the cards have been sorted on all d digits. Remarkably, at that point, the cards
are fully sorted on the d-digit number. Thus, only d passes through the deck are
required to sort. Figure 8.3 shows how radix sort operates on a “deck” of seven
3-digit numbers.
In order for radix sort to work correctly, the digit sorts must be stable. The sort
performed by a card sorter is stable, but the operator has to be wary about not
changing the order of the cards as they come out of a bin, even though all the cards
in a bin have the same digit in the chosen column.
-by CLRS
From the article, you may get that, MSD radix sort is not feasible.
and
for the need of stable digit sorting algo, let's try to understand with an example
assume a list to be sorted
21, 52, 35, 76, 49, 55, 51, 34, 31, 39
sort the number using digit at once place.
(21, 51, 31,) (52,) (34,) (35, 55,) (76,) (49, 39) <---- this is what when we use stable sort to sort once digit.
But if we use unstable sorting algo to sort once digit, then values within the parenthesis can be interchanged with each other.
can be like
(31, 51, 21,) (52,) (34,) (35, 55,) (76,) (49, 39) <----- this order will not affect the final result
let's sort this w.r.t. digit at tenth place
(21,) (31, 34, 35, 39,) (49,) (51, 52, 55,) (76)<----this will be the (final)output if we use stable sort for digit sorting.
if digit sort is not stable then the output may not be sored order.
like this
(21,) (39, 35, 39, 31) (49,) (52, 51, 55,) (76)
How to form a combination of say 10 questions so that each student (total students = 10) get unique combination.
I don't want to use factorial.
you can use circular queue data structure
now you can cut this at any point you like , and it then it will give you a unique string
for example , if you cut this at point between 2 and 3 and then iterate your queue, you will get :
3, 4, 5, 6, 7, 8, 9, 10, 1, 2
so you need to implement a circular queue, then cut it from 10 different points (after 1, after 2[shown in picture 2],after 3,....)
There are 3,628,800 different permutations of 10 items taken 10 at a time.
If you only need 10 of them you could start with an array that has the values 1-10 in it. Then shuffle the array. That becomes your first permutation. Shuffle the array again and check to see that you haven't already generated that permutation. Repeat that process: shuffle, check, save, until you have 10 unique permutations.
It's highly unlikely (although possible) that you'll generate a duplicate permutation in only 10 tries.
The likelihood that you generate a duplicate increases as you generate more permutations, increasing to 50% by the time you've generated about 2,000. But if you just want a few hundred or less, then this method will do it for you pretty quickly.
The proposed circular queue technique works, too, and has the benefit of simplicity, but the resulting sequences are simply rotations of the original order, and it can't produce more than 10 without a shuffle. The technique I suggest will produce more "random" looking orderings.
I want to pick the top "range" of cards based upon a percentage. I have all my possible 2 card hands organized in an array in order of the strength of the hand, like so:
AA, KK, AKsuited, QQ, AKoff-suit ...
I had been picking the top 10% of hands by multiplying the length of the card array by the percentage which would give me the index of the last card in the array. Then I would just make a copy of the sub-array:
Arrays.copyOfRange(cardArray, 0, 16);
However, I realize now that this is incorrect because there are more possible combinations of, say, Ace King off-suit - 12 combinations (i.e. an ace of one suit and a king of another suit) than there are combinations of, say, a pair of aces - 6 combinations.
When I pick the top 10% of hands therefore I want it to be based on the top 10% of hands in proportion to the total number of 2 cards combinations - 52 choose 2 = 1326.
I thought I could have an array of integers where each index held the combined total of all the combinations up to that point (each index would correspond to a hand from the original array). So the first few indices of the array would be:
6, 12, 16, 22
because there are 6 combinations of AA, 6 combinations of KK, 4 combinations of AKsuited, 6 combinations of QQ.
Then I could do a binary search which runs in BigOh(log n) time. In other words I could multiply the total number of combinations (1326) by the percentage, search for the first index lower than or equal to this number, and that would be the index of the original array that I need.
I wonder if there a way that I could do this in constant time instead?
As Groo suggested, if precomputation and memory overhead permits, it would be more efficient to create 6 copies of AA, 6 copies of KK, etc and store them into a sorted array. Then you could run your original algorithm on this properly weighted list.
This is best if the number of queries is large.
Otherwise, I don't think you can achieve constant time for each query. This is because the queries depend on the entire frequency distribution. You can't look only at a constant number of elements to and determine if it's the correct percentile.
had a similar discussion here Algorithm for picking thumbed-up items As a comment to my answer (basically what you want to do with your list of cards), someone suggested a particular data structure, http://en.wikipedia.org/wiki/Fenwick_tree
Also, make sure your data structure will be able to provide efficient access to, say, the range between top 5% and 15% (not a coding-related tip though ;).
There is one question and I have the solution to it also. But I couldn't understand the solution. Kindly help with some set of examples and shower some experience.
Question
Given a file containing roughly 300 million social security numbers (9-digit numbers), find a 9-digit number that is not in the file. You have unlimited drive space but only 2MB of RAM at your disposal.
Answer
In the first step, we build an array 2^16 integers that is initialized to 0 and for every number in the file, we take its 16 most significant bits to index into this array and increment the number.
Since there are less than 2^32 numbers in the file, there is bound to be (at least) one number in the array that is less than 2^16. This tells us that there is at least one number missing among the possible numbers with those upper bits.
In the second pass, we can focus only only on the numbers that match this criterion and use a bit-vector of size 2^16 to identify one of the missing numbers.
To make the explanation simpler, let's say you have a list of two-digit numbers, where each digit is between 0 and 3, but you can't spare the 16 bits to remember for each of the 16 possible numbers, whether you have already encountered it. What you do is to create an array a of 4 3-bit integers and in a[i], you store how many numbers with the first digit i you encountered. (Two-bit integers wouldn't be enough, because you need the values 0, 4 and all numbers between them.)
If you had the file
00, 12, 03, 31, 01, 32, 02
your array would look like this:
4, 1, 0, 2
Now you know that all numbers starting with 0 are in the file, but for each of the remaining, there is at least one missing. Let's pick 1. We know there is at least one number starting with 1 that is not in the file. So, create an array of 4 bits, for each number starting with 1 set the appropriate bit and in the end, pick one of the bits that wasn't set, in our example if could be 0. Now we have the solution: 10.
In this case, using this method is the difference between 12 bits and 16 bits. With your numbers, it's the difference between 32 kB and 119 MB.
In round terms, you have about 1/3 of the numbers that could exist in the file, assuming no duplicates.
The idea is to make two passes through the data. Treat each number as a 32-bit (unsigned) number. In the first pass, keep a track of how many numbers have the same number in the most significant 16 bits. In practice, there will be a number of codes where there are zero (all those for 10-digit SSNs, for example; quite likely, all those with a zero for the first digit are missing too). But of the ranges with a non-zero count, most will not have 65536 entries, which would be how many would appear if there were no gaps in the range. So, with a bit of care, you can choose one of the ranges to concentrate on in the second pass.
If you're lucky, you can find a range in the 100,000,000..999,999,999 with zero entries - you can choose any number from that range as missing.
Assuming you aren't quite that lucky, choose one with the lowest number of bits (or any of them with less than 65536 entries); call it the target range. Reset the array to all zeroes. Reread the data. If the number you read is not in your target range, ignore it. If it is in the range, record the number by setting the array value to 1 for the low-order 16-bits of the number. When you've read the whole file, any of the numbers with a zero in the array represents a missing SSN.
This article says:
Every prime number can be expressed as
30k±1, 30k±7, 30k±11, or
30k±13 for some k.
That means we can use eight bits per
thirty numbers to store all the
primes; a million primes can be
compressed to 33,334 bytes
"That means we can use eight bits per thirty numbers to store all the primes"
This "eight bits per thirty numbers" would be for k, correct? But each k value will not necessarily take-up just one bit. Shouldn't it be eight k values instead?
"a million primes can be compressed to 33,334 bytes"
I am not sure how this is true.
We need to indicate two things:
VALUE of k (can be arbitrarily large)
STATE from one of the eight states (-13,-11,-7,-1,1,7,11,13)
I am not following how "33,334 bytes" was arrived at, but I can say one thing: as the prime numbers become larger and larger in value, we will need more space to store the value of k.
How, then can we fix it at "33,334 bytes"?
The article is a bit misleading: we can't store 1 million primes, but we can store all primes below 1 million.
The value of k comes from its position in the list. We only need 1 bit for each of those 8 permutations (-13,-11..,11,13)
In other words, we'll use 8 bits to store for k=0, 8 to store for k=1, 8 to store for k=2, etc. By letting these follow sequentially, we don't need to specify the value of k for each 8 bits - it's simply the value for the previous 8 bits + 1.
Since 1,000,000 / 30 = 33,333 1/3, we can store 33,334 of these 8 bit sequences to represent which values below 1 million are prime, since we cover all of the values k can have without 30k-13 exceeding the limit of 1 million.
You don't need to store each value of k. If you want to store the prime numbers below 1 million, use 33,334 bytes - the first byte corresponds to k=0, the second to k=1 etc. Then, in each byte, use 1 bit to indicate "prime" or "composite" for 30k+1, 30k+7 etc.
It's a bitmask--one bit for each of the 8 values out of 30 that might be prime, so 8 bits per 30 numbers. To tabulate all primes up to 10^6, you thus need 8*10^6/30 = 2666667 bits = 33334 bytes.
To explain why this is a good way to go, you need to look at the obvious alternatives.
A more naive way to go would just be to use a bitmask. You need a million bits, 125000 bytes.
You could also store the values of the primes themselves. Up to 1000000, the values fit in 20 bits, and there are 78498 primes, so this gives a disappointing 1569960 bits (196245 bytes).
Another way to go--though less useful for looking up primes--is to store the differences between each prime and the next. Under a million, this fits in 6 bits (as long as you remember that the primes are all odd at that point, so you only need to store even differences and can thus throw away the lowest bit), for 470998 bits == 58874 bytes. (You could shave off another bit by counting how many mod-30 slots you had to jump.)
Now, there's nothing particularly special about 30 except that 30 = 2*3*5, so this lookup is actually walking you up through a bitmask representation of the Sieve of Eratosthanes pattern just after you've gotten started. You could instead use 2*3*5*7 = 210, and then you'd have to consider +- 1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, for 48 values. If you were doing this with 7 blocks of 30, you'd need 7*8=56 bits, so this is a slight improvement, but ugh...hardly worth the hassle.
So this is one of the better tricks out there for compactly storing reasonably small prime numbers.
(P.S. It's interesting to note that if primes appeared randomly (but the same number appeared up to 1000000 as actually appear) the amount of information stored in the primality of a number between 1 and 10^6 would be ~0.397 bits per number. Thus, under naive information-theoretic assumptions, you'd think that the best you could possibly do to store the first million primes was to use 1000000*0.397 bits, or 49609 bytes.)
As another perspective on this, the first 23,163,298 primes can be considered nicely compressible. It is the maximum number of primes for which every gap is <= 255, i.e. fits into a single byte.
I used this fact here, to reduce memory footprint for primes cache by 8 times, i.e. instead of using number (8 bytes), I'm caching only the gaps between primes, using just 1 byte per prime.