Sort N numbers in digit order - algorithm

Given a N number range E.g. [1 to 100], sort the numbers in digit order (i.e) For the numbers 1 to 100, the sorted output wound be
1 10 100 11 12 13 . . . 19 2 20 21..... 99
This is just like Radix Sort but just that the digits are sorted in reversed order to what would be done in a normal Radix Sort.
I tried to store all the digits in each number as a linked list for faster operation but it results in a large Space Complexity.
I need a working algorithm for the question.
From all the answers, "Converting to Strings" is an option, but is there no other way this can be done?
Also an algorithm for Sorting Strings as mentioned above can also be given.

Use any sorting algorithm you like, but compare the numbers as strings, not as numbers. This is basically lexiographic sorting of regular numbers. Here's an example gnome sort in C:
#include <stdlib.h>
#include <string.h>
void sort(int* array, int length) {
int* iter = array;
char buf1[12], buf2[12];
while(iter++ < array+length) {
if(iter == array || (strcmp(itoa(*iter, &buf1, 10), itoa(*(iter-1), &buf2, 10) >= 0) {
iter++;
} else {
*iter ^= *(iter+1);
*(iter+1) ^= *iter;
*iter ^= *(iter+1);
iter--;
}
}
}
Of course, this requires the non-standard itoa function to be present in stdlib.h. A more standard alternative would be to use sprintf, but that makes the code a little more cluttered. You'd possibly be better off converting the whole array to strings first, then sort, then convert it back.
Edit: For reference, the relevant bit here is strcmp(itoa(*iter, &buf1, 10), itoa(*(iter-1), &buf2, 10) >= 0, which replaces *iter >= *(iter-1).

I have a solution but not exactly an algorithm.. All you need to do is converts all the numbers to strings & sort them as strings..

Here is how you can do it with a recursive function (the code is in Java):
void doOperation(List<Integer> list, int prefix, int minimum, int maximum) {
for (int i = 0; i <= 9; i++) {
int newNumber = prefix * 10 + i;
if (newNumber >= minimum && newNumber <= maximum) {
list.add(newNumber);
}
if (newNumber > 0 && newNumber <= maximum) {
doOperation(list, newNumber, minimum, maximum);
}
}
}
You call it like this:
List<Integer> numberList = new ArrayList<Integer>();
int min=1, max =100;
doOperation(numberList, 0, min, max);
System.out.println(numberList.toString());
EDIT:
I translated my code in C++ here:
#include <stdio.h>
void doOperation(int list[], int &index, int prefix, int minimum, int maximum) {
for (int i = 0; i <= 9; i++) {
int newNumber = prefix * 10 + i;
if (newNumber >= minimum && newNumber <= maximum) {
list[index++] = newNumber;
}
if (newNumber > 0 && newNumber <= maximum) {
doOperation(list, index, newNumber, minimum, maximum);
}
}
}
int main(void) {
int min=1, max =100;
int* numberList = new int[max-min+1];
int index = 0;
doOperation(numberList, index, 0, min, max);
printf("[");
for(int i=0; i<max-min+1; i++) {
printf("%d ", numberList[i]);
}
printf("]");
return 0;
}
Basically, the idea is: for each digit (0-9), I add it to the array if it is between minimum and maximum. Then, I call the same function with this digit as prefix. It does the same: for each digit, it adds it to the prefix (prefix * 10 + i) and if it is between the limits, it adds it to the array. It stops when newNumber is greater than maximum.

i think if you convert numbers to string, you can use string comparison to sort them.
you can use anny sorting alghorighm for it.
"1" < "10" < "100" < "11" ...

Optimize the way you are storing the numbers: use a binary-coded decimal (BCD) type that gives simple access to a specific digit. Then you can use your current algorithm, which Steve Jessop correctly identified as most significant digit radix sort.
I tried to store all the digits in
each number as a linked list for
faster operation but it results in a
large Space Complexity.
Storing each digit in a linked list wastes space in two different ways:
A digit (0-9) only requires 4 bits of memory to store, but you are probably using anywhere from 8 to 64 bits. A char or short type takes 8 bits, and an int can take up to 64 bits. That's using 2X to 16X more memory than the optimal solution!
Linked lists add additional unneeded memory overhead. For each digit, you need an additional 32 to 64 bits to store the memory address of the next link. Again, this increases the memory required per digit by 8X to 16X.
A more memory-efficient solution stores BCD digits contiguously in memory:
BCD only uses 4 bits per digit.
Store the digits in a contiguous memory block, like an array. This eliminates the need to store memory addresses. You don't need linked lists' ability to easily insert/delete from the middle. If you need the ability to grow the numbers to an unknown length, there are other abstract data types that allow that with much less overhead. For example, a vector.
One option, if other operations like addition/multiplication are not important, is to allocate enough memory to store each BCD digit plus one BCD terminator. The BCD terminator can be any combination of 4 bits that is not used to represent a BCD digit (like binary 1111). Storing this way will make other operations like addition and multiplication trickier, though.
Note this is very similar to the idea of converting to strings and lexicographically sorting those strings. Integers are internally stored as binary (base 2) in the computer. Storing in BCD is more like base 10 (base 16, actually, but 6 combinations are ignored), and strings are like base 256. Strings will use about twice as much memory, but there are already efficient functions written to sort strings. BCD's will probably require developing a custom BCD type for your needs.

Edit: I missed that it's a contiguous range. That being the case, all the answers which talk about sorting an array are wrong (including your idea stated in the question that it's like a radix sort), and True Soft's answer is right.
just like Radix Sort but just that the digits are sorted in reversed order
Well spotted :-) If you actually do it that way, funnily enough, it's called an MSD radix sort.
http://en.wikipedia.org/wiki/Radix_sort#Most_significant_digit_radix_sorts
You can implement one very simply, or with a lot of high technology and fanfare. In most programming languages, your particular example faces a slight difficulty. Extracting decimal digits from the natural storage format of an integer, isn't an especially fast operation. You can ignore this and see how long it ends up taking (recommended), or you can add yet more fanfare by converting all the numbers to decimal strings before sorting.
Of course you don't have to implement it as a radix sort: you could use a comparison sort algorithm with an appropriate comparator. For example in C, the following is suitable for use with qsort (unless I've messed it up):
int lex_compare(void *a, void *b) {
char a_str[12]; // assuming 32bit int
char b_str[12];
sprintf(a_str, "%d", *(int*)a);
sprintf(b_str, "%d", *(int*)b);
return strcmp(a_str,b_str);
}
Not terribly efficient, since it does a lot of repeated work, but straightforward.

If you do not want to convert them to strings, but have enough space to store an extra copy of the list I would store the largest power of ten less than the element in the copy. This is probably easiest to do with a loop. Now call your original array x and the powers of ten y.
int findPower(int x) {
int y = 1;
while (y * 10 < x) {
y = y * 10;
}
return y;
}
You could also compute them directly
y = exp10(floor(log10(x)));
but I suspect that the iteration may be faster than the conversions to and from floating point.
In order to compare the ith and jth elements
bool compare(int i, int j) {
if (y[i] < y[j]) {
int ti = x[i] * (y[j] / y[i]);
if (ti == x[j]) {
return (y[i] < y[j]); // the compiler will optimize this
} else {
return (ti < x[j]);
}
} else if (y[i] > y[j]) {
int tj = x[j] * (y[i] / y[j]);
if (x[i] == tj) {
return (y[i] < y[j]); // the compiler will optimize this
} else {
return (x[i] < tj);
}
} else {
return (x[i] < x[j];
}
}
What is being done here is we are multiplying the smaller number by the appropriate power of ten to make the two numbers have an equal number of digits, then comparing them. if the two modified numbers are equal, then compare the digit lengths.
If you do not have the space to store the y arrays you can compute them on each comparison.
In general, you are likely better off using the preoptimized digit conversion routines.

Related

Integer division without using the / or * operator

I am going through an algorithms and datastructures textbook and came accross this question:
1-28. Write a function to perform integer division without using
either the / or * operators. Find a fast way to do it.
How can we come up with a fast way to do it?
I like this solution: https://stackoverflow.com/a/34506599/1008519, but I find it somewhat hard to reason about (especially the |-part). This solution makes a little more sense in my head:
var divide = function (dividend, divisor) {
// Handle 0 divisor
if (divisor === 0) {
return NaN;
}
// Handle negative numbers
var isNegative = false;
if (dividend < 0) {
// Change sign
dividend = ~dividend+1;
isNegative = !isNegative;
}
if (divisor < 0) {
// Change sign
divisor = ~divisor+1;
isNegative = !isNegative;
}
/**
* Main algorithm
*/
var result = 1;
var denominator = divisor;
// Double denominator value with bitwise shift until bigger than dividend
while (dividend > denominator) {
denominator <<= 1;
result <<= 1;
}
// Subtract divisor value until denominator is smaller than dividend
while (denominator > dividend) {
denominator -= divisor;
result -= 1;
}
// If one of dividend or divisor was negative, change sign of result
if (isNegative) {
result = ~result+1;
}
return result;
}
We initialize our result to 1 (since we are going to double our denominator until it is bigger than the dividend)
Double our denominator (with bitwise shifts) until it is bigger than the dividend
Since we know our denominator is bigger than our dividend, we can minus our divisor until it is less than our dividend
Return result since denominator is now as close to the result as possible using the divisor
Here are some test runs:
console.log(divide(-16, 3)); // -5
console.log(divide(16, 3)); // 5
console.log(divide(16, 33)); // 0
console.log(divide(16, 0)); // NaN
console.log(divide(384, 15)); // 25
Here is a gist of the solution: https://gist.github.com/mlunoe/e34f14cff4d5c57dd90a5626266c4130
Typically, when an algorithms textbook says fast they mean in terms of computational complexity. That is, the number of operations per bit of input. In general, they don't care about constants, so if you have an input of n bits, whether it takes two operations per bit or a hundred operations per bit, we say the algorithm takes O(n) time. This is because if we have an algorithm that runs in O(n^2) time (polynomial... in this case, square time) and we imagine a O(n) algorithm that does 100 operations per bit compared to our algorithm which may do 1 operation per bit, once the input size is 100 bits, the polynomial algorithm starts to run really slow really quickly (compared to our other algorithm). Essentially, you can imagine two lines, y=100x and y=x^2. Your teacher probably made you do an exercise in Algebra (maybe it was calculus?) where you have to say which one is bigger as x approaches infinity. This is actually a key concept in divergence/convergence in calculus if you have gotten there already in mathematics. Regardless, with a little algebra, you can imagine our graphs intersecting at x=100, and y=x^2 being larger for all points where x is greater than 100.
As far as most textbooks are concerned, O(nlgn) or better is considered "fast". One example of a really bad algorithm to solve this problem would be the following:
crappyMultiplicationAlg(int a, int b)
int product = 0
for (b>0)
product = product + a
b = b-1
return product
This algorithm basically uses "b" as a counter and just keeps adding "a" to some variable for each time b counts down. To calculate how "fast" the algorithm is (in terms of algorithmic complexity) we count how many runs different components will take. In this case, we only have a for loop and some initialization (which is negligible in this case, ignore it). How many times does the for loop run? You may be saying "Hey, guy! It only runs 'b' times! That may not even be half the input. Thats way better than O(n) time!"
The trick here, is that we are concerned with the size of the input in terms of storage... and we all (should) know that to store an n bit integer, we need lgn bits. In other words, if we have x bits, we can store any (unsigned) number up to (2^x)-1. As a result, if we are using a standard 4 byte integer, that number could be up to 2^32 - 1 which is a number well into the billions, if my memory serves me right. If you dont trust me, run this algorithm with a number like 10,000,000 and see how long it takes. Still not convinced? Use a long to use a number like 1,000,000,000.
Since you didn't ask for help with the algorithm, Ill leave it for you as a homework exercise (not trying to be a jerk, I am a total geek and love algorithm problems). If you need help with it, feel free to ask! I already typed up some hints by accident since I didnt read your question properly at first.
EDIT: I accidentally did a crappy multiplication algorithm. An example of a really terrible division algorithm (i cheated) would be:
AbsolutelyTerribleDivisionAlg(int a, int b)
int quotient = 0
while crappyMultiplicationAlg(int b, int quotient) < a
quotient = quotient + 1
return quotient
This algorithm is bad for a whole bunch of reasons, not the least of which is the use of my crappy multiplication algorithm (which will be called more than once even on a relatively "tame" run). Even if we were allowed to use the * operator though, this is still a really bad algorithm, largely due to the same mechanism used in my awful mult alg.
PS There may be a fence-post error or two in my two algs... i posted them more for conceptual clarity than correctness. No matter how accurate they are at doing multiplication or division, though, never use them. They will give your laptop herpes and then cause it to burn up in a sulfur-y implosion of sadness.
I don't know what you mean by fast...and this seems like a basic question to test your thought process.
A simple function can be use a counter and keep subtracting the divisor from the dividend till it becomes 0. This is O(n) process.
int divide(int n, int d){
int c = 0;
while(1){
n -= d;
if(n >= 0)
c++;
else
break;
}
return c;
}
Another way can be using shift operator, which should do it in log(n) steps.
int divide(int n, int d){
if(d <= 0)
return -1;
int k = d;
int i, c, index=1;
c = 0;
while(n > d){
d <<= 1;
index <<= 1;
}
while(1){
if(k > n)
return c;
if(n >= d){
c |= index;
n -= d;
}
index >>= 1;
d >>= 1;
}
return c;
}
This is just like integer division as we do in High-School Mathematics.
PS: If you need a better explanation, I will. Just post that in comments.
EDIT: edited the code wrt Erobrere's comment.
The simplest way to perform a division is by successive subtractions: subtract b from a as long as a remains positive. The quotient is the number of subtractions performed.
This can be pretty slow, as you will perform q subtractions and tests.
With a=28 and b=3,
28-3-3-3-3-3-3-3-3-3=1
the quotient is 9 and the remainder 1.
The next idea that comes to mind is to subtract several times b in a single go. We can try with 2b or 4b or 8b... as these numbers are easy to compute with additions. We can go as for as possible as long as the multiple of b does not exceed a.
In the example, 2³.3 is the largest multiple which is possible
28>=2³.3
So we subtract 8 times 3 in a single go, getting
28-2³.3=4
Now we continue to reduce the remainder with the lower multiples, 2², 2 and 1, when possible
4-2².3<0
4-2.3 <0
4-1.3 =1
Then our quotient is 2³+1=9 and the remainder 1.
As you can check, every multiple of b is tried once only, and the total number of attempts equals the number of doublings required to reach a. This number is just the number of bits required to write q, which is much smaller than q itself.
This is not the fastest solution, but I think it's readable enough and works:
def weird_div(dividend, divisor):
if divisor == 0:
return None
dend = abs(dividend)
dsor = abs(divisor)
result = 0
# This is the core algorithm, the rest is just for ensuring it works with negatives and 0
while dend >= dsor:
dend -= dsor
result += 1
# Let's handle negative numbers too
if (dividend < 0 and divisor > 0) or (dividend > 0 and divisor < 0):
return -result
else:
return result
# Let's test it:
print("49 divided by 7 is {}".format(weird_div(49,7)))
print("100 divided by 7 is {} (Discards the remainder) ".format(weird_div(100,7)))
print("-49 divided by 7 is {}".format(weird_div(-49,7)))
print("49 divided by -7 is {}".format(weird_div(49,-7)))
print("-49 divided by -7 is {}".format(weird_div(-49,-7)))
print("0 divided by 7 is {}".format(weird_div(0,7)))
print("49 divided by 0 is {}".format(weird_div(49,0)))
It prints the following results:
49 divided by 7 is 7
100 divided by 7 is 14 (Discards the remainder)
-49 divided by 7 is -7
49 divided by -7 is -7
-49 divided by -7 is 7
0 divided by 7 is 0
49 divided by 0 is None
unsigned bitdiv (unsigned a, unsigned d)
{
unsigned res,c;
for (c=d; c <= a; c <<=1) {;}
for (res=0;(c>>=1) >= d; ) {
res <<= 1;
if ( a >= c) { res++; a -= c; }
}
return res;
}
The pseudo code:
count = 0
while (dividend >= divisor)
dividend -= divisor
count++
//Get count, your answer

Is this a good Primality Checking Solution?

I have written this code to check if a number is prime (for numbers upto 10^9+7)
Is this a good method ??
What will be the time complexity for this ??
What I have done is that I have made a unordered_set which stores the prime numbers upto sqrt(n).
When checking if a number is prime or not if first check if its is less than the max number in the table.
If it is less it is searched in the table so the complexity should be O(1) in this case.
If it is more the number is put through a divisibility test with the numbers from the set of number containing the prime numbers.
#include<iostream>
#include<set>
#include<math.h>
#include<unordered_set>
#define sqrt10e9 31623
using namespace std;
unordered_set<long long> primeSet = { 2, 3 }; //used for fast lookups
void genrate_prime_set(long range) //this generates prime number upto sqrt(10^9+7)
{
bool flag;
set<long long> tempPrimeSet = { 2, 3 }; //a temporay set is used for genration
set<long long>::iterator j;
for (int i = 3; i <= range; i = i + 2)
{
//cout << i << " ";
flag = true;
for (j = tempPrimeSet.begin(); *j * *j <= i; ++j)
{
if (i % (*j) == 0)
{
flag = false;
break;
}
}
if (flag)
{
primeSet.insert(i);
tempPrimeSet.insert(i);
}
}
}
bool is_prime(long long i,unordered_set<long long> primeSet)
{
bool flag = true;
if(i <= sqrt10e9) //if number exist in the lookup table
return primeSet.count(i);
//if it doesn't iterate through the table
for (unordered_set<long long>::iterator j = primeSet.begin(); j != primeSet.end(); ++j)
{
if (*j * *j <= i && i % (*j) == 0)
{
flag = false;
break;
}
}
return flag;
}
int main()
{
//long long testCases, a, b, kiwiCount;
bool primeFlag = true;
//unordered_set<int> primeNum;
genrate_prime_set(sqrt10e9);
cout << primeSet.size()<<"\n";
cout << is_prime(9999991,primeSet);
return 0;
}
This doesn't strike me as a particularly efficient way to do the job at hand.
Although it probably won't make a big difference in the end, the efficient way to generate all the primes up to some specific limit is clearly to use a sieve--the sieve of Eratosthenes is simple and fast. There are a couple of modifications that can be faster, but for the small size you're dealing with, they're probably not worthwhile.
These normally produce their output in a more effective format than you're currently using as well. In particular, you typically just dedicate one bit to each possible prime (i.e., each odd number) and end up with it zeroed if the number is composite, and one if it's prime (you can, of course, reverse the sense if you prefer).
Since you only need one bit for each odd number from 3 to 31623, this requires only about 16 K bits, or about 2K bytes--a truly minuscule amount of memory by modern standards (especially: little enough to fit in L1 cache quite easily).
Since the bits are stored in order, it's also trivial to compute and test by the factors up to the square root of the number you're testing instead of testing against all the numbers in the table (including those greater than the square root of the number you're testing, which is obviously a waste of time). This also optimizes access to the memory in case some of it's not in the cache (i.e., you can access all the data in order, making life as easy as possible for the hardware prefetcher).
If you wanted to optimize further, I'd consider just using the sieve to find all primes up to 109+7, and look up inputs. Whether this is a win will depend (heavily) upon the number of queries you can expect to receive. A quick check shows that a simple implementation of the Sieve of Eratosthenes can find all primes up to 109 in about 17 seconds. After that, each query is (of course) essentially instantaneous (i.e., the cost of a single memory read). This does require around 120 megabytes of memory for the result of the sieve, which would once have been a major consideration, but (except on fairly limited systems) normally wouldn't be any more.
The very short answer: do research on the subject, starting with the term "Miller-Rabin"
The short answer is no:
Looking for factors of a number is a poor way to check for primality
Exhaustively searching through primes is a poor way to look for factors
Especially if you search through every prime, rather than just the ones less than or equal to the square root of the number
Doing a primality test on each number of them is a poor way to generate a list of primes
Also, you should take in primeSet by reference rather than copy, if it really needs to be a parameter.
Note: testing small primes to see if they divide a number is a useful first step of a primality test, but should generally only be used for the smallest primes before switching to a better method
No, it's not a very good way to determine if a number is prime. Here is pseudocode for a simple primality test that is sufficient for numbers in your range; I'll leave it to you to translate to C++:
function isPrime(n)
d := 2
while d * d <= n
if n % d == 0
return False
d := d + 1
return True
This works by trying every potential divisor up to the square root of the input number n; if no divisor has been found, then the input number could not be composite, meaning of the form n = p × q, because one of the two divisors p or q must be less than the square root of n while the other is greater than the square root of n.
There are better ways to determine primality; for instance, after initially checking if the number is even (and hence prime only if n = 2), it is only necessary to test odd potential divisors, halving the amount of work necessary. If you have a list of primes up to the square root of n, you can use that list as trial divisors and make the process even faster. And there are other techniques for larger n.
But that should be enough to get you started. When you are ready for more, come back here and ask more questions.
I can only suggest a way to use a library function in Java to check the primality of a number. As for the other questions, I do not have any answers.
The java.math.BigInteger.isProbablePrime(int certainty) returns true if this BigInteger is probably prime, false if it's definitely composite. If certainty is ≤ 0, true is returned. You should try and use it in your code. So try rewriting it in Java
Parameters
certainty - a measure of the uncertainty that the caller is willing to tolerate: if the call returns true the probability that this BigInteger is prime exceeds (1 - 1/2^certainty). The execution time of this method is proportional to the value of this parameter.
Return Value
This method returns true if this BigInteger is probably prime, false if it's definitely composite.
Example
The following example shows the usage of math.BigInteger.isProbablePrime() method
import java.math.*;
public class BigIntegerDemo {
public static void main(String[] args) {
// create 3 BigInteger objects
BigInteger bi1, bi2, bi3;
// create 3 Boolean objects
Boolean b1, b2, b3;
// assign values to bi1, bi2
bi1 = new BigInteger("7");
bi2 = new BigInteger("9");
// perform isProbablePrime on bi1, bi2
b1 = bi1.isProbablePrime(1);
b2 = bi2.isProbablePrime(1);
b3 = bi2.isProbablePrime(-1);
String str1 = bi1+ " is prime with certainity 1 is " +b1;
String str2 = bi2+ " is prime with certainity 1 is " +b2;
String str3 = bi2+ " is prime with certainity -1 is " +b3;
// print b1, b2, b3 values
System.out.println( str1 );
System.out.println( str2 );
System.out.println( str3 );
}
}
Output
7 is prime with certainity 1 is true
9 is prime with certainity 1 is false
9 is prime with certainity -1 is true

Find all number pairs in a given range

I have N numbers let say 20 30 15 30 30 40 15 20. Now I want to find how many numbers pairs are in a given range.(L and R given).
number pair= both numbers are same.
My approach:
Create a Map of Array, such that key of map= number, and value=ArrayList of indexes at which that number appears. Then I traverse from L to R and for each value in that range I traverse in the corresponding arraylist to find if there is a pair that fits in range, and then increment count.
But I think this approach is too slow. Is there some faster method to do the same?
Example: for above given sequence and L=0 and R=6
Answer=5. Possible pairs are 1 for 20, 1 for 15 and 3 for 30.
I am developing a solution, assuming numbers can be upto 10^8( and non negative).
If you are looking for speed and don't care about memory there's maybe a better way.
You can use a set as an auxiliary data structure to see if a number was found, and then simply walk the array. Pseudo code:
int numPairs = 0;
set setVisited;
for (int i = L; i < R; i++) {
if (setVisited.contains(a[i])) {
// found the second of a pair. count it up and reset.
numPairs++;
setVisited.remove(a[i]);
} else {
// remember that we saw this number, so we can spot the next pair.
setVisited.add(a[i]);
}
New solution... hopefully better this time. Psuedo C-ish code:
// Sort the sub-array a[L..R]. This can be done O(nlogn) using qsort.
// ... code omitted ...
// Walk through the sorted array counting how many times number occurs.
// When the number changes, count how many possibles ways to make pairs
// from the given count.
int totalPairs = 0;
int count = 1;
int current = a[L];
for (i = L+1; i < R; i++) {
if (a[i] == current) { // found another, keep counting
count++;
} else { // found a different one
if (count > 1) { // need at least 2 to make a pair!
totalPairs += factorial(count) / 2;
}
}
// start counting the new one
current = a[i];
count = 1;
}
// count the final one
if (count > 1) {
totalPairs += factorial(count) / 2;
}
The sort runs O(nlgn), and the loop body runs O(n). Interestingly the performance barrier is now factorial. For really long arrays with really high numbers of occurrences, factorial is expensive unless you optimize further.
One way would be to have loop count repetitions but not compute factorial yet -- leave yet another array of counts of numbers. Then sort this array (again Nlg(N)), then walk through this array and re-use previously computed factorial to compute the next one.
Also if this array gets big, you'll need a large integer to represent the total. I don't know the O() performance of large integers off the top of my head.
Cool problem!

Algorithm to select a single, random combination of values?

Say I have y distinct values and I want to select x of them at random. What's an efficient algorithm for doing this? I could just call rand() x times, but the performance would be poor if x, y were large.
Note that combinations are needed here: each value should have the same probability to be selected but their order in the result is not important. Sure, any algorithm generating permutations would qualify, but I wonder if it's possible to do this more efficiently without the random order requirement.
How do you efficiently generate a list of K non-repeating integers between 0 and an upper bound N covers this case for permutations.
Robert Floyd invented a sampling algorithm for just such situations. It's generally superior to shuffling then grabbing the first x elements since it doesn't require O(y) storage. As originally written it assumes values from 1..N, but it's trivial to produce 0..N and/or use non-contiguous values by simply treating the values it produces as subscripts into a vector/array/whatever.
In pseuocode, the algorithm runs like this (stealing from Jon Bentley's Programming Pearls column "A sample of Brilliance").
initialize set S to empty
for J := N-M + 1 to N do
T := RandInt(1, J)
if T is not in S then
insert T in S
else
insert J in S
That last bit (inserting J if T is already in S) is the tricky part. The bottom line is that it assures the correct mathematical probability of inserting J so that it produces unbiased results.
It's O(x)1 and O(1) with regard to y, O(x) storage.
Note that, in accordance with the combinations tag in the question, the algorithm only guarantees equal probability of each element occuring in the result, not of their relative order in it.
1O(x2) in the worst case for the hash map involved which can be neglected since it's a virtually nonexistent pathological case where all the values have the same hash
Assuming that you want the order to be random too (or don't mind it being random), I would just use a truncated Fisher-Yates shuffle. Start the shuffle algorithm, but stop once you have selected the first x values, instead of "randomly selecting" all y of them.
Fisher-Yates works as follows:
select an element at random, and swap it with the element at the end of the array.
Recurse (or more likely iterate) on the remainder of the array, excluding the last element.
Steps after the first do not modify the last element of the array. Steps after the first two don't affect the last two elements. Steps after the first x don't affect the last x elements. So at that point you can stop - the top of the array contains uniformly randomly selected data. The bottom of the array contains somewhat randomized elements, but the permutation you get of them is not uniformly distributed.
Of course this means you've trashed the input array - if this means you'd need to take a copy of it before starting, and x is small compared with y, then copying the whole array is not very efficient. Do note though that if all you're going to use it for in future is further selections, then the fact that it's in somewhat-random order doesn't matter, you can just use it again. If you're doing the selection multiple times, therefore, you may be able to do only one copy at the start, and amortise the cost.
If you really only need to generate combinations - where the order of elements does not matter - you may use combinadics as they are implemented e.g. here by James McCaffrey.
Contrast this with k-permutations, where the order of elements does matter.
In the first case (1,2,3), (1,3,2), (2,1,3), (2,3,1), (3,1,2), (3,2,1) are considered the same - in the latter, they are considered distinct, though they contain the same elements.
In case you need combinations, you may really only need to generate one random number (albeit it can be a bit large) - that can be used directly to find the m th combination.
Since this random number represents the index of a particular combination, it follows that your random number should be between 0 and C(n,k).
Calculating combinadics might take some time as well.
It might just not worth the trouble - besides Jerry's and Federico's answer is certainly simpler than implementing combinadics.
However if you really only need a combination and you are bugged about generating the exact number of random bits that are needed and none more... ;-)
While it is not clear whether you want combinations or k-permutations, here is a C# code for the latter (yes, we could generate only a complement if x > y/2, but then we would have been left with a combination that must be shuffled to get a real k-permutation):
static class TakeHelper
{
public static IEnumerable<T> TakeRandom<T>(
this IEnumerable<T> source, Random rng, int count)
{
T[] items = source.ToArray();
count = count < items.Length ? count : items.Length;
for (int i = items.Length - 1 ; count-- > 0; i--)
{
int p = rng.Next(i + 1);
yield return items[p];
items[p] = items[i];
}
}
}
class Program
{
static void Main(string[] args)
{
Random rnd = new Random(Environment.TickCount);
int[] numbers = new int[] { 1, 2, 3, 4, 5, 6, 7 };
foreach (int number in numbers.TakeRandom(rnd, 3))
{
Console.WriteLine(number);
}
}
}
Another, more elaborate implementation that generates k-permutations, that I had lying around and I believe is in a way an improvement over existing algorithms if you only need to iterate over the results. While it also needs to generate x random numbers, it only uses O(min(y/2, x)) memory in the process:
/// <summary>
/// Generates unique random numbers
/// <remarks>
/// Worst case memory usage is O(min((emax-imin)/2, num))
/// </remarks>
/// </summary>
/// <param name="random">Random source</param>
/// <param name="imin">Inclusive lower bound</param>
/// <param name="emax">Exclusive upper bound</param>
/// <param name="num">Number of integers to generate</param>
/// <returns>Sequence of unique random numbers</returns>
public static IEnumerable<int> UniqueRandoms(
Random random, int imin, int emax, int num)
{
int dictsize = num;
long half = (emax - (long)imin + 1) / 2;
if (half < dictsize)
dictsize = (int)half;
Dictionary<int, int> trans = new Dictionary<int, int>(dictsize);
for (int i = 0; i < num; i++)
{
int current = imin + i;
int r = random.Next(current, emax);
int right;
if (!trans.TryGetValue(r, out right))
{
right = r;
}
int left;
if (trans.TryGetValue(current, out left))
{
trans.Remove(current);
}
else
{
left = current;
}
if (r > current)
{
trans[r] = left;
}
yield return right;
}
}
The general idea is to do a Fisher-Yates shuffle and memorize the transpositions in the permutation.
It was not published anywhere nor has it received any peer-review whatsoever. I believe it is a curiosity rather than having some practical value. Nonetheless I am very open to criticism and would generally like to know if you find anything wrong with it - please consider this (and adding a comment before downvoting).
A little suggestion: if x >> y/2, it's probably better to select at random y - x elements, then choose the complementary set.
The trick is to use a variation of shuffle or in other words a partial shuffle.
function random_pick( a, n )
{
N = len(a);
n = min(n, N);
picked = array_fill(0, n, 0); backup = array_fill(0, n, 0);
// partially shuffle the array, and generate unbiased selection simultaneously
// this is a variation on fisher-yates-knuth shuffle
for (i=0; i<n; i++) // O(n) times
{
selected = rand( 0, --N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
value = a[ selected ];
a[ selected ] = a[ N ];
a[ N ] = value;
backup[ i ] = selected;
picked[ i ] = value;
}
// restore partially shuffled input array from backup
// optional step, if needed it can be ignored
for (i=n-1; i>=0; i--) // O(n) times
{
selected = backup[ i ];
value = a[ N ];
a[ N ] = a[ selected ];
a[ selected ] = value;
N++;
}
return picked;
}
NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and non-destructive on the input array (as a partial shuffle would be) but this is optional
adapted from here
update
another approach using only a single call to PRNG (pseudo-random number generator) in [0,1] by IVAN STOJMENOVIC, "ON RANDOM AND ADAPTIVE PARALLEL GENERATION OF COMBINATORIAL OBJECTS" (section 3), of O(N) (worst-case) complexity
Here is a simple way to do it which is only inefficient if Y is much larger than X.
void randomly_select_subset(
int X, int Y,
const int * inputs, int X, int * outputs
) {
int i, r;
for( i = 0; i < X; ++i ) outputs[i] = inputs[i];
for( i = X; i < Y; ++i ) {
r = rand_inclusive( 0, i+1 );
if( r < i ) outputs[r] = inputs[i];
}
}
Basically, copy the first X of your distinct values to your output array, and then for each remaining value, randomly decide whether or not to include that value.
The random number is further used to choose an element of our (mutable) output array to replace.
If, for example, you have 2^64 distinct values, you can use a symmetric key algorithm (with a 64 bits block) to quickly reshuffle all combinations. (for example Blowfish).
for(i=0; i<x; i++)
e[i] = encrypt(key, i)
This is not random in the pure sense but can be useful for your purpose.
If you want to work with arbitrary # of distinct values following cryptographic techniques you can but it's more complex.

Generate all binary strings of length n with k bits set

What's the best algorithm to find all binary strings of length n that contain k bits set? For example, if n=4 and k=3, there are...
0111
1011
1101
1110
I need a good way to generate these given any n and any k so I'd prefer it to be done with strings.
This method will generate all integers with exactly N '1' bits.
From https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
Compute the lexicographically next bit permutation
Suppose we have a pattern of N bits set to 1 in an integer and we want
the next permutation of N 1 bits in a lexicographical sense. For
example, if N is 3 and the bit pattern is 00010011, the next patterns
would be 00010101, 00010110, 00011001, 00011010, 00011100, 00100011,
and so forth. The following is a fast way to compute the next
permutation.
unsigned int v; // current permutation of bits
unsigned int w; // next permutation of bits
unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number of trailing zeros. If you are using Microsoft compilers for
x86, the intrinsic is _BitScanForward. These both emit a bsf
instruction, but equivalents may be available for other architectures.
If not, then consider using one of the methods for counting the
consecutive zero bits mentioned earlier. Here is another version that
tends to be slower because of its division operator, but it does not
require counting the trailing zeros.
unsigned int t = (v | (v - 1)) + 1;
w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
Thanks to Dario Sneidermanis of Argentina, who provided this on November 28, 2009.
Python
import itertools
def kbits(n, k):
result = []
for bits in itertools.combinations(range(n), k):
s = ['0'] * n
for bit in bits:
s[bit] = '1'
result.append(''.join(s))
return result
print kbits(4, 3)
Output: ['1110', '1101', '1011', '0111']
Explanation:
Essentially we need to choose the positions of the 1-bits. There are n choose k ways of choosing k bits among n total bits. itertools is a nice module that does this for us. itertools.combinations(range(n), k) will choose k bits from [0, 1, 2 ... n-1] and then it's just a matter of building the string given those bit indexes.
Since you aren't using Python, look at the pseudo-code for itertools.combinations here:
http://docs.python.org/library/itertools.html#itertools.combinations
Should be easy to implement in any language.
Forget about implementation ("be it done with strings" is obviously an implementation issue!) -- think about the algorithm, for Pete's sake... just as in, your very first TAG, man!
What you're looking for is all combinations of K items out of a set of N (the indices, 0 to N-1 , of the set bits). That's obviously simplest to express recursively, e.g., pseudocode:
combinations(K, setN):
if k > length(setN): return "no combinations possible"
if k == 0: return "empty combination"
# combinations including the first item:
return ((first-item-of setN) combined combinations(K-1, all-but-first-of setN))
union combinations(K, all-but-first-of setN)
i.e., the first item is either present or absent: if present, you have K-1 left to go (from the tail aka all-but-firs), if absent, still K left to go.
Pattern-matching functional languages like SML or Haskell may be best to express this pseudocode (procedural ones, like my big love Python, may actually mask the problem too deeply by including too-rich functionality, such as itertools.combinations, which does all the hard work for you and therefore HIDES it from you!).
What are you most familiar with, for this purpose -- Scheme, SML, Haskell, ...? I'll be happy to translate the above pseudocode for you. I can do it in languages such as Python too, of course -- but since the point is getting you to understand the mechanics for this homework assignment, I won't use too-rich functionality such as itertools.combinations, but rather recursion (and recursion-elimination, if needed) on more obvious primitives (such as head, tail, and concatenation). But please DO let us know what pseudocode-like language you're most familiar with! (You DO understand that the problem you state is identically equipotent to "get all combinations of K items out or range(N)", right?).
This C# method returns an enumerator that creates all combinations. As it creates the combinations as you enumerate them it only uses stack space, so it's not limited by memory space in the number of combinations that it can create.
This is the first version that I came up with. It's limited by the stack space to a length of about 2700:
static IEnumerable<string> BinStrings(int length, int bits) {
if (length == 1) {
yield return bits.ToString();
} else {
if (length > bits) {
foreach (string s in BinStrings(length - 1, bits)) {
yield return "0" + s;
}
}
if (bits > 0) {
foreach (string s in BinStrings(length - 1, bits - 1)) {
yield return "1" + s;
}
}
}
}
This is the second version, that uses a binary split rather than splitting off the first character, so it uses the stack much more efficiently. It's only limited by the memory space for the string that it creates in each iteration, and I have tested it up to a length of 10000000:
static IEnumerable<string> BinStrings(int length, int bits) {
if (length == 1) {
yield return bits.ToString();
} else {
int first = length / 2;
int last = length - first;
int low = Math.Max(0, bits - last);
int high = Math.Min(bits, first);
for (int i = low; i <= high; i++) {
foreach (string f in BinStrings(first, i)) {
foreach (string l in BinStrings(last, bits - i)) {
yield return f + l;
}
}
}
}
}
One problem with many of the standard solutions to this problem is that the entire set of strings is generated and then those are iterated through, which may exhaust the stack. It quickly becomes unwieldy for any but the smallest sets. In addition, in many instances, only a partial sampling is needed, but the standard (recursive) solutions generally chop the problem into pieces that are heavily biased to one direction (eg. consider all the solutions with a zero starting bit, and then all the solutions with a one starting bit).
In many cases, it would be more desireable to be able to pass a bit string (specifying element selection) to a function and have it return the next bit string in such a way as to have a minimal change (this is known as a Gray Code) and to have a representation of all the elements.
Donald Knuth covers a whole host of algorithms for this in his Fascicle 3A, section 7.2.1.3: Generating all Combinations.
There is an approach for tackling the iterative Gray Code algorithm for all ways of choosing k elements from n at http://answers.yahoo.com/question/index?qid=20081208224633AA0gdMl
with a link to final PHP code listed in the comment (click to expand it) at the bottom of the page.
One possible 1.5-liner:
$ python -c 'import itertools; \
print set([ n for n in itertools.permutations("0111", 4)])'
set([('1', '1', '1', '0'), ('0', '1', '1', '1'), ..., ('1', '0', '1', '1')])
.. where k is the number of 1s in "0111".
The itertools module explains equivalents for its methods; see the equivalent for the permutation method.
One algorithm that should work:
generate-strings(prefix, len, numBits) -> String:
if (len == 0):
print prefix
return
if (len == numBits):
print prefix + (len x "1")
generate-strings(prefix + "0", len-1, numBits)
generate-strings(prefix + "1", len-1, numBits)
Good luck!
In a more generic way, the below function will give you all possible index combinations for an N choose K problem which you can then apply to a string or whatever else:
def generate_index_combinations(n, k):
possible_combinations = []
def walk(current_index, indexes_so_far=None):
indexes_so_far = indexes_so_far or []
if len(indexes_so_far) == k:
indexes_so_far = tuple(indexes_so_far)
possible_combinations.append(indexes_so_far)
return
if current_index == n:
return
walk(current_index + 1, indexes_so_far + [current_index])
walk(current_index + 1, indexes_so_far)
if k == 0:
return []
walk(0)
return possible_combinations
I would try recursion.
There are n digits with k of them 1s. Another way to view this is sequence of k+1 slots with n-k 0s distributed among them. That is, (a run of 0s followed by a 1) k times, then followed by another run of 0s. Any of these runs can be of length zero, but the total length needs to be n-k.
Represent this as an array of k+1 integers. Convert to a string at the bottom of the recursion.
Recursively call to depth n-k, a method that increments one element of the array before a recursive call and then decrements it, k+1 times.
At the depth of n-k, output the string.
int[] run = new int[k+1];
void recur(int depth) {
if(depth == 0){
output();
return;
}
for(int i = 0; i < k + 1; ++i){
++run[i];
recur(depth - 1);
--run[i];
}
public static void main(string[] arrrgghhs) {
recur(n - k);
}
It's been a while since I have done Java, so there are probably some errors in this code, but the idea should work.
Are strings faster than an array of ints? All the solutions prepending to strings probably result in a copy of the string at each iteration.
So probably the most efficient way would be an array of int or char that you append to. Java has efficient growable containers, right? Use that, if it's faster than string. Or if BigInteger is efficient, it's certainly compact, since each bit only takes a bit, not a whole byte or int. But then to iterate over the bits you need to & mask a bit, and bitshift the mask to the next bit position. So probably slower, unless JIT compilers are good at that these days.
I would post this a a comment on the original question, but my karma isn't high enough. Sorry.
Python (functional style)
Using python's itertools.combinations you can generate all choices of k our of n and map those choices to a binary array with reduce
from itertools import combinations
from functools import reduce # not necessary in python 2.x
def k_bits_on(k,n):
one_at = lambda v,i:v[:i]+[1]+v[i+1:]
return [tuple(reduce(one_at,c,[0]*n)) for c in combinations(range(n),k)]
Example usage:
In [4]: k_bits_on(2,5)
Out[4]:
[(0, 0, 0, 1, 1),
(0, 0, 1, 0, 1),
(0, 0, 1, 1, 0),
(0, 1, 0, 0, 1),
(0, 1, 0, 1, 0),
(0, 1, 1, 0, 0),
(1, 0, 0, 0, 1),
(1, 0, 0, 1, 0),
(1, 0, 1, 0, 0),
(1, 1, 0, 0, 0)]
Well for this question (where you need to iterate over all the submasks in increasing order of their number of set bits), which has been marked as a duplicate of this.
We can simply iterate over all the submasks add them to a vector and sort it according to the number of set bits.
vector<int> v;
for(ll i=mask;i>0;i=(i-1)&mask)
v.push_back(i);
auto cmp = [](const auto &a, const auto &b){
return __builtin_popcountll(a) < __builtin_popcountll(b);
}
v.sort(v.begin(), v.end(), cmp);
Another way would be to iterate over all the submasks N times and add a number to the vector if the number of set bits is equal to i in the ith iteration.
Both ways have complexity of O(n*2^n)
Best and Easy Solution
This is an easy problem. We just need to use Dynamic Programming.
I can give my solution which stores integeres. After that you can convert integers to bitwise strings.
List<Long> dp[]=new List[m+1];
for(int i=0;i<=m;i++) dp[i]=new ArrayList<>();
// dp[i] stores all possible bit masks of n length and i bits set
dp[0].add(0l);
for(int i=1;i<=m;i++){
// transitions
for(int j=0;j<dp[i-1].size();j++){
long num=dp[i-1].get(j);
for(int p=0;p<n;p++){
if((num&(1l<<p))==0) dp[i].add(num|(1l<<p));
}
}
}
// dp[m] contains all possible numbers having m bits set of len n
But dp[m] contains duplicates because adding 1 to 10 or 01 gives 11 two times. To handle that we can use HashSet
Set<Long> set=new HashSet<>();
for(int i=0;i<dp[m].size();i++) set.add(dp[m].get(i));
if you want to solve this problem recursively, you can do this by a D&C algorithm :
def binlist(n,k,s):
if n==0:
if s.count('1')==k:
print(s)
else:
binlist(n-1,k,s+'1')
binlist(n-1,k,s+'0')
binlist(5,3,'')
the output will be :
11100
11010
11001
10110
10101
10011
01110
01101
01011
00111

Resources