Related
I need to generate a list of numbers (about 120.) The numbers range from 1 to X (max 10), both included. The algorithm should use every number an equal amount of times, or at least try, if some numbers are used once less, that's OK.
This is the first time I have to make this kind of algorithm, I've created very simple once, but I'm stumped on how to do this. I tried googling first, though don't really know what to call this kind of algorithms, so I couldn't find anything.
Thanks a lot!
It sounds like what you want to do is first fill a list with the numbers you want and then shuffle that list. One way to do this would be to add each of your numbers to the list and then repeat that process until the list has as many items as you want. After that, randomly shuffle the list.
In pseudo-code, generating the initial list might look something like this:
list = []
while length(list) < N
for i in 1, 2, ..., X
if length(list) >= N
break
end if
list.append(i)
end for
end while
I leave the shuffling part as an exercise to the reader.
EDIT:
As pointed out in the comments the above will always put more smaller numbers than larger numbers. If this isn't what's desired, you could iterate over the possible numbers in a random order. For example:
list = []
numbers = shuffle( [1, 2, ..., X] )
while length(list) < N
for i in 1, 2, ..., X
if length(list) >= N
break
end if
list.append( numbers[i] )
end for
end while
I think this should remove that bias.
What you want is a uniformly distributed random number (wiki). It means that if you generate 10 numbers between 1 to 10 then there is a high probability that all the numbers 1 upto 10 are present in the list.
The Random() class in java gives a fairly uniform distribution. So just go for it. To test, just check this:
Random rand = new Random();
for(int i=0;i<10;i++)
int rNum = rand.nextInt(10);
And see in the result whether you get all the numbers between 1 to 10.
One more similar discussion that might help: Uniform distribution with Random class
Given are an iterator it over data points, the number of data points we have n, and the maximum number of samples we want to use to do some calculations (maxSamples).
Imagine a function calculateStatistics(Iterator it, int n, int maxSamples). This function should use the iterator to retrieve the data and do some (heavy) calculations on the data element retrieved.
if n <= maxSamples we will of course use each element we get from the iterator
if n > maxSamples we will have to choose which elements to look at and which to skip
I've been spending quite some time on this. The problem is of course how to choose when to skip an element and when to keep it. My approaches so far:
I don't want to take the first maxSamples coming from the iterator, because the values might not be evenly distributed.
Another idea was to use a random number generator and let me create maxSamples (distinct) random numbers between 0 and n and take the elements at these positions. But if e.g. n = 101 and maxSamples = 100 it gets more and more difficult to find a new distinct number not yet in the list, loosing lot of time just in the random number generation
My last idea was to do the contrary: to generate n - maxSamples random numbers and exclude the data elements at these positions elements. But this also doesn't seem to be a very good solution.
Do you have a good idea for this problem? Are there maybe standard known algorithms for this?
To provide some answer, a good way to collect a set of random numbers given collection size > elements needed, is the following. (in C++ ish pseudo code).
EDIT: you may need to iterate over and create the "someElements" vector first. If your elements are large they can be "pointers" to these elements to save space.
vector randomCollectionFromVector(someElements, numElementsToGrab) {
while(numElementsToGrab--) {
randPosition = rand() % someElements.size();
resultVector.push(someElements.get(randPosition))
someElements.remove(randPosition);
}
return resultVector;
}
If you don't care about changing your vector of elements, you could also remove random elements from someElements, as you mentioned. The algorithm would look very similar, and again, this is conceptually the same idea, you just pass someElements by reference, and manipulate it.
Something worth noting, is the quality of psuedo random distributions as far as how random they are, grows as the size of the distribution you used increases. So, you may tend to get better results if you pick which method you use based on which method results in the use of more random numbers. Example: if you have 100 values, and need 99, you should probably pick 99 values, as this will result in you using 99 pseudo random numbers, instead of just 1. Conversely, if you have 1000 values, and need 99, you should probably prefer the version where you remove 901 values, because you use more numbers from the psuedo random distribution. If what you want is a solid random distribution, this is a very simple optimization, that will greatly increase the quality of "fake randomness" that you see. Alternatively, if performance matters more than distribution, you would take the alternative or even just grab the first 99 values approach.
interval = n/(n-maxSamples) //an euclidian division of course
offset = random(0..(n-1)) //a random number between 0 and n-1
totalSkip = 0
indexSample = 0;
FOR it IN samples DO
indexSample++ // goes from 1 to n
IF totalSkip < (n-maxSamples) AND indexSample+offset % interval == 0 THEN
//do nothing with this sample
totalSkip++
ELSE
//work with this sample
ENDIF
ENDFOR
ASSERT(totalSkip == n-maxSamples) //to be sure
interval represents the distance between two samples to skip.
offset is not mandatory but it allows to have a very little diversity.
Based on the discussion, and greater understanding of your problem, I suggest the following. You can take advantage of a property of prime numbers that I think will net you a very good solution, that will appear to grab pseudo random numbers. It is illustrated in the following code.
#include <iostream>
using namespace std;
int main() {
const int SOME_LARGE_PRIME = 577; //This prime should be larger than the size of your data set.
const int NUM_ELEMENTS = 100;
int lastValue = 0;
for(int i = 0; i < NUM_ELEMENTS; i++) {
lastValue += SOME_LARGE_PRIME;
cout << lastValue % NUM_ELEMENTS << endl;
}
}
Using the logic presented here, you can create a table of all values from 1 to "NUM_ELEMENTS". Because of the properties of prime numbers, you will not get any duplicates until you rotate all the way around back to the size of your data set. If you then take the first "NUM_SAMPLES" of these, and sort them, you can iterate through your data structure, and grab a pseudo random distribution of numbers(not very good random, but more random than a pre-determined interval), without extra space and only one pass over your data. Better yet, you can change the layout of the distribution by grabbing a random prime number each time, again must be larger than your data set, or the following example breaks.
PRIME = 3, data set size = 99. Won't work.
Of course, ultimately this is very similar to the pre-determined interval, but it inserts a level of randomness that you do not get by simply grabbing every "size/num_samples"th element.
This is called the Reservoir sampling
How would you implement a function that is returning a random number from interval 1..1000
in the case there is a number N determining the chance of reaching higher numbers or lower numbers?
It should behave as follows:
e.g.
if N = 0 and we will generate many times the random number we will get a certain equilibrium (every number from interval 1..1000 has equal chance).
if N = 2321 (I call it positive factor) it will be very hard to achieve small number (often will be generated numbers > 900, sometimes numbers near 500 and rarely numbers < 100). The highest the positive factor the highest probability for high numbers
if N = -2321 (negative factor) this will be the opposite of positive factor
It's clear that the generated numbers will create for given N certain characteristic curve. Could you advise me how to achieve this goal and what curves can I create? What possibilities do I have here? How would you limit positive and negative factors etc.
thank you for help
If you generate a uniform random number, and then raise it to a power > 1, it will get smaller, but stay in the range [0, 1]. If you raise it to a power greater than 0 but less than 1, it will get larger, but stay in the range [0, 1].
So you can use the exponent to pick a power when generating your random numbers.
def biased_random(scale, bias):
return random.random() ** bias * scale
sum(biased_random(1000, 2.5) for x in range(100)) / 100
291.59652962214676 # average less than 500
max(biased_random(1000, 2.5) for x in range(100))
963.81166161355998 # but still occasionally generates large numbers
sum(biased_random(1000, .3) for x in range(100)) / 100
813.90199860117821 # average > 500
min(biased_random(1000, .3) for x in range(100))
265.25040459294883 # but still occasionally generates small numbers
This problem is severely underspecified. There are a million ways to solve it as it is mentioned.
Instead of arbitrary positive and negative values, try to think what is the meaning behind them. IMHO, beta distribution is the one you should consider. By selecting the parameters \alpha and \beta you should be appropriately modulate the behavior of your distribution.
See what shapes you can get with certain \alpha and \beta http://en.wikipedia.org/wiki/Beta_distribution#Shapes
http://en.wikipedia.org/wiki/File:Beta_distribution_pdf.svg
Lets for beginning decide that we will pick numbers from [0,1] because it makes stuff simpler.
n is number that represents distribution (0,2321 or -2321) as in example
We need solution only for n > 0, because if n < 0. You can take positive version of n and subtract from 1.
One simple idea for PDF in interval [0,1] is x^n. (or at least this kind of shape)
Calculating CDF is then integrating x^n and is x^(n+1)/(n+1)
Because CDF must be 1 at the end (in our case at 1) our final CDF is than x^(n+1) and is properly weighted
In order to generate this kind distribution from this, we must calaculate quantile function
Quantile function is just inverse of CDF and is in our case. x^(1/(n+1))
And that is it. Your QF is x^(1/(n+1))
To generate numbers from [0,1] you have to pick uniformly distributetd random from [0,1] (most common random function in programming languages)
and than power this whit (1/(n+1))
Only problem I see is that it can be problem to calculate 1-x^(1/(-n+1)) correctly, where n < 0 but i think that you can use log1p,
so it becomes exp(log1p(-x^(1/(-n+1))) if n<0
conclusion whit normalizations
if n>=0: (x^(1/(n/1000+1)))*1000
if n<0: exp(log1p(-(x^(1/(-(n/1000)+1)))))*1000
where x is uniformly distributed random value in interval [0,1]
In Mathematica, do I have to use an explicit loop to calculate the product of elements in a given list (potentially very long) modulo to another number?
Please teach me your elegant approach if you do have. Thanks!
Edit
Just to give an example
list=Range[2000];Mod[Product[list],32327]
The above is very inefficient, because while calculating the products, one could have taken the modulo to make the multipliers smaller.
Edit 2
I guess my question relates to how to replace for loop for
Module[{ret = initial_value}, For[i = 1, i <= Length[list], i++, ret = general_function[list[[i]],ret]; ret]
given a general function general_function and a list list.
For long lists a divide-and-conquer is typically faster. The idea is to compute the times-mod for the first and second halves, multiply that, and take the mod.
Here is an example. We'll use a list of 10^6 integers, all between 0 and 10^10.
SeedRandom[1111111];
len = 6;
max = 10;
list = RandomInteger[10^max, 10^len];
Multiplying and taking the modulus, for a slightly larger mod (I wanted to decrease the likelihood that the result was zero):
In[119]:= Timing[Mod[Times ## list, 32327541]]
Out[119]= {1.360000, 8826597}
Here is a variant of the sort I described. Trial and error tuning indicated that lists of length 2^9 or so were best done nonrecursively, at least for numbers in the size range indicated above.
tmod2[ll_List, m_] := With[{len=Floor[Length[ll]/2]},
If[len<=256,
Mod[Times ## ll, m],
Mod[tmod2[Take[ll,len],m] * tmod2[Drop[ll,len],m], m]]]
In[120]:= Timing[tmod2[list, 32327541]]
Out[120]= {0.310000, 8826597}
When I increase the list length to 10^7 and allow ints from 0 to 10^20, the first method takes 50 seconds and the second one takes 5 seconds. So clearly the scaling is working to our advantage.
For situations where an iteration interleaving two operations might be preferred to divide-and-conquer, one might use Fold as below.
tmod3[ll_List, m_] := Fold[Mod[#1*#2,m]&, First[ll], Rest[ll]]
While not competitive with tmod2 on long lists, this is faster than multiplying out everything prior to invoking Mod. For length 10^7 and max element 0f 10^20 it takes around 8 seconds to do what tmod2 did in 5.
Why not use Times? The following
list=Range[2000];
Mod[Times##list,32327]
will probably be the most efficient. From a recent WRI blog post,
Times knows a clever binary splitting trick that can be used when you have a large number of integer arguments. It is faster to recursively split the arguments into two smaller products, (1*2*…32767)(32768*…*65536), rather than working through the arguments from first to last. It still has to do the same number of multiplications, but fewer of them involve very big integers, and so, on average, are quicker to do
I'm assuming that list in your question is just an example. If you really have to take the product of n consecutive integers starting with 1, then Factorial will be the fastest. i.e.,
Mod[2000!, 32327]
This appears to be as much as twice as fast as Daniel's code on my system:
SeedRandom[1];
list = RandomInteger[1*^20, 1*^7];
m = 32327501;
Mod[Times ## Mod[Times ### Partition[list, 50, 50, 1, {}], m], m] // AbsoluteTiming
tmod2[list, m] // AbsoluteTiming
{1.5800904, 21590133}
{3.1081778, 21590133}
Different partition lengths could be used to tune this for your system and work set.
There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!
Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.
Now you count up in that histogram until you reach the bin that covers the midpoint of the values.
Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.
Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.
Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).
Here's some sample Scala code that does this:
def medianFinder(numbers: Iterable[Int]) = {
def midArgMid(a: Array[Long], mid: Long) = {
val cuml = a.scanLeft(0L)(_ + _).drop(1)
cuml.zipWithIndex.dropWhile(_._1 < mid).head
}
val topHistogram = new Array[Long](65536)
var count = 0L
numbers.foreach(number => {
count += 1
topHistogram(number>>>16) += 1
})
val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
val botHistogram = new Array[Long](65536)
numbers.foreach(number => {
if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
})
val (botCount,botIndex) =
midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))
(topIndex<<16) + botIndex
}
and here it is working on a small set of input data:
scala> medianFinder(List(1,123,12345,1234567,123456789))
res18: Int = 12345
If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.
You can use the Medians of Medians algorithm.
If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.
If you can't read them into memory, this is what I came up with:
Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.
Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the x-th largest integer. You can discard all of these except for the x-th largest, which I'll call x1.
Do another pass through, finding the next x largest integers less than x1, the least of which is x2.
I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)-th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.
Make a pass through the file and find count of integers and minimum and maximum integer value.
Take midpoint of min and max, and get count, min and max for values either side of the midpoint - by again reading through the file.
partition count > count => median lies within that partition.
Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.
Am sure this'd work for an arbitrary number of partitions as well.
Do an on-disk external mergesort on the file to sort the integers (counting them if that's not already known).
Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.
The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.
Given n = number of integers in the original file:
Running time: O(nlogn)
Memory: O(1), adjustable
Disk: O(n)
Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.
My best guess that probabilistic median of medians would be the fastest one. Recipe:
Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
Then calculate median of these integers and assign it to variable X_new.
If iteration is not first - calculate median of two medians:
X_global = (X_global + X_new) / 2
When you will see that X_global fluctuates not much - this means that you found approximate median of data.
But there some notes :
question arises - Is median error acceptable or not.
integers must be distributed randomly in a uniform way, for solution to work
EDIT:
I've played a bit with this algorithm, changed a bit idea - in each iteration we should sum X_new with decreasing weight, such as:
X_global = k*X_global + (1.-k)*X_new :
k from [0.5 .. 1.], and increases in each iteration.
Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define ARRAY_SIZE 100000000
#define RANGE_SIZE 1000
// probabilistic median of medians method
// should print 5000 as data average
// from ARRAY_SIZE of elements
int main (int argc, const char * argv[]) {
int iter = 0;
int X_global = 0;
int X_new = 0;
int i = 0;
float dk = 0.002;
float k = 0.5;
srand(time(NULL));
while (i<ARRAY_SIZE && k!=1.) {
X_new=0;
for (int j=i; j<i+RANGE_SIZE; j++) {
X_new+=rand()%10000 + 1;
}
X_new/=RANGE_SIZE;
if (iter>0) {
k += dk;
k = (k>1.)? 1.:k;
X_global = k*X_global+(1.-k)*X_new;
}
else {
X_global = X_new;
}
i+=RANGE_SIZE+1;
iter++;
printf("iter %d, median = %d \n",iter,X_global);
}
return 0;
}
Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean - ignore my post. In any case mean and median are very related concepts.
Good luck.
Here is the algorithm described by #Rex Kerr implemented in Java.
/**
* Computes the median.
* #param arr Array of strings, each element represents a distinct binary number and has the same number of bits (padded with leading zeroes if necessary)
* #return the median (number of rank ceil((m+1)/2) ) of the array as a string
*/
static String computeMedian(String[] arr) {
// rank of the median element
int m = (int) Math.ceil((arr.length+1)/2.0);
String bitMask = "";
int zeroBin = 0;
while (bitMask.length() < arr[0].length()) {
// puts elements which conform to the bitMask into one of two buckets
for (String curr : arr) {
if (curr.startsWith(bitMask))
if (curr.charAt(bitMask.length()) == '0')
zeroBin++;
}
// decides in which bucket the median is located
if (zeroBin >= m)
bitMask = bitMask.concat("0");
else {
m -= zeroBin;
bitMask = bitMask.concat("1");
}
zeroBin = 0;
}
return bitMask;
}
Some test cases and updates to the algorithm can be found here.
I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found from Cracking The Coding interview book.
Example: Numbers are randomly generated and stored into an (expanding) array. How
wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and
sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements
sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average
of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins.
This is actually interesting—if you had two heaps, you could keep track of the bigger
half and the smaller half of the elements. The bigger half is kept in a min heap, such
that the smallest element in the bigger half is at the root.The smaller half is kept in a
max heap, such that the biggest element of the smaller half is at the root. Now, with
these data structures, you have the potential median elements at the roots. If the
heaps are no longer the same size, you can quickly "rebalance" the heaps by popping
an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data
structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.