Random logic engine implementation ideas - algorithm

I try to find an effective random logic algorithm for this scenario. It doesn't matter which programming Language:
Say I have 20 element array filled with numbers
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
From this I need to construct each time 15 size array BUT
each time I set numbers that must be in this new array, and the remaining slots will be filled with random numbers from the master array.
For example:
In the new array the numbers that must be in are: 1,11,13,20,8,9
so the new array will be:
[1,N,N,11,N,20,8,N,9,N,N,N,13,N,N]
Where the Ns are random numbers from ALL 20 elements of the Master array.
Another example:
given 2,18,17,9,5
create new 10 element array:
[2,2,18,2,11,17,20,5,5,9]
No problem with duplicate elements
I'm trying to find some good algorithm for this.

If you want to receive one random number at a time and don't want to create the full result array up front, an alternative to my other answer is this:
Get a random number ranging from 0..requested_number (where requested_number is the total number of elements to fetch).
If this index is between 0 and length(required), print it from the array required; then remove it from the array;
.. else the next index is > length(required) and so pick any random number out of the optional array.
Decrease requested_number and repeat until this reaches 0.
You need 2 calls to random; the first to select an index from total_number - required_number, so you know from which array to pick a value, and the second time for optional, to pick a random number out of the entire available range.
Here is a basic implementation in C (footnote: using mod on rand() does not yield A Good Random Number, but it'll do for this example).
int main()
{
int optional[] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 };
int required[] = { 21,22,23,24,25 };
int requested_number = 15;
int take_from_required, optional_size, next;
srand(time(NULL));
if (requested_number < sizeof(required)/sizeof(required[0]))
{
printf ("requested number of elements must be at least as large as required array\n");
return EDOM;
}
/* Use this much from 'required': */
take_from_required = sizeof(required)/sizeof(required[0]);
/* Use this much from 'optional': */
optional_size = sizeof(optional)/sizeof(optional[0]);
while (requested_number > 0)
{
/* Please note this is a fairly bad 'random'!
As discussed many times before on SO. */
next = rand() % requested_number;
/* Take from which array? */
if (next >= take_from_required)
{
printf ("%d\n", optional[rand() % optional_size]);
} else
{
printf ("%d (required)\n", required[next]);
required[next] = required[take_from_required-1];
take_from_required--;
}
requested_number--;
}
return 0;
}

If I understand correctly, this is the issue:
optional [ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 ]
required [ 2,18,17,9,5 ]
Now construct a new array containing at least all elements of required, and filled to its capacity with elements taken from optional.
The problem seems to be that you need to take out random numbers from either required or optional and at the same time make sure required is empty at the end. [*]
Create a new array result (which needs to be at least as long as required -- then again, that can be inferred from the question). Copy all elements of required into it; fill the rest with random elements from optional.
At this point, you fulfill the primary condition, but the elements of required always appear first. So, as a last step, shuffle the elements now stored in the result array (for example, with the well-known Fisher-Yates shuffle).
[*] 'Empty', because all numbers in required must be used at least once. Taking them "out" of the array is the easiest way to make sure this happens. Things start to get complicated when (a) you may have duplicates of any number (from both optional and required) and (b) required is not a subset of optional.

Related

Converting Scratch to Algorithm

First time I am learning algorithms and trying to figure out with stratch. I am following tutorials on Stratch wiki. How can I convert this to algorithm?( with flow chart or normal steps). Especially the loop.( I uploaded as picture) Please click here to see picture
I Started:
Step:1 Start
Step2: İnt: delete all of numbers, iterator, amount,sum
Step3: How many numbers you want?
Step4:initialize sum=0,amount=0,iterator=1
Step5: Enter the elements values
Step6: found the sum by using loop in array and update sum value in which loop must be continue till (no of elements-1 ) times
Step7:avg=sum/no of elements
Step8: Print the values average
I don't think It's true. I mean I feel there are errors? Thank you for time.
Scratch
Here is the algorithm in variant 2 (see Java algorithm below) in Scratch. The output should be identical.
Java
Here is the algorithm in Java where I did comment the steps which should give you a step-by-step guide on how to do it in Scratch as well.
I have also implemented two variants of the algorithm to show you some considerations that a programmer often has to think of when implementing an algorithm which mainly is time (= time required for the algorithm to complete) and space (= memory used on your computer).
Please note: the following algorithms do not handle errors. E.g. if a user would enter a instead of a number the program would crash. It is easy to adjust the program to handle this but for simplicity I did not do that.
Variant 1: Storing all elements in array numbers
This variant stores all numbers in an array numbers and calculates the sum at the end using those numbers which is slower than variant 2 as the algorithm goes over all the numbers twice. The upside is that you will preserve all the numbers the user entered and you could use that later on if you need to but you will need storage to store those values.
public static void yourAlgorithm() {
// needed in Java to get input from user
var sc = new Scanner(System.in);
// print to screen (equivalent to "say"/ "ask")
System.out.print("How many numbers do you want? ");
// get amount of numbers as answer from user
var amount = sc.nextInt();
// create array to store all elements
var numbers = new int[amount];
// set iterator to 1
int iterator = 1;
// as long as the iterator is smaller or equal to the number of required numbers, keep asking for new numbers
// equivalent to "repeat amount" except that retries are possible if no number was entered
while (iterator <= amount) {
// ask for a number
System.out.printf("%d. number: ", iterator);
// insert the number at position iterator - 1 in the array
numbers[iterator - 1] = sc.nextInt();
// increase iterator by one
iterator++;
}
// calulate the sum after all the numbers have been entered by the user
int sum = 0;
// go over all numbers again! (this is why it is slower) and calculate the sum
for (int i = 0; i < amount; i++) {
sum += numbers[i];
}
// print average to screen
System.out.printf("Average: %s / %s = %s", sum, amount, (double)sum / (double)amount);
}
Variant 2: Calculating sum when entering new number
This algorithm does not store the numbers the user enters but immediately uses the input to calculate the sum, hence it is faster as only one loop is required and it needs less memory as the numbers do not need to be stored.
This would be the best solution (fastest, least space/ memory needed) in case you do not need all the numbers the user entered later on.
// needed in Java to get input from user
var sc = new Scanner(System.in);
// print to screen (equivalent to "say"/ "ask")
System.out.print("How many numbers do you want? ");
// get amount of numbers as answer from user
var amount = sc.nextInt();
// set iterator to 1
int iterator = 1;
int sum = 0;
// as long as the iterator is smaller or equal to the number of required numbers, keep asking for new numbers
// equivalent to "repeat amount" except that retries are possible if no number was entered (e.g. character was entered instead)
while (iterator <= amount) {
// ask for a number
System.out.printf("%d. number: ", iterator);
// get number from user
var newNumber = sc.nextInt();
// add the new number to the sum
sum += newNumber;
// increase iterator by one
iterator++;
}
// print average to screen
System.out.printf("Average: %s / %s = %s", sum, amount, (double)sum / (double)amount);
Variant 3: Combining both approaches
You could also combine both approaches, i. e. calculating the sum within the first loop and additionally storing the values in a numbers array so you could use that later on if you need to.
Expected output

How to instruct longest palindrome from a list of numbers?

I am trying to solve a question which says that we need to write a function in which given a list of numbers, we need to find the longest palindrome that we can from given only the numbers in the list.
For eg:
If the given list is : [3,47,6,6,5,6,15,22,1,6,15]
The longest palindrome that we can return is one of length 9, such as [6,15,6,3,47,3,6,15,6].
Additionally, we have the following constraints:
One can only use an array queue, array stack, and a chaining hashmap, and the list we are supposed to return, and the function must run in linear time. And we can use only constant additional space.
My approach was the following:
Since a palindrome can be formed if have an even number of certain characters, we can iterate over all the elements in the list, and store in a chaining hash map, the number of times each number appears in the list. This should take O(N) time since each lookup in the chaining hash map takes constant time, and iterating over the list takes linear time.
Then we can iterate over all the numbers in the chaining hash map, to see which numbers appear an even number of times, and accordingly, just make a palindrome. In the worst case, this will take a O(n) linear time.
Now there are two things I am wondering:
How should I make the actual palindrome? Like how do I use the data structures that I am being allowed to use in order to make a palindrome? I am thinking that since the queue is a LIFO data structure, for each number that occurs an even number of times, we add it once to the queue and once to the stack, and so on and so forth. And finally, we can just dequeue everything from the queue, and pop once from the stack, and then add it to the list!
It seems that with my approach, it is taking me two linear runs to solve the question. I am wondering if there is a faster way to do this.
Any and all help will be appreciated. Thanks!
It is not possible to get a better algorithm than one that is O(n), as every number in the input has to be inspected, as it might provide a possibility for a longer palindrome. If indeed the output must be a longest palindrome itself (and not only its length), then producing that output itself represents O(n).
You have also omitted one additional thing you have to do in your algorithm: there can be one value in the final palindrome that occurs only once (in the centre). So whenever you encounter a value that occurs an odd number of times, you may reserve one occurrence of that value for putting in the middle of an odd-length palindrome. The even remainder of the occurrences can be used as usual.
As to your questions:
How should I make the actual palindrome?
There are many ways to do it. But don't forget that if you have an even number of occurrences, you should use all those occurrences, not just two. So add half of them to the queue and half of them to the stack. When the frequency is odd, then still do the same (rounded down), and log the number also as a potential centre value.
When you have done this for all values, then dump the queue and stack together in the result list as you suggested, but don't forget to put the centre value in between the two, if you identified such a centre value (i.e. when not all occurrences were even).
It seems that with my approach, it is taking me two linear runs to solve the question.
You cannot do this better than with a linear time complexity. You can save a bit of time if you use the stack also for the result, and just dump the queue unto the stack (after potentially pushing the centre value).
I've got a solution when its palindrome only for the number and not the digit.
for the input: [51,15]
we will return [15] || [51] and not [51,15] =>(5,1,1,5);
feature more your example as a problem 3 doesn't appear twice(and appears in the answer)
or maybe I didn't understand the question.
public static int[] polidrom(int [] numbers){
HashMap<Integer/*the numbere*/,Integer/*how many time appeared*/> hash = new HashMap<>();
boolean middleFree= false;
int middleNumber = 0;
int space = 0;
Stack<Integer> stack = new Stack<>();
for (Integer num:numbers) {//count how mant times each digit appears
if(hash.containsKey(num)){hash.replace(num,1+hash.get(num));}
else{hash.put(num,1);}
}
for (Integer num:hash.keySet()) {//how many times i can use pairs
int original =hash.get(num);
int insert = (int)Math.floor(hash.get(num)/2);
if(hash.get(num) % 2 !=0){middleNumber = num;middleFree = true;}//as no copy
hash.replace(num,insert);
if(insert != 0){
for(int i =0; i < original;i++){
stack.push(num);
}
}
}
space = stack.size();
if(space == numbers.length){ space--;};//all the numbers are been used
int [] answer = new int[space+1];//the middle dont have to have an pair
int startPointer =0 , endPointer= space;
while (!stack.isEmpty()){
int num = stack.pop();
answer[startPointer] = num;
answer[endPointer] = num;
startPointer++;
endPointer--;
}
if (middleFree){answer[answer.length/2] = middleNumber;}
return answer;
}
space O(n) => {stack , hashMap , answer Array};
complexity: O(n)
You can skip the part where I used the stack and build the answer array in the same loop.
and I can't think of a way where you will not iterate at least twice;
Hope I've helped

Sorting and Counting Elements in OpenCL

I want to create an OpenCL kernel that sorts and counts millions of ulong.
There is a particular algorithm that fits my needs or should I go for an hash table?
To be clear, given the following input:
[42, 13, 9, 42]
I would like to get an output like this:
[(9,1), (13,1), (42,2)]
My first idea was to modify the Counting Sort - which already counts in order to sort - but because of the wide range of ulongs it requires too much memory. Bitonic or Radix sort plus something to count elements could be a way but I miss a fast way to count the elements. Any suggestions on this?
Extra notes:
I'm developing using an NVIDIA Tesla K40C GPU and a Terasic DE5-Net FPGA. So far the main goal is to make it work on the GPU but I'm also interested in solutions that might be a nice fit for FPGAs.
I know that some values inside the range of ulong aren't used so we can use them to mark invalid elements or duplicates.
I want to consume the output from the GPU using multiple threads in the CPU so a would like to avoid any solution that require some post-processing (in the host side I mean) that has data dependencies spread around the output.
This solution requires two passes of the bitonic sort to both count the duplicates as well as remove them (well move them to the end of the array). Bitonic sort is O(log(n)^2), so this then will run with time complexity 2(log(n)^2), which shouldn't be a problem unless you are running this in a loop.
Create a simple struct for each of the elements, to include the number of duplicates, and if the element has been added as a duplicate, something like:
// Note: If you are worried about space, or know that there
// will only be a few duplicates for each element, then
// make the count element smaller
typedef struct {
cl_ulong value;
cl_ulong count : 63;
cl_ulong seen : 1;
} Element;
Algorithm:
You can start by creating a comparison function which will move duplicates to the end, and count the duplicates if they are you to be added to the total count for the element. This is the logic behind the comparison function:
If one element is a duplicate and another is not, return that the non-duplicate element is smaller (regardless of the values), which will move all duplicates to the end.
If the elements are duplicates and the right element has not been marked a duplicate (seen=0), then add the right element's count to the left element's count and set the right element as a duplicate (seen=1). This has the effect of moving the total count of an element with a specific value to the leftmost element in the array with that value, and all duplicates with that value to the end.
Otherwise return that the element with the smaller value, is smaller.
The comparison function would look like:
bool compare(const Element* E1, const Element* E2) {
if (!E1->seen && E2->seen) return true; // E1 smaller
if (!E2->seen && E1->seen) return false; // E2 smaller
// If the elements are duplicates and the right element has
// not yet been "seen" by an element with the same value
if (E1->value == E2->value && !E2->seen) {
E1->count += E2->count;
E2->seen = 1;
return true;
}
// They aren't duplicates, and either
// neither has been seen, or both have
return E1->value < E2->value;
}
Bitonic sort has a specific structure, which can be nicely illustrated with a diagram. In the diagram, each element is referred to by a 3-tuple (a,b,c) where a = value, b = count, and c = seen.
Each diagram shows one run of bitonic sort on the array (vertical lines denote a comparison between elements, and horizontal lines move right to the next stage of the bitonic sort). Using the diagram and the above comparison function and logic, you should be able to convince yourself that this does what is required.
Run 1:
Run 2:
At the end of run 2, all elements are arranged by value. Duplicates with seen = 1 are at the end, and duplicates with seen = 0 are in their correct place and count is the number of other elements with the same value.
Implementation:
The diagrams are color coded to illustrate the sub-processes of bitonic sort. I'll call the blue blocks a phase (there are three phases in each run in the diagrams). In general, there will be ceil(log(N)) phases for each run. Each phase consists of a number of green block (I'll call these out-in blocks, because the shape of the comparisons is out to in), and red blocks (I'll call these constant blocks, because the distance between elements to compare remains constant).
From the diagram, the out-in block size (elements in each block) starts at 2 and doubles in each pass. The constant block size for each pass starts at half the out-in block size (in the second (blue block) phase, there are 2 elements in each of the four red blocks, because the green blocks have a size of 4) and halves for each successive vertical lines of red block within the phase. Also, the number of successive vertical lines of the constant (red) blocks in a phase is always the same as the phase number with 0 indexing (0 vertical lines of red blocks for phase 0, 1 vertical line of red bocks for phase 1, and 2 vertical lines of red blocks for phase 2 -- each vertical line is an iteration of calling that kernel).
You can then make kernels for the out-in passes, and the constant passes, then invoke the kernels from the host side (because you need to constantly synchronise, which is a disadvantage, but you should still see large performance improvements over sequential implementations).
From the host side, the overall bitonic sort might look like:
cl_uint num_elements = 4; // Set number of elements
cl_uint phases = (cl_uint)ceil((float)log2(num_elements));
cl_uint out_in_block_size = 2;
cl_uint constant_block_size;
// Set the elements_buffer, which should have been created with
// with clCreateBuffer, as the first kernel argument, and the
// number of elements as the second kernel argument
clSetKernelArg(out_in_kernel, 0, sizeof(cl_mem), (void*)(&elements_buffer));
clSetKernelArg(out_in_kernel, 1, sizeof(cl_uint), (void*)(&num_elements));
clSetKernelArg(constant_kernel, 0, sizeof(cl_mem), (void*)(&elements_buffer));
clSetKernelArg(constant_kernel, 1, sizeof(cl_uint), (void*)(&num_elements));
// For each pass
for (unsigned int phase = 0; phase < phases; ++phase) {
// -------------------- Green Part ------------------------ //
// Set the out_in_block size for the kernel
clSetKernelArg(out_in_kernel, 2, sizeof(cl_int), (void*)(&out_in_block_size));
// Call the kernel - command_queue is the clCommandQueue
// which should have been created during cl setup
clEnqueNDRangeKernel(command_queue , // clCommandQueue
out_in_kernel , // The kernel
1 , // Work dim = 1 since 1D array
NULL , // No global offset
&global_work_size,
&local_work_size ,
0 ,
NULL ,
NULL);
barrier(CLK_GLOBAL_MEM_FENCE); // Synchronise
// ---------------------- End Green Part -------------------- //
// Set the block size for constant blocks based on the out_in_block_size
constant_block_size = out_in_block_size / 2;
// -------------------- Red Part ------------------------ //
for (unsigned int i 0; i < phase; ++i) {
// Set the constant_block_size as a kernel argument
clSetKernelArg(constant_kernel, 2, sizeof(cl_int), (void*)(&constant_block_size));
// Call the constant kernel
clEnqueNDRangeKernel(command_queue , // clCommandQueue
constant_kernel , // The kernel
1 , // Work dim = 1 since 1D array
NULL , // No global offset
&global_work_size,
&local_work_size ,
0 ,
NULL ,
NULL);
barrier(CLK_GLOBAL_MEM_FENCE); // Synchronise
// Update constant_block_size for next iteration
constant_block_size /= 2;
}
// ------------------- End Red Part ---------------------- //
}
And then the kernels would be something like (you also need to put the struct typedef in the kernel file so that the OpenCL compiler know what 'Element' is):
__global void out_in_kernel(__global Element* elements, unsigned int num_elements, unsigned int block_size) {
const unsigned int idx_upper = // index of upper element in diagram.
const unsigned int idx_lower = // index of lower element in diagram
// Check that both indices are in range (this depends on thread mapping)
if (idx_upper is in range && index_lower is in range) {
// Do the comparison
if (!compare(elements + idx_upper, elements + idx_lower) {
// Swap the elements
}
}
}
The constant_kernel will look the same, but the thread mapping (how you determine idx_upper and idx_lower) will be different. There are many ways you can map the threads to the elements generally to mimic the diagrams (note that the number of threads required is half the total number of elements, since each thread can do one comparison).
Another consideration is how to make the thread mapping general (so that if you have a number of elements which is not a power of two the algorithm doesn't break).
How about boost.compute or VexCL? Both provide sorting algorithms.
Mergesort works quite well on GPUs and you could modify it to sort key+count instead of keys only. During merging you would then also check if do keys are identical and if yes, fuse them into a single key during merge. (If you merge [9/c:1, 42/c:1] and [13/c:1,42/c:1] you would get [9/c:1,13/c:1,42/c:2] )
You might have to use parallel prefix sum to remove the gaps caused by fusing keys.
Or: Use a regular GPU sort first, mark all keys where the key to its right is different (this is only true at the last key of each unique key), use parallel prefix sum to get consecutive indexes for all unique keys and note their position in the sorted array. Then you only need to subtract the index of the previous unique key to get the count.

Data structure for set of (non-disjoint) sets

I'm looking for a data structure that roughly corresponds to (in Java terms) Map<Set<int>, double>. Essentially a set of sets of labeled marbles, where each set of marbles is associated with a scalar. I want it to be able to efficiently handle the following operations:
Add a given integer to every set.
Remove every set that contains (or does not contain) a given integer, or at least set the associated double to 0.
Union two of the maps, adding together the doubles for sets that appear in both.
Multiply all of the doubles by a given double.
Rarely, iterate over the entire map.
under the following conditions:
The integers will fall within a constrained range (between 1 and 10,000 or so); the exact range will be known at compile-time.
Most of the integers within the range (80-90%) will never be used, but which ones will not be easily determinable until the end of the calculation.
The number of integers used will almost always still be over 100.
Many of the sets will be very similar, differing only by a few elements.
It may be possible to identify certain groups of integers that frequently appear only in sequential order: for example, if a set contains the integers 27 and 29 then it (almost?) certainly contains 28 as well.
It may be possible to identify these groups prior to running the calculation.
These groups would typically have 100 or so integers.
I've considered tries, but I don't see a good way to handle the "remove every set that contains a given integer" operation.
The purpose of this data structure would be to represent discrete random variables and permit addition, multiplication, and scalar multiplication operations on them. Each of these discrete random variables would ultimately have been created by applying these operations to a fixed (at compile-time) set of independent Bernoulli random variables (i.e. each takes the value 1 or 0 with some probability).
The systems being modeled are close to being representable as a time-inhomogeneous Markov chains (which would of course simplify this immensely) but, unfortunately, it is essential to track the duration since various transitions.
Here's a data structure, that can do all of your operations pretty efficiently:
I'm going to refer to it as a BitmapArray for this explanation.
Thinking about it, apparently for just the operations you have described a sorted array with bitmaps as keys and weights(your doubles) as values will be pretty efficient.
The bitmaps are what maintain membership in your set. Since you said the range of integers in the set are between 1-10,000, we can maintain information about any set with a bitmap of length 10,000.
It's gonna be tough sorting an array where the keys can be as big as 2^10000, but you can be smart about implementing the comparison function in the following way:
Iterate from left to right on the two bitmaps
XOR the bits on each index
Say you get a 1 at ith position
Whichever bitmap has 1 at ith position is greater
If you never get a 1 they're equal
I know this is still a slow comparison.
But not too slow, Here's a benchmark fiddle I did on bitmaps with length 10000.
This is in Javascript, if you're going to write in Java, it's going to perform even better.
function runTest() {
var num = document.getElementById("txtValue").value;
num = isNaN(num * 1) ? 0 : num * 1;
/*For integers in the range 1-10,000 the worst case for comparison are any equal integers which will cause the comparision to iterate over the whole BitArray*/
bitmap1 = convertToBitmap(10000, num);
bitmap2 = convertToBitmap(10000, num);
before = new Date().getMilliseconds();
var result = firstIsGreater(bitmap1, bitmap2, 10000);
after = new Date().getMilliseconds();
alert(result + " in time: " + (after-before) + " ms");
}
function convertToBitmap(size, number) {
var bits = new Array();
var q = number;
do {
bits.push(q % 2);
q = Math.floor(q / 2);
} while (q > 0);
xbitArray = new Array();
for (var i = 0; i < size; i++) {
xbitArray.push(0);
}
var j = xbitArray.length - 1;
for (var i = bits.length - 1; i >= 0; i--) {
xbitArray[j] = bits[i];
j--
}
return xbitArray;
}
function firstIsGreater(bitArray1, bitArray2, lengthOfArrays) {
for (var i = 0; i < lengthOfArrays; i++) {
if (bitArray1[i] ^ bitArray2[i]) {
if (bitArray1[i]) return true;
else return false;
}
}
return false;
}
document.getElementById("btnTest").onclick = function (e) {
runTest();
};
Also, remember that you only have to do this once, when building your BitmapArray (or while taking unions) and then it's going to become pretty efficient for the operations you'd do most often:
Note: N is the length of the BitmapArray.
Add integer to every set: Worst/best case O(N) time. Flip a 0 to 1 in each bitmap.
Remove every set that contains a given integer: Worst case O(N) time.
For each bitmap check the bit that represents the given integer, if 1 mark it's index.
Compress the array by deleting all marked indices.
If you're okay with just setting the weights to 0 it'll be even more efficient. This also makes it very easy if you want to remove all sets that have any element in a given set.
Union of two maps: Worst case O(N1+N2) time. Just like merging two sorted arrays, except you have to be smart about comparisons once more.
Multiply all of the doubles by a given double: Worst/best case O(N) time. Iterate and multiply each value by the input double.
Iterate over the BitmapArray: Worst/best case O(1) time for next element.

Get N samples given iterator

Given are an iterator it over data points, the number of data points we have n, and the maximum number of samples we want to use to do some calculations (maxSamples).
Imagine a function calculateStatistics(Iterator it, int n, int maxSamples). This function should use the iterator to retrieve the data and do some (heavy) calculations on the data element retrieved.
if n <= maxSamples we will of course use each element we get from the iterator
if n > maxSamples we will have to choose which elements to look at and which to skip
I've been spending quite some time on this. The problem is of course how to choose when to skip an element and when to keep it. My approaches so far:
I don't want to take the first maxSamples coming from the iterator, because the values might not be evenly distributed.
Another idea was to use a random number generator and let me create maxSamples (distinct) random numbers between 0 and n and take the elements at these positions. But if e.g. n = 101 and maxSamples = 100 it gets more and more difficult to find a new distinct number not yet in the list, loosing lot of time just in the random number generation
My last idea was to do the contrary: to generate n - maxSamples random numbers and exclude the data elements at these positions elements. But this also doesn't seem to be a very good solution.
Do you have a good idea for this problem? Are there maybe standard known algorithms for this?
To provide some answer, a good way to collect a set of random numbers given collection size > elements needed, is the following. (in C++ ish pseudo code).
EDIT: you may need to iterate over and create the "someElements" vector first. If your elements are large they can be "pointers" to these elements to save space.
vector randomCollectionFromVector(someElements, numElementsToGrab) {
while(numElementsToGrab--) {
randPosition = rand() % someElements.size();
resultVector.push(someElements.get(randPosition))
someElements.remove(randPosition);
}
return resultVector;
}
If you don't care about changing your vector of elements, you could also remove random elements from someElements, as you mentioned. The algorithm would look very similar, and again, this is conceptually the same idea, you just pass someElements by reference, and manipulate it.
Something worth noting, is the quality of psuedo random distributions as far as how random they are, grows as the size of the distribution you used increases. So, you may tend to get better results if you pick which method you use based on which method results in the use of more random numbers. Example: if you have 100 values, and need 99, you should probably pick 99 values, as this will result in you using 99 pseudo random numbers, instead of just 1. Conversely, if you have 1000 values, and need 99, you should probably prefer the version where you remove 901 values, because you use more numbers from the psuedo random distribution. If what you want is a solid random distribution, this is a very simple optimization, that will greatly increase the quality of "fake randomness" that you see. Alternatively, if performance matters more than distribution, you would take the alternative or even just grab the first 99 values approach.
interval = n/(n-maxSamples) //an euclidian division of course
offset = random(0..(n-1)) //a random number between 0 and n-1
totalSkip = 0
indexSample = 0;
FOR it IN samples DO
indexSample++ // goes from 1 to n
IF totalSkip < (n-maxSamples) AND indexSample+offset % interval == 0 THEN
//do nothing with this sample
totalSkip++
ELSE
//work with this sample
ENDIF
ENDFOR
ASSERT(totalSkip == n-maxSamples) //to be sure
interval represents the distance between two samples to skip.
offset is not mandatory but it allows to have a very little diversity.
Based on the discussion, and greater understanding of your problem, I suggest the following. You can take advantage of a property of prime numbers that I think will net you a very good solution, that will appear to grab pseudo random numbers. It is illustrated in the following code.
#include <iostream>
using namespace std;
int main() {
const int SOME_LARGE_PRIME = 577; //This prime should be larger than the size of your data set.
const int NUM_ELEMENTS = 100;
int lastValue = 0;
for(int i = 0; i < NUM_ELEMENTS; i++) {
lastValue += SOME_LARGE_PRIME;
cout << lastValue % NUM_ELEMENTS << endl;
}
}
Using the logic presented here, you can create a table of all values from 1 to "NUM_ELEMENTS". Because of the properties of prime numbers, you will not get any duplicates until you rotate all the way around back to the size of your data set. If you then take the first "NUM_SAMPLES" of these, and sort them, you can iterate through your data structure, and grab a pseudo random distribution of numbers(not very good random, but more random than a pre-determined interval), without extra space and only one pass over your data. Better yet, you can change the layout of the distribution by grabbing a random prime number each time, again must be larger than your data set, or the following example breaks.
PRIME = 3, data set size = 99. Won't work.
Of course, ultimately this is very similar to the pre-determined interval, but it inserts a level of randomness that you do not get by simply grabbing every "size/num_samples"th element.
This is called the Reservoir sampling

Resources