Related
I've a piece of code to randomize the order of words in a text file, but I'm not sure exactly what it's doing. Here is the code.
Randomize()
For count = 1 To 10
rand = (Int((10 - count + 1) * Rnd() + count))
temp = words(count)
words(count) = words(rand)
words(rand) = temp
Next
Could somebody please explain this to me? Thanks in advance.
first check msdn rnd description and note:
The Rnd function returns a value less than 1 but greater than or equal to 0
and
To produce random integers in a given range, use this formula: Int((upperbound - lowerbound + 1) * Rnd + lowerbound)
having this in mind, we see next algorithm:
set current word index to 1 (first word)
pick random index which is equals or greater then current (ie select word from the rest of array)
swap current word with randomly picked
increase current word index (ie reduce size of unassigned words pool)
go to #2
you can also use a little bit different description:
imagine that you have un-ordered set of words, you pick random one, remove it from set and append it to ordered array, so finally you will have randomly ordered array of words from original set
When drawing in random from a set of values in succession, where a drawn value is allowed to
be drawn again, a given value has (of course) a small chance of being drawn twice (or more) in immediate succession, but that causes an issue (for the purposes of a given application) and we would like to eliminate this chance. Any algorithmic ideas on how to do so (simple/efficient)?
Ideally we would like to set a threshold say as a percentage of the size of the data set:
Say the size of the set of values N=100, and the threshold T=10%, then if a given value is drawn in the current draw, it is guaranteed not to show up again in the next N*T=10 draws.
Obviously this restriction introduces bias in the random selection. We don't mind that a
proposed algorithm introduces further bias into the randomness of selection, what really
matters for this application is that the selection is just random enough to appear so
for a human observer.
As an implementation detail, the values are stored as database records, so database table flags/values can be used, or maybe external memory structures. Answers about the abstract case are welcome too.
Edit:
I just hit this other SO question here, which has good overlap with my own. Going through the good points there.
Here's an implementation that does the whole process in O(1) (for a single element) without any bias:
The idea is to treat the last K elements in the array A (which contains all the values) like a queue, we draw a value from the first N-k values in A, which is the random value, and swap it with an element in position N-Pointer, when Pointer represents the head of the queue, and it resets to 1 when it crosses K elements.
To eliminate any bias in the first K draws, the random value will be drawn between 1 and N-Pointer instead of N-k, so this virtual queue is growing in size at each draw until reaching the size of K (e.g. after 3 draws the number of possible values appear in A between indexes 1 and N-3, and the suspended values appear in indexes N-2 to N.
All operations are O(1) for drawing a single elemnent and there's no bias throughout the entire process.
void DrawNumbers(val[] A, int K)
{
N = A.size;
random Rnd = new random;
int Drawn_Index;
int Count_To_K = 1;
int Pointer = K;
while (stop_drawing_condition)
{
if (Count_To_K <= K)
{
Drawn_Index = Rnd.NextInteger(1, N-Pointer);
Count_To_K++;
}
else
{
Drawn_Index = Rnd.NextInteger(1, N-K)
}
Print("drawn value is: " + A[Drawn_Index])
Swap(A[Drawn_Index], A[N-Pointer])
Pointer--;
if (Pointer < 1) Pointer = K;
}
}
My previous suggestion, by using a list and an actual queue, is dependent on the remove method of the list, which I believe can be at best O(logN) by using an array to implement a self balancing binary tree, as the list has to have direct access to indexes.
void DrawNumbers(list N, int K)
{
queue Suspended_Values = new queue;
random Rnd = new random;
int Drawn_Index;
while (stop_drawing_condition)
{
if (Suspended_Values.count == K)
N.add(Suspended_Value.Dequeue());
Drawn_Index = Rnd.NextInteger(1, N.size) // random integer between 1 and the number of values in N
Print("drawn value is: " + N[Drawn_Index]);
Suspended_Values.Enqueue(N[Drawn_Index]);
N.Remove(Drawn_Index);
}
}
I assume you have an array, A, that contains the items you want to draw. At each time period you randomly select an item from A.
You want to prevent any given item, i, from being drawn again within some k iterations.
Let's say that your threshold is 10% of A.
So create a queue, call it drawn, that can hold threshold items. Also create a hash table that contains the drawn items. Call the hash table hash.
Then:
do
{
i = Get random item from A
if (i in hash)
{
// we have drawn this item recently. Don't draw it.
continue;
}
draw(i);
if (drawn.count == k)
{
// remove oldest item from queue
temp = drawn.dequeue();
// and from the hash table
hash.remove(temp);
}
// add new item to queue and hash table
drawn.enqueue(i);
hash.add(i);
} while (forever);
The hash table exists solely to increase lookup speed. You could do without the hash table if you're willing to do a sequential search of the queue to determine if an item has been drawn recently.
Say you have n items in your list, and you don't want any of the k last items to be selected.
Select at random from an array of size n-k, and use a queue of size k to stick the items you don't want to draw (adding to the front and removing from the back).
All operations are O(1).
---- clarification ----
Give n items, and a goal of not redrawing any of the last k draws, create an array and queue as follows.
Create an array A of size n-k, and put n-k of your items in the list (chosen at random, or seeded however you like).
Create a queue (linked list) Q and populate it with the remaining k items, again in random order or whatever order you like.
Now, each time you want to select an item at random:
Choose a random index from your array, call this i.
Give A[i] to whomever is asking for it, and add it to the front of Q.
Remove the element from the back of Q, and store it in A[i].
Everything is O(1) after the array and linked list are created, which is a one-time O(n) operation.
Now, you might wonder, what do we do if we want to change n (i.e. add or remove an element).
Each time we add an element, we either want to grow the size of A or of Q, depending on our logic for deciding what k is (i.e. fixed value, fixed fraction of n, whatever...).
If Q increases then the result is trivial, we just append the new element to Q. In this case I'd probably append it to the end of Q so that it gets in play ASAP. You could also put it in A, kicking some element out of A and appending it to the end of Q.
If A increases, you can use a standard technique for increasing arrays in amortized constant time. E.g., each time A fills up, we double it in size, and keep track of the number of cells of A that are live. (look up 'Dynamic Arrays' in Wikipedia if this is unfamiliar).
Set-based approach:
If the threshold is low (say below 40%), the suggested approach is:
Have a set and a queue of the last N*T generated values.
When generating a value, keep regenerating it until it's not contained in the set.
When pushing to the queue, pop the oldest value and remove it from the set.
Pseudo-code:
generateNextValue:
// once we're generated more than N*T elements,
// we need to start removing old elements
if queue.size >= N*T
element = queue.pop
set.remove(element)
// keep trying to generate random values until it's not contained in the set
do
value = getRandomValue()
while set.contains(value)
set.add(value)
queue.push(value)
return value
If the threshold is high, you can just turn the above on its head:
Have the set represent all values not in the last N*T generated values.
Invert all set operations (replace all set adds with removes and vice versa and replace the contains with !contains).
Pseudo-code:
generateNextValue:
if queue.size >= N*T
element = queue.pop
set.add(element)
// we can now just get a random value from the set, as it contains all candidates,
// rather than generating random values until we find one that works
value = getRandomValueFromSet()
//do
// value = getRandomValue()
//while !set.contains(value)
set.remove(value)
queue.push(value)
return value
Shuffled-based approach: (somewhat more complicated that the above)
If the threshold is a high, the above may take long, as it could keep generating values that already exists.
In this case, some shuffle-based approach may be a better idea.
Shuffle the data.
Repeatedly process the first element.
When doing so, remove it and insert it back at a random position in the range [N*T, N].
Example:
Let's say N*T = 5 and all possible values are [1,2,3,4,5,6,7,8,9,10].
Then we first shuffle, giving us, let's say, [4,3,8,9,2,6,7,1,10,5].
Then we remove 4 and insert it back in some index in the range [5,10] (say at index 5).
Then we have [3,8,9,2,4,6,7,1,10,5].
And continue removing the next element and insert it back, as required.
Implementation:
An array is fine if we don't care about efficient a whole lot - to get one element will cost O(n) time.
To make this efficient we need to use an ordered data structure that supports efficient random position inserts and first position removals. The first thing that comes to mind is a (self-balancing) binary search tree, ordered by index.
We won't be storing the actual index, the index will be implicitly defined by the structure of the tree.
At each node we will have a count of children (+ 1 for itself) (which needs to be updated on insert / remove).
An insert can be done as follows: (ignoring the self-balancing part for the moment)
// calling function
insert(node, value)
insert(node, N*T, value)
insert(node, offset, value)
// node.left / node.right can be defined as 0 if the child doesn't exist
leftCount = node.left.count - offset
rightCount = node.right.count
// Since we're here, it means we're inserting in this subtree,
// thus update the count
node.count++
// Nodes to the left are within N*T, so simply go right
// leftCount is the difference between N*T and the number of nodes on the left,
// so this needs to be the new offset (and +1 for the current node)
if leftCount < 0
insert(node.right, -leftCount+1, value)
else
// generate a random number,
// on [0, leftCount), insert to the left
// on [leftCount, leftCount], insert at the current node
// on (leftCount, leftCount + rightCount], insert to the right
sum = leftCount + rightCount + 1
random = getRandomNumberInRange(0, sum)
if random < leftCount
insert(node.left, offset, value)
else if random == leftCount
// we don't actually want to update the count here
node.count--
newNode = new Node(value)
newNode.count = node.count + 1
// TODO: swap node and newNode's data so that node's parent will now point to newNode
newNode.right = node
newNode.left = null
else
insert(node.right, -leftCount+1, value)
To visualize inserting at the current node:
If we have something like:
4
/
1
/ \
2 3
And we want to insert 5 where 1 is now, it will do this:
4
/
5
\
1
/ \
2 3
Note that when a red-black tree, for example, performs operations to keep itself balanced, none of these involve comparisons, so it doesn't need to know the order (i.e. index) of any already-inserted elements. But it will have to update the counts appropriately.
The overall efficiency will be O(log n) to get one element.
I'd put all "values" into a "list" of size N, then shuffle the list and retrieve values from the top of the list. Then you "insert" the retrieved value at a random position with any index >= N*T.
Unfortunately I'm not truly a math-guy :( So I simply tried it (in VB, so please take it as pseudocode ;) )
Public Class BiasedRandom
Private prng As New Random
Private offset As Integer
Private l As New List(Of Integer)
Public Sub New(ByVal size As Integer, ByVal threshold As Double)
If threshold <= 0 OrElse threshold >= 1 OrElse size < 1 Then Throw New System.ArgumentException("Check your params!")
offset = size * threshold
' initial fill
For i = 0 To size - 1
l.Add(i)
Next
' shuffle "Algorithm p"
For i = size - 1 To 1 Step -1
Dim j = prng.Next(0, i + 1)
Dim tmp = l(i)
l(i) = l(j)
l(j) = tmp
Next
End Sub
Public Function NextValue() As Integer
Dim tmp = l(0)
l.RemoveAt(0)
l.Insert(prng.Next(offset, l.Count + 1), tmp)
Return tmp
End Function
End Class
Then a simple check:
Public Class Form1
Dim z As Integer = 10
Dim k As BiasedRandom
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
k = New BiasedRandom(z, 0.5)
End Sub
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim j(z - 1)
For i = 1 To 10 * 1000 * 1000
j(k.NextValue) += 1
Next
Stop
End Sub
End Class
And when I check out the distribution it looks okay enough for an unarmed eye ;)
EDIT:
After thinking about RonTeller's argumentation, I have to admit that he is right. I don't think that there is a performance friendly way to achieve the wanted and to pertain a good (not more biased than required) random order.
I came to the follwoing idea:
Given a list (array whatever) like this:
0123456789 ' not shuffled to make clear what I mean
We return the first element which is 0. This one must not come up again for 4 (as an example) more draws but we also want to avoid a strong bias. Why not simply put it to the end of the list and then shuffle the "tail" of the list, i.e. the last 6 elements?
1234695807
We now return the 1 and repeat the above steps.
2340519786
And so on and so on. Since removing and inserting is kind of unnecessary work, one could use a simple array and a "pointer" to the actual element. I have changed the code from above to give a sample. It's slower than the first one, but should avoid the mentioned bias.
Public Function NextValue() As Integer
Static current As Integer = 0
' only shuffling a part of the list
For i = current + l.Count - 1 To current + 1 + offset Step -1
Dim j = prng.Next(current + offset, i + 1)
Dim tmp = l(i Mod l.Count)
l(i Mod l.Count) = l(j Mod l.Count)
l(j Mod l.Count) = tmp
Next
current += 1
Return l((current - 1) Mod l.Count)
End Function
EDIT 2:
Finally (hopefully), I think the solution is quite simple. The below code assumes that there is an array of N elements called TheArray which contains the elements in random order (could be rewritten to work with sorted array). The value DelaySize determines how long a value should be suspended after it has been drawn.
Public Function NextValue() As Integer
Static current As Integer = 0
Dim SelectIndex As Integer = prng.Next(0, TheArray.Count - DelaySize)
Dim ReturnValue = TheArray(SelectIndex)
TheArray(SelectIndex) = TheArray(TheArray.Count - 1 - current Mod DelaySize)
TheArray(TheArray.Count - 1 - current Mod DelaySize) = ReturnValue
current += 1
Return ReturnValue
End Function
I don't have a scenario, but here goes the problem. This is one is just driving me crazy. There is a nxn boolean matrix initially all elements are 0, n <= 10^6 and given as input.
Next there will be up to 10^5 queries. Each query can be either set all elements of column c to 0 or 1, or set all elements of row r to 0 or 1. There can be another type of query, printing the total number of 1's in column c or row r.
I have no idea how to solve this and any help would be appreciated. Obviously a O(n) solution per query is not feasible.
The idea of using a number to order the modifications is taken from Dukeling's post.
We will need 2 maps and 4 binary indexed tree (BIT, a.k.a. Fenwick Tree): 1 map and 2 BITs for rows, and 1 map and 2 BITs for columns. Let us call them m_row, f_row[0], and f_row[1]; m_col, f_col[0] and f_col[1] respectively.
Map may be implemented with array, or tree like structure, or hashing. The 2 maps are used to store the last modification to a row/column. Since there can be at most 105 modification, you may use that fact to save space from simple array implementation.
BIT has 2 operations:
adjust(value, delta_freq), which adjusts the frequency of the value by delta_freq amount.
rsq(from_value, to_value), (rsq stands for range sum query) which finds the sum of the all the frequencies from from_value to to_value inclusive.
Let us declare global variable: version
Let us define numRow to be the number of rows in the 2D boolean matrix, and numCol to be the number of columns in the 2D boolean matrix.
The BITs should have size of at least MAX_QUERY + 1, since it is used to count the number of changes to the rows and columns, which can be as many as the number of queries.
Initialization:
version = 1
# Map should return <0, 0> for rows or cols not yet
# directly updated by query
m_row = m_col = empty map
f_row[0] = f_row[1] = f_col[0] = f_col[1] = empty BIT
Update algorithm:
update(isRow, value, idx):
if (isRow):
# Since setting a row/column to a new value will reset
# everything done to it, we need to erase earlier
# modification to it.
# For example, turn on/off on a row a few times, then
# query some column
<prevValue, prevVersion> = m_row.get(idx)
if ( prevVersion > 0 ):
f_row[prevValue].adjust( prevVersion, -1 )
m_row.map( idx, <value, version> )
f_row[value].adjust( version, 1 )
else:
<prevValue, prevVersion> = m_col.get(idx)
if ( prevVersion > 0 ):
f_col[prevValue].adjust( prevVersion, -1 )
m_col.map( idx, <value, version> )
f_col[value].adjust( version, 1 )
version = version + 1
Count algorithm:
count(isRow, idx):
if (isRow):
# If this is row, we want to find number of reverse modifications
# done by updating the columns
<value, row_version> = m_row.get(idx)
count = f_col[1 - value].rsq(row_version + 1, version)
else:
# If this is column, we want to find number of reverse modifications
# done by updating the rows
<value, col_version> = m_col.get(idx)
count = f_row[1 - value].rsq(col_version + 1, version)
if (isRow):
if (value == 1):
return numRow - count
else:
return count
else:
if (value == 1):
return numCol - count
else:
return count
The complexity is logarithmic in worst case for both update and count.
Take version just to mean a value that gets auto-incremented for each update.
Store the last version and last update value at each row and column.
Store a list of (versions and counts of zeros and counts of ones) for the rows. The same for the columns. So that's only 2 lists for the entire grid.
When a row is updated, we set its version to the current version and insert into the list for rows the version and if (oldRowValue == 0) zeroCount = oldZeroCount else zeroCount = oldZeroCount + 1 (so it's not the number of zero's, rather the number of times a value was updated with a zero). Same for oneCount. Same for columns.
If you do a print for a row, we get the row's version and last value, we do a binary search for that version in the column list (first value greater than). Then:
if (rowValue == 1)
target = n*rowValue
- (latestColZeroCount - colZeroCount)
+ (latestColOneCount - colOneCount)
else
target = (latestColOneCount - colOneCount)
Not too sure whether the above will work.
That's O(1) for update, O(log k) for print, where k is the number of updates.
This is my first post here but reading other posts has helped me with countless issues. You guys are awesome. So thanks for all the help you guys have provided up to this point and thanks in advance for any advice you can give me here.
So, I'm working on this script that will end up randomly choosing numbers from a list in an array. What I'm trying to do in this section is read the number in each element of the third dimension of NumberHistoryArray and write the corresponding number from the second dimension the number of times indicated in the third into the numberDump array. I also have it writing to a text file just to see that it works and to have something to reference if I run into other problems and need to know what the output is.
Edit
The second dimension of NumberHistoryArray contains the numbers 1 through 35. The third dimension contains the number of times each of the numbers in the second dimension should be repeated when written to the numberDump array.
If NumberHistoryArray(4, 0, 0) is 4, for example, it should write the number 1 four times, once each in numberDump(0, 0, 0) through (0, 0, 3) and dynamically resize numberDump() as needed to fit each 1. The result of echoing numberDump(0, 0, 0) through numberDump(0, 0, 3) should be 1 1 1 1. If NumberHistoryArray(4, 1, 1) is 7, it should write the number 2 seven times in numberDump(1, 1, 0).
Edit
When writing to the numberDump text file from within the Do loop, the text file is perfect. If I echo the contents of each element in the array from inside the Do loop, the contents are accurate. Once outside the Do loop, the contents are blank. Any takers?
Dim NumberHistoryArray(5, 34, 3)
Dim numberDump()
reDim numberDump(3, 34, 0)
' generate the number list and store it in the array
For n = LBound(numberDump, 1) to UBound(numberDump, 1)
i = 1
x = 0
y = 1
Do
Set objTextFile = objFSO.OpenTextFile("numberDump.txt", 8, True)
' determine how many elements are needed in the third dimension of the
' array to store the output
x = x + NumberHistoryArray(4, i - 1, n)
' resize the array
ReDim Preserve numberDump(3, 34, x)
' input the numbers
For z = y to UBound(NumberDump, 3)
numberDump(n, i - 1, z) = i
objTextFile.WriteLine(numberDump(n, i - 1, z))
Next
objTextFile.Close
y = x + 1
i = i + 1
Loop Until i = 36
Next
For i = 0 to UBound(numberDump, 2)
For j = 0 to UBound(numberDump, 3)
wscript.Echo numberDump(0, i, j)
Next
Next
The issue here is that you are resizing the third dimension of numberDump to x each iteration, and you are resetting the value of x with each iteration of n. For example, when n is 0, by the end of the Do loop, x may be equal to 10. However, on the next iteration, when n is 1, x is reset to 0 and you end up resizing the third dimension of numberDump to 0, erasing most of the existing elements.
There are two methods of working around this:
Keep track of the maximum value of x for all iterations of n. Use that value to resize numberDump as necessary.
Use a jagged array (aka, an array of arrays). In this case, numberDump would be a two-dimensional array whose values are arrays containing the repeated values. This would allow each value to have a different length, as opposed to a three-dimensional array, where the third dimension always has the same length.
Ever heard of life span of objects? Where do you initialize the objects : within a loop or outside the loop so that it can be referred within the entire sub routine/function.
How do you refer to Array elements outside the loop? Each element has an index. So you elment is referred via the index.
e.g. ArrayOne has 3 elements from Lower bound 0 to 2. Within your loops you can refer to ArrayOne(1) to get 2nd element.
Other than using indices to refer elements, you can also try to view array elements during the debug mode on immediate window or watch window.
So as you check on these tips, please show us a proper complete code for further verification.
I have a set of numbers say : 1 1 2 8 5 6 6 7 8 8 4 2...
I want to detect the duplicate element in sub-array (of given size say k) of the above numbers... For example : Consider the increasing sub arrays for k=3`
Sub array 1 :{1,1,2}
Sub array 2 :{1,2,8}
Sub array 3 :{2,8,5}
Sub array 4 :{8,5,6}
Sub array 5 :{5,6,6}
Sub array 6 :{6,6,7}
....
So my algorithm should detect that sub-arrays 1, 5, and 6 contain duplicates..
My approach :
1)Copy the 1st k elements to a temporary array(vector)
2) using #include file in C++ STL...using unique() I would determine if there's any change in size of vector..
Any other clue how to approach this problem....because my method would consume lot of time and space if the list of the given number is large..
O(n) average time and O(k) space solution could be to build a hash-based histogram, and iterate the array while maintaining #occurances for elements in each sublist.
In each iteration, kick the eldest element out (by decreasing the relevant entrance in the histogram) and add a new element.
Also maintain a numDupes variable, which counts how many dupes you currently have and is maintained when adding/removing elements from the current candidate.
Pseudo code (sorry if I have off by 1 error or something, but this is the idea):
numDupes = 0
histogram = new map<int,int>;
//first set:
for each i form 0 to k:
if histogram.contains(arr[i]):
histogram.put(arr[i],histogram.get(arr[i]) + 1)
numDupes += 1
else:
histogram.put(arr[i],1)
//each iteration is for a new set
if (numDupes > 0) print 1 //first sub array has dupes
for each i from k to n:
if (histogram.get(arr[i-k]) > 1) numDupes -= 1 //we just removed a dupe
histogram.put(arr[i-k],histogram.get(arr[i-k] - 1)) //take off "eldest" element.
if (histogram.contains(arr[i]) && histogram.get(arr[i]) > 0):
histogram.put(arr[i],histogram,get(arr[i]) + 1))
numDupes += 1 //we just added a dupe
else:
histogram.put(arr[i],1)
if (numDupes > 0) print i-k+1 // the current sub array contains a dupe
The initial answer had a small mistake: It failקג to catch cases when the last element added did not cause the dupe, but there is still one (like sub array 6 in the example).
It can be solved by maintaining an extra integer that counts the number of current dupes found, and print the sub-array when the counter is greater then 0. (updated the pseudo code).
Also note: To achieve O(k) space, your histogram needs to delete elements when their value is 0.