Distinct number of changes in real time data - for-loop

Hi I am taking in data in real time where the value goes from 1009 , 1008 o 1007 to 0. I am trying to count the number of distinct times this occurs, for example the snippet below should count 2 distinct periods of change.
1008
1009
1008
0
0
0
1008
1007
1008
1008
1009
9
0
0
1009
1008
I have written a for loop as below but I can't figure out if the logic is correct as I get multiple increments instead of just the one
if(current != previous && current < 100)
x++;
else
x = x;

You tagged this with the LabVIEW tag. Is this actually supposed to be LabVIEW code?
Your logic has a bug related to the noise you say you have - if the value is less than 100 and it changes (for instance from 9 to 0), you log that as a change. You also have a line which doesn't do anything (x=x), although if this is supposed to be LV code, then this could make sense.

The code you posted here does not seem to make sense to me if I understand your goal. My understanding is that you want to identify this specific pattern:
1009
1008
1007
0
And that any deviation from this sequence of numbers would constitute data that should be ignored. To this end, you should be monitoring the history of the past 3 numbers. In C you might write this logic in the following way:
#include <stdio.h>
//Function to get the next value from our data stream.
int getNext(int *value) {
//Variable to hold our return code.
int code;
//Replace following line to get gext number from the stream. Possible read from a file?
*value = 0;
//Replace following logic to set 'code' appropriately.
if(*value == -1)
code = -1;
else
code = 0;
//Return 'code' to the caller.
return code;
}
//Example application for counting the occurrences of the sequence '1009','1008','1007','0'.
int main(int argc, char **argv) {
//Declare 4 items to store the past 4 items in the sequence (x0-x3)
//Declare a count and a flag to monitor the occurrence of our pattern
int x0 = 0, x1 = 0, x2 = 0, x3 = 0, count = 0, occurred = 0;
//Run forever (just as an example, you would provide your own looping structure or embed the algorithm in your app).
while(1) {
//Get the next element (implement getNext to provide numbers from your data source).
//If the function returns non-zero, exit the loop and print the count.
if( getNext(&x0) != 0 )
break;
//If the newest element is 0, we can trigger a check of the prior 3.
if(x0 == 0) {
//Set occurred to 0 if the prior elements don't match our pattern.
occurred = (x1 == 1007) && (x2 == 1008) && (x3 == 1009);
if(occurred) {
//Occurred was 1, meaning the pattern was matched. Increment our count.
count++;
//Reset occurred
occurred = 0;
}
//If the newest element is not 0, dont bother checking. Just shift the elements down our list.
} else {
x3 = x2; //Shift 3rd element to 4th position
x2 = x1; //Shift 2nd element to 3rd position
x1 = x0; //Shift 1st element to 2nd position
}
}
printf("The pattern count is %d\n", count);
//Exit application
return 0;
}
Note that the getNext function is just shown here as an example but obviously what I have implemented will not work. This function should be implemented based on how you are extracting data from the stream.
Writing the application in this way might not make sense within your larger application but the algorithm is what you should take away from this. Essentially you want to buffer 4 elements in a rolling window. You push the newest element into x0 and shift the others down. After this process you check the four elements to see if they match your desired pattern and increment the count accordingly.

If the requirement is to count falling edges and you don't care about the specific level, and want to reject noise band or ripple in the steady state then just make the conditional something like
if ((previous - current) > threshold)
No complex shifting, history, or filtering required. Depending on the application you can follow up with a debounce (persistency check) to ignore spurious samples (just keep track of falling/rising, or fell/rose as simple toggling state spanning a desired number of samples).
Code to the pattern, not the specific values; use constant or adjustable parameters to control the value sensitivity.

Related

Non-associative RDOM parallellization in Halide

I am trying to write a decoder for GPU. My encoding scheme has data dependencies between lines. So when decoding columns of data each column depends on the previous. I want to parallellize the internal computation of each column, but execute each column one-by-one and sequentially, but I am having trouble getting this correctly.
Below I have modeled a toy example to show the problem:
Func f;
Var x,y;
RDom r(1,3,1,3); // goes from (1,1) to (4,4)
f(x,y) = 0;
f(0,y) = y;
Expr p_1 = f(r.x-1,r.y);
Expr p_2 = f(r.x-1,r.y-1);
f(r.x,r.y) = p_1 + p_2;
Buffer<int32_t> output_2D = f.realize({4,4});
A visualization of this program can be seen here: Serial Computation Visualisation
This reduction should give the following array():
int expected_output[4][4] = {{0,0,0,0},
{1,1,1,1},
{2,3,4,5},
{3,5,8,12}};
And checking using Catch2 I can see that it actually calculates it correctly
for(int j = 0; j < output_2D.height(); j++){
for(int i = 0; i < output_2D.width(); i++){
CAPTURE(i,j);
REQUIRE(expected_output[j][i]==output_2D(i,j));
}
}
My task is to speed this computation up. Since column one depends on column zero I have to calculate each column in series. I can however, calculate all the values in the column in parallel. Please see Computation Steps Parallel and Desired Pipeline to see how I want Halide to compute the pipeline.
I tried doing this in halide using the f.update(1).allow_race_conditions().parallel(r.y); and this does almost what I want.
f(r.x,r.y) = p_1 + p_2;
f.update(1).allow_race_conditions().parallel(r.y);
f.trace_stores();
Buffer<int32_t> output_2D = f.realize({4,4});
For some reason however, it seems that parallel(y) executes the columns in seemingly random order.
It yields the following store_trace:
Init Image:
Store f29.0(0, 0) = 0
Store f29.0(1, 0) = 0
....
Store f29.0(3, 3) = 0
Init first row:
Store f29.0(0, 0) = 0
Store f29.0(1, 0) = 1
Store f29.0(2, 0) = 2
Store f29.0(3, 0) = 3
Start Parallel Computation:
Store f29.0(1, 1) = 1 // First parallel column
Store f29.0(2, 1) = 1
Store f29.0(3, 1) = 1
Store f29.0(1, 3) = 5 // Second parallel column: THIS IS MY PROBLEM
Store f29.0(2, 3) = 5 // This should be column 2 not column 3.
Store f29.0(3, 3) = 5
Store f29.0(1, 2) = 3
Store f29.0(2, 2) = 4
Store f29.0(3, 2) = 5
A visualization of this pattern can be seen here in this figure: Current Pipeline.
I know that I explicitly enabling the race_conditions so I must be doing something wrong, but I dont know what is the right way to do this and this is the closest I got. I could vectorize() with respect to y and that gives the correct evaluation, but I want to use the parallel() block to gain greater speedup for larger matrixes/images. RFactor might be a solution as my problem should be associative in the y direction, but it might not work as it is non-associative in the x-direction(each column depends on the previous) Does anyone know how to be serial in x and parallel in y when using RDoms?

Fuzzy string record search algorithm (supporting word transpose and character transpose)

I am trying to find the best algorithm for my particular application. I have searched around on SO, Google, read various articles about Levenshtein distances, etc. but honestly it's a bit out of my area of expertise. And most seem to find how similar two input strings are, like a Hamming distance between strings.
What I'm looking for is different, more of a fuzzy record search (and I'm sure there is a name for it, that I don't know to Google). I am sure someone has solved this problem before and I'm looking for a recommendation to point me in the right direction for my further research.
In my case I am needing a fuzzy search of a database of entries of music artists and their albums. As you can imagine, the database will have millions of entries so an algorithm that scales well is crucial. It's not important to my question that Artist and Album are in different columns, the database could just store all words in one column if that helped the search.
The database to search:
|-------------------|---------------------|
| Artist | Album |
|-------------------|---------------------|
| Alanis Morissette | Jagged Little Pill |
| Moby | Everything is Wrong |
| Air | Moon Safari |
| Pearl Jam | Ten |
| Nirvana | Nevermind |
| Radiohead | OK Computer |
| Beck | Odelay |
|-------------------|---------------------|
The query text will contain from just one word in the entire Artist_Album concatenation up to the entire thing. The query text is coming from OCR and is likely to have single character transpositions but the most likely thing is the words are not guaranteed to have the right order. Additionally, there could be extra words in the search that aren't a part of the album (like cover art text). For example, "OK Computer" might be at the top of the album and "Radiohead" below it, or some albums have text arranged in columns which intermixes the word orders.
Possible search strings:
C0mputer Rad1ohead
Pearl Ten Jan
Alanis Jagged Morisse11e Litt1e Pi11
Air Moon Virgin Records
Moby Everything
Note that with OCR, some letters will look like numbers, or the wrong letter completely (Jan instead of Jam). And in the case of Radiohead's OK Computer and Moby's Everything Is Wrong, the query text doesn't even have all of the words. In the case of Air's Moon Safari, the extra words Virgin Records are searched, but Safari is missing.
Is there a general algorithm that could return the single likeliest result from the database, and if none meet some "likeliness" score threshold, it returns nothing? I'm actually developing this in Python, but that's just a bonus, I'm looking more for where to get started researching.
Let's break the problem down in two parts.
First, you want to define some measure of likeness (this is called a metric). This metric should return a small number if the query text closely matches the album/artist cover, and return a larger number otherwise.
Second, you want a datastructure that speeds up this process. Obviously, you don't want to calculate this metric every single time a query is ran.
part 1: the metric
You already mentioned Levenshtein distance, which is a great place to start.
Think outside the box though.
LD makes certain assumptions (each character replacement is equally likely, deletion is equally likely as insertion, etc). You can obviously improve the performance of this metric by taking into account what faults OCR is likely to introduce.
E.g. turning a '1' into an 'i' should not be penalized as harshly as turning a '0' into an '_'.
I would implement the metric in two stages. For any given two strings:
split both strings in tokens (assume space as the separator)
look for the most similar words (using a modified version of LD)
assign a final score based on 'matching words', 'missing words' and 'added words' (preferably weighted)
This is an example implementation (fiddle around with the constants):
static double m(String a, String b){
String[] aParts = a.split(" ");
String[] bParts = b.split(" ");
boolean[] bUsed = new boolean[bParts.length];
int matchedTokens = 0;
int tokensInANotInB = 0;
int tokensInBNotInA = 0;
for(int i=0;i<aParts.length;i++){
String a0 = aParts[i];
boolean wasMatched = true;
for(int j=0;j<bParts.length;j++){
String b0 = bParts[j];
double d = levenshtein(a0, b0);
/* If we match the token a0 with a token from b0
* update the number of matchedTokens
* escape the loop
*/
if(d < 2){
bUsed[j]=true;
wasMatched = true;
matchedTokens++;
break;
}
}
if(!wasMatched){
tokensInANotInB++;
}
}
for(boolean partUsed : bUsed){
if(!partUsed){
tokensInBNotInA++;
}
}
return (matchedTokens
+ tokensInANotInB * -0.3 // the query is allowed to contain extra words at minimal cost
+ tokensInBNotInA * -0.5 // the album title should not contain too many extra words
) / java.lang.Math.max(aParts.length, bParts.length);
}
This function uses a modified levenshtein function:
static double levenshtein(String x, String y) {
double[][] dp = new double[x.length() + 1][y.length() + 1];
for (int i = 0; i <= x.length(); i++) {
for (int j = 0; j <= y.length(); j++) {
if (i == 0) {
dp[i][j] = j;
}
else if (j == 0) {
dp[i][j] = i;
}
else {
dp[i][j] = min(dp[i - 1][j - 1]
+ costOfSubstitution(x.charAt(i - 1), y.charAt(j - 1)),
dp[i - 1][j] + 1,
dp[i][j - 1] + 1);
}
}
}
return dp[x.length()][y.length()];
}
Which uses the function 'cost of substitution' (which works as explained)
static double costOfSubstitution(char a, char b){
if(a == b)
return 0.0;
else{
// 1 and i
if(a == '1' && b == 'i')
return 0.5;
if(a == 'i' && b == '1')
return 0.5;
// 0 and O
if(a == '0' && b == 'o')
return 0.5;
if(a == 'o' && b == '0')
return 0.5;
if(a == '0' && b == 'O')
return 0.5;
if(a == 'O' && b == '0')
return 0.5;
// default
return 1.0;
}
}
I only included a couple of examples (turning '1' into 'i' or '0' into 'o').
But I'm sure you get the idea.
part 2: the datastructure
Look into BK-trees. They are a specific datastructure to hold metric information. Your metric needs to be a genuine metric (in the mathematical sense of the word). But that's easily arranged.

How to implement "i++ and i>=max ? 0: i" that only use atomic in Go

only use atomic implement the follow code:
const Max = 8
var index int
func add() int {
index++
if index >= Max {
index = 0
}
return index
}
such as:
func add() int {
atomic.AddUint32(&index, 1)
// error: race condition
atomic.CompareAndSwapUint32(&index, Max, 0)
return index
}
but it is wrong. there is a race condition.
can be implemented that don't use lock ?
Solving it without loops and locks
A simple implementation may look like this:
const Max = 8
var index int64
func Inc() int64 {
value := atomic.AddInt64(&index, 1)
if value < Max {
return value // We're done
}
// Must normalize, optionally reset:
value %= Max
if value == 0 {
atomic.AddInt64(&index, -Max) // Reset
}
return value
}
How does it work?
It simply adds 1 to the counter; atomic.AddInt64() returns the new value. If it's less than Max, "we're done", we can return it.
If it's greater than or equal to Max, then we have to normalize the value (make sure it's in the range [0..Max)) and reset the counter.
Reset may only be done by a single caller (a single goroutine), which will be selected by the counter's value. The winner will be the one that caused the counter to reach Max.
And the trick to avoid the need of locks is to reset it by adding -Max, not by setting it to 0. Since the counter's value is normalized, it won't cause any problems if other goroutines are calling it and incrementing it concurrently.
Of course with many goroutines calling this Inc() concurrently it may be that the counter will be incremented more that Max times before a goroutine that ought to reset it can actually carry out the reset, which would cause the counter to reach or exceed 2 * Max or even 3 * Max (in general: n * Max). So we handle this by using a value % Max == 0 condition to decide if a reset should happen, which again will only happen at a single goroutine for each possible values of n.
Simplification
Note that the normalization does not change values already in the range [0..Max), so you may opt to always perform the normalization. If you want to, you may simplify it to this:
func Inc() int64 {
value := atomic.AddInt64(&index, 1) % Max
if value == 0 {
atomic.AddInt64(&index, -Max) // Reset
}
return value
}
Reading the counter without incrementing it
The index variable should not be accessed directly. If there's a need to read the counter's current value without incrementing it, the following function may be used:
func Get() int64 {
return atomic.LoadInt64(&index) % Max
}
Extreme scenario
Let's analyze an "extreme" scenario. In this, Inc() is called 7 times, returning the numbers 1..7. Now the next call to Inc() after the increment will see that the counter is at 8 = Max. It will then normalize the value to 0 and wants to reset the counter. Now let's say before the reset (which is to add -8) is actually executed, 8 other calls happen. They will increment the counter 8 times, and the last one will again see that the counter's value is 16 = 2 * Max. All the calls will normalize the values into the range 0..7, and the last call will again go on to perform a reset. Let's say this reset is again delayed (e.g. for scheduling reasons), and yet another 8 calls come in. For the last, the counter's value will be 24 = 3 * Max, the last call again will go on to perform a reset.
Note that all calls will only return values in the range [0..Max). Once all reset operations are executed, the counter's value will be 0, properly, because it had a value of 24 and there were 3 "pending" reset operations. In practice there's only a slight chance for this to happen, but this solution handles it nicely and efficiently.
I assume your goal is to never let index has value equal or greater than Max. This can be solved using CAS (Compare-And-Swap) loop:
const Max = 8
var index int32
func add() int32 {
var next int32;
for {
prev := atomic.LoadInt32(&index)
next = prev + 1;
if next >= Max {
next = 0
}
if (atomic.CompareAndSwapInt32(&index, prev, next)) {
break;
}
}
return next
}
CAS can be used to implement almost any operation atomically like this. The algorithm is:
Load the value
Perform the desired operation
Use CAS, goto 1 on failure.

Add water between in a bar chart

Recently came across an interview question in glassdoor-like site and I can't find an optimized solution to solve this problem:
This is nothing like trapping water problem. Please read through the examples.
Given an input array whose each element represents the height of towers, the amount of water will be poured and the index number indicates the pouring water position.The width of every tower is 1. Print the graph after pouring water.
Notes:
Use * to indicate the tower, w to represent 1 amount water.
The pouring position will never at the peak position.No need to consider the divide water case.
(A Bonus point if you gave a solution for this case, you may assume that if Pouring N water at peak position, N/2 water goes to left, N/2 water goes to right.)
The definition for a peak: the height of peak position is greater than the both left and right index next to it.)
Assume there are 2 extreme high walls sits close to the histogram.
So if the water amount is over the capacity of the histogram,
you should indicate the capacity number and keep going. See Example 2.
Assume the water would go left first, see Example 1
Example 1:
int[] heights = {4,2,1,2,3,2,1,0,4,2,1}
It look like:
* *
* * **
** *** **
******* ***
+++++++++++ <- there'll always be a base layer
42123210431
Assume given this heights array, water amout 3, position 2:
Print:
* *
*ww * **
**w*** **
******* ***
+++++++++++
Example 2:
int[] heights = {4,2,1,2,3,2,1,0,4,2,1}, water amout 32, position 2
Print:
capacity:21
wwwwwwwwwww
*wwwwwww*ww
*www*www**w
**w***ww**w
*******w***
+++++++++++
At first I though it's like the trapping water problem but I was wrong. Does anyone have an algorithm to solve this problem?
An explanation or comments in the code would be welcomed.
Note:
The trapping water problem is asked for the capacity, but this question introduced two variables: water amount and the pouring index. Besides, the water has the flowing preference. So it not like trapping water problem.
I found a Python solution to this question. However, I'm not familiar with Python so I quote the code here. Hopefully, someone knows Python could help.
Code by #z026
def pour_water(terrains, location, water):
print 'location', location
print 'len terrains', len(terrains)
waters = [0] * len(terrains)
while water > 0:
left = location - 1
while left >= 0:
if terrains[left] + waters[left] > terrains[left + 1] + waters[left + 1]:
break
left -= 1
if terrains[left + 1] + waters[left + 1] < terrains[location] + waters[location]:
location_to_pour = left + 1
print 'set by left', location_to_pour
else:
right = location + 1
while right < len(terrains):
if terrains[right] + waters[right] > terrains[right - 1] + waters[right - 1]:
print 'break, right: {}, right - 1:{}'.format(right, right - 1)
break
right += 1
if terrains[right - 1] + waters[right - 1] < terrains[location] + waters[right - 1]:
location_to_pour = right - 1
print 'set by right', location_to_pour
else:
location_to_pour = location
print 'set to location', location_to_pour
waters[location_to_pour] += 1
print location_to_pour
water -= 1
max_height = max(terrains)
for height in xrange(max_height, -1, -1):
for i in xrange(len(terrains)):
if terrains + waters < height:
print ' ',
elif terrains < height <= terrains + waters:
print 'w',
else:
print '+',
print ''
Since you have to generate and print out the array anyway, I'd probably opt for a recursive approach keeping to the O(rows*columns) complexity. Note each cell can be "visited" at most twice.
On a high level: first recurse down, then left, then right, then fill the current cell.
However, this runs into a little problem: (assuming this is a problem)
*w * * *
**ww* * instead of **ww*w*
This can be fixed by updating the algorithm to go left and right first to fill cells below the current row, then to go both left and right again to fill the current row. Let's say state = v means we came from above, state = h1 means it's the first horizontal pass, state = h2 means it's the second horizontal pass.
You might be able to avoid this repeated visiting of cells by using a stack, but it's more complex.
Pseudo-code:
array[][] // populated with towers, as shown in the question
visited[][] // starts with all false
// call at the position you're inserting water (at the very top)
define fill(x, y, state):
if x or y out of bounds
or array[x][y] == '*'
or waterCount == 0
return
visited = true
// we came from above
if state == v
fill(x, y+1, v) // down
fill(x-1, y, h1) // left , 1st pass
fill(x+1, y, h1) // right, 1st pass
fill(x-1, y, h2) // left , 2nd pass
fill(x+1, y, h2) // right, 2nd pass
// this is a 1st horizontal pass
if state == h1
fill(x, y+1, v) // down
fill(x-1, y, h1) // left , 1st pass
fill(x+1, y, h1) // right, 1st pass
visited = false // need to revisit cell later
return // skip filling the current cell
// this is a 2nd horizontal pass
if state == h2
fill(x-1, y, h2) // left , 2nd pass
fill(x+1, y, h2) // right, 2nd pass
// fill current cell
if waterCount > 0
array[x][y] = 'w'
waterCount--
You have an array height with the height of the terrain in each column, so I would create a copy of this array (let's call it w for water) to indicate how high the water is in each column. Like this you also get rid of the problem not knowing how many rows to initialize when transforming into a grid and you can skip that step entirely.
The algorithm in Java code would look something like this:
public int[] getWaterHeight(int index, int drops, int[] heights) {
int[] w = Arrays.copyOf(heights);
for (; drops > 0; drops--) {
int idx = index;
// go left first
while (idx > 0 && w[idx - 1] <= w[idx])
idx--;
// go right
for (;;) {
int t = idx + 1;
while (t < w.length && w[t] == w[idx])
t++;
if (t >= w.length || w[t] >= w[idx]) {
w[idx]++;
break;
} else { // we can go down to the right side here
idx = t;
}
}
}
return w;
}
Even though there are many loops, the complexity is only O(drops * columns). If you expect huge amount of drops then it could be wise to count the number of empty spaces in regard to the highest terrain point O(columns), then if the number of drops exceeds the free spaces, the calculation of the column heights becomes trivial O(1), however setting them all still takes O(columns).
You can iterate over the 2D grid from bottom to top, create a node for every horizontal run of connected cells, and then string these nodes together into a linked list that represents the order in which the cells are filled.
After row one, you have one horizontal run, with a volume of 1:
1(1)
In row two, you find three runs, one of which is connected to node 1:
1(1)->2(1) 3(1) 4(1)
In row three, you find three runs, one of which connects runs 2 and 3; run 3 is closest to the column where the water is added, so it comes first:
3(1)->1(1)->2(1)->5(3) 6(1) 4(1)->7(1)
In row four you find two runs, one of which connects runs 6 and 7; run 6 is closest to the column where the water is added, so it comes first:
3(1)->1(1)->2(1)->5(3)->8(4) 6(1)->4(1)->7(1)->9(3)
In row five you find a run which connects runs 8 and 9; they are on opposite sides of the column where the water is added, so the run on the left goes first:
3(1)->1(1)->2(1)->5(3)->8(4)->6(1)->4(1)->7(1)->9(3)->A(8)
Run A combines all the columns, so it becomes the last node and is given infinite volume; any excess drops will simply be stacked up:
3(1)->1(1)->2(1)->5(3)->8(4)->6(1)->4(1)->7(1)->9(3)->A(infinite)
then we fill the runs in the order in which they are listed, until we run out of drops.
Thats my 20 minutes solution. Each drop is telling the client where it will stay, so the difficult task is done.(Copy-Paste in your IDE) Only the printing have to be done now, but the drops are taking their position. Take a look:
class Test2{
private static int[] heights = {3,4,4,4,3,2,1,0,4,2,1};
public static void main(String args[]){
int wAmount = 10;
int position = 2;
for(int i=0; i<wAmount; i++){
System.out.println(i+"#drop");
aDropLeft(position);
}
}
private static void aDropLeft(int position){
getHight(position);
int canFallTo = getFallPositionLeft(position);
if(canFallTo==-1){canFallTo = getFallPositionRight(position);}
if(canFallTo==-1){
stayThere(position);
return;
}
aDropLeft(canFallTo);
}
private static void stayThere(int position) {
System.out.print("Staying at: ");log(position);
heights[position]++;
}
//the position or -1 if it cant fall
private static int getFallPositionLeft(int position) {
int tempHeight = getHight(position);
int tempPosition = position;
//check left , if no, then check right
while(tempPosition>0){
if(tempHeight>getHight(tempPosition-1)){
return tempPosition-1;
}else tempPosition--;
}
return -1;
}
private static int getFallPositionRight(int position) {
int tempHeight = getHight(position);
int tempPosition = position;
while(tempPosition<heights.length-1){
if(tempHeight>getHight(tempPosition+1)){
return tempPosition+1;
}else if(tempHeight<getHight(tempPosition+1)){
return -1;
}else tempPosition++;
}
return -1;
}
private static int getHight(int position) {
return heights[position];
}
private static void log(int position) {
System.out.println("I am at position: " + position + " height: " + getHight(position));
}
}
Of course the code can be optimized, but thats my straightforward solution
l=[0,1,0,2,1,0,1,3,2,1,2,1]
def findwater(l):
w=0
for i in range(0,len(l)-1):
if i==0:
pass
else:
num = min(max(l[:i]),max(l[i:]))-l[i]
if num>0:
w+=num
return w
col_names=[1,2,3,4,5,6,7,8,9,10,11,12,13] #for visualization
bars=[4,0,2,0,1,0,4,0,5,0,3,0,1]
pd.DataFrame(dict(zip(col_names,bars)),index=range(1)).plot(kind='bar') # Plotting bars
def measure_water(l):
water=0
for i in range(len(l)-1): # iterate over bars (list)
if i==0: # case to avoid max(:i) situation in case no item on left
pass
else:
vol_at_curr_bar=min(max(l[:i]),max(l[i:]))-l[i] #select min of max heighted bar on both side and minus current height
if vol_at_curr_bar>0: # case to aviod any negative sum
water+=vol_at_curr_bar
return water
measure_water(bars)

Efficient way to generate a seemingly random permutation from a very large set without repeating?

I have a very large set (billions or more, it's expected to grow exponentially to some level), and I want to generate seemingly random elements from it without repeating. I know I can pick a random number and repeat and record the elements I have generated, but that takes more and more memory as numbers are generated, and wouldn't be practical after couple millions elements out.
I mean, I could say 1, 2, 3 up to billions and each would be constant time without remembering all the previous, or I can say 1,3,5,7,9 and on then 2,4,6,8,10, but is there a more sophisticated way to do that and eventually get a seemingly random permutation of that set?
Update
1, The set does not change size in the generation process. I meant when the user's input increases linearly, the size of the set increases exponentially.
2, In short, the set is like the set of every integer from 1 to 10 billions or more.
3, In long, it goes up to 10 billion because each element carries the information of many independent choices, for example. Imagine an RPG character that have 10 attributes, each can go from 1 to 100 (for my problem different choices can have different ranges), thus there's 10^20 possible characters, number "10873456879326587345" would correspond to a character that have "11, 88, 35...", and I would like an algorithm to generate them one by one without repeating, but makes it looks random.
Thanks for the interesting question. You can create a "pseudorandom"* (cyclic) permutation with a few bytes using modular exponentiation. Say we have n elements. Search for a prime p that's bigger than n+1. Then find a primitive root g modulo p. Basically by definition of primitive root, the action x --> (g * x) % p is a cyclic permutation of {1, ..., p-1}. And so x --> ((g * (x+1))%p) - 1 is a cyclic permutation of {0, ..., p-2}. We can get a cyclic permutation of {0, ..., n-1} by repeating the previous permutation if it gives a value bigger (or equal) n.
I implemented this idea as a Go package. https://github.com/bwesterb/powercycle
package main
import (
"fmt"
"github.com/bwesterb/powercycle"
)
func main() {
var x uint64
cycle := powercycle.New(10)
for i := 0; i < 10; i++ {
fmt.Println(x)
x = cycle.Apply(x)
}
}
This outputs something like
0
6
4
1
2
9
3
5
8
7
but that might vary off course depending on the generator chosen.
It's fast, but not super-fast: on my five year old i7 it takes less than 210ns to compute one application of a cycle on 1000000000000000 elements. More details:
BenchmarkNew10-8 1000000 1328 ns/op
BenchmarkNew1000-8 500000 2566 ns/op
BenchmarkNew1000000-8 50000 25893 ns/op
BenchmarkNew1000000000-8 200000 7589 ns/op
BenchmarkNew1000000000000-8 2000 648785 ns/op
BenchmarkApply10-8 10000000 170 ns/op
BenchmarkApply1000-8 10000000 173 ns/op
BenchmarkApply1000000-8 10000000 172 ns/op
BenchmarkApply1000000000-8 10000000 169 ns/op
BenchmarkApply1000000000000-8 10000000 201 ns/op
BenchmarkApply1000000000000000-8 10000000 204 ns/op
Why did I say "pseudorandom"? Well, we are always creating a very specific kind of cycle: namely one that uses modular exponentiation. It looks pretty pseudorandom though.
I would use a random number and swap it with an element at the beginning of the set.
Here's some pseudo code
set = [1, 2, 3, 4, 5, 6]
picked = 0
Function PickNext(set, picked)
If picked > Len(set) - 1 Then
Return Nothing
End If
// random number between picked (inclusive) and length (exclusive)
r = RandomInt(picked, Len(set))
// swap the picked element to the beginning of the set
result = set[r]
set[r] = set[picked]
set[picked] = result
// update picked
picked++
// return your next random element
Return temp
End Function
Every time you pick an element there is one swap and the only extra memory being used is the picked variable. The swap can happen if the elements are in a database or in memory.
EDIT Here's a jsfiddle of a working implementation http://jsfiddle.net/sun8rw4d/
JavaScript
var set = [];
set.picked = 0;
function pickNext(set) {
if(set.picked > set.length - 1) { return null; }
var r = set.picked + Math.floor(Math.random() * (set.length - set.picked));
var result = set[r];
set[r] = set[set.picked];
set[set.picked] = result;
set.picked++;
return result;
}
// testing
for(var i=0; i<100; i++) {
set.push(i);
}
while(pickNext(set) !== null) { }
document.body.innerHTML += set.toString();
EDIT 2 Finally, a random binary walk of the set. This can be accomplished with O(Log2(N)) stack space (memory) which for 10billion is only 33. There's no shuffling or swapping involved. Using trinary instead of binary might yield even better pseudo random results.
// on the fly set generator
var count = 0;
var maxValue = 64;
function nextElement() {
// restart the generation
if(count == maxValue) {
count = 0;
}
return count++;
}
// code to pseudo randomly select elements
var current = 0;
var stack = [0, maxValue - 1];
function randomBinaryWalk() {
if(stack.length == 0) { return null; }
var high = stack.pop();
var low = stack.pop();
var mid = ((high + low) / 2) | 0;
// pseudo randomly choose the next path
if(Math.random() > 0.5) {
if(low <= mid - 1) {
stack.push(low);
stack.push(mid - 1);
}
if(mid + 1 <= high) {
stack.push(mid + 1);
stack.push(high);
}
} else {
if(mid + 1 <= high) {
stack.push(mid + 1);
stack.push(high);
}
if(low <= mid - 1) {
stack.push(low);
stack.push(mid - 1);
}
}
// how many elements to skip
var toMid = (current < mid ? mid - current : (maxValue - current) + mid);
// skip elements
for(var i = 0; i < toMid - 1; i++) {
nextElement();
}
current = mid;
// get result
return nextElement();
}
// test
var result;
var list = [];
do {
result = randomBinaryWalk();
list.push(result);
} while(result !== null);
document.body.innerHTML += '<br/>' + list.toString();
Here's the results from a couple of runs with a small set of 64 elements. JSFiddle http://jsfiddle.net/yooLjtgu/
30,46,38,34,36,35,37,32,33,31,42,40,41,39,44,45,43,54,50,52,53,51,48,47,49,58,60,59,61,62,56,57,55,14,22,18,20,19,21,16,15,17,26,28,29,27,24,25,23,6,2,4,5,3,0,1,63,10,8,7,9,12,11,13
30,14,22,18,16,15,17,20,19,21,26,28,29,27,24,23,25,6,10,8,7,9,12,13,11,2,0,63,1,4,5,3,46,38,42,44,45,43,40,41,39,34,36,35,37,32,31,33,54,58,56,55,57,60,59,61,62,50,48,49,47,52,51,53
As I mentioned in my comment, unless you have an efficient way to skip to a specific point in your "on the fly" generation of the set this will not be very efficient.
if it is enumerable then use a pseudo-random integer generator adjusted to the period 0 .. 2^n - 1 where the upper bound is just greater than the size of your set and generate pseudo-random integers discarding those more than the size of your set. Use those integers to index items from your set.
Pre- compute yourself a series of indices (e.g. in a file), which has the properties you need and then randomly choose a start index for your enumeration and use the series in a round-robin manner.
The length of your pre-computed series should be > the maximum size of the set.
If you combine this (depending on your programming language etc.) with file mappings, your final nextIndex(INOUT state) function is (nearly) as simple as return mappedIndices[state++ % PERIOD];, if you have a fixed size of each entry (e.g. 8 bytes -> uint64_t).
Of course, the returned value could be > your current set size. Simply draw indices until you get one which is <= your sets current size.
Update (In response to question-update):
There is another option to achieve your goal if it is about creating 10Billion unique characters in your RPG: Generate a GUID and write yourself a function which computes your number from the GUID. man uuid if you are are on a unix system. Else google it. Some parts of the uuid are not random but contain meta-info, some parts are either systematic (such as your network cards MAC address) or random, depending on generator algorithm. But they are very very most likely unique. So, whenever you need a new unique number, generate a uuid and transform it to your number by means of some algorithm which basically maps the uuid bytes to your number in a non-trivial way (e.g. use hash functions).

Resources