Random index distribution weirdness - random

I stumbled onto this, trying to do a random biased sample from some data. It seems a simple distribution fitted to x^2 is what I'm looking for, but there's an artefact here I can't quite wrap my head around.
Here's a snippet of a for loop selecting an index in an array distributed by x^2, and then incrementing the counter at that index position.
package main
import "time"
import "fmt"
import "math"
import "math/rand"
func main() {
rand.Seed(time.Now().UTC().UnixNano())
var arr [10]int
for i := 0; i < 5000; i++ {
rnd := rand.Float64()
tmp := rnd * rnd * 9
index := int(math.Floor(tmp + .5))
arr[index]++
}
fmt.Printf("%v", arr)
}
No matter the bounds or the number of iterations, plotting the values the graph always comes out looking like this, with a noticable "drop" at the end.
This is what I have trouble understanding. Shouldn't the indexes fit the curve all the way?
I'm suspecting something related to the rounding, but I'm grasping for straws at the moment.

The problem is that your distribution has the range [0,1], and then you multiply it by 9, making the range [0,9], then you add 0.5, which makes the range [0.5, 9.5].
Not only is there a noticeable drop in the last index value, there is also an unnoticeable drop in the first index value, since each bucket is only half filled.
Have you considered simply multiplying by 10 rather than 9
tmp := rnd * rnd * 10
And then leaving off the + 0.5 in your Floor?
index := int(math.Floor(tmp))
That produces a distribution like you would expect, here are a few results for a loop going to 500,000:
[157949 65411 50239 42599 37637 33706 31200 28789 26927 25543]
[158302 65533 49712 42480 37347 33882 30987 28696 27225 25836]
[157824 65627 50432 42328 37307 33900 30787 29006 26975 25814]

First, your X-scale is misleading, as it starts from 1 and ends with 10. Should be 0...9.
Considering that it would be fixed, your distribution is fully correct, though maybe not intended (what did you actually want?).
You first have a distribution between 0 and 9, both inclusive. If you add 0.5 and then round down, ask yourself how many hits each index can acually "get"?
A: Most indexes get a "full set" with decimal values between 1 and 2 (or 6 and 7, or any other interval) which gets rounded down to 1 (or 6, or any index)
EXCEPT
The edge indexes 0 and 9 only get a "half full set".
Because you offset index 0...1 to 0.5...1.5 and round down. Only half of this range will then remain for index=0, ie. values between 0.5 and 1 (as there are no longer any between 0 and 0.5).
Same with other end. You offset 8...9 to 8.5...9.5 and then round down. Index 9 only gets 1/2, ie. values between 9 and 9.5.
The left end of your chart is actually lower than you probably expected, though it's not as distinguishable as the right end.
The numbers are indeed sometimes surprising :-).

Related

Hacker Earth's Monk and Rotation Time limit exceeded

I am trying to solve the following question from Hacker Earth which is Monk and Rotation
Please refer the link: https://www.hackerearth.com/practice/codemonk/ .
It's a beginner level problem which I managed to solve but unfortunately I am facing Time limit exceeded issue for Input #5
I have tried to optimize the solution. I think I managed to do so up to some extent wrt my crude solution in the beginning but sadly that's not enough.
I suspect the flaw lies with how I handle the input and it could be the bottleneck.
It may also have to do with how I am handling slices but I don't see a better way than that. Before optimization I was certain that slices were the culprit.
Edit:
I was not aware the question is not accessible without login, here is some brief explaination.
Input:
The first line will consists of one integer T denoting the number of
test cases. For each test case:
The first line consists of two
integers N and K, N being the number of elements in the array and K
denotes the number of steps of rotation.
The next line consists of
N space separated integers , denoting the elements of the array A.
Output:
Print the required array.
Constraints:
1 ≤ T ≤ 20
1 ≤ N ≤ 10^5
0 ≤ K ≤ 10^6
0 ≤ A[i] ≤ 10^6
Sample Input:
1
5 2
1 2 3 4 5
Sample Output:
4 5 1 2 3
Explanation
Here T is 1, which means one test case.
N = 5 denoting the number of elements in the array and K = 2, denoting the number of steps of rotations.
The initial array is 1,2,3,4,5:
In first rotation, 5 will come in the first position and all other elements will move to one position ahead from their current position. Now, the resultant array will be 5,1,2,3,4.
In second rotation, 4 will come in the first position and all other elements will move to one position ahead from their current position. Now, the resultant array will be 4,5,1,2,3
Time Limit: 1.0 sec(s) for each input file (2x for Golang)
Memory Limit: 256 MB
Source Limit: 1024 KB.
Input #5
package main
import (
"fmt"
)
func main() {
var testCases int
var arrSize, rotation int
fmt.Scanln(&testCases)
for i := 0; i < testCases; i++ {
fmt.Scanln(&arrSize, &rotation)
arr := make([]int, arrSize)
for i := range arr {
fmt.Scanf("%d", &arr[i])
}
rotation = rotation % arrSize
for i := arrSize - rotation; i < arrSize; i++ {
fmt.Printf("%v ", arr[i])
}
for i := 0; i < arrSize-rotation; i++ {
fmt.Printf("%v ", arr[i])
}
fmt.Print("\n")
}
}
The time complexity of your solution seems to be O(n), so I would say that your solution is "correct". You could probably optimize by not constructing an array and reading number after number. Just read the whole line. Find the index where to split the string and print the two substrings in reverse order.

How to generate random number in range of truncated normal distribution

I need to generate a value in range of truncated normal distribution, for example, in python you could use scipy.stats.truncnorm() to make
def get_truncated_normal(mean=.0, sd=1., low=.0, upp=10.):
return truncnorm((low - mean) / sd, (upp - mean) / sd, loc=mean, scale=sd)
how it is described here
Is there any package to make something in go or should I write the following function myself?
I tried following, how the doc says, but it makes number not in the needed range:
func GenerateTruncatedNormal(mean, sd uint64) float64 {
return rand.NormFloat64() * (float64)(sd + mean)
}
GenerateTruncatedNormal(10, 5)
makes 16.61, -14.54, or even 32.8, but I expect a small chance of getting 15 due to mean = 10 -> 10 + 5 = 15 is maximum value which we can get. What is wrong here? 😅
One way of achieving this consists of
generating a number x from the Normal distribution, with the desired parameters for mean and standard deviation,
if it's outside the range [low..high], then throw it away and try again.
This respects the Probability Density Function of the Normal distribution, effectively cutting out the left and right tails.
func TruncatedNormal(mean, stdDev, low, high float64) float64 {
if low >= high {
panic("high must be greater than low")
}
for {
x := rand.NormFloat64()*stdDev + mean
if low <= x && x < high {
return x
}
// fmt.Println("missed!", x)
}
}
Playground
If the [low..high] interval is very narrow, then it will take a bit more computation time, as more generated numbers will be thrown away. However, it still converges very fast in practice.
I checked the code above by plotting its results against and compare them to the results of scipy's truncnorm, and they do produce equivalent charts.

How can I acquire a certain "random range" in a higher frequency?

My question is basically, "how can I obtain certain random values within a specific range more than random values outside the range?"
Allow me to demonstrate what I mean:
If I were to, on a good amount of trials, start picking a variety of
random numbers from 1-10, I should be seeing more numbers in the 7-10
range than in the 1-6 range.
I tried a couple of ways, but I am not getting desirable results.
First Function:
function getAverage(i)
math.randomseed(os.time())
local sum = 0;
for j = 1,i do
sum = sum + (1-math.random()^3)*10
end
print(sum/i)
end
getAverage(500)
I was constantly getting numbers only around 7.5, such as 7.48, and 7.52. Although this does indeed get me a number within my range, I don't want such strict consistancy.
Second Function:
function getAverage(i)
math.randomseed(os.time())
local sum = 0;
for j = 1,i do
sum = sum + (math.random() > .3 and math.random(7,10) or math.random(1,6))
end
print(sum/i)
end
getAverage(500)
This function didn't work as I wanted it to either. I primarily getting numbers such as 6.8 and 7.2 but nothing even close to 8.
Third Function:
function getAverage(i)
math.randomseed(os.time())
local sum = 0;
for j = 1,i do
sum = sum + (((math.random(10) * 2)/1.2)^1.05) - math.random(1,3)
end
print(sum/i)
end
getAverage(500)
This function was giving me slightly more favorable results, with the function consistently returning 8, but that is the issue - consistency.
What type of paradigms or practical solutions can I use to generate more random numbers within a specific range over another range?
I have labeled this as Lua, but a solution in any language that is understandable is acceptable.
I don't want such strict consistancy.
What does that mean?
If you average a very large number of values in a given range from any RNG, you should expect that to produce the same number. That means each of the numbers in the range was equally likely to appear.
This function didn't work as I wanted it to either. I primarily getting numbers such as 6.8 and 7.2 but nothing even close to 8.
You have to clarify what "didn't work" means. Why would you expect it to give you 8? You can see it won't just by looking at the formula you used.
For instance, if you'd used math.random(1,10), assuming all numbers in the range have an equal chance of appearing, you should expect the average to be 5.5, dead in the middle of 1 and 10 (because (1+2+3+4+5+6+7+8+9+10)/10 = 5.5).
You used math.random() > .3 and math.random(7,10) or math.random(1,6) which is saying 70% of the time to give 7, 8, 9, or 10 (average = 8.5) and 30% of the time to give you 1, 2, 3, 4, 5, or 6 (average = 3.5). That should give you an overall average of 7 (because 3.5 * .3 + 8.5 * .7 = 7). If you bump up your sample size, that's exactly what you'll see. You're seeing values on either size because you sample size is so small (try bumping it up to 100000).
I've made skewed random values before by simply generating two random numbers in the range, and then picking the largest (or smallest). This skews the probability towards the high (or low) endpoint.
Picking the smallest of two gives you a linear probability distribution.
Picking the smallest of three gives you a parabolic distribution (more selectivity, less probability at "the other end"). For my needs, a linear distribution was fine.
Not exactly what you wanted, but maybe it's good enough.
Have fun!

how to read all 1's in an Array of 1's and 0's spread-ed all over the array randomly

I have an Array with 1 and 0 spread over the array randomly.
int arr[N] = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N}
Now I want to retrive all the 1's in the array as fast as possible, but the condition is I should not loose the exact position(based on index) of the array , so sorting option not valid.
So the only option left is linear searching ie O(n) , is there anything better than this.
The main problem behind linear scan is , I need to run the scan even
for X times. So I feel I need to have some kind of other datastructure
which maintains this list once the first linear scan happens, so that
I need not to run the linear scan again and again.
Let me be clear about final expectations-
I just need to find the number of 1's in a certain range of array , precisely I need to find numbers of 1's in the array within range of 40-100. So this can be random range and I need to find the counts of 1 within that range. I can't do sum and all as I need to iterate over the array over and over again because of different range requirements
I'm surprised you considered sorting as a faster alternative to linear search.
If you don't know where the ones occur, then there is no better way than linear searching. Perhaps if you used bits or char datatypes you could do some optimizations, but it depends on how you want to use this.
The best optimization that you could do on this is to overcome branch prediction. Because each value is zero or one, you can use it to advance the index of the array that is used to store the one-indices.
Simple approach:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
if( arr[i] ) indices[end++] = i; // Slow due to branch prediction
}
Without branching:
int end = 0;
int indices[N];
for( int i = 0; i < N; i++ )
{
indices[end] = i;
end += arr[i];
}
[edit] I tested the above, and found the version without branching was almost 3 times faster (4.36s versus 11.88s for 20 repeats on a randomly populated 100-million element array).
Coming back here to post results, I see you have updated your requirements. What you want is really easy with a dynamic programming approach...
All you do is create a new array that is one element larger, which stores the number of ones from the beginning of the array up to (but not including) the current index.
arr : 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1
count : 0 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 5 6 6 6 6 7
(I've offset arr above so it lines up better)
Now you can compute the number of 1s in any range in O(1) time. To compute the number of 1s between index A and B, you just do:
int num = count[B+1] - count[A];
Obviously you can still use the non-branch-prediction version to generate the counts initially. All this should give you a pretty good speedup over the naive approach of summing for every query:
int *count = new int[N+1];
int total = 0;
count[0] = 0;
for( int i = 0; i < N; i++ )
{
total += arr[i];
count[i+1] = total;
}
// to compute the ranged sum:
int range_sum( int *count, int a, int b )
{
if( b < a ) return range_sum(b,a);
return count[b+1] - count[a];
}
Well one time linear scanning is fine. Since you are looking for multiple scans across ranges of array I think that can be done in constant time. Here you go:
Scan the array and create a bitmap where key = key of array = sequence (1,2,3,4,5,6....).The value storedin bitmap would be a tuple<IsOne,cumulativeSum> where isOne is whether you have a one in there and cumulative Sum is addition of 1's as and wen you encounter them
Array = 1 1 0 0 1 0 1 1 1 0 1 0
Tuple: (1,1) (1,2) (0,2) (0,2) (1,3) (0,3) (1,4) (1,5) (1,6) (0,6) (1,7) (0,7)
CASE 1: When lower bound of cumulativeSum has a 0. Number of 1's [6,11] =
cumulativeSum at 11th position - cumulativeSum at 6th position = 7 - 3 = 4
CASE 2: When lower bound of cumulativeSum has a 1. Number of 1's [2,11] =
cumulativeSum at 11th position - cumulativeSum at 2nd position + 1 = 7-2+1 = 6
Step 1 is O(n)
Step 2 is 0(1)
Total complexity is linear no doubt but for your task where you have to work with the ranges several times the above Algorithm seems to be better if you have ample memory :)
Does it have to be a simple linear array data structure? Or can you create your own data structure which happens to have the desired properties, for which you're able to provide the required API, but whose implementation details can be hidden (encapsulated)?
If you can implement your own and if there is some guaranteed sparsity (to either 1s or 0s) then you might be able to offer better than linear performance. I see that you want to preserve (or be able to regenerate) the exact stream, so you'll have to store an array or bitmap or run-length encoding for that. (RLE will be useless if the stream is actually random rather than arbitrary but could be quite useful if there are significant sparsity or patterns with long strings of one or the other. For example a black&white raster of a bitmapped image is often a good candidate for RLE).
Let's say that your guaranteed that the stream will be sparse --- that no more than 10%, for example, of the bits will be 1s (or, conversely that more than 90% will be). If that's the case then you might model your solution on an RLE and maintain a count of all 1s (simply incremented as you set bits and decremented as you clear them). If there might be a need to quickly get the number of set bits for arbitrary ranges of these elements then instead of a single counter you can have a conveniently sized array of counters for partitions of the stream. (Conveniently-sized, in this case, means something which fits easily within memory, within your caches, or register sets, but which offers a reasonable trade off between computing a sum (all the partitions fully within the range) and the linear scan. The results for any arbitrary range is the sum of all the partitions fully enclosed by the range plus the results of linear scans for any fragments that are not aligned on your partition boundaries.
For a very, very, large stream you could even have a multi-tier "index" of partition sums --- traversing from the largest (most coarse) granularity down toward the "fragments" to either end (using the next layer of partition sums) and finishing with the linear search of only the small fragments.
Obviously such a structure represents trade offs between the complexity of building and maintaining the structure (inserting requires additional operations and, for an RLE, might be very expensive for anything other than appending/prepending) vs the expense of performing arbitrarily long linear search/increment scans.
If:
the purpose is to be able to find the number of 1s in the array at any time,
given that relatively few of the values in the array might change between one moment when you want to know the number and another moment, and
if you have to find the number of 1s in a changing array of n values m times,
... you can certainly do better than examining every cell in the array m times by using a caching strategy.
The first time you need the number of 1s, you certainly have to examine every cell, as others have pointed out. However, if you then store the number of 1s in a variable (say sum) and track changes to the array (by, for instance, requiring that all array updates occur through a specific update() function), every time a 0 is replaced in the array with a 1, the update() function can add 1 to sum and every time a 1 is replaced in the array with a 0, the update() function can subtract 1 from sum.
Thus, sum is always up-to-date after the first time that the number of 1s in the array is counted and there is no need for further counting.
(EDIT to take the updated question into account)
If the need is to return the number of 1s in a given range of the array, that can be done with a slightly more sophisticated caching strategy than the one I've just described.
You can keep a count of the 1s in each subset of the array and update the relevant subset count whenever a 0 is changed to a 1 or vice versa within that subset. Finding the total number of 1s in a given range within the array would then be a matter of adding the number of 1s in each subset that is fully contained within the range and then counting the number of 1s that are in the range but not in the subsets that have already been counted.
Depending on circumstances, it might be worthwhile to have a hierarchical arrangement in which (say) the number of 1s in the whole array is at the top of the hierarchy, the number of 1s in each 1/q th of the array is in the second level of the hierarchy, the number of 1s in each 1/(q^2) th of the array is in the third level of the hierarchy, etc. e.g. for q = 4, you would have the total number of 1s at the top, the number of 1s in each quarter of the array at the second level, the number of 1s in each sixteenth of the array at the third level, etc.
Are you using C (or derived language)? If so, can you control the encoding of your array? If, for example, you could use a bitmap to count. The nice thing about a bitmap, is that you can use a lookup table to sum the counts, though if your subrange ends aren't divisible by 8, you'll have to deal with end partial bytes specially, but the speedup will be significant.
If that's not the case, can you at least encode them as single bytes? In that case, you may be able to exploit sparseness if it exists (more specifically, the hope that there are often multi index swaths of zeros).
So for:
u8 input = {1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,1....................N};
You can write something like (untested):
uint countBytesBy1FromTo(u8 *input, uint start, uint stop)
{ // function for counting one byte at a time, use with range of less than 4,
// use functions below for longer ranges
// assume it's just one's and zeros, otherwise we have to test/branch
uint sum;
u8 *end = input + stop;
for (u8 *each = input + start; each < end; each++)
sum += *each;
return sum;
}
countBytesBy8FromTo(u8 *input, uint start, uint stop)
{
u64 *chunks = (u64*)(input+start);
u64 *end = chunks + ((start - stop) >> 3);
uint sum = countBytesBy1FromTo((u8*)end, 0, stop - (u8*)end);
for (; chunks < end; chunks++)
{
if (*chunks)
{
sum += countBytesBy1FromTo((u8*)chunks, 0, 8);
}
}
}
The basic trick, is exploiting the ability to cast slices of your target array to single entities your language can look at in one swoop, and test by inference if ANY of the values of it are zeros, and then skip the whole block. The more zeros, the better it will work. In the case where your large cast integer always has at least one, this approach just adds overhead. You might find that using a u32 is better for your data. Or that adding a u32 test between the 1 and 8 helps. For datasets where zeros are much more common than ones, I've used this technique to great advantage.
Why is sorting invalid? You can clone the original array, sort the clone, and count and/or mark the locations of the 1s as needed.

How do I get an unbiased random sample from a really huge data set?

For an application I'm working on, I need to sample a small set of values from a very large data set, on the order of few hundred taken from about 60 trillion (and growing).
Usually I use the technique of seeing if a uniform random number r (0..1) is less than S/T, where S is the number of sample items I still need, and T is the number of items in the set that I haven't considered yet.
However, with this new data, I don't have time to roll the die for each value; there are too many. Instead, I want to generate a random number of entries to "skip", pick the value at the next position, and repeat. That way I can just roll the die and access the list S times. (S is the size of the sample I want.)
I'm hoping there's a straightforward way to do that and create an unbiased sample, along the lines of the S/T test.
To be honest, approximately unbiased would be OK.
This is related (more or less a follow-on) to this persons question:
https://math.stackexchange.com/questions/350041/simple-random-sample-without-replacement
One more side question... the person who showed first showed this to me called it the "mailman's algorithm", but I'm not sure if he was pulling my leg. Is that right?
How about this:
precompute S random numbers from 0 to the size of your dataset.
order your numbers, low to high
store the difference between consecutive numbers as the skip size
iterate though the large dataset using the skip size above.
...The assumption being the order you collect the samples doesn't matter
So I thought about it, and got some help from http://math.stackexchange.com
It boils down to this:
If I picked n items randomly all at once, where would the first one land? That is, min({r_1 ... r_n}). A helpful fellow at math.stackexchange boiled it down to this equation:
x = 1 - (1 - r) ** (1 / n)
that is, the distribution would be 1 minus (1 - r) to the nth power. Then solve for x. Pretty easy.
If I generate a uniform random number and plug it in for r, this is distributed the same as min({r_1 ... r_n}) -- the same way that the lowest item would fall. Voila! I've just simulated picking the first item as if I had randomly selected all n.
So I skip over that many items in the list, pick that one, and then....
Repeat until n is 0
That way, if I have a big database (like Mongo), I can skip, find_one, skip, find_one, etc. Until I have all the items I need.
The only problem I'm having is that my implementation favors the first and last element in the list. But I can live with that.
In Python 2.7, my implementation looks like:
def skip(n):
"""
Produce a random number with the same distribution as
min({r_0, ... r_n}) to see where the next smallest one is
"""
r = numpy.random.uniform()
return 1.0 - (1.0 - r) ** (1.0 / n)
def sample(T, n):
"""
Take n items from a list of size T
"""
t = T
i = 0
while t > 0 and n > 0:
s = skip(n) * (t - n + 1)
i += s
yield int(i) % T
i += 1
t -= s + 1
n -= 1
if __name__ == '__main__':
t = [0] * 100
for c in xrange(10000):
for i in sample(len(t), 10):
t[i] += 1 # this is where we would read value i
pprint.pprint(t)

Resources