PyMC3: How to create multiple random walks? - pymc

I want to create several random walk variables in PyMC3, or rather, stack several random walks into a single variable. I know I can create a single random walk of 100 steps like this:
with pm.Model() as model():
z = pm.GaussianRandomWalk('z', mu=0, sd=1, shape=100)
But let's say I want to create 20 random walks, each of length 100. If I write this,
with pm.Model() as model():
z = pm.GaussianRandomWalk('z', mu=0, sd=1, shape=(20,100))
Does that make every row a random walk, so I have 20 instances of 100 steps each? Or does it make every column a random walk, so I have 100 instances of 20 steps each?

You will have 100 instances of 20 steps each.
The time-series distributions in PyMC3 is not very rigour - it was designed with only 1D time series in mind, so do be careful when you are working with multiple time-series.

Related

Random sampling in pyspark with replacement

I have a dataframe df with 9000 unique ids.
like
| id |
1
2
I want to generate a random sample with replacement these 9000 ids 100000 times.
How do I do it in pyspark
I tried
df.sample(True,0.5,100)
But I do not know how to get to 100000 number exact
Okay, so first things first. You will probably not be able to get exactly 100,000 in your (over)sample. The reason why is that in order to sample efficiently, Spark uses something called Bernouilli Sampling. Basically this means it goes through your RDD, and assigns each row a probability of being included. So if you want a 10% sample, each row individually has a 10% chance of being included but it doesn't take into account if it adds up perfectly to the number you want, but it tends to be pretty close for large datasets.
The code would look like this: df.sample(True, 11.11111, 100). This will take a sample of the dataset equal to 11.11111 times the size of the original dataset. Since 11.11111*9,000 ~= 100,000, you will get approximately 100,000 rows.
If you want an exact sample, you have to use df.takeSample(True, 100000). However, this is not a distributed dataset. This code will return an Array (a very large one). If it can be created in Main Memory then do that. However, because you require the exact right number of IDs, I don't know of a way to do that in a distributed fashion.

Splitting data in batches based on performance (load balancing)

I am looking to write a small load balancing algorithm to determine how much data to send to each server based on it's performance metric (it's weight). My question is similar to this one: Writing a weighted load balancing algorithm
However that one is more for servers that get a constant stream of data. My case will be run once, and split between multiple systems (2 at the time). Load needs to be split so all data is processed and completed at same time.
Example 1:
We get a zip with 1,000 images
System #1: 3 s/img
System #2: 6 s/img
Since sys1 is 50% faster, it should receive double the data.
Sys1 receives: 667 images
Sys2 receives: 333 images
I am doing this in python, what would be an equation to take in a list of numbers and split up the batch # based on the weight? Ex. Receive weights = [4.1, 7.3, 2.5] and img_count = 5000?
You just need to calculate how much work a unit weight can perform and then multiply you weight by the unit work. Here are a couple of lines in python.
w = [4.1, 7.3, 2.5]
total = 5000
unit = total / sum(w)
res = [unit * i for i in w]
print res // [1474.820143884892, 2625.8992805755397, 899.2805755395684]
At the end do something with rounding and you have what you want

Generating seed values for pseudorandom number generators

I have 4 integers with which I want to convert to a seed in order to generate a random number. I understand this is arbitrary for the most part, I do however want to make sure what I am currently doing is not overkill (or doesn't generate enough spread in seed values).
I have roughly 1000 objects which I want to have random properties based on some of their variables.
Two variables are constant and are of the 0 - 1000 range and are random for each object, duplicates can occur but this is not likely at all (constant1 and constant2). The other two variables change with deltas of 1 over long time periods through the running of the program, start at 0, can be anywhere within the signed int32 range but will tend to be between -100 and 100 (variable1 and variable2).
How do you suitably generate a seed from these 4 values?
You should probably initialize Random generator only once, when class instance is initialized, so you should use only 2 of the properties (the other 2 are set to 0 by default, aren't they?) to get a seed.
Because of 1. and assuming that constant1 and constant2 are random by default within 0-1000, you can use constant1 * 1000 + constant2 to get random number between 0 and 1000000. I'm not sure about the randomness distribution, but it should be enough to get a seed.
Update
If you really need to get the seed depend on other two variables, you can follow the pattern and do it as follows:
var seed = ((variable1 * 200 + variable) * 1000 + constant1) * 1000 + constant2;
but because it exceeds Int32 range you have to do that in unsafe context to prevent OverflowException being thrown.
And the last thing: I'm not 100% sure it will give you normalized distribution of generated values.

How Split a dataset into two random halves in weka?

I want to split my dataset into two random halves in weka.
How can I do it?
I had same question and the answer is too simple. First, you need to randomly shuffle the order of instances with weka filter (Unsupervised-> instances) and then split data set into two parts. You can find a complete explanation at below link:
http://cs-people.bu.edu/yingy/intro_to_weka.pdf
you can use first randomize data set in filter , to make it randomly, secondly use, the Remove percentage filter, use first for 30% for testing and save it then reuse it but check the INVERT box so will be the other 70% and save it
so u will have the testing, and training sets randomized and splitted
I have an idea but not using Weka native api. How about use Random Number Generator? Math.random() generates numbers from 0 to 1.
Suppose that we want to split dataset into set1 and set2.
for every instance in dataset
{
if Math.random() < 0.5
put the instance into set1
else
put the instance into set2
}
I think that this method may generate similar number of instances for the two subset. If you want to generate exactly the same quantities, you may add additional conditions to if-else.
Hope this may offer you some inspiration.

how to generate longer random number from a short random number?

I have a short random number input, let's say int 0-999.
I don't know the distribution of the input. Now I want to generate a random number in range 0-99999 based on the input without changing the distribution shape.
I know there is a way to make the input to [0,1] by dividing it by 999 and then multiple 99999 to get the result. However, this method doesn't cover all the possible values, like 99999 will never get hit.
Assuming your input is some kind of source of randomness...
You can take two consecutive inputs and combine them:
input() + 1000*(input()%100)
Be careful though. This relies on the source having plenty of entropy, so that a given input number isn't always followed by the same subsequent input number. If your source is a PRNG designed to cycle between the numbers 0–999 in some fashion, this technique won't work.
With most production entropy sources (e.g., /dev/urandom), this should work fine. OTOH, with a production entropy source, you could fetch a random number between 0–99999 fairly directly.
You can try something like the following:
(input * 100) + random
where random is a random number between 0 and 99.
The problem is that input only specifies which 100 range to use. For instance 50 just says you will have a number between 5000 and 5100 (to keep a similar shape distribution). Which number between 5000 and 5100 to pick is up to you.

Resources