Making a histogram out of values from a table - wolfram-mathematica

I have defined a Mixed distribution made out of two Normal Distributions, like this
MixDist[s_,n_]:=With[{Dist=MixtureDistribution[{.5,.5},{NormalDistribution[0,s],Normaldistribution[0.5s,s]}]},RandomVariate[Dist,n]]
For example, MixDist[1,1000] should generate 1000 numbers distributed with a mixed distribution made out of NormalDistribution1[0,1] and NormalDistribution2[0.5,1].
Now, I want to run this generator 100 times, and this is where I am stuck.
I tried doing this
dist1=Table[MixDist[1,1000],100]
to generate a table with 100 sets of 1000 random numbers, but when trying to plot a histogram with
histogram=Histogram[dist1,20,"ProbabilityDensity"]
it shows a blank coordinate system.
Can data from a table be included in a histogram? Or is there another way to do this (make a histogram of 100 sets of 1000 randomly generated numbers from the mixed distribution mentioned above).
Thank you!

Short answer
mixDist[s_, n_] := With[{
dist = MixtureDistribution[
{.5, .5},
{NormalDistribution[0, s], NormalDistribution[0.5 s, s]}
]
},
RandomVariate[dist, n]
]
dist1 = Table[mixDist[1, 1000], 100];
histogram = Histogram[dist1, 20, "ProbabilityDensity"]
Details
The only mistake was that the second NormalDistribution was written as Normaldistribution
Improved code style. I am convinced that user-defined variables should start with a lowercase letter to be distinguishable from the system ones.

Related

Random value from two seeds

Have a twodimensional grid and need a reproducible, random value for every integer coordinate on this grid. This value should be as unique as possible. In a grid of, let's say 1000 x 1000 it shouldn't occur twice.
To put it more mathematical: I'd need a function f(x, y) which gives an unique number no matter what x and y are as long as they are each in the range [0, 1000]
f(x, y) has to be reproducible and not have side-effects.
Probably there is some trivial solution but everything that comes to my mind, like multiplying x and y, adding some salt, etc. does not lead anywhere because the resulting number can easily occur multiple times.
One working solution I got is to use a randomizer and simply compute ALL values in the grid, but that is too computationally heavy (to do every time a value is needed) or requires too much memory in my case (I want to avoid pre-computing all the values).
Any suggestions?
Huge thanks in advance.
I would use the zero-padded concatenation of your x and y as a seed for a built-in random generator. I'm actually using something like this in some of my current experiments.
I.e. x = 13, y = 42 would become int('0013' + '0042') = 130042 to use as random seed. Then you can use the random generator of your choice to get the kind (float, int, etc) and range of values you need:
Example in Python 3.6+:
import numpy as np
from itertools import product
X = np.zeros((1000, 1000))
for x, y in product(range(1000), range(1000)):
np.random.seed(int(f'{x:04}{y:04}'))
X[x, y] = np.random.random()
Each value in the grid is randomly generated, but independently reproducible.

Julia - Random number in different intervals

Hello
I would like create an array with numbers from different intervals.
For example, with the following code:
using Distributions
A = rand(Uniform(1,10),1,20)
"A" contains 20 numbers between 1 and 10.
I would like create "B" where "B" contains 20 numbers between 1 and 4, or between 6 and 10 but not between 4 and 6.
Is it possible ?
Thank you
I think for general usecase, you want to make sure that the new probability you're sampling from is still a uniform one, albeit spread across non-connecting ranges.
I hacked together a function that produces a new uniform distribution from multiple disconnected uniform distributions:
using Distributions
function general_uniform(distributions...)
all_dists = [distributions...]
sort!(all_dists, by = D -> minimum(D))
# make sure ranges are non overlapping
#assert all(map(maximum, all_dists)[1:end-1] .<= map(minimum, all_dists)[2:end])
dist_legths = map(D -> maximum(D) - minimum(D), all_dists)
ratios = dist_legths ./ sum(dist_legths)
return MixtureModel(all_dists, Categorical(ratios))
end
Then you can sample from this like this:
B = rand(general_uniform(Uniform(1,4), Uniform(6,10)),1,20)
This will give you a uniform distribution even if your ranges don't have the same length. For example:
general_uniform(Uniform(0,1), Uniform(1,10))
Will sample from range 0-1 with probability of 0.1 and from range 1-10 with probability of 0.9.
For example, the following gives a number around 5:
mean(rand(general_uniform(Uniform(0,9), Uniform(9,10)),1000))
Sure:
numbers = []
for i in 1 : 20
if rand() < 0.5
push!(numbers, rand(Uniform(1,4)))
else
push!(numbers, rand(Uniform(6,10)))
end
end
You can also do a mixture:
D = MixtureModel([Uniform(1,4), Uniform(6,10)], Categorical([0.5,0.5]))
rand(D, 1, 20)
Here you have to specify a probability distribution over which uniform distribution to select from, hence the Categorical. The code above samples from each uniform range with equal probability. You can adjust the weighting by changing the Categorical as you see fit.
Using a mixture model of two uniform distributions
rand(MixtureModel(Uniform[Uniform(1,4),Uniform(6,10)]),1,20)
edit :: this sampling is only correct if the size of the intervals is equal!
hth!

generate clustered spatstat marks?

I was wondering if anyone knows how to assign marks in spatstat so that they tend to cluster spatially? I have a set of lat long coordinates that I want to categorize into 4 groups. I have figured out how to randomly assign marks/groups to these points using the following code:
as.ppp(data, window ,marks=factor(sample(1:4,replace=TRUE)))
But I can't figure out how to assign the marks so that groups tend to occupy points closer to one another. As a further complication, I would also like the number of points within each group to be the same, specified number each time. Does anyone have any leads? Thanks in advance!
Typically in spatstat we define models which describe/generate points at random locations and possibly with random marks. If I understand you correctly you have a fixed set of locations and you simply want to assign random marks. How many points do you have? If you don't have too many points a simple suggestion could be to generate a multivariate normally distributed variable and then take the n_1 lowest values for the first mark, the n_2 next values for the second mark, and so on. A simple example with 4 equal sized groups of points:
library(spatstat)
library(mvtnorm)
set.seed(42) # Make reproducible
X <- redwood # Example data
n <- npoints(redwood)
Xdist <- pairdist(X) # n x n matrix of distances in X
decay_rate <- 1 # Parameter for covariance sturcture
sigma <- exp(-decay_rate * Xdist)
m <- rmvnorm(1, rep(0, n), sigma)
breaks <- quantile(m, probs = c(0, .25, .5, .75, 1)) # breaks to cut marks in four equal sized groups
marks(X) <- cut(m, breaks = breaks, include.lowest=TRUE, labels = 1:4)
plot(X)

How to generate correlated Uniform[0,1] variables

(This question is related to how to generate a dataset of correlated variables with different distributions?)
In Stata, say that I create a random variable following a Uniform[0,1] distribution:
set seed 100
gen random1 = runiform()
I now want to create a second random variable that is correlated with the first (the correlation should be .75, say), but is bounded by 0 and 1. I would like this second variable to also be more-or-less Uniform[0,1]. How can I do this?
This won't be exact, but the NORTA/copula method should be pretty close and easy to implement.
The relevant citation is:
Cario, Marne C., and Barry L. Nelson. Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois, 1997.
The paper can be found here.
The general recipe to generate correlated random variables from any distribution is:
Draw two (or more) correlated variables from a joint standard normal distribution using corr2data
Calculate the univariate normal CDF of each of these variables using normal()
Apply the inverse CDF of any distribution to simulate draws from that distribution.
The third step is pretty easy with the [0,1] uniform: you don't even need it. Typically, the magnitude of the correlations you get will be less than the magnitudes of the original (normal) correlations, so it might be useful to bump those up a bit.
Stata Code for 2 uniformish variables that have a correlation of 0.75:
clear
// Step 1
matrix C = (1, .75 \ .75, 1)
corr2data x y, n(10000) corr(C) double
corr x y, means
// Steps 2-3
replace x = normal(x)
replace y = normal(y)
// Make sure things worked
corr x y, means
stack x y, into(z) clear
lab define vars 1 "x" 2 "y"
lab val _stack vars
capture ssc install bihist
bihist z, by(_stack) density tw1(yline(-1 0 1))
If you want to improve the approximation for the uniform case, you can transform the correlations like this (see section 5 of the linked paper):
matrix C = (1,2*sin(.75*_pi/6)\2*sin(.75*_pi/6),1)
This is 0.76536686 instead of the 0.75.
Code for the question in the comments
The correlation matrix C written more compactly, and I am applying the transformation:
clear
matrix C = ( 1, ///
2*sin(-.46*_pi/6), 1, ///
2*sin(.53*_pi/6), 2*sin(-.80*_pi/6), 1, ///
2*sin(0*_pi/6), 2*sin(-.41*_pi/6), 2*sin(.48*_pi/6), 1 )
corr2data v1 v2 v3 v4, n(10000) corr(C) cstorage(lower)
forvalues i=1/4 {
replace v`i' = normal(v`i')
}

How to create a scoring system using two variables

I have an application (Node/Angular) that I'm creating where I'm trying to rank users based on overall performance across two metrics. There are two metrics used to track the users we are using are the following:
Units Produced (ranges between 0 - 6000)
Rate of production = [ Units Produced ] / [ Labor Hours ] (ranges between 0 - 100)
However, ranking users explicitly by either of these variables doesn't make sense, because it creates some strange incentives/behaviors.
For instance, it is possible to have a really high Rate of Production, but a super low number of total number of units produced by working really hard over a short period of time. Alternatively, you can have a very high number of Units Produced, but it may be due to the fact that they worked overtime, and thus were able to produce more units than anyone else just due to the fact that they had longer to work, and they could have a low Rate of Production.
Does anyone have experience designing these types of scoring systems? How have you handled it?
First, I would recommend to bring them on the same scale. E.g. divide Units produced by 60.
Then, if you are fine with equal weights, there are three common simple choices:
Add the scores
Multiply the scores (equal to adding logs of each)
Take the minimum of the two scores
Which of the ones is best, depends on to what extent you want it to be a measure of combined good results. In your case, I would recommend you to multiply and put a scale on the resulting product.
If you want to go a little more complex and weigh or play around with how much to reward separate vs joint scores, you can use the following formula:
V = alpha * log_b[Units Produced / 60] + (1-alpha) * log_b[Rate of Production],
where alpha determines the weighting of one vs the other and the base of the logarithmic function determines to what extent a joint success is rewarded.
I did something very similar I found it valuable to break them into leagues or tiers, for example using Units Produced as a base.
Novice = 100 Units Produced
Beginner = 500 Units Produced
Advanced = 2000 Units Produced
Expert = 4000 Units Produced
Putting this into a useable object
var levels = [
{id: 1, name: "Novice", minUnits: 100, maxUnits: 599 },
{id: 2, name: "Beginner", minUnits: 500, maxUnits: 1999 },
{id: 3, name: "Advanced", minUnits: 2000, maxUnits: 3999 },
{id: 4, name: "Expert", minUnits: 4000, maxUnits: 6000 }
]
You can then use your Rate of production to multiply by a weighted value inside the levels, you can determine what this is. You can play with the values to make it as hard or as easy as you want.
You can do a combination with
SCORE = 200/( K_1/x_1 + K_2/x_2 )
// x_1 : Score 1
// x_2 : Score 2
// K_1 : Maximum of Score 1
// K_2 : Maximum of Score 2
Of course be carefull when dividing by zero. If either x_1 or x_2 are zero then SCORE=0. If x_1=K_1 and x_2=K_2 then SCORE=100 (maximum)
Otherwise the score is somewhere in between. If x_1/K_1 = x_2/K_2 = z then SCORE = 100*z
This weighs the lower score more such that you get rewarded when raising one of the two scores (unlike a minimum of the two scenarios) but not as much as raising both.

Resources