StatsBase.sample() can't draw without replacement if FrequencyWeights() are provided

StatsBase.sample() can't draw without replacement if FrequencyWeights() are provided - random

I'm trying to sample without replacement using StatsBase.sample() in Julia. Because I have my data in the following form I can use my counts as FrequencyWeights():
using StatsBase
data = ["red", "blue", "green"]
counts = [2000, 2000, 1]
balls = StatsBase.sample(data, FrequencyWeights(counts), 1000)
One problem with this is that StatsBase.sample() implicitly sets replace=true so this is possible:
countmap(balls)
Dict("blue" => 478,
"green" => 2, # <= two green balls?
"red" => 520)
Explicitly setting replace=false throws an error.
balls = StatsBase.sample(data, FrequencyWeights(counts), 1000, replace=false)
Cannot draw 3 samples from 1000 samples without replacement.
error(::String)#error.jl:33
var"#sample!#174"(::Bool, ::Bool, ::typeof(StatsBase.sample!), ::Random._GLOBAL_RNG, ::Vector{String}, ::StatsBase.FrequencyWeights{Int64, Int64, Vector{Int64}}, ::Vector{String})#sampling.jl:858
#sample#175#sampling.jl:871[inlined]
#sample#176#sampling.jl:874[inlined]
top-level scope#Local: 2[inlined]
Is my only solution here to reformat my data to a wide form like this? Because that seems very inefficient as my actual data set has a lot of counts.:
wide_data = [fill("red", 2000)..., fill("blue", 2000)..., "green"]
sample(wide_data, 1000, replace=false)

You could use something like this:
function mysample(data::AbstractVector, counts::AbstractVector, n::Integer)
#assert n <= sum(counts)
#assert firstindex(data) == 1
#assert firstindex(counts) == 1
res = similar(data, n)
fw = FrequencyWeights(copy(counts))
for i in 1:n
j = sample(axes(data, 1), fw)
res[i] = data[j]
fw.sum -= 1
fw.values[j] -= 1
end
return res
end

Related

Function to find X numbers that add up to a certain value

I need a function that finds a variable amount of numbers, which together must add up to a certain value. In this case it is 8.
The numbers which can be added together are predefined in a table, to make things easier.
Current approach: Shuffle the table using a small algorithm, add first X values together, if they don't add up to 8, start over (including shuffling again) until the first X values add up to 8.
My code does work, just 2 problems: It takes a long time to process (obviously) and it can cause a stack overflow error if I don't add a cooldown.
Code can be dirty, it's not for a live production. Also im only an intermediate lua developer at best...
function sleep (a) -- random sleep function I found
local sec = tonumber(os.clock() + a);
while (os.clock() < sec) do
end
end
function shuffle(tbl) -- random shuffle function I found
for i = #tbl, 2, -1 do
math.randomseed( os.time() )
math.random();math.random();math.random();math.random();
local j = math.random(i)
tbl[i], tbl[j] = tbl[j], tbl[i]
end
return tbl
end
local times = {
0.5,
1.0,
1.5,
2.0,
2.5,
3.0,
3.5,
4.0
}
local timeunits = {} --refer to line 49, I did not want to do it like that...
function nnumbersto8(amount)
local sum = 0
local numbs = {}
times = shuffle(times) --reshuffle the set
for i = 1,amount,1 do --add first x values together
sum = sum + times[i]
numbs[i] = times[i]
end
if sum ~= 8 then sleep(0.1) nnumbersto8(amount) return end --if they are not 8, repeat process with cooldown to avoid stack overflow
--return numbs -- This doesn't work for some reason, nothing gets returned outside the function
timeunits = numbs
end
nnumbersto8(5) -- manual run it for now
print(unpack(timeunits))
There must be a simpler way, right?
Thanks in advance, any help is appreciated!

Here is a method that will work for large numbers of elements, and will pick a random solution with theoretically even likelihood for each.
function solution_node (value, count, remainder)
local node = {}
node.value = value
node.count = count
node.remainder = remainder
return node
end
function choose_solutions (node1, node2)
if node1 == nil then
return node2
elseif node2 == nil then
return node1
else
-- Make a random choice of which solution to pick.
if node1.count < math.random(node1.count + node2.count) then
node2.count = node1.count + node2.count
return node2
else
node1.count = node1.count + node2.count
return node1
end
end
end
function decode_solution (node)
if node == nil then
return nil
end
answer = {}
while node.value ~= nil do
table.insert(answer, node.value)
-- This causes the solution to be randomly shuffled.
local i = math.random(#answer)
answer[#answer], answer[i] = answer[i], answer[#answer]
node = node.remainder
end
return answer
end
function random_sum(tbl, count, target)
local choices = {}
-- Normally arrays are not 0-based in Lua but this is very convenient.
for j = 0,count do
choices[j] = {}
end
-- Make sure that the empty set is there.
choices[0][0.0] = solution_node(nil, 1, nil)
for i = 1,#tbl do
for j = count,1,-1 do
for this_sum, node in pairs(choices[j-1]) do
local next_sum = this_sum + tbl[i]
local next_node = solution_node(tbl[i], node.count, node)
-- Try adding this value in to a solution.
if next_sum <= target then
choices[j][next_sum] = choose_solutions(next_node, choices[j][next_sum])
end
end
end
end
return decode_solution(choices[count][target])
end
local times = {
0.2,
0.3,
0.5,
1.0,
1.2,
1.3,
1.5,
2.0,
2.5,
3.0,
3.5,
4.0
}
math.randomseed( os.time() )
local result = random_sum(times, 5, 8.0)
print("answer")
for k, v in pairs(result) do print(v) end
Sorry for my code. I haven't coded in Lua for a few years.

This is the subset sum problem with an extra restriction on the number of elements you are allowed to choose.
The solution is to use Dynamic Programming similar to regular Subset Sum, but add an extra variable that indicates how many items you have used.
This should go something among the lines of:
Failing stop clauses:
DP[-1][x][n] = false, for all x,n>0 // out of elements
DP[i][-1][n] = false, for all i,n>0 // exceeded X items
DP[i][x][n] = false n < 0 // Passed the sum limit. This is an optimization only if all elements are non negative.
Successful stop clause:
DP[i][0][0] = true for all i >= 0
Recursive formula:
DP[i][x][n] = DP[i-1][x][n] OR DP[i-1][x-1][n-item[i]] // Watch for n<item[i] case here.
^ ^
Did not take the item Used the item

There are no solutions for 1, 2 and for values greater than 5, so the function only accepts 3, 4 and 5.
Here we are doing a shallow copy of the times table then we get a random index from the copy and begin searching for the solution, removing values we use as we go.
local times = {
0.5,
1.0,
1.5,
2.0,
2.5,
3.0,
3.5,
4.0
}
function nNumbersTo8(amount)
if amount < 3 or amount > 5 then
return {}
end
local sum = 0
local numbers = {}
local set = {table.unpack(times)}
for i = 1, amount - 1, 1 do
local index = math.random(#set)
local value = set[index]
if not (8 < (sum + value)) then
sum = sum + value
table.insert(numbers, value)
table.remove(set, index)
else
break
end
end
local reminder = 8 - sum
for _,v in ipairs(set)do
if v == reminder then
sum = sum + v
table.insert(numbers, v)
break
end
end
if #numbers == amount then
return numbers
else
return nNumbersTo8(amount)
end
end
for i=1,100 do
print(table.unpack(nNumbersTo8(5)))
end
Example response:
1.5 0.5 3 2 1
3 0.5 1.5 1 2
2 3 1.5 0.5 1
3 2 1.5 1 0.5
0.5 1 2 3 1.5

How to go about a d-smooth sequence algorithm

I'm really struggling to design an algorithm to find d, which is the lowest value that can be added or subtracted (at most) to make a given sequence strictly increasing.
For example.. say seq[] = [2,4,8,3,1,12]
given that sequence, the algorithm should return "5" as d because you can add or subtract at most 5 to each element such that the function is strictly increasing.
I've tried several approaches and can't seem to get a solid technique down.
I've tried looping through the seq. and checking if seq[i] < seq[i+1]. If not, it checks if d>0.. if it is, try to add/subtract it from seq[i+1]. Otherwise it calculates d by taking the difference of seq[i-1] - seq[i].
I can't get it to be stable though and Its like I keep adding if statements that are more "special cases" for unique input sequences. People have suggested using a binary search approach, but I can't make sense of applying it to this problem.
Any tips and suggestions are greatly appreciated. Thanks!
Here's my code in progress - using Python - v4
def ComputeMaxDelta3(seq):
# Create a copy to speed up comparison on modified values
aItems = seq[1:] #copies sequence elements from 1 (ignores seq[0])
# Will store the fix values for every item
# this should allocate 'length' times the 0 value
fixes = [0] * len(aItems)
print("fixes>>",fixes)
# Loop until no more fixes get applied
bNeedFix = True
while(bNeedFix):
# Hope will have no fix this turn
bNeedFix = False
# loop all subsequent item pairs (i should run from 0 to length - 2)
for i in range(0,len(aItems)-1):
# Left item
item1 = aItems[i]
# right item
item2 = aItems[i+1]
# Compute delta between left and right item
# We remember that (right >= left + 1
nDelta = item2 - (item1 + 1)
if(nDelta < 0):
# Fix the right item
fixes[i+1] -= nDelta
aItems[i+1] -= nDelta
# Need another loop
bNeedFix = True
# Compute the fix size (rounded up)
# max(s) should be int and the division should produce an int
nFix = int((max(fixes)+1)/2)
print("current nFix:",nFix)
# Balance all fixes
for i in range(len(aItems)):
fixes[i] -= nFix
print("final Fixes:",fixes)
print("d:",nFix)
print("original sequence:",seq[1:])
print("result sequence:",aItems)
return
Here's whats displayed:
Working with: [6, 2, 4, 8, 3, 1, 12]
[0]= 6 So the following numbers are the sequence:
aItems = [2, 4, 8, 3, 1, 12]
fixes>> [0, 0, 0, 0, 0, 0]
current nFix: 6
final Fixes: [-6, -6, -6, 0, 3, -6]
d: 1
original sequence: [2, 4, 8, 3, 1, 12]
result sequence: [2, 4, 8, 9, 10, 12]
d SHOULD be: 5
done!
~Note~
I start at 1 rather than 0 due to the first element being a key

As anticipated, here is (or should be) the Python version of my initial solution:
def ComputeMaxDelta(aItems):
# Create a copy to speed up comparison on modified values
aItems = aItems[:]
# Will store the fix values for every item
# this should allocate 'length' times the 0 value
fixes = [0] * len(aItems)
# Loop until no more fixes get applied
bNeedFix = True
while(bNeedFix):
# Hope will have no fix this turn
bNeedFix = False
# loop all subsequent item pairs (i should run from 0 to length - 2)
for i in range(0,len(aItems)-1):
# Left item
item1 = aItems[i]
# right item
item2 = aItems[i+1]
# Compute delta between left and right item
# We remember that (right >= left + 1
nDelta = item2 - (item1 + 1)
if(nDelta < 0):
# Fix the right item
fixes[i+1] -= nDelta
aItems[i+1] -= nDelta
# Need another loop
bNeedFix = True
# Compute the fix size (rounded up)
# max(s) should be int and the division should produce an int
nFix = (max(fixes)+1)/2 # corrected from **(max(s)+1)/2**
# Balance all fixes
for i in range(len(s)):
fixes[i] -= nFix
print("d:",nFix) # corrected from **print("d:",nDelta)**
print("s:",fixes)
return
I took your Python and fixed in order to operate exactly as my C# solution.
I don't know Python, but looking for some reference on the web, I should have found the points where your porting was failing.
If you compare your python version with mine you should find the following differences:
You saved a reference aItems into s and used it as my fixes, but fixes was meant to start as all 0.
You didn't cloned aItems over itself, then every alteration to its items was reflected outside of the method.
Your for loop was starting at index 1, whereas mine started at 0 (the very first element).
After the check for nDelta you subtracted nDelta from both s and aItems, but as I stated at points 1 and 2 they were pointing to the same items.
The ceil instruction was unnedeed because the division between two integers produces an integer, as with C#.
Please remember that I fixed the Python code basing my knowledge only on online documentation, because I don't code in that language, so I'm not 100% sure about some syntax (my main doubt is about the fixes declaration).
Regards,
Daniele.

Here is my solution:
public static int ComputeMaxDelta(int[] aItems, out int[] fixes)
{
// Create a copy to speed up comparison on modified values
aItems = (int[])aItems.Clone();
// Will store the fix values for every item
fixes = new int[aItems.Length];
// Loop until no more fixes get applied
var bNeedFix = true;
while (bNeedFix)
{
// Hope will have no fix this turn
bNeedFix = false;
// loop all subsequent item pairs
for (int ixItem = 0; ixItem < aItems.Length - 1; ixItem++)
{
// Left item
var item1 = aItems[ixItem];
// right item
var item2 = aItems[ixItem + 1];
// Compute delta between left and right item
// We remember that (right >= left + 1)
var nDelta = item2 - (item1 + 1);
if (nDelta < 0)
{
// Fix the right item
fixes[ixItem + 1] -= nDelta;
aItems[ixItem + 1] -= nDelta;
//Need another loop
bNeedFix = true;
}
}
}
// Compute the fix size (rounded up)
var nFix = (fixes.Max() + 1) / 2;
// Balance all fixes
for (int ixItem = 0; ixItem < aItems.Length; ixItem++)
fixes[ixItem] -= nFix;
return nFix;
}
The function returns the maximum computed fix gap.
As a bounus, the parameter fixes will receive the fixes for every item. These are the delta to apply to each source value in order to be sure that they will be in ascending order: some fix can be reduced but some analysis loop is required to achieve that optimization.
The following is a code to test the algorithm. If you set a breakpoint at the end of the loop, you'll be able to check the result for sequence you provided in your example.
var random = new Random((int)Stopwatch.GetTimestamp());
for (int ixLoop = -1; ixLoop < 100; ixLoop++)
{
int nCount;
int[] aItems;
// special case as the provided sample sequence
if (ixLoop == -1)
{
aItems = new[] { 2, 4, 8, 3, 1, 12 };
nCount = aItems.Length;
}
else
{
// Generates a random amount of items based on my screen's width
nCount = 4 + random.Next(21);
aItems = new int[nCount];
for (int ixItem = 0; ixItem < nCount; ixItem++)
{
// Keep the generated numbers below 30 for easier human analysis
aItems[ixItem] = random.Next(30);
}
}
Console.WriteLine("***");
Console.WriteLine(" # " + GetText(Enumerable.Range(0, nCount).ToArray()));
Console.WriteLine(" " + GetText(aItems));
int[] aFixes;
var nFix = ComputeMaxDelta(aItems, out aFixes);
// Computes the new values, that will be always in ascending order
var aNew = new int[aItems.Length];
for (int ixItem = 0; ixItem < aItems.Length; ixItem++)
{
aNew[ixItem] = aItems[ixItem] + aFixes[ixItem];
}
Console.WriteLine(" = " + nFix.ToString());
Console.WriteLine(" ! " + GetText(aFixes));
Console.WriteLine(" > " + GetText(aNew));
}
Regards,
Daniele.

Math.random specific values corona sdk

I have a set of pre-declared values to set specific rotations for an object.
local rotations = {900,-900}
And want my spawn function for the blocks to randomly pick one or the other from this function:
local blocks = {}
timerSrc = timer.performWithDelay(1200, createBlock, -1)
function createBlock(event)
b = display.newImageRect("images/block8.png", 20, 150)
b.x = 500
b.y = math.random(100,250)
b.name = 'block'
physics.addBody(b, "static")
transition.to( b, { rotation = math.random(rotations), time = math.random(2700,3700)} )
blocks:insert(b)
end
When I use:
rotation = math.random(-900,900)
it just chooses any values between the 2 numbers rather than 1 or the other. How can I do this correctly ?

If m is an integer value, math.random(m) returns integers in range [1, m] randomly. So math.random(2) returns integers 1 or 2 randomly.
To generate random numbers either 900 or -900, use:
rotation = math.random(2) == 1 and 900 or -900

How to model a mixture of 3 Normals in PyMC?

There is a question on CrossValidated on how to use PyMC to fit two Normal distributions to data. The answer of Cam.Davidson.Pilon was to use a Bernoulli distribution to assign data to one of the two Normals:
size = 10
p = Uniform( "p", 0 , 1) #this is the fraction that come from mean1 vs mean2
ber = Bernoulli( "ber", p = p, size = size) # produces 1 with proportion p.
precision = Gamma('precision', alpha=0.1, beta=0.1)
mean1 = Normal( "mean1", 0, 0.001 )
mean2 = Normal( "mean2", 0, 0.001 )
#deterministic
def mean( ber = ber, mean1 = mean1, mean2 = mean2):
return ber*mean1 + (1-ber)*mean2
Now my question is: how to do it with three Normals?
Basically, the issue is that you can't use a Bernoulli distribution and 1-Bernoulli anymore. But how to do it then?
edit: With the CDP's suggestion, I wrote the following code:
import numpy as np
import pymc as mc
n = 3
ndata = 500
dd = mc.Dirichlet('dd', theta=(1,)*n)
category = mc.Categorical('category', p=dd, size=ndata)
precs = mc.Gamma('precs', alpha=0.1, beta=0.1, size=n)
means = mc.Normal('means', 0, 0.001, size=n)
#mc.deterministic
def mean(category=category, means=means):
return means[category]
#mc.deterministic
def prec(category=category, precs=precs):
return precs[category]
v = np.random.randint( 0, n, ndata)
data = (v==0)*(50+ np.random.randn(ndata)) \
+ (v==1)*(-50 + np.random.randn(ndata)) \
+ (v==2)*np.random.randn(ndata)
obs = mc.Normal('obs', mean, prec, value=data, observed = True)
model = mc.Model({'dd': dd,
'category': category,
'precs': precs,
'means': means,
'obs': obs})
The traces with the following sampling procedure look good as well. Solved!
mcmc = mc.MCMC( model )
mcmc.sample( 50000,0 )
mcmc.trace('means').gettrace()[-1,:]

there is a mc.Categorical object that does just this.
p = [0.2, 0.3, .5]
t = mc.Categorical('test', p )
t.random()
#array(2, dtype=int32)
It returns an int between 0 and len(p)-1. To model the 3 Normals, you make p a mc.Dirichlet object (it accepts a k length array as the hyperparameters; setting the values in the array to be the same is setting the prior probabilities to be equal). The rest of the model is nearly identical.
This is a generalization of the model I suggested above.
Update:
Okay, so instead of having different means, we can collapse them all into 1:
means = Normal( "means", 0, 0.001, size=3 )
...
#mc.deterministic
def mean(categorical=categorical, means = means):
return means[categorical]

In MongoDB, how can I replicate this simple query using map/reduce in ruby?

So using the regular MongoDB library in Ruby I have the following query to find average filesize across a set of 5001 documents:
avg = 0
total = collection.count()
Rails.logger.info "#{total} asset creation stats in the system"
collection.find().each {|row| avg += (row["filesize"] * (1/total.to_f)) if row["filesize"]}
Its pretty simple, so I'm trying to do the same using map/reduce as a learning exercise. This is what I came up with:
map = 'function(){emit("filesizes", {size: this.filesize, num: 1});}'
reduce = 'function(k, vals){
var result = {size: 0, num: 0};
for(var x in vals) {
var new_total = result.num + vals[x].num;
result.num = new_total
result.size = result.size + (vals[x].size * (vals[x].num / new_total));
}
return result;
}'
#results = collection.map_reduce(map, reduce)
However the two queries come back with two different results!
What am I doing wrong?

You're weighting the results by doing the division in every reduce function.
Say you had [{size : 5, num : 1}, {size : 5, num : 1}, {size : 5, num : 1}]. Your reduce would calculate:
result.size = 0 + (5*(1/1)) = 5
result.size = 5 + (5*(1/2)) = 7.25
result.size = 7.25 + (5*(1/3)) = 8.9
As you can see, this weights the results towards the earliest elements.
Fortunately, there's a simple solution. Just add a finalize function, which will be run once after the reduce step is finished.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

StatsBase.sample() can't draw without replacement if FrequencyWeights() are provided - random

Related

Function to find X numbers that add up to a certain value

How to go about a d-smooth sequence algorithm

Math.random specific values corona sdk

How to model a mixture of 3 Normals in PyMC?

In MongoDB, how can I replicate this simple query using map/reduce in ruby?

Categories

Resources