I have learned bootstrap and stratification. But what is stratified bootstrap? And how does it work?
Let's say we have a dataset of n instances (observations), and m is the number of classes. How should I divide the dataset, and what's the percentage for training and testing?
You split your dataset per class. Afterwards, you sample from each sub-population independently. The number of instances you sample from one sub-population should be relative to its proportion.
data
d(i) <- { x in data | class(x) =i }
for each class
for j = 0..samplesize*(size(d(i))/size(data))
sample(i) <- draw element from d(i)
sample <- U sample(i)
If you sample four elements from a dataset with classes {'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b'}, this procedure makes sure that at least one element of class b is contained in the stratified sample.
Just had to implement this in python, I will just post my current approach here in case this is of interest for others.
Function to create index for original Dataframe to create stratified bootstrapped sample
I chose to iterate over all relevant strata clusters in the original Dataframe , retrieve the index of the relevant rows in each stratum and randomly (with replacement) draw the same amount of samples from the stratum that this very stratum consists of.
In turn, the randomly drawn indices can just be combined into one list (that should in the end have the same length as the original Dataframe).
import pandas as pd
from random import choices
def provide_stratified_bootstap_sample_indices(bs_sample):
strata = bs_sample.loc[:, "STRATIFICATION_VARIABLE"].value_counts()
bs_index_list_stratified = []
for idx_stratum_var, n_stratum_var in strata.iteritems():
data_index_stratum = list(bs_sample[bs_sample["STRATIFICATION_VARIABLE"] == idx_stratum_var[0]].index)
bs_index_list_stratified.extend(choices(data_index_stratum , k = len(data_index_stratum )))
return bs_index_list_stratified
And then the actual bootstrapping loop
(say 10k times):
k=10000
for i in range(k):
bs_sample = DATA_original.copy()
bs_index_list_stratified = provide_stratified_bootstap_sample_indices(bs_sample)
bs_sample = bs_sample.loc[bs_index_list_stratified , :]
# process data with some statistical operation as required and save results as required for each iteration
RESULTS = FUNCTION_X(bs_sample)
Related
Problem: Given lists of integers uu,ww of same length, in Python 3 I wish to compute shorter lists u,w of same length, where u=unique elements of uu and w[i]=sum of all ww[j] such that uu[j]=u[i], i.e. accumulate the elements. For example, for uu=[1,2,1,3,2,2,1,3,1,4]; ww=[0,4,2,1,6,3,2,6,3,0] the output should be [1,2,3,4],[7,13,7,0].
Solution: I have found these ways to achieve my goal:
import numpy as np
import pandas as pd
import itertools, collections
def ti(): return time.perf_counter() #current time
def acml(method,uu,ww):
if method==1: #result is not sorted
d=collections.defaultdict(int)
for i in range(len(uu)): d[uu[i]]+=ww[i]
return list(d.keys()),list(d.values())
if method==2: #result is sorted
uw=list(zip(uu,ww)); #transposed list
uw.sort(key=lambda x:x[0]); #sorted by zeroth coordinate
uw=itertools.groupby(uw,lambda x: x[0]); #iterator
uw=[[k,sum([i[1] for i in v])] for k,v in uw];
return list(zip(*uw))
if method==3: #slow, result is sorted
x=pd.DataFrame(data={'u':uu,'w':ww}); x=x.groupby('u');
u=[i for i,xi in x]; w=[xi['w'].sum() for i,xi in x]; return u,w
n=10**5; r=n//2; uu=list(np.random.randint(0,r,size=n)); ww=list(np.random.randint(-10,10,size=n));
t0=ti(); uw1=acml(1,uu,ww); t1=ti(); print(t1-t0)
t0=ti(); uw2=acml(2,uu,ww); t1=ti(); print(t1-t0)
t0=ti(); uw3=acml(3,uu,ww); t1=ti(); print(t1-t0)
Is there a faster function that achieves this? (the order of elements is irrelevant)
Background: The reason I need the function acml is that I am constructing a huge sparse matrix (of size up to 10^7 x 10^7 with a few entries per column). I do this by computing for each column v the entries (u,w). Typically, there appear several such entries with the same first coordinate, and therefore the second coordinates of those elements should be summed.
I realize that the construction scipy.sparse.csr_matrix(ww,(uu,vv)) already takes care of this (i.e. it sums values belonging to the same location), but I would still like to use acml, because my matrices can barely fit into memory, so accumulating the entries of each column (before the entries of other columns are even found) would save up RAM.
Given a data set of a few millions of price ranges, we need to find the smallest range that contains a given price.
The following rules apply:
Ranges can be fully nested (ie, 1-10 and 5-10 is valid)
Ranges cannot be partially nested (ie, 1-10 and 5-15 is invalid)
Example:
Given the following price ranges:
1-100
50-100
100-120
5-10
5-20
The result for searching price 7 should be 5-10
The result for searching price 100 should be 100-120 (smallest range containing 100).
What's the most efficient algorithm/data structure to implement this?
Searching the web, I only found solutions for searching ranges within ranges.
I've been looking at Morton count and Hilbert curve, but can't wrap my head around how to use them for this case.
Thanks.
Because you did not mention this ad hoc algorithm, I'll propose this as a simple answer to your question:
This is a python function, but it's fairly easy to understand and convert it in another language.
def min_range(ranges, value):
# ranges = [(1, 100), (50, 100), (100, 120), (5, 10), (5, 20)]
# value = 100
# INIT
import math
best_range = None
best_range_len = math.inf
# LOOP THROUGH ALL RANGES
for b, e in ranges:
# PICK THE SMALLEST
if b <= value <= e and e - b < best_range_len:
best_range = (b, e)
best_range_len = e - b
print(f'Minimal range containing {value} = {best_range}')
I believe there are more efficient and complicated solutions (if you can do some precomputation for example) but this is the first step you must take.
EDIT : Here is a better solution, probably in O(log(n)) but it's not trivial. It is a tree where each node is an interval, and has a child list of all strictly non overlapping intervals that are contained inside him.
Preprocessing is done in O(n log(n)) time and queries are O(n) in worst case (when you can't find 2 ranges that don't overlap) and probably O(log(n)) in average.
2 classes: Tree that holds the tree and can query:
class tree:
def __init__(self, ranges):
# sort the ranges by lowest starting and then greatest ending
ranges = sorted(ranges, key=lambda i: (i[0], -i[1]))
# recursive building -> might want to optimize that in python
self.node = node( (-float('inf'), float('inf')) , ranges)
def __str__(self):
return str(self.node)
def query(self, value):
# bisect is for binary search
import bisect
curr_sol = self.node.inter
node_list = self.node.child_list
while True:
# which of the child ranges can include our value ?
i = bisect.bisect_left(node_list, (value, float('inf'))) - 1
# does it includes it ?
if i < 0 or i == len(node_list):
return curr_sol
if value > node_list[i].inter[1]:
return curr_sol
else:
# if it does then go deeper
curr_sol = node_list[i].inter
node_list = node_list[i].child_list
Node that holds the structure and information:
class node:
def __init__(self, inter, ranges):
# all elements in ranges will be descendant of this node !
import bisect
self.inter = inter
self.child_list = []
for i, r in enumerate(ranges):
if len(self.child_list) == 0:
# append a new child when list is empty
self.child_list.append(node(r, ranges[i + 1:bisect.bisect_left(ranges, (r[1], r[1] - 1))]))
else:
# the current range r is included in a previous range
# r is not a child of self but a descendant !
if r[0] < self.child_list[-1].inter[1]:
continue
# else -> this is a new child
self.child_list.append(node(r, ranges[i + 1:bisect.bisect_left(ranges, (r[1], r[1] - 1))]))
def __str__(self):
# fancy
return f'{self.inter} : [{", ".join([str(n) for n in self.child_list])}]'
def __lt__(self, other):
# this is '<' operator -> for bisect to compare our items
return self.inter < other
and to test that:
ranges = [(1, 100), (50, 100), (100, 120), (5, 10), (5, 20), (50, 51)]
t = tree(ranges)
print(t)
print(t.query(10))
print(t.query(5))
print(t.query(40))
print(t.query(50))
Preprocessing that generates disjoined intervals
(I call source segments as ranges and resulting segments as intervals)
For ever range border (both start and end) make tuple: (value, start/end fiels, range length, id), put them in array/list
Sort these tuples by the first field. In case of tie make longer range left for start and right for end.
Make a stack
Make StartValue variable.
Walk through the list:
if current tuple contains start:
if interval is opened: //we close it
if current value > StartValue: //interval is not empty
make interval with //note id remains in stack
(start=StartValue, end = current value, id = stack.peek)
add interval to result list
StartValue = current value //we open new interval
push id from current tuple onto stack
else: //end of range
if current value > StartValue: //interval is not empty
make interval with //note id is removed from stack
(start=StartValue, end = current value, id = stack.pop)
add interval to result list
if stack is not empty:
StartValue = current value //we open new interval
After that we have sorted list of disjointed intervals containing start/end value and id of the source range (note that many intervals might correspond to the same source range), so we can use binary search easily.
If we add source ranges one-by-one in nested order (nested after it parent), we can see that every new range might generate at most two new intervals, so overall number of intervals M <= 2*N and overall complexity is O(Nlog N + Q * logN) where Q is number of queries
Edit:
Added if stack is not empty section
Result for your example 1-100, 50-100, 100-120, 5-10, 5-20 is
1-5(0), 5-10(3), 10-20(4), 20-50(0), 50-100(1), 100-120(2)
Since pLOPeGG already covered the ad hoc case, I will answer the question under the premise that preporcessing is performed in order to support multiple queries efficiently.
General data structures for efficient queries on intervals are the Interval Tree and the Segment Tree
What about an approach like this. Since we only allow nested and not partial-nesting. This looks to be a do-able approach.
Split segments into (left,val) and (right,val) pairs.
Order them with respect to their vals and left/right relation.
Search the list with binary search. We get two outcomes not found and found.
If found check if it is a left or right. If it is a left go right until you find a right without finding a left. If it is a right go left until you find a left without finding a right. Pick the smallest.
If not found stop when the high-low is 1 or 0. Then compare the queried value with the value of the node you are at and then according to that search right and left to it just like before.
As an example;
We would have (l,10) (l,20) (l,30) (r,45) (r,60) (r,100) when searching for say, 65 you drop on (r,100) so you go left and can't find a spot with a (l,x) such that x>=65 so you go left until you get balanced lefts and rights and first right and last left is your interval. The reprocessing part will be long but since you will keep it that way. It is still O(n) in worst-case. But that worst case requires you to have everything nested inside each other and you searching for the outer-most.
I have a problem in which I have 4 objects (1s) on a 100x100 grid of zeros that is split up into 16 even squares of 25x25.
I need to create a (16^4 * 4) table where entries listing all the possible positions of each of these 4 objects across the 16 submatrices. The objects can be anywhere within the submatrices so long as they aren't overlapping one another. This is clearly a permutation problem, but there is added complexity because of the indexing and the fact that the positions ned to be random but not overlapping within a 16th square. Would love any pointers!
What I tried to do was create a function called "top_left_corner(position)" that returns the subscript of the top left corner of the sub-matrix you are in. E.g. top_left_corner(1) = (1,1), top_left_corner(2) = (26,1), etc. Then I have:
pos = randsample(24,2);
I = pos(1)+top_left_corner(position,1);
J = pos(2)+top_left_corner(position,2);
The problem is how to generate and store permutations of this in a table as linear indices.
First using ndgrid cartesian product generated in the form of a [4 , 16^4] matrix perm. Then in the while loop random numbers generated and added to perm. If any column of perm contains duplicated random numbers ,random number generation repeated for those columns until no column has duplicated elements.Normally no more than 2-3 iterations needed. Since the [100 ,100] array divided into 16 blocks, using kron an index pattern like the 16 blocks generated and with the sort function indexes of sorted elements extracted. Then generated random numbers form indexes of the pattern( 16 blocks).
C = cell(1,4);
[C{:}]=ndgrid(0:15,0:15,0:15,0:15);
perm = reshape([C{:}],16^4,4).';
perm_rnd = zeros(size(perm));
c = 1:size(perm,2);
while true
perm_rnd(:,c) = perm(:,c) * 625 +randi(625,4,numel(c));
[~ ,c0] = find(diff(sort(perm_rnd(:,c),1),1,1)==0);
if isempty(c0)
break;
end
%c = c(unique(c0));
c = c([true ; diff(c0)~=0]);
end
pattern = kron(reshape(1:16,4,4),ones(25));
[~,idx] = sort(pattern(:));
result = idx(perm_rnd).';
So i am trying to build one factor models with stocks and indices in R. I have 30 stocks and 16 indices in total. They are all time series from "2013-1-1" to "2014-12-31". Well at least all my stocks are. All of my indices are missing some entries here and there. For example, all of my stocks' data have the length of 522 but one indice has a length of 250, one 300, another 400 etc. But they all start from "2013-1-1" and end at "2014-12-31". Because my indice data has holes in it, i can't check correlations and build linear models with them. I can't do anything basically. So i need to fill these holes. I am thinking about filling those holes with their mean. But i don't know how to do it.I am open to other ideas of course. Can you help me? It is an important term project for me, so there is a lot on the line...
Edited based upon your comments (and to fix a mistake I made):
This is basic data management and I'm surprised that you're being required to work with timeseries data without knowing how to merge() and how to create dataframes.
Create some fake date and value data with holes in the dates:
dFA <- data.frame(seq.Date(as.Date("2014-01-01"), as.Date("2014-02-28"), 3))
names(dFA) <- "date"
dFA$vals <- rnorm(nrow(dFA), 25, 5)
Create a dataframe of dates from the min value in dFA to the max value in dFA
dFB <- as.data.frame(seq.Date(as.Date(min(dFA$date, na.rm = T), format = "%Y-%m-%d"),
as.Date(max(dFA$date, na.rm = T), format = "%Y-%m-%d"),
1))
names(dFB) <- "date"
Merge the two dataframes together
tmp <- merge(dFB, dFA, by = "date", all = T)
Change NA values in tmp$vals to whatever you want
tmp$vals[is.na(tmp$vals)] <- mean(dFA$vals)
head(tmp)
date vals
1 2014-01-01 18.48131
2 2014-01-02 24.16256
3 2014-01-03 24.16256
4 2014-01-04 28.78855
5 2014-01-05 24.16256
6 2014-01-06 24.16256
Original comment below
The easiest way to fill in the holes is with merge().
Create a new data frame with one vector as a sequence of dates that span the range of your original dataframe and the other vector with whatever you're going to fill the holes (zeroes, means, whatever). Then just merge() the two together:
merge(dFB, dFA, by = [the column with the date values], all = TRUE)
I've need for a particular form of 'set' partitioning that is escaping me, as it's not quite partitioning. Or rather, it's the subset of all partitions for a particular list that maintain the original order.
I have a list of n elements [a,b,c,...,n] in a particular order.
I need to get all discrete variations of partitioning that maintains the order.
So, for four elements, the result will be:
[{a,b,c,d}]
[{a,b,c},{d}]
[{a,b},{c,d}]
[{a,b},{c},{d}]
[{a},{b,c,d}]
[{a},{b,c},{d}]
[{a},{b},{c,d}]
[{a},{b},{c},{d}]
I need this for producing all possible groupings of tokens in a list that must maintain their order, for use in a broader pattern matching algorithm.
I've found only one other question that relates to this particular issue here, but it's for ruby. As I don't know the language, it looks like someone put code in a blender, and don't particularly feel like learning a language just for the sake of deciphering an algorithm, I feel I'm out of options.
I've tried to work it out mathematically so many times in so many ways it's getting painful. I thought I was getting closer by producing a list of partitions and iterating over it in different ways, but each number of elements required a different 'pattern' for iteration, and I had to tweak them in by hand.
I have no way of knowing just how many elements there could be, and I don't want to put an artificial cap on my processing to limit it just to the sizes I've tweaked together.
You can think of the problem as follows: each of the partitions you want are characterized by a integer between 0 and 2^(n-1). Each 1 in the binary representation of such a number corresponds to a "partition break" between two consecutive numbers, e.g.
a b|c|d e|f
0 1 1 0 1
so the number 01101 corresponds to the partition {a,b},{c},{d,e},{f}. To generate the partition from a known parition number, loop through the list and slice off a new subset whenever the corresponding bit it set.
I can understand your pain reading the fashionable functional-programming-flavored Ruby example. Here's a complete example in Python if that helps.
array = ['a', 'b', 'c', 'd', 'e']
n = len(array)
for partition_index in range(2 ** (n-1)):
# current partition, e.g., [['a', 'b'], ['c', 'd', 'e']]
partition = []
# used to accumulate the subsets, e.g., ['a', 'b']
subset = []
for position in range(n):
subset.append(array[position])
# check whether to "break off" a new subset
if 1 << position & partition_index or position == n-1:
partition.append(subset)
subset = []
print partition
Here's my recursive implementation of partitioning problem in Python. For me, recursive solutions are always easier to comprehend. You can find more explanation about it in here.
# Prints partitions of a set : [1,2] -> [[1],[2]], [[1,2]]
def part(lst, current=[], final=[]):
if len(lst) == 0 :
if len(current) == 0:
print (final)
elif len(current) > 1:
print ([current] + final)
else :
part(lst[1:], current + [lst[0]], final[:])
part(lst[1:], current[:], final + [[lst[0]]])
Since nobody has mentioned backtrack technique in solving this. Here is the Python solution to solve this using backtrack.
def partition(num):
def backtrack(index, chosen):
if index == len(num):
print(chosen)
else:
for i in range(index, len(num)):
# Choose
cur = num[index:i + 1]
chosen.append(cur)
# Explore
backtrack(i + 1, chosen)
# Unchoose
chosen.pop()
backtrack(0, [])
>>> partition('123')
['1', '2', '3']
['1', '23']
['12', '3']
['123']