Slow numpy array indexing for keras time series generator - performance

I use the keras time series generator for training a neural network with LSTM cells, which unfortunately proved to be a bottleneck in training.
Below is a simplified example to run, which shows the high runtime of the batch generator. It is important to note that the rows from the dataset are chosen randomly and thus a sliding window is not possible. During the training the CPUs are running continuously at about 80%, whereas the GPU is running at a single-digit percentage rate.
def get_time_series(data, index, look_back, batch_size):
samples1 = np.empty((batch_size, look_back, np.size(data, axis=1)))
rows = np.random.randint(look_back, np.size(data, axis=1), size=batch_size)
for j, row in enumerate(rows):
indices = range(rows[j] - look_back, rows[j], 1)
samples1[j] = data[indices]
return samples1
data = np.random.rand(100000, 20)
start = time.time()
batch = get_time_series(data, index=50, look_back=1000, batch_size=2**12)
print("Batch generator needs", time.time()-start, "seconds")
Result:
Batch generator needs 0.6224319934844971 seconds
I already tried to build the 3-d array first, so I only have to index the array-rows in the *get_time_series-*Function. This was about 60 times faster during the training, but leads to an "out of memory error" with large datasets.
Does anyone have ideas on how to improve the performance of this bottleneck? Work with pointer, faster indexing methods, ...
Thanks,
Max

EDIT 2:
Not sure if this is going to be any faster, but you can also just do something like this. It still relies on advanced indexing, although over contiguous data, so maybe it's a bit better?:
import numpy as np
def get_time_series(data, indices, look_back):
# Make sure indices are big enough
indices = indices[indices >= look_back]
# Make indexing matrix
idx = indices[:, np.newaxis] + np.arange(-look_back, 0)
# Make batch
return data[idx]
You would use it for example like this:
import numpy as np
def get_time_series(data, indices, look_back):
indices = indices[indices >= look_back]
idx = indices[:, np.newaxis] + np.arange(-look_back, 0)
return data[idx]
def make_batches(data, look_back, batch_size):
indices = np.random.permutation(np.arange(look_back, len(data) + 1))
for i in range(0, len(indices), batch_size):
yield get_time_series(data, indices[i:i + batch_size], look_back)
data = ...
look_back = ...
batch_size = ...
for batch in make_batches(data, look_back, batch_size):
# Use batch
EDIT:
If you want to shuffle the examples, you could first make the sliding window for the whole dataset (which should not take any memory or time) and then take batches from a shuffled index:
# Make sliding window with the previous function
data_sw = get_time_series(data, 0, look_back, len(data))
# Random index
batch_idx = np.random.permutation(len(data_sw))
# To get the first batch
batch = data_sw[batch_idx[:batch_size]]
I think this does what you want, and should be quite faster than using loops:
import numpy as np
def get_time_series(data, index, look_back, batch_size):
from numpy.lib.stride_tricks import as_strided
# Index should be at least as big as look_back to have enough elements before it
index = max(index, look_back)
# Batch size should not go beyond the array
batch_size = min(batch_size, len(data) - index + 1)
# Relevant slice for the batch
data_slice = data[index - look_back:index + batch_size]
# Reshape with stride tricks as a "sliding window"
data_strides = data_slice.strides
batch_shape = (batch_size, look_back, data_slice.shape[-1])
batch_strides = (data_strides[0], data_strides[0], data_strides[1])
return as_strided(data_slice, batch_shape, batch_strides, writeable=False)
# Test
data = np.arange(300).reshape((100, 3))
batch = get_time_series(data, 20, 5, 4)
print(batch)
Output:
[[[45 46 47]
[48 49 50]
[51 52 53]
[54 55 56]
[57 58 59]]
[[48 49 50]
[51 52 53]
[54 55 56]
[57 58 59]
[60 61 62]]
[[51 52 53]
[54 55 56]
[57 58 59]
[60 61 62]
[63 64 65]]
[[54 55 56]
[57 58 59]
[60 61 62]
[63 64 65]
[66 67 68]]]

Related

Problem by dictionaries to use numba njit parallelization to accelerate the code

I have written a code and try to use numba for accelerating the code. The main goal of the code is to group some values based on a condition. In this regard, iter_ is used for converging the code to satisfy the condition. I prepared a small case below to reproduce the sample code:
import numpy as np
import numba as nb
rng = np.random.default_rng(85)
# --------------------------------------- small data volume ---------------------------------------
# values_ = {'R0': np.array([0.01090976, 0.01069902, 0.00724112, 0.0068463 , 0.01135723, 0.00990762,
# 0.01090976, 0.01069902, 0.00724112, 0.0068463 , 0.01135723]),
# 'R1': np.array([0.01836379, 0.01900166, 0.01864162, 0.0182823 , 0.01840322, 0.01653088,
# 0.01900166, 0.01864162, 0.0182823 , 0.01840322, 0.01653088]),
# 'R2': np.array([0.02430913, 0.02239156, 0.02225379, 0.02093393, 0.02408692, 0.02110411,
# 0.02239156, 0.02225379, 0.02093393, 0.02408692, 0.02110411])}
#
# params = {'R0': [3, 0.9490579204466154, 1825, 7.070272000000002e-05],
# 'R1': [0, 0.9729203826820172, 167 , 7.070272000000002e-05],
# 'R2': [1, 0.6031363088057902, 1316, 8.007296000000003e-05]}
#
# Sno, dec_, upd_ = 2, 100, 200
# -------------------------------------------------------------------------------------------------
# ----------------------------- UPDATED (medium and large data volumes) ---------------------------
# values_ = np.load("values_med.npy", allow_pickle=True)[()]
# params = np.load("params_med.npy", allow_pickle=True)[()]
values_ = np.load("values_large.npy", allow_pickle=True)[()]
params = np.load("params_large.npy", allow_pickle=True)[()]
Sno, dec_, upd_ = 2000, 1000, 200
# -------------------------------------------------------------------------------------------------
# values_ = [*values_.values()]
# params = [*params.values()]
# #nb.jit(forceobj=True)
# def test(values_, params, Sno, dec_, upd_):
final_dict = {}
for i, j in enumerate(values_.keys()):
Rand_vals = []
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
if params[j][0] != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
while not np.allclose(goal_sum, final_sum, atol=tel):
iter_ += 1
vals_group = rng.choice(values_[j], size=params[j][0], replace=False)
# final_sum = 0.0016 * np.sum(vals_group) # -----> For small data volume
final_sum = np.sum(vals_group ** 3) # -----> UPDATED For med or large data volume
if iter_ == upd_:
t += 1
tel = t * tel
values_[j] = np.delete(values_[j], np.where(np.in1d(values_[j], vals_group)))
Rand_vals.append(vals_group)
else:
Rand_vals = [np.array([])] * Sno
final_dict["R" + str(i)] = Rand_vals
# return final_dict
# test(values_, params, Sno, dec_, upd_)
At first, for applying numba on this code #nb.jit was used (forceobj=True is used for avoiding warnings and …), which will have adverse effect on the performance. nopython is checked, too, with #nb.njit which get the following error due to not supporting (as mentioned in 1, 2) dictionary type of the inputs:
cannot determine Numba type of <class 'dict'>
I don't know if (how) it could be handled by Dict from numba.typed (by converting created python dictionaries to numba Dict) or if converting the dictionaries to lists of arrays have any advantage. I think, parallelization may be possible if some code lines e.g. Rand_vals.append(vals_group) or else section or … be taken or be modified out of the function to get the same results as before, but I don't have any idea how to do so.
I will be grateful for helping utilize numba on this code. numba parallelization will be the most desired (probably the best applicable method in terms of performance) solution if it could.
Data:
medium data volume: values_med, params_med
large data volume: values_large, params_large
This code can be converted to Numba but it is not straightforward.
First of all, the dictionary and list type must be defined since Numba njit functions cannot directly operate on reflected lists (aka. pure-python lists). This is a bit tedious to do in Numba and the resulting code is a bit verbose:
String = nb.types.unicode_type
ValueArray = nb.float64[::1]
ValueDict = nb.types.DictType(String, ValueArray)
ParamDictValue = nb.types.Tuple([nb.int_, nb.float64, nb.int_, nb.float64])
ParamDict = nb.types.DictType(String, ParamDictValue)
FinalDictValue = nb.types.ListType(ValueArray)
FinalDict = nb.types.DictType(String, FinalDictValue)
Then you need to convert the input dictionaries:
nbValues = nb.typed.typeddict.Dict.empty(String, ValueArray)
for key,value in values_.items():
nbValues[key] = value.copy()
nbParams = nb.typed.typeddict.Dict.empty(String, ParamDictValue)
for key,value in params.items():
nbParams[key] = (nb.int_(value[0]), nb.float64(value[1]), nb.int_(value[2]), nb.float64(value[3]))
Then, you need to write the core function. np.allclose and np.isin are not implemented in Numba so they should be reimplemented manually. But the main point is that Numba does not support the rng Numpy object. I think it will certainly not support it any time soon. Note that Numba has an random numbers implementation that try to mimic the behavior of Numpy but the management of the seed is a bit different. Note also that results should be the same with the np.random.xxx Numpy functions if the seed is set to the same value (Numpy and Numba have different seed variables that are not synchronized).
#nb.njit(FinalDict(ValueDict, ParamDict, nb.int_, nb.int_, nb.int_))
def nbTest(values_, params, Sno, dec_, upd_):
final_dict = nb.typed.Dict.empty(String, FinalDictValue)
for i, j in enumerate(values_.keys()):
Rand_vals = nb.typed.List.empty_list(ValueArray)
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
if params[j][0] != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
vals_group = np.empty(0, dtype=nb.float64)
while np.abs(goal_sum - final_sum) > (1e-05 * np.abs(final_sum) + tel):
iter_ += 1
vals_group = np.random.choice(values_[j], size=params[j][0], replace=False)
final_sum = 0.0016 * np.sum(vals_group)
# final_sum = 0.0016 * np.sum(vals_group) # (for small data volume)
final_sum = np.sum(vals_group ** 3) # (for med or large data volume)
if iter_ == upd_:
t += 1
tel = t * tel
# Perform an in-place deletion
vals, gr = values_[j], vals_group
cur = 0
for l in range(vals.size):
found = False
for m in range(gr.size):
found |= vals[l] == gr[m]
if not found:
# Keep the value (delete it otherwise)
vals[cur] = vals[l]
cur += 1
values_[j] = vals[:cur]
Rand_vals.append(vals_group)
else:
for k in range(Sno):
Rand_vals.append(np.empty(0, dtype=nb.float64))
final_dict["R" + str(i)] = Rand_vals
return final_dict
Note that the replacement implementation of np.isin is quite naive but it works pretty well in practice on your input example.
The function can be called using the following way:
nbFinalDict = nbTest(nbValues, nbParams, Sno, dec_, upd_)
Finally, the dictionary should be converted back to basic Python objects:
finalDict = dict()
for key,value in nbFinalDict.items():
finalDict[key] = list(value)
This implementation is fast for small inputs but not large ones since np.random.choice takes almost all the time (>96%). The thing is this function is clearly not optimal when the number of requested item is small (which is your case). Indeed, it surprisingly runs in linear time of the input array and not in linear time of the number of requested items.
Further Optimizations
The algorithm can be completely rewritten to extract only 12 random items and discard them from the main currant array in a much more efficient way. The idea is to swap n items (small target sample) at the end of the array with other items at random locations, then check the sum, repeat this process until a condition is fulfilled, and finally extract the view to the last n items before resizing the view so to discard the last items. All of this can be done in O(n) time rather than O(m) time where m is the size of the main current array with n << m (eg. 12 VS 20_000). It can also be compute without any expensive allocation. Here is the resulting code:
#nb.njit(nb.void(ValueArray, nb.int_, nb.int_))
def swap(arr, i, j):
arr[i], arr[j] = arr[j], arr[i]
#nb.njit(FinalDict(ValueDict, ParamDict, nb.int_, nb.int_, nb.int_))
def nbTest(values_, params, Sno, dec_, upd_):
final_dict = nb.typed.Dict.empty(String, FinalDictValue)
for i, j in enumerate(values_.keys()):
Rand_vals = nb.typed.List.empty_list(ValueArray)
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
values = values_[j]
n = params[j][0]
if n != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
m = values.size
assert n <= m
group = values[-n:]
while np.abs(goal_sum - final_sum) > (1e-05 * np.abs(final_sum) + tel):
iter_ += 1
# Swap the group view with other random items
for pos in range(m - n, m):
swap(values, pos, np.random.randint(0, m))
# For small data volume:
# final_sum = 0.0016 * np.sum(group)
# For med/large data volume
final_sum = 0.0
for v in group:
final_sum += v ** 3
if iter_ == upd_:
t += 1
tel *= t
assert iter_ > 0
values = values[:m-n]
Rand_vals.append(group)
else:
for k in range(Sno):
Rand_vals.append(np.empty(0, dtype=nb.float64))
final_dict["R" + str(i)] = Rand_vals
return final_dict
In addition to being faster, this implementation as the benefit of being also simpler. Results looks quite similar to the previous implementation despite the randomness make the check of the results tricky (especially since this function does not use the same method to choose the random sample). Note that this implementation does not remove items in values that are in group as opposed to the previous one (this is probably not wanted though).
Benchmark
Here are the results of the last implementation on my machine (compilation and conversion timings excluded):
Provided small input (embedded in the question):
- Initial code: 42.71 ms
- Numba code: 0.11 ms
Medium input:
- Initial code: 3481 ms
- Numba code: 11 ms
Large input:
- Initial code: 6728 ms
- Numba code: 20 ms
Note that the conversion time takes about the same time than the computation.
This last implementation is 316~388 times faster than the initial code on small inputs.
Notes
Note that the compilation time takes few seconds due to the dict and lists types.
Note that while it may be possible to parallelise the implementation, only the most encompassing loop can be parallelised. The thing is there is only few items to compute and the time is already quite small (not the best case for multi-threading). <-- Additionally, the creation of many temporary arrays (created by rng.choice) will certainly cause the parallel loop not to scale well anyway. --> Additionally, the list/dict cannot be written from multiple threads safely so one need to use Numpy arrays in the whole function to be able to do that (or add additional conversion that are already expensive). Moreover, Numba parallelism tends to increase significantly the compilation time which is already significant. Finally, the result will be less deterministic since each Numba thread has its own random number generator seed and the items computed by the threads cannot be predicted with prange (dependent of the parallel runtime chosen on the target platform). Note that in Numpy there is one global seed by default used by usual random functions (deprecated way) and RNG objects have their own seed (new preferred way).

Tensorflow: Efficient multinomial sampling (Theano x50 faster?)

I want to be able to sample from a multinomial distribution very efficiently and apparently my TensorFlow code is very... very slow...
The idea is that, I have:
A vector: counts = [40, 50, 26, ..., 19] for example
A matrix of probabilities: probs = [[0.1, ..., 0.5], ... [0.3, ..., 0.02]] such that np.sum(probs, axis=1) = 1
Let's say len(counts) = N and len(probs) = (N, 50). What I want to do is (in our example):
sample 40 times from the first probability vector of the matrix probs
sample 50 times from the second probability vector of the matrix probs
...
sample 19 times from the Nth probability vector of the matrix probs
such that my final matrix looks like (for example):
A = [[22, ... 13], ..., [12, ..., 3]] where np.sum(A, axis=1) == counts
(i.e the sum over each row = the number in the corresponding row of counts vector)
Here is my TensorFlow code sample:
import numpy as np
import tensorflow as tf
import tensorflow.contrib.distributions as ds
import time
nb_distribution = 100 # number of probability distributions
counts = np.random.randint(2000, 3500, size=nb_distribution) # define number of counts (vector of size 100 with int in 2000, 3500)
# print(u[:40]) # should be the same as the output of print(np.sum(res, 1)[:40]) in the tf.Session()
# probsn is a matrix of probability:
# each row of probsn contains a vector of size 30 that sums to 1
probsn = np.random.uniform(size=(nb_distribution, 30))
probsn /= np.sum(probsn, axis=1)[:, None]
counts = tf.Variable(counts, dtype=tf.float32)
probs = tf.Variable(tf.convert_to_tensor(probsn.astype(np.float32)))
# sample from the multinomial
dist = ds.Multinomial(total_count=counts, probs=probs)
out = dist.sample()
start = time.time()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
res = sess.run(out)
# print(np.sum(res, 1)[:40])
print(time.time() - start)
elapsed time: 0.12 seconds
My equivalent code in Theano:
import numpy as np
import theano
from theano.tensor import _shared
nb_distribution = 100 # number of probability distributions
counts = np.random.randint(2000, 3500, size=nb_distribution)
#print(u[:40]) # should be the same as the output of print(np.sum(v_sample(), 1)[:40])
counts = _shared(counts) # define number of counts (vector of size 100 with int in 2000, 3500)
# probsn is a matrix of probability:
# each row of probsn contains a vector that sums to 1
probsn = np.random.uniform(size=(nb_distribution, 30))
probsn /= np.sum(probsn, axis=1)[:, None]
probsn = _shared(probsn)
from theano.tensor.shared_randomstreams import RandomStreams
np_rng = np.random.RandomState(12345)
theano_rng = RandomStreams(np_rng.randint(2 ** 30))
v_sample = theano.function(inputs=[], outputs=theano_rng.multinomial(n=counts, pvals=probsn))
start_t = time.time()
out = np.sum(v_sample(), 1)[:40]
# print(out)
print(time.time() - start_t)
elapsed time: 0.0025 seconds
Theano is like 100x faster... Is there something wrong with my TensorFlow code? How can I sample from a multinomial distribution efficiently in TensorFlow?
The problem is that the TensorFlow multinomial sample() method actually uses the method calls _sample_n(). This method is defined here. As we can see in the code to sample from the multinomial the code produces a matrix of one_hot for each row and then reduce the matrix into a vector by summing over the rows:
math_ops.reduce_sum(array_ops.one_hot(x, depth=k), axis=-2)
It is inefficient because it uses extra memory. To avoid this I have used the
tf.scatter_nd function. Here is a fully runnable example:
import tensorflow as tf
import numpy as np
import tensorflow.contrib.distributions as ds
import time
tf.reset_default_graph()
nb_distribution = 100 # number of probabilities distribution
u = np.random.randint(2000, 3500, size=nb_distribution) # define number of counts (vector of size 100 with int in 2000, 3500)
# probsn is a matrix of probability:
# each row of probsn contains a vector of size 30 that sums to 1
probsn = np.random.uniform(size=(nb_distribution, 30))
probsn /= np.sum(probsn, axis=1)[:, None]
counts = tf.Variable(u, dtype=tf.float32)
probs = tf.Variable(tf.convert_to_tensor(probsn.astype(np.float32)))
# sample from the multinomial
dist = ds.Multinomial(total_count=counts, probs=probs)
out = dist.sample()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
res = sess.run(out) # if remove this line the code is slower...
start = time.time()
res = sess.run(out)
print(time.time() - start)
print(np.all(u == np.sum(res, axis=1)))
This code took 0.05 seconds to compute
def vmultinomial_sampling(counts, pvals, seed=None):
k = tf.shape(pvals)[1]
logits = tf.expand_dims(tf.log(pvals), 1)
def sample_single(args):
logits_, n_draw_ = args[0], args[1]
x = tf.multinomial(logits_, n_draw_, seed)
indices = tf.cast(tf.reshape(x, [-1,1]), tf.int32)
updates = tf.ones(n_draw_) # tf.shape(indices)[0]
return tf.scatter_nd(indices, updates, [k])
x = tf.map_fn(sample_single, [logits, counts], dtype=tf.float32)
return x
xx = vmultinomial_sampling(u, probsn)
# check = tf.expand_dims(counts, 1) * probs
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
res = sess.run(xx) # if remove this line the code is slower...
start_t = time.time()
res = sess.run(xx)
print(time.time() -start_t)
#print(np.sum(res, axis=1))
print(np.all(u == np.sum(res, axis=1)))
This code took 0.016 seconds
The drawback is that my code doesn't actually parallelize the computation (even though parallel_iterations parameter is set to 10 by default in map_fn, putting it to 1 doesn't change anything...)
Maybe someone will find something better because it is still very slow as compare to Theano's implementation (due to the fact that it doesn't take advantage of the parallelization... and yet, here, parallelization makes sense because sampling one row is indenpendent from sampling another one...)

Shuffling rows of large pandas DataFrame and correlation with a series

I need to independently shuffle each row of large pandas DataFrames several times (the typical shape is (10000,1000)) and then estimate the correlation of each row with a given series.
The most efficient (=quick) way I found to do it staying within pandas is the following:
for i in range(N): #the larger is N, the better it is
df_sh = df.apply(numpy.random.permutation, axis=1)
#where df this is my large dataframe, with 10K rows and 1K columns
corr = df_sh.corrwith(s, axis = 1)
#where s is the provided series (shape of s =(1000,))
The two tasks take approximately the same amount of time (namely 30 secs each). I tried to convert my dataframe into a numpy.array, to perform a for loop over the array and, for each line, I first perform the permutation and then measure the correlation with scipy.stats.pearsonr. I unfortunately managed to speed up my two tasks only by a factor 2.
There would be other viable options to speed up the tasks even more? (NB: I am already parallelizing with Joblib the execution of my code up to the maximum factor allowed by the machine I am using).
Correlation between a 2D matrix/array and 1D array/vector :
We can adapt corr2_coeff_rowwise for correlation between a 2D array/matrix and a 1D array/vector, like so -
def corr2_coeff_2d_1d(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1,keepdims=1)
B_mB = B - B.mean()
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
ssB = B_mB.dot(B_mB)
# Finally get corr coeff
return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
To shuffle each row and do this for all rows, we can make use of np.random.shuffle. Now, this shuffle function works along the first axis. So, to solve our case, we need to feed in the transposed version. Also, note that this shuffling would be done in-place. So, if original dataframe is needed elsewhere, do make a copy before processing. Thus, the solution would be -
Hence, let's use this to solve our case -
# Extract underlying arry data for faster NumPy processing in loop later on
a = df.values
s_ar = s.values
# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
# https://stackoverflow.com/a/45438143/
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
corr = corr2_coeff_2d_1d(shuffled_a, s_ar)
Optimized version #1
Now, we could pre-compute for the parts involving the series that stays the same between iterations. Hence, a further optimized version would look like this -
a = df.values
s_ar = s.values
r = np.arange(a.shape[0])[:,None]
B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)
A = a
A_mean = A.mean(1,keepdims=1)
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
# https://stackoverflow.com/a/45438143/
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
A = shuffled_a
A_mA = A - A_mean
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
Benchmarking
Setup inputs with actual use-case shapes/sizes
In [302]: df = pd.DataFrame(np.random.rand(10000,1000))
In [303]: s = pd.Series(df.iloc[0])
1. Original method
In [304]: %%timeit
...: df_sh = df.apply(np.random.permutation, axis=1)
...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop
2. Proposed method
The pre-processing part (only done once before starting loop, so not including in timings) -
In [305]: a = df.values
...: s_ar = s.values
...: r = np.arange(a.shape[0])[:,None]
...:
...: B = s_ar
...: B_mB = B - B.mean()
...: ssB = B_mB.dot(B_mB)
...:
...: A = a
...: A_mean = A.mean(1,keepdims=1)
Part of proposed solution that runs in loop -
In [306]: %%timeit
...: idx = np.random.rand(*a.shape).argsort(1)
...: shuffled_a = a[r, idx]
...:
...: A = shuffled_a
...: A_mA = A - A_mean
...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop
Thus, we are seeing a speedup of around 3x here!

How to sample without replacement and reweigh each time (conditional sampling)?

Consider a dataset of N rows with weights. This is the basic algorithm:
Normalize the weights so that they sum to 1.
Backup the weights into another column to record sample probabilities
Randomly choose 1 row (without replacement), given the sample probabilities, and add it to the sample dataset
Remove the drawn weight from the original dataset, and recompute the sample probabilities by normalizing the weights of the remaining rows
Repeat steps 3 and 4 till sum of weights in sample reaches or exceeds threshold (assume 0.6)
Here is a toy example:
import pandas as pd
import numpy as np
def sampler(n):
df = pd.DataFrame(np.random.rand(n), columns=['weight'])
df['weight'] = df['weight']/df['weight'].sum()
df['samp_prob'] = df['weight']
samps = pd.DataFrame(columns=['weight'])
while True:
choice = np.random.choice(df.index, 1, replace=False, p=df['samp_prob'])[0]
samps.loc[choice, 'weight'] = df.loc[choice, 'weight']
df.drop(choice, axis=0, inplace=True)
df['samp_prob'] = df['weight']/df['weight'].sum()
if samps['weight'].sum() >= 0.6:
break
return samps
The problem with the toy example is the exponential growth in run times with increasing size of n:
Starting off approach
Few observations :
The dropping of rows per iteration that results in creation of new dataframes isn't helping with the performance.
Doesn't look like easy to vectorize, BUT should be easy to work with the underlying array data for performance. The idea would be to use masks and avoid re-creating dataframes or arrays. Starting off, we would be using two columns array, corresponding to the columns named : 'weights' and 'samp_prob'.
So, with those in mind, the starting approach would be something like this -
def sampler2(n):
a = np.random.rand(n,2)
a[:,0] /= a[:,0].sum()
a[:,1] = a[:,0]
N = len(a)
idx = np.arange(N)
mask = np.ones(N,dtype=bool)
while True:
choice = np.random.choice(idx[mask], 1, replace=False, p=a[mask,1])[0]
mask[choice] = 0
a_masked = a[mask,0]
a[mask,1] = a_masked/a_masked.sum()
if a[~mask,0].sum() >= 0.6:
break
out = a[~mask,0]
return out
Improvement #1
A later observation revealed that the first column of the array isn't changing across iterations. So, we could optimize for the masked summations for the first column, by pre-computing the total summation and then at each iteration, a[~mask,0].sum() would be simply the total summation minus a_masked.sum(). Thsi leads us to the first improvement, listed below -
def sampler3(n):
a = np.random.rand(n,2)
a[:,0] /= a[:,0].sum()
a[:,1] = a[:,0]
N = len(a)
idx = np.arange(N)
mask = np.ones(N,dtype=bool)
a0_sum = a[:,0].sum()
while True:
choice = np.random.choice(idx[mask], 1, replace=False, p=a[mask,1])[0]
mask[choice] = 0
a_masked = a[mask,0]
a_masked_sum = a_masked.sum()
a[mask,1] = a_masked/a_masked_sum
if a0_sum - a_masked_sum >= 0.6:
break
out = a[~mask,0]
return out
Improvement #2
Now, slicing and masking into the columns of a 2D array could be improved by using two separate arrays instead, given that the first column wasn't changing between iterations. That gives us a modified version, like so -
def sampler4(n):
a = np.random.rand(n)
a /= a.sum()
b = a.copy()
N = len(a)
idx = np.arange(N)
mask = np.ones(N,dtype=bool)
a_sum = a.sum()
while True:
choice = np.random.choice(idx[mask], 1, replace=False, p=b[mask])[0]
mask[choice] = 0
a_masked = a[mask]
a_masked_sum = a_masked.sum()
b[mask] = a_masked/a_masked_sum
if a_sum - a_masked_sum >= 0.6:
break
out = a[~mask]
return out
Runtime test -
In [250]: n = 1000
In [251]: %timeit sampler(n) # original app
...: %timeit sampler2(n)
...: %timeit sampler3(n)
...: %timeit sampler4(n)
1 loop, best of 3: 655 ms per loop
10 loops, best of 3: 50 ms per loop
10 loops, best of 3: 44.9 ms per loop
10 loops, best of 3: 38.4 ms per loop
In [252]: n = 2000
In [253]: %timeit sampler(n) # original app
...: %timeit sampler2(n)
...: %timeit sampler3(n)
...: %timeit sampler4(n)
1 loop, best of 3: 1.32 s per loop
10 loops, best of 3: 134 ms per loop
10 loops, best of 3: 119 ms per loop
10 loops, best of 3: 100 ms per loop
Thus, we are getting 17x+ and 13x+ speedups with the final version over the original method for n=1000 and n=2000 sizes!
I think you can rewrite this while loop to do it in a single pass:
while True:
choice = np.random.choice(df.index, 1, replace=False, p=df['samp_prob'])[0]
samps.loc[choice, 'weight'] = df.loc[choice, 'weight']
df.drop(choice, axis=0, inplace=True)
df['samp_prob'] = df['weight']/df['weight'].sum()
if samps['weight'].sum() >= 0.6:
break
to something more like:
n = len(df.index)
ind = np.random.choice(n, n, replace=False, p=df["samp_prob"])
res = df.iloc[ind]
i = (res.cumsum() >= 0.6).idxmax() # first index that satisfies .sum() >= 0.6
samps = res.iloc[:i+1]
The key parts are that choice can take multiple elements (indeed the entire array) whilst still respecting the probabilities. The cumsum allows you to cut off after passing the 0.6 threshold.
In this example you can see that the array is randomly chosen, but that 4 is most likely chosen nearer the top.
In [11]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[11]: array([0, 4, 3, 2, 1])
In [12]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[12]: array([3, 4, 1, 2, 0])
In [13]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[13]: array([0, 4, 3, 1, 2])
In [14]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[14]: array([4, 3, 0, 2, 1])
In [15]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[15]: array([4, 2, 3, 0, 1])
In [16]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[16]: array([3, 4, 2, 0, 1])
Note: The replace=False, ensures the probabilities are "reweighed" in the sense that it can't be picked again.

Find the smallest sum of the squares of two measurements taken at least 5 min apart

I'm trying to solve this problem in Python3. I know how to find min1 and min2, but I cannot guess how to search 5 elements in a single pass.
Problem Statement
The input program serves measurements performed by a device at intervals of 1 minute. All data are in natural numbers not exceeding 1000. The problem is to find the smallest sum of the squares of two measurements performed at intervals not less than 5 minutes apart. The first line will contain one natural number -- the number of measurements N. It is guaranteed that 5 < N <= 10000. Each of the following N lines contains one natural number -- the result of the next measurement.
Your program should output a single number, the lowest sum of the squares of two measurements performed at intervals not less than 5 minutes apart.
Sample input:
9
12
45
5
4
21
20
10
12
26
Expected output: 169
I like this question. Fun brain-teaser. :)
I noticed your sample input was all integers in range(1, 100) with some repetition, so I generated sample lists like so:
>>> import random
>>> sample_list = [random.choice(range(1, 100)) for i in range(10)]
>>> sample_list
[74, 68, 57, 18, 36, 8, 89, 73, 77, 80]
According to the problem statement, these numbers represent data measured at one-minute intervals, and one of our constraints is that our result must represent data gathered at least five minutes apart. Ultimately, that means the indices of the data in the original list must differ by at least five. In other words, for any two inputs v1 and v2:
abs(sample_list.index(v1) - sample_list.index(v2)) >= 5
must be true. We also know that we're searching for the smallest sum, so it will be helpful to look at the smallest numbers first.
Thus, I started by mapping the values in the sample_list to the indices where they occur, then sorting them:
>>> occurrences = {}
>>> for index, value in enumerate(sample_list):
... try:
... occurrences[value].append(index)
... except KeyError:
... occurrences[value] = [index]
...
>>> occurrences
{80: [9], 18: [3], 68: [1], 73: [7], 89: [6], 8: [5], 57: [2], 74: [0], 77: [8], 36: [4]}
>>> sorted_occurrences = sorted(occurrences)
>>> sorted_occurrences
[8, 18, 36, 57, 68, 73, 74, 77, 80, 89]
After a whole lot of trial and error, here's what I finally came up with in function form (including some of the earlier-discussed pieces):
def smallest_sum_of_squares_five_apart(sample):
occurrences = {}
for index, value in enumerate(sample):
try:
occurrences[value].append(index)
except KeyError:
occurrences[value] = [index]
sorted_occurrences = sorted(occurrences)
least_sum = 0
for index, v1 in enumerate(sorted_occurrences):
if least_sum and v1**2 > least_sum:
return least_sum
for v2 in sorted_occurrences[:index+1]:
if (abs(max(occurrences[v1]) - min(occurrences[v2])) >= 5 or
abs(max(occurrences[v2]) - min(occurrences[v1])) >= 5):
print('Found candidates:', str((v1, v2)))
sum_of_squares = v1**2 + v2**2
if not least_sum or sum_of_squares < least_sum:
least_sum = sum_of_squares
return least_sum
The idea here is to:
Start by looking at the smallest values first.
Compare them one by one with all the values smaller, up to themselves.
Check each against our criteria. Notice we do this by checking the extremes of each, where these two numbers occur the farthest away from one another in the original sample.
Break out when checking becomes pointless.
Unfortunately, it is not sufficient to find the first one. Depending how the list is constructed, it will not always find the smallest pair first this way. In fact, it does not for your own sample input. However, once v1**2 (the square of the larger value) is larger than the sum, we know since all numbers are natural numbers it is pointless to continue looking.
I have included a full runnable implementation of this below. It takes a command line argument (default 10) indicating the number of items you want in the randomly generated sample. It will print the randomly generated sample as well as all candidate pairs it checked, and finally the sum itself. I have checked this on 10-sized inputs several times and it seems to be working in general. However, feedback is welcome if it is not correct. Note also you can uncomment your sample list from the question to see how it works (and that it gets the right answer) for it.
import random
import sys
def smallest_sum_of_squares_five_apart(sample):
occurrences = {}
for index, value in enumerate(sample):
try:
occurrences[value].append(index)
except KeyError:
occurrences[value] = [index]
sorted_occurrences = sorted(occurrences)
least_sum = 0
for index, v1 in enumerate(sorted_occurrences):
if least_sum and v1**2 > least_sum:
return least_sum
for v2 in sorted_occurrences[:index+1]:
if (abs(max(occurrences[v1]) - min(occurrences[v2])) >= 5 or
abs(max(occurrences[v2]) - min(occurrences[v1])) >= 5):
print('Found candidates:', str((v1, v2)))
sum_of_squares = v1**2 + v2**2
if not least_sum or sum_of_squares < least_sum:
least_sum = sum_of_squares
return least_sum
if __name__ == '__main__':
try:
r = int(sys.argv[1])
except IndexError:
r = 10
sample_list = [random.choice(range(1, 100)) for i in range(r)]
#sample_list = [9, 12, 45, 5, 4, 21, 20, 10, 12, 26]
print(sample_list)
print(smallest_sum_of_squares_five_apart(sample_list))
Try this:
#!/usr/bin/env python3
import queue
inp = [9,12,45,5,4,21,20,10,12,26]
q = queue.Queue() #Make a new queue
smallest = False #No smallest number, yet
best = False #No best sum of squares, yet
for x in inp:
q.put(x) #Place current element on queue
#If there's an item from more than five minutes ago, consider it
if q.qsize()>5:
temp = q.get() #Pop oldest item from queue into temporary variable
if not smallest: #If this is the first item more than 5 minutes old
smallest = temp #it is the smallest item by default
else: #otherwise...
smallest = min(temp,smallest) #only store it if it is the smallest yet
#If we have no best sum of squares or the current item produces one, then
#save it as the best
if (not best) or (x*x+smallest*smallest<best):
best = x*x+smallest*smallest
print(best)
The idea is to walk through the queue keeping track of the smallest element we have seen yet which is older than five minutes and comparing it against the newest element keeping track of the smallest sum of squares as we go.
I think you'll find the answer to be pretty intuitive if you think about it.
The algorithm operates in O(N) time.

Resources