Creating a slice of a matrix from a vector in Numpy - performance

Provided that I have a matrix A of size 5 by 4, also a vector b of length 5 whose element indicates how many values I need in the corresponding row of matrix A. That means each value in b is upper-bounded by the size of second dimension of A. My problem is how to make a slice of a matrix given an vector, which is a complex-version of taking an integer-valued elements of a vector by writing vector[:n]
For example, this can be implemented with a loop over A's rows:
import numpy
A=numpy.arange(20).reshape((5,4))
b=numpy.array([0, 3, 3, 2, 3])
output=A[0, :b[0]]
for i in xrange(1, A.shape[0]):
output=numpy.concatenate((output, A[i, :b[i]]), axis=0)
# output is array([ 4, 5, 6, 8, 9, 10, 12, 13, 16, 17, 18])
The computation efficiency of this loop can be fairly low when dealing with a very large array. Furthermore, my purpose is to apply this in Theano eventually without a scan operation. I want to avoid using a loop to make a slice given an vector.

Another good setup for using NumPy broadcasting!
A[b[:,None] > np.arange(A.shape[1])]
Sample run
1) Inputs :
In [16]: A
Out[16]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
In [17]: b
Out[17]: array([0, 3, 3, 2, 3])
2) Use broadcasting to create mask for selection :
In [18]: b[:,None] > np.arange(A.shape[1])
Out[18]:
array([[False, False, False, False],
[ True, True, True, False],
[ True, True, True, False],
[ True, True, False, False],
[ True, True, True, False]], dtype=bool)
3) Finally use boolean-indexing for selecting elems off A :
In [19]: A[b[:,None] > np.arange(A.shape[1])]
Out[19]: array([ 4, 5, 6, 8, 9, 10, 12, 13, 16, 17, 18])

You could speed up the loop by collecting values in a list, and doing just one concatenate:
In [126]: [A[i,:j] for i,j in enumerate(b)]
Out[126]:
[array([], dtype=int32),
array([4, 5, 6]),
array([ 8, 9, 10]),
array([12, 13]),
array([16, 17, 18])]
In [127]: np.concatenate([A[i,:j] for i,j in enumerate(b)])
Out[127]: array([ 4, 5, 6, 8, 9, 10, 12, 13, 16, 17, 18])

Related

split an array which comtains partially relatively order into two sorted array in O(n) time

Assume I have two arrays, both of them are sorted, for example:
A: [1, 4, 5, 8, 10, 24]
B: [3, 6, 9, 29, 50, 65]
And then I merge these two array into one array and keep original relative order of both two array
C: [1, 4, 3, 5, 6, 9, 8, 29, 10, 24, 50, 65]
Is there any way to split C into two sorted array in O(n) time?
note: not necessarily into the original A and B
Greedily assign your integers to list 1 if they can go there. If they can't, assign them to list 2.
Here's some Ruby code to play around with this idea. It randomly splits the integers from 0 to n-1 into two sorted lists, then randomly merges them, then applies the greedy approach.
def f(n)
split1 = []
split2 = []
0.upto(n-1) do |i|
if rand < 0.5
split1.append(i)
else
split2.append(i)
end
end
puts "input 1: #{split1.to_s}"
puts "input 2: #{split2.to_s}"
merged = []
split1.reverse!
split2.reverse!
while split1.length > 0 && split2.length > 0
if rand < 0.5
merged.append(split1.pop)
else
merged.append(split2.pop)
end
end
merged += split1.reverse
merged += split2.reverse
puts "merged: #{merged.to_s}"
merged.reverse!
greedy1 = [merged.pop]
greedy2 = []
while merged.length > 0
if merged[-1] >= greedy1[-1]
greedy1.append(merged.pop)
else
greedy2.append(merged.pop)
end
end
puts "greedy1: #{greedy1.to_s}"
puts "greedy2: #{greedy2.to_s}"
end
Here's sample output:
> f(20)
input 1: [2, 3, 4, 5, 8, 9, 10, 18, 19]
input 2: [0, 1, 6, 7, 11, 12, 13, 14, 15, 16, 17]
merged: [2, 0, 1, 6, 3, 4, 5, 8, 9, 7, 10, 11, 18, 12, 13, 19, 14, 15, 16, 17]
greedy1: [2, 6, 8, 9, 10, 11, 18, 19]
greedy2: [0, 1, 3, 4, 5, 7, 12, 13, 14, 15, 16, 17]
> f(20)
input 1: [1, 3, 5, 6, 8, 9, 10, 11, 13, 15]
input 2: [0, 2, 4, 7, 12, 14, 16, 17, 18, 19]
merged: [0, 2, 4, 7, 12, 14, 16, 1, 3, 5, 6, 8, 17, 9, 18, 10, 19, 11, 13, 15]
greedy1: [0, 2, 4, 7, 12, 14, 16, 17, 18, 19]
greedy2: [1, 3, 5, 6, 8, 9, 10, 11, 13, 15]
> f(20)
input 1: [0, 1, 2, 6, 7, 9, 11, 14, 15, 18]
input 2: [3, 4, 5, 8, 10, 12, 13, 16, 17, 19]
merged: [3, 4, 5, 8, 10, 12, 0, 13, 16, 17, 1, 19, 2, 6, 7, 9, 11, 14, 15, 18]
greedy1: [3, 4, 5, 8, 10, 12, 13, 16, 17, 19]
greedy2: [0, 1, 2, 6, 7, 9, 11, 14, 15, 18]
Let's take your example.
[1, 4, 3, 5, 6, 9, 8, 29, 10, 24, 50, 65]
In time O(n) you can work out the minimum of the tail.
[1, 3, 3, 5, 6, 8, 8, 10, 10, 24, 50, 65]
And now the one stream is all cases where it is the minimum, and the other is the cases where it isn't.
[1, 3, 5, 6, 8, 10, 24, 50, 65]
[ 4, 9, 29, ]
This is all doable in time O(n).
We can go further and now split into 3 streams based on which values in the first stream could have gone in the last without changing it being increasing.
[ 3, 5, 6, 8, 10, 24, ]
[1, 5, 6, 8, 50, 65]
[ 4, 9, 29, ]
And now we can start enumerating the 2^6 = 64 different ways of splitting the original stream back into 2 increasing streams.

Get the first possible combination of numbers in a list that adds up to a certain sum

Given a list of numbers, find the first combination of numbers that adds up to a certain sum.
Example:
Given list: [1, 2, 3, 4, 5]
Given sum: 5
Response: [1, 4]
the response can also be [2, 3], it doesn't matter. What matters is that we get a combination of numbers from the given list that adds up to the given sum as fast as possible.
I tried doing this with itertools.combinations in python, but it takes way too long:
from typing import List
import itertools
def test(target_sum, numbers):
for i in range(len(numbers), 0, -1):
for seq in itertools.combinations(numbers, i):
if(sum(seq) == target_sum):
return seq
if __name__ == "__main__":
target_sum: int = 616
numbers: List[int] = [16, 96, 16, 32, 16, 4, 4, 32, 32, 10, 16, 8, 32, 8, 4, 16, 8, 8, 8, 16, 8, 8, 8, 16, 8, 16, 16, 4, 8, 8, 16, 12, 16, 16, 8, 16, 8, 8, 8, 8, 4, 32, 16, 8, 32, 16, 8, 8, 8, 8, 16, 32, 8, 32, 8, 8, 16, 24, 32, 8]
print(test(target_sum, numbers))
def subsum(tsum, numbers):
a = [0]*(tsum+1)
a[0] = -1
for x in numbers:
for i in range(tsum, x-1,-1): #reverse to avoid reusing the same item
if a[i-x] and a[i] == 0: #we can form sum i with item x and existing sum i-x
a[i] = x #remember the last item for given sum
res = []
idx = tsum
while idx:
res.append(a[idx])
idx -= a[idx]
return res
print(subsum(21, [2,3,5,7,11]))
>>>[11, 7, 3]
When the last cell is nonzero, combination does exist, and we can retrieve items.
Complexity is O(target_sum*len(numbers))

How to divide a list of negative and positive numbers into the largest number of subsets whose sum is 0?

I am trying to solve this problem but I can't manage to figure out how.
Let's suppose I have a list of positive and negative numbers whose sum is guaranteed to be 0.
[-10, 1, 2, 20, 5, -100, -80, 10, 15, 15, 60, 100, -20, -18]
I want to obtain a list with the largest number of sub-sets, using all the elements of the initial list only once. And each subset must have the sum 0.
So in the case of this simple input:
[-5, -4, 5, 2, 3, -1]
The best results that can be obtained are:
1. [[-5, 5], [[-4, -1, 2, 3]]] #2 subsets
2. [[-5, 2, 3], [-4, -1, 5]] #2 subsets
These, for example, would be totally wrong answers:
1. [[-5, -4, -1, 2, 3, 5]] #1 subsets that is the initial list, NO
2. [[-5,5]] #1 subset, and not all elements are used, NO
Even if it's NP-Complete, how can I manage to solve it even with a brute-force approach? I just need a solution for small list of numbers.
def get_subsets(lst):
N = len(lst)
cand = []
dp = [0 for x in range(1<<N)] # maximum number of subsets using elements represented by bitset
back = [0 for x in range(1<<N)]
# Section 1
for i in range(1,1<<N):
cur = 0
for j in range(N):
if i&(1<<j):
cur += lst[j]
if not cur:
cand.append(i) # if subset sums to 0, it's viable
dp[0] = 1
# Section 2
for i in range(1<<N):
while cand and cand[0] <= i:
cand.pop(0)
if not dp[i]:
continue
for j in cand:
if i&j: # if subsets intersect, it cannot be added
continue
if dp[i]+1 > dp[i|j]:
back[i|j] = j
dp[i|j] = dp[i]+1
# Section 3
ind = dp.index(max(dp))
res = []
while back[ind]:
cur = []
for i in range(N):
if back[ind]&(1<<i):
cur.append(lst[i])
res.append(cur)
ind = ind^back[ind]
return res
print (get_subsets([-5, -4, 5, 2, 3, -1]))
Basically, this solution collects all subsets of the original list that can sum to zero, then attempts to merge as many of them together as possible without colliding. It runs in worst-case O(2^{2N}) time, where N is the length of the list, but it should hit an average case of around O(2^N), since there typically shouldn't be too many subsets summing to 0.
EDIT: I added sections to facilitate explanation of the algorithm
Section 1: I iterate through all possible 2^N-1 nonempty subsets of the original list, and check which of these subsets sum to 0; any viable zero-sum subsets are added to the list cand (represented as an integer in the range [1,2^N-1] with bits set at the indices making up the subset).
Section 2: dp is a dynamic programming table storing the maximum number of subsets summing to 0 that can be formed using the subset represented by the integer i at dp[i]. Initially, all entries of dp are set to 0 except dp[0] = 1, since the empty set has a sum of 0. Then I iterate through each subset from 0 to 2^N-1, and I run through the list of candidate subsets and attempt to merge the two subsets.
Section 3: This is just backtracking to find the answer: while filling in dp, I also kept an array back that stores the most recent subset added to achieve the subset i at back[i]. So I find the subset that maximizes the number of sub-subsets summing to 0 with ind = dp.index(max(dp)), and then I backtrack from there, shrinking the subset by removing the most recently added subset until I finally arrive back to the empty set.
This problem is NP-complete, since it is a combination of two NP-complete problems:
finding a single subset whose sum is 0 is known as the subset sum problem
when you find all the subsets whose sum is 0, you have to solve an exact cover problem with a special condition: you want to maximize the number of subsets.
The following steps will provide a solution:
use dynamic programming to find the subsets whose sum is 0 ('https://en.wikipedia.org/wiki/Subset_sum_problem#Pseudo-polynomial_time_dynamic_programming_solution)
to maximize the number of subsets, one would use D. Knuth's Algorithm X to find the exact cover.
A few remarks:
First, we know that there is an exact cover because the list of numbers has a sum of 0.
Second, we can use only the subsets that are not supersets of any other subset. Because, if A is a superset of X (both sum to 0), A can't be in the cover that has the largest number of subsets. Let A, B, C, ... be the cover with the maximum number of subsets, then we can replace A by X and A\X (it is trivial to see that the sum of A\X elements is 0) and we get the cover X, A\X, B, C, ... that is better.
Third, when we use Algorithm X, all paths in the search tree will lead to a success. Let A, B, C, ... be a path composed of non overlapping subsets having each a sum of 0. Then the complent has also a sum of 0 (which may be a superset of another subset, and then we'll use 2.).
As you see, nothing new here, and I will use only well known techniques/algorithms.
Find the subsets having a sum of 0.
The algorithm is well known. Here's a Python implementation based on Wikipedia explanations
class Q:
def __init__(self, values):
self.len = len(values)
self.min = sum(e for e in values if e <= 0)
self.max = sum(e for e in values if e >= 0)
self._arr = [False] * self.len * (self.max - self.min + 1)
def __getitem__(self, item):
index, v = item
return self._arr[v * self.len + index]
def __setitem__(self, item, value):
index, v = item
self._arr[v * self.len + index] = value
class SubsetSum:
def __init__(self, values):
self._values = values
self._q = Q(values)
def prepare(self):
for s in range(self._q.min, self._q.max + 1):
self._q[0, s] = (self._values[0] == s)
for i in range(self._q.len):
self._q[i, 0] = True
for i in range(1, self._q.len):
v = self._values[i]
for s in range(self._q.min, self._q.max + 1):
self._q[i, s] = (v == s) or self._q[i - 1, s] or self._q[
i - 1, s - v]
def subsets(self, target=0):
yield from self._subsets(self._q.len - 1, target, [])
def _subsets(self, i, target, p):
assert i >= 0
v = self._values[i]
c = self._q[i - 1, target]
b = self._q[i - 1, target - v]
if i == 0:
if target == 0:
if p:
yield p
elif self._q[0, target]:
yield p + [i]
else:
if self._q.min <= target - v <= self._q.max and self._q[
i - 1, target - v]:
yield from self._subsets(i - 1, target - v, p + [i])
if self._q[i - 1, target]:
yield from self._subsets(i - 1, target, p)
Here's how it works:
arr = [-10, 1, 2, 20, 5, -100, -80, 10, 15, 15, 60, 100, -20, -18]
arr = sorted(arr)
s = SubsetSum(arr)
s.prepare()
subsets0 = list(s.subsets())
print(subsets0)
Output:
[[13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0], [13, 12, 11, 10, 9, 7, 6, 5, 3, 2, 1, 0], [13, 12, 11, 10, 9, 4, 2, 1, 0], [13, 12, 11, 10, 8, 7, 4, 2, 1, 0], [13, 12, 11, 10, 8, 6, 5, 4, 3, 1, 0], [13, 12, 11, 10, 7, 2, 1, 0], [13, 12, 11, 10, 6, 5, 3, 1, 0], [13, 12, 11, 9, 8, 7, 4, 2, 1, 0], [13, 12, 11, 9, 8, 6, 5, 4, 3, 1, 0], [13, 12, 11, 9, 7, 2, 1, 0], [13, 12, 11, 9, 6, 5, 3, 1, 0], [13, 12, 11, 8, 7, 6, 5, 3, 1, 0], [13, 12, 11, 8, 4, 1, 0], [13, 12, 11, 1, 0], [13, 12, 10, 9, 8, 7, 6, 5, 4, 3, 1, 0], [13, 12, 10, 9, 8, 2, 1, 0], [13, 12, 10, 9, 7, 6, 5, 3, 1, 0], [13, 12, 10, 9, 4, 1, 0], [13, 12, 10, 8, 7, 4, 1, 0], [13, 12, 10, 7, 1, 0], [13, 12, 9, 8, 7, 4, 1, 0], [13, 12, 9, 7, 1, 0], [13, 11, 10, 8, 6, 5, 4, 3, 2, 0], [13, 11, 10, 6, 5, 3, 2, 0], [13, 11, 9, 8, 6, 5, 4, 3, 2, 0], [13, 11, 9, 6, 5, 3, 2, 0], [13, 11, 8, 7, 6, 5, 3, 2, 0], [13, 11, 8, 4, 2, 0], [13, 11, 7, 6, 5, 4, 3, 2, 1], [13, 11, 7, 6, 5, 4, 3, 0], [13, 11, 2, 0], [13, 10, 9, 8, 7, 6, 5, 4, 3, 2, 0], [13, 10, 9, 7, 6, 5, 3, 2, 0], [13, 10, 9, 4, 2, 0], [13, 10, 8, 7, 4, 2, 0], [13, 10, 8, 6, 5, 4, 3, 2, 1], [13, 10, 8, 6, 5, 4, 3, 0], [13, 10, 7, 2, 0], [13, 10, 6, 5, 3, 2, 1], [13, 10, 6, 5, 3, 0], [13, 9, 8, 7, 4, 2, 0], [13, 9, 8, 6, 5, 4, 3, 2, 1], [13, 9, 8, 6, 5, 4, 3, 0], [13, 9, 7, 2, 0], [13, 9, 6, 5, 3, 2, 1], [13, 9, 6, 5, 3, 0], [13, 8, 7, 6, 5, 3, 2, 1], [13, 8, 7, 6, 5, 3, 0], [13, 8, 4, 2, 1], [13, 8, 4, 0], [13, 7, 6, 5, 4, 3, 1], [13, 2, 1], [13, 0], [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1], [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 0], [12, 11, 10, 9, 8, 2, 0], [12, 11, 10, 9, 7, 6, 5, 3, 2, 1], [12, 11, 10, 9, 7, 6, 5, 3, 0], [12, 11, 10, 9, 4, 2, 1], [12, 11, 10, 9, 4, 0], [12, 11, 10, 8, 7, 4, 2, 1], [12, 11, 10, 8, 7, 4, 0], [12, 11, 10, 8, 6, 5, 4, 3, 1], [12, 11, 10, 7, 2, 1], [12, 11, 10, 7, 0], [12, 11, 10, 6, 5, 3, 1], [12, 11, 9, 8, 7, 4, 2, 1], [12, 11, 9, 8, 7, 4, 0], [12, 11, 9, 8, 6, 5, 4, 3, 1], [12, 11, 9, 7, 2, 1], [12, 11, 9, 7, 0], [12, 11, 9, 6, 5, 3, 1], [12, 11, 8, 7, 6, 5, 3, 1], [12, 11, 8, 4, 1], [12, 11, 1], [12, 10, 9, 8, 7, 6, 5, 4, 3, 1], [12, 10, 9, 8, 2, 1], [12, 10, 9, 8, 0], [12, 10, 9, 7, 6, 5, 3, 1], [12, 10, 9, 4, 1], [12, 10, 8, 7, 4, 1], [12, 10, 7, 1], [12, 9, 8, 7, 4, 1], [12, 9, 7, 1], [11, 10, 8, 6, 5, 4, 3, 2], [11, 10, 6, 5, 3, 2], [11, 9, 8, 6, 5, 4, 3, 2], [11, 9, 6, 5, 3, 2], [11, 8, 7, 6, 5, 3, 2], [11, 8, 4, 2], [11, 7, 6, 5, 4, 3], [11, 2], [10, 9, 8, 7, 6, 5, 4, 3, 2], [10, 9, 7, 6, 5, 3, 2], [10, 9, 4, 2], [10, 8, 7, 4, 2], [10, 8, 6, 5, 4, 3], [10, 7, 2], [10, 6, 5, 3], [9, 8, 7, 4, 2], [9, 8, 6, 5, 4, 3], [9, 7, 2], [9, 6, 5, 3], [8, 7, 6, 5, 3], [8, 4]]
Reduce the number of subsets
We have 105 subsets that sum to 0, but we can remove the subsets that are superset of other subsets. We need a function to find if a list of elements contains all elements in another list. In Python:
import collections
def contains(l1, l2):
"""
Does l1 contain all elements of l2?
"""
c = collections.Counter(l1)
for e in l2:
c[e] -= 1
return all(n >= 0 for n in c.values())
Now, we can remove the subsets that are supersets of another subset.
def remove_supersets(subsets):
subsets = sorted(subsets, key=len)
new_subsets = []
for i, s1 in enumerate(subsets):
for s2 in subsets[:i]: # smaller subsets
if contains(s1, s2):
break
else: # not a superset
new_subsets.append(s1)
return new_subsets
In our situation:
subsets0 = remove_supersets(subsets0)
print(len(subsets0))
Output:
[[13, 0], [11, 2], [8, 4], [13, 2, 1], [12, 11, 1], [10, 7, 2], [9, 7, 2], [12, 10, 7, 1], [12, 9, 7, 1], [10, 9, 4, 2], [10, 6, 5, 3], [9, 6, 5, 3], [12, 11, 10, 7, 0], [12, 11, 9, 7, 0], [12, 10, 9, 8, 0], [12, 10, 9, 4, 1], [8, 7, 6, 5, 3], [12, 11, 10, 9, 4, 0], [12, 10, 9, 8, 2, 1], [11, 7, 6, 5, 4, 3], [13, 7, 6, 5, 4, 3, 1]]
[[0, 2, 10, 6, 4], [0, 2, 10, 8, 1], [0, 2, 11, 5, 4], [0, 2, 11, 7, 1], [0, 16, 9, 4], [0, 16, 15, 1], [0, 18, 19], [3, 2, 12, 11], [3, 2, 13, 10], [3, 17, 16], [3, 19, 14], [20, 14, 1]]
We managed to reduce the number of subsets to 21, that is a good improvement since we need to explore all possibilities to find an exact cover.
Algorithm X
I do not use the dancing links here (I think that technique is we'll designed for low level languages like C, but you can implement them in Python if you want). We just need to keep track of the remaing subsets:
class Matrix:
def __init__(self, subsets, ignore_indices=set()):
self._subsets = subsets
self._ignore_indices = ignore_indices
def subset_values(self, i):
assert i not in self._ignore_indices
return self._subsets[i]
def value_subsets_indices(self, j):
return [i for i, s in self._subsets_generator() if j in s]
def _subsets_generator(self):
return ((i, s) for i, s in enumerate(self._subsets) if
i not in self._ignore_indices)
def rarest_value(self):
c = collections.Counter(
j for _, s in self._subsets_generator() for j in s)
return c.most_common()[-1][0]
def take_subset(self, i):
s = self._subsets[i]
to_ignore = {i2 for i2, s2 in self._subsets_generator() if
set(s2) & set(s)}
return Matrix(self._subsets,
self._ignore_indices | to_ignore)
def __bool__(self):
return bool(list(self._subsets_generator()))
And finally the cover function:
def cover(m, t=[]):
if m: # m is not empty
j = m.rarest_value()
for i in m.value_subsets_indices(j):
m2 = m.take_subset(i)
yield from cover(m2, t + [i])
else:
yield t
Finally, we have:
m = Matrix(subsets0)
ts = list(cover(m))
t = max(ts, key=len)
print([[arr[j] for j in subsets0[i]] for i in t])
Output:
[[100, -100], [10, -10], [15, 2, 1, -18], [15, 5, -20], [60, 20, -80]]
Below essentially the same idea as Michael Huang, with 30 more lines...
A solution with cliques
We can prebuild all subsets whose sum is 0.
Build subsets of 1 elem;
then of size 2 by reusing the previous ones
and keep those whose sum is zero along the way
Now say such subset is a node of a graph.
Then a node is in relation with another one iff their associated subset has no number in common.
We thus want to build the maximum clique of the graph:
In a clique, all nodes are in relation idem their subsets are disjoints
The maximum clique gives us the maximal number of subsets
function forall (v, reduce) {
const nexts = v.map((el, i) => ({ v: [el], i, s: el })).reverse()
while (nexts.length) {
const next = nexts.pop()
for (let i = next.i + 1; i < v.length; ++i) {
const { s, skip } = reduce(next, v[i])
if (!skip) {
nexts.push({ v: next.v.concat(v[i]), s: s, i })
}
}
}
}
function buildSubsets (numbers) {
const sums = []
forall(numbers, (next, el) => {
const s = next.s + el
if (s === 0) {
sums.push({ s, v: next.v.concat(el) })
return { s, skip: true }
}
return { s }
})
return sums
}
const bin2decs = bin => {
const v = []
const s = bin.toString(2)
for (let i = 0; i < s.length; ++i) {
if (intersects(dec2bin(i), bin)) {
v.push(i)
}
}
return v
}
const dec2bin = dec => Math.pow(2, dec)
const decs2bin = decs => decs.reduce((bin, dec) => union(dec2bin(dec), bin), 0)
// Set methods on int
const isIn = (a, b) => (a & b) === a
const intersects = (a, b) => a & b
const union = (a, b) => a | b
// if a subset contains another one, discard it
// e.g [1,2,4] should be discarded if [1,2] is present
const cleanSubsets = bins => bins.filter(big => bins.every(b => big === b || !isIn(b, big)))
function bestClique (decs) {
const cliques = []
forall(decs, (next, el) => {
if (intersects(next.s, el)) { return { skip: true } }
const s = union(next.s, el)
cliques.push({ s, v: next.v.concat(el) })
return { s }
})
return cliques.sort((a, b) => b.v.length - a.v.length)[0]
}
// in case we have duplicated numbers in the list,
// they are still uniq thanks to their id: i (idem position in the list)
const buildNumbers = v => v.map((n, i) => {
const u = new Number(n)
u.i = i
return u
})
function run (v) {
const numbers = buildNumbers(v)
const subs = buildSubsets(numbers)
const bins = subs.map(s => decs2bin(s.v.map(n => n.i)))
const clique = bestClique(cleanSubsets(bins))
const indexedSubs = clique.v.map(bin2decs)
const subsets = indexedSubs.map(sub => sub.map(i => numbers[i].valueOf()))
console.log('subsets', JSON.stringify(subsets))
}
run([1, -1, 2, -2])
run([-10, 1, 2, 20, 5, -100, -80, 10, 15, 15, 60, 100, -20, -18, 10, -10])
run([-5, -4, 5, 2, 3, -1])

Find Top N Most Frequent Sequence of Numbers in List of a Billion Sequences

Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
if target doesn't exist in the list then we ignore that list completely
if the list length is less than M then we ignore the list completely
if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
Important
This is super simplified but, in reality, my actual list-of-sequences
consists of a few billion lists of positive integers (between 1 and 10,000)
each list can be as short as 1 element or as long as 500 elements
N and M can be as small as 1 or as big as 100
My questions are:
Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
Are there known algorithms for performing this kind of analysis for various combinations of N and M? I've looked at suffix trees but I'd have to roll my own custom version to even get close to what I need.
For the same dataset, I need to repeatedly query the dataset for various values or different combinations of target, N, and M (where target <= 10,000, N <= 100 and `M <= 100). How can I do this efficiently?
Extending on my comment. Here is a sketch how you could approach this using an out-of-the-box suffix array:
1) reverse and concatenate your lists with a stop symbol (I used 0 here).
[7, 6, 5, 4, 3, 2, 1, 0, 11, 10, 5, 6, 0, 5, 4, 3, 2, 8, 9, 0, 5, 6, 12, 12, 0, 2, 4, 3, 8, 5, 0, 5, 1, 0, 6, 5, 12, 4, 1, 9, 5, 3, 8, 8, 2, 0, 2, 1, 4, 3, 7, 1, 7, 0, 1, 5, 6, 12, 12, 4, 9]
2) Build a suffix array
[53, 45, 24, 30, 12, 19, 33, 7, 32, 6, 47, 54, 51, 38, 44, 5, 46, 25, 16, 4, 15, 49, 27, 41, 37, 3, 14, 48, 26, 59, 29, 31, 40, 2, 13, 10, 20, 55, 35, 11, 1, 34, 21, 56, 52, 50, 0, 43, 28, 42, 17, 18, 39, 60, 9, 8, 23, 36, 58, 22, 57]
3) Build the LCP array. The LCP array will tell you how many numbers a suffix has in common with its neighbour in the suffix array. However, you need to stop counting when you encounter a stop symbol
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 0, 2, 1, 1, 2, 0, 1, 3, 2, 2, 1, 0, 1, 1, 1, 4, 1, 2, 4, 1, 0, 1, 2, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 2, 1, 2, 0]
4) When a query comes in (target = 5, M= 4) you search for the first occurence of your target in the suffix array and scan the corresponding LCP-array until the starting number of suffixes changes. Below is the part of the LCP array that corresponds to all suffixes starting with 5.
[..., 1, 1, 1, 4, 1, 2, 4, 1, 0, ...]
This tells you that there are two sequences of length 4 that occur two times. Brushing over some details using the indexes you can find the sequences and revert them back to get your final results.
Complexity
Building up the suffix array is O(n) where n is the total number of elements in all lists and O(n) space
Building the LCP array is also O(n) in both time and space
Searching a target number in the suffix is O(log n) in average
The cost of scanning through the relevant subsequences is linear in the number of times the target occurs. Which should be 1/10000 on average according to your given parameters.
The first two steps happen offline. Querying is technically O(n) (due to step 4) but with a small constant (0.0001).

Numpy select different number of first elements from each numpy array row

I want to select different number of first elements from matrix.
Numbers are specified in an array.
And result is one dimensional array.
For example:
a = np.arange(25).reshape([5, 5])
numbers = np.array([3, 2, 0, 1, 2])
And I want this result:
[0, 1, 2, 5, 6, 15, 20, 21]
without for loop.
Let's use some NumPy broadcasting magic!
a[numbers[:,None] > np.arange(a.shape[1])]
Sample run -
In [161]: a
Out[161]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
In [162]: numbers
Out[162]: array([3, 2, 0, 1, 2])
In [163]: numbers[:,None] > np.arange(a.shape[1]) # Mask to select elems
Out[163]:
array([[ True, True, True, False, False],
[ True, True, False, False, False],
[False, False, False, False, False],
[ True, False, False, False, False],
[ True, True, False, False, False]], dtype=bool)
In [164]: a[numbers[:,None] > np.arange(a.shape[1])] # Select w/ boolean indexing
Out[164]: array([ 0, 1, 2, 5, 6, 15, 20, 21])

Resources