Torch code producing CUDA Runtime Error - runtime

a friend of mine implemented a sparse version of torch.bmm that actually works, but when I try a test, I have a runtime error (that has nothing to do with this implementation), that I don't understand. I have seen a few topics about if but couldn't find a solution. Here is the code, and the error:
if __name__ == "__main__":
tmp = torch.zeros(1).cuda()
batch_csr = BatchCSR()
sparse_bmm = SparseBMM()
i=torch.LongTensor([[0,5,8], [1,5,8], [2,5,8]])
v=torch.FloatTensor([4,3,8])
s=torch.Size([3,500,500])
indices, values, size = i,v,s
a_ = torch.sparse.FloatTensor(indices, values, size).cuda().transpose(2, 1)
batch_size, num_nodes, num_faces = a_.size()
a = a_.to_dense()
for _ in range(10):
b = torch.randn(batch_size, num_faces, 16).cuda()
torch.cuda.synchronize()
time1 = time.time()
result = torch.bmm(a, b)
torch.cuda.synchronize()
time2 = time.time()
print("{} CuBlas dense bmm".format(time2 - time1))
torch.cuda.synchronize()
time1 = time.time()
col_ind, col_ptr = batch_csr(a_.indices(), a_.size())
my_result = sparse_bmm(a_.values(), col_ind, col_ptr, a_.size(), b)
torch.cuda.synchronize()
time2 = time.time()
print("{} My sparse bmm".format(time2 - time1))
print("{} Diff".format((result-my_result).abs().max()))
And the error:
Traceback (most recent call last):
File "sparse_bmm.py", line 72, in <module>
b = torch.randn(3, 500, 16).cuda()
File "/home/bizeul/virtual_env/lib/python2.7/site-packages/torch/_utils.py", line 65, in _cuda
return new_type(self.size()).copy_(self, async)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorCopy.c:18
When running with the command CUDA_LAUNCH_BLOCKING=1, I get the error :
/b/wheel/pytorch-src/torch/lib/THC/THCTensorIndex.cu:121: void indexAddSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 1, SrcDim = 1, IdxDim = -2]: block: [0,0,0], thread: [0,0,0] Assertion `dstIndex < dstAddDimSize` failed.
THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THCS/generic/THCSTensorMath.cu line=292 error=59 : device-side assert triggered
Traceback (most recent call last):
File "sparse_bmm.py", line 69, in <module>
a = a_.to_dense()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THCS/generic/THCSTensorMath.cu:292

The indices that you are passing to create the sparse tensor are incorrect.
here is how it should be:
i = torch.LongTensor([[0, 1, 2], [5, 5, 5], [8, 8, 8]])
How to create a sparse tensor:
Lets take a simpler example. Lets say we want the following tensor:
0 0 0 2 0
0 0 0 0 0
0 0 0 0 20
[torch.cuda.FloatTensor of size 3x5 (GPU 0)]
As you can see, the number (2) needs to be in the (0, 3) location of the sparse tensor. And the number (20) needs to be in the (2, 4) location.
In order to create this, our index tensor should look like this
[[0 , 2],
[3 , 4]]
And, now for the code to create the above sparse tensor:
i=torch.LongTensor([[0, 2], [3, 4]])
v=torch.FloatTensor([2, 20])
s=torch.Size([3, 5])
a_ = torch.sparse.FloatTensor(indices, values, size).cuda()
More comments regarding the assert error by cuda:
Assertion 'dstIndex < dstAddDimSize' failed. tells us that, its highly likely, you've got an index out of bounds. So whenever you notice that, look for places where you might have supplied the wrong indices to any of the tensors.

Related

Algorithm to find some rows from a matrix, whose sum is equal to a given row

For example, here is a matrix:
[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 0, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 1],
I want to find some rows, whose sum is equal to [4, 3, 2, 1].
The expected answer is rows: {0,1,3,4}.
Because:
[1, 0, 0, 0] + [1, 1, 0, 0] + [1, 1, 1, 0] + [1, 1, 1, 1] = [4, 3, 2, 1]
Is there some famous or related algrithoms to resolve the problem?
Thank #sascha and #N. Wouda for the comments.
To clarify it, here I provide some more details.
In my problem, the matrix will have about 50 rows and 25 columns. But echo row will just have less than 4 elements (other is zero). And every solution has 8 rows.
If I try all combinations, c(8, 50) is about 0.55 billion times of attempt. Too complex. So I want to find a more effective algrithom.
If you want to make the jump to using a solver, I'd recommend it. This is a pretty straightforward Integer Program. Below solutions use python, python's pyomo math programming package to formulate the problem, and COIN OR's cbc solver for Integer Programs and Mixed Integer Programs, which needs to be installed separately (freeware) available: https://www.coin-or.org/downloading/
Here is the an example with your data followed by an example with 100,000 rows. The example above solves instantly, the 100,000 row example takes about 2 seconds on my machine.
# row selection Integer Program
import pyomo.environ as pyo
data1 = [ [1, 0, 0, 0],
[1, 1, 0, 0],
[1, 0, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 1],]
data_dict = {(i, j): data1[i][j] for i in range(len(data1)) for j in range(len(data1[0]))}
model = pyo.ConcreteModel()
# sets
model.I = pyo.Set(initialize=range(len(data1))) # a simple row index
model.J = pyo.Set(initialize=range(len(data1[0]))) # a simple column index
# parameters
model.matrix = pyo.Param(model.I , model.J, initialize=data_dict) # hold the sparse matrix of values
magic_sum = [4, 3, 2, 1 ]
# variables
model.row_select = pyo.Var(model.I, domain=pyo.Boolean) # row selection variable
# constraints
# ensure the columnar sum is at least the magic sum for all j
def min_sum(model, j):
return sum(model.row_select[i] * model.matrix[(i, j)] for i in model.I) >= magic_sum[j]
model.c1 = pyo.Constraint(model.J, rule=min_sum)
# objective function
# minimze the overage
def objective(model):
delta = 0
for j in model.J:
delta += sum(model.row_select[i] * model.matrix[i, j] for i in model.I) - magic_sum[j]
return delta
model.OBJ = pyo.Objective(rule=objective)
model.pprint() # verify everything
solver = pyo.SolverFactory('cbc') # need to have cbc solver installed
result = solver.solve(model)
result.write() # solver details
model.row_select.display() # output
Output:
# ----------------------------------------------------------
# Solver Information
# ----------------------------------------------------------
Solver:
- Status: ok
User time: -1.0
System time: 0.0
Wallclock time: 0.0
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 0.01792597770690918
# ----------------------------------------------------------
# Solution Information
# ----------------------------------------------------------
Solution:
- number of solutions: 0
number of solutions displayed: 0
row_select : Size=5, Index=I
Key : Lower : Value : Upper : Fixed : Stale : Domain
0 : 0 : 1.0 : 1 : False : False : Boolean
1 : 0 : 1.0 : 1 : False : False : Boolean
2 : 0 : 0.0 : 1 : False : False : Boolean
3 : 0 : 1.0 : 1 : False : False : Boolean
4 : 0 : 1.0 : 1 : False : False : Boolean
A more stressful rendition with 100,000 rows:
# row selection Integer Program stress test
import pyomo.environ as pyo
import numpy as np
# make a large matrix 100,000 x 8
data1 = np.random.randint(0, 1000, size=(100_000, 8))
# inject "the right answer into 3 rows"
data1[42602] = [8, 0, 0, 0, 0, 0, 0, 0 ]
data1[3] = [0, 0, 0, 0, 4, 3, 2, 1 ]
data1[10986] = [0, 7, 6, 5, 0, 0, 0, 0 ]
data_dict = {(i, j): data1[i][j] for i in range(len(data1)) for j in range(len(data1[0]))}
model = pyo.ConcreteModel()
# sets
model.I = pyo.Set(initialize=range(len(data1))) # a simple row index
model.J = pyo.Set(initialize=range(len(data1[0]))) # a simple column index
# parameters
model.matrix = pyo.Param(model.I , model.J, initialize=data_dict) # hold the sparse matrix of values
magic_sum = [8, 7, 6, 5, 4, 3, 2, 1 ]
# variables
model.row_select = pyo.Var(model.I, domain=pyo.Boolean) # row selection variable
# constraints
# ensure the columnar sum is at least the magic sum for all j
def min_sum(model, j):
return sum(model.row_select[i] * model.matrix[(i, j)] for i in model.I) >= magic_sum[j]
model.c1 = pyo.Constraint(model.J, rule=min_sum)
# objective function
# minimze the overage
def objective(model):
delta = 0
for j in model.J:
delta += sum(model.row_select[i] * model.matrix[i, j] for i in model.I) - magic_sum[j]
return delta
model.OBJ = pyo.Objective(rule=objective)
solver = pyo.SolverFactory('cbc')
result = solver.solve(model)
result.write()
print('\n\n======== row selections =======')
for i in model.I:
if model.row_select[i].value > 0:
print (f'row {i} selected')
Output:
# ----------------------------------------------------------
# Solver Information
# ----------------------------------------------------------
Solver:
- Status: ok
User time: -1.0
System time: 2.18
Wallclock time: 2.61
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 2.800779104232788
# ----------------------------------------------------------
# Solution Information
# ----------------------------------------------------------
Solution:
- number of solutions: 0
number of solutions displayed: 0
======== row selections =======
row 3 selected
row 10986 selected
row 42602 selected
This one picks and not picks an element (recursivly). As soon as the tree is impossible to solve (no elements left or any target value negative) it will return false. In case the sum of the target is 0 a solution is found and returned in form of the picked elements.
Feel free to add time and memory complexity in the comments. Worst case should be 2^(n+1)
Please let me know how it performs on your 8/50 data.
const elements = [
[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 0, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 1]
];
const target = [4, 3, 2, 1];
let iterations = 0;
console.log(iter(elements, target, [], 0));
console.log(`Iterations: ${iterations}`);
function iter(elements, target, picked, index) {
iterations++;
const sum = target.reduce(function(element, sum) {
return sum + element;
});
if (sum === 0) return picked;
if (elements.length === 0) return false;
const result = iter(
removeElement(elements, 0),
target,
picked,
index + 1
);
if (result !== false) return result;
const newTarget = matrixSubtract(target, elements[0]);
const hasNegatives = newTarget.some(function(element) {
return element < 0;
});
if (hasNegatives) return false;
return iter(
removeElement(elements, 0),
newTarget,
picked.concat(index),
index + 1
);
}
function removeElement(target, i) {
return target.slice(0, i).concat(target.slice(i + 1));
}
function matrixSubtract(minuend, subtrahend) {
let i = 0;
return minuend.map(function(element) {
return minuend[i] - subtrahend[i++]
});
}

Calculate the amount of water a tool described by an array can contain

There is a tool for collecting rainwater. The transect chart of the tool is described by an array in the length of n.
For example:
for this array {2,1,1,4,1,1,2,3} the transect chart is:
I am required to calculate the amount of water the tool can sustain, in time and place complexity of O(n).
.
For the array above it is 7 (the grey area).
My thought:
Since it's a graphical problem, my initial thought was to first calculate the maximum of the array and multiply it by n. This is the starting volume I need to subtract from.
For example in the array above I need to subtract the green area and the heights themselves:
This is where I'm stuck and need help in order to do so in the required complexity.
Note: Maybe I'm overthinking and there are better ways to handle this problem. But as I said, since it's a graphical problem, my first thought was to go for a geometric solution.
Any tips or hints would be appreciate.
The water level at position i is the smaller of:
The maximum container height at positions <= i; and
The maximum container height at positions >= i
Calculate these two maximum values for every position using two passes through the array, and then sum up the differences between the water levels and the container heights.
Here is a python implementation of an algorithm similar to the one described by #MattTimmermans. The code reads like pseudocode, so I don't think extra explanations are needed:
def _find_water_capacity(container):
"""returns the max water capacity as calculated from the left bank
of the given container
"""
water_levels = [0]
current_left_bank = 0
idx = 0
while idx < len(container) - 1:
current_left_bank = max(current_left_bank, container[idx])
current_location_height = container[idx + 1]
possible_water_level = current_left_bank - current_location_height
if possible_water_level <= 0:
water_levels.append(0)
else:
water_levels.append(possible_water_level)
idx += 1
return water_levels
def find_water_capacity(container):
"""returns the actual water capacity as the sum of the minimum between the
left and right capacity for each position """
to_left = _find_water_capacity(container[::-1])[::-1] #reverse the result from _find_water_capacity of the reversed container.
to_right = _find_water_capacity(container)
return sum(min(left, right) for left, right in zip(to_left, to_right))
def test_find_water_capacity():
container = []
expected = 0
assert find_water_capacity(container) == expected
assert find_water_capacity(container[::-1]) == expected
container = [1, 1, 1, 1, 1]
expected = 0
assert find_water_capacity(container) == expected
assert find_water_capacity(container[::-1]) == expected
container = [5, 4, 3, 2, 1]
expected = 0
assert find_water_capacity(container) == expected
assert find_water_capacity(container[::-1]) == expected
container = [2, 1, 1, 4, 1, 1, 2, 3] # <--- the sample provided
expected = 7
assert find_water_capacity(container) == expected
assert find_water_capacity(container[::-1]) == expected
container = [4, 1, 1, 2, 1, 1, 3, 3, 3, 1, 2]
expected = 10
assert find_water_capacity(container) == expected
assert find_water_capacity(container[::-1]) == expected
container = [4, 5, 6, 7, 8, -10, 12, 11, 10, 9, 9]
expected = 18
assert find_water_capacity(container) == expected
assert find_water_capacity(container[::-1]) == expected
container = [2, 1, 5, 4, 3, 2, 1, 5, 1, 2]
expected = 12
assert find_water_capacity(container) == expected
assert find_water_capacity(container[::-1]) == expected
print("***all tests find_water_capacity passed***")
test_find_water_capacity()

Parallel Computing - Shuffle

I am looking to shuffle an array in parallel. I have found that doing an algorithm similar to bitonic sort but with a random (50/50) re-order results in an equal distribution but only if the array is a power of 2. I've considered the Yates Fisher Shuffle but I can't see how I could parallel-ize it in order to avoid O(N) computations.
Any advice?
Thanks!
There's a good clear recent paper on this here and the references, especially Shun et al 2015 are worth a read.
But basically you can do this using the same sort of approach that's used in sort -R: shuffle by giving each row a random key value and sorting on that key. And there are lots of ways to do good parallel distributed sort.
Here's a basic version in python + MPI using an odd-even sort; it goes through P communication steps if P is the number of processors. You can do better than that, but this is pretty simple to understand; it's discussed in this question.
from __future__ import print_function
import sys
import random
from mpi4py import MPI
comm = MPI.COMM_WORLD
def exchange(localdata, sendrank, recvrank):
"""
Perform a merge-exchange with a neighbour;
sendrank sends local data to recvrank,
which merge-sorts it, and then sends lower
data back to the lower-ranked process and
keeps upper data
"""
rank = comm.Get_rank()
assert rank == sendrank or rank == recvrank
assert sendrank < recvrank
if rank == sendrank:
comm.send(localdata, dest=recvrank)
newdata = comm.recv(source=recvrank)
else:
bothdata = list(localdata)
otherdata = comm.recv(source=sendrank)
bothdata = bothdata + otherdata
bothdata.sort()
comm.send(bothdata[:len(otherdata)], dest=sendrank)
newdata = bothdata[len(otherdata):]
return newdata
def print_by_rank(data, rank, nprocs):
""" crudely attempt to print data coherently """
for proc in range(nprocs):
if proc == rank:
print(str(rank)+": "+str(data))
comm.barrier()
return
def odd_even_sort(data):
rank = comm.Get_rank()
nprocs = comm.Get_size()
data.sort()
for step in range(1, nprocs+1):
if ((rank + step) % 2) == 0:
if rank < nprocs - 1:
data = exchange(data, rank, rank+1)
elif rank > 0:
data = exchange(data, rank-1, rank)
return data
def main():
# everyone get their data
rank = comm.Get_rank()
nprocs = comm.Get_size()
n_per_proc = 5
data = list(range(n_per_proc*rank, n_per_proc*(rank+1)))
if rank == 0:
print("Original:")
print_by_rank(data, rank, nprocs)
# tag your data with random values
data = [(random.random(), item) for item in data]
# now sort it by these random tags
data = odd_even_sort(data)
if rank == 0:
print("Shuffled:")
print_by_rank([x for _, x in data], rank, nprocs)
return 0
if __name__ == "__main__":
sys.exit(main())
Running gives:
$ mpirun -np 5 python mergesort_shuffle.py
Original:
0: [0, 1, 2, 3, 4]
1: [5, 6, 7, 8, 9]
2: [10, 11, 12, 13, 14]
3: [15, 16, 17, 18, 19]
4: [20, 21, 22, 23, 24]
Shuffled:
0: [19, 17, 4, 20, 9]
1: [23, 12, 3, 2, 8]
2: [14, 6, 13, 15, 1]
3: [11, 0, 22, 16, 18]
4: [5, 10, 21, 7, 24]

How to slice a rank 4 tensor in TensorFlow?

I am trying to slice a four-dimensional tensor using the tf.slice() operator, as follows:
x_image = tf.reshape(x, [-1,28,28,1], name='Images_2D')
slice_im = tf.slice(x_image,[0,2,2],[1, 24, 24])
However, when I try to run this code, I get the following exception:
raise ValueError("Shape %s must have rank %d" % (self, rank))
ValueError: Shape TensorShape([Dimension(None), Dimension(28), Dimension(28), Dimension(1)]) must have rank 3
How can I slice this tensor?
The tf.slice(input, begin, size) operator requires that the begin and size vectors—which define the subtensor to be sliced—have the same length as the number of dimensions in input. Therefore, to slice a 4-D tensor, you must pass a vector (or list) of four numbers as the second and third arguments of tf.slice().
For example:
x_image = tf.reshape(x, [-1, 28, 28, 1], name='Images_2D')
slice_im = tf.slice(x_image, [0, 2, 2, 0], [1, 24, 24, 1])
# Or, using the indexing operator:
slice_im = x_image[0:1, 2:26, 2:26, :]
The indexing operator is slightly more powerful, as it can also reduce the rank of the output, if for a dimension you specify a single integer, rather than a range:
slice_im = x_image[0:1, 2:26, 2:26, :]
print slice_im_2d.get_shape() # ==> [1, 24, 24, 1]
slice_im_2d = x_image[0, 2:26, 2:26, 0]
print slice_im_2d.get_shape() # ==> [24, 24]

Python Coin Change: Incrementing list in return statement?

Edit: Still working on this, making progress though.
def recursion_change(available_coins, tender):
"""
Returns a tuple containing:
:an array counting which coins are used to make change, mirroring the input array
:the number of coins to make tender.
:coins: List[int]
:money: int
:rtype: (List[int], int)
"""
change_list = [0] * len(available_coins)
def _helper_recursion_change(change_index, remaining_balance, change_list):
if remaining_balance == 0:
return (change_list, sum(change_list))
elif change_index == -1 or remaining_balance < 0:
return float('inf')
else:
test_a = _helper_recursion_change(change_index-1, remaining_balance, change_list)
test_b = _helper_recursion_change(_helper_recursion_change(len(available_coins)-1, tender, change_list))
test_min = min(test_a or test_b)
if :
_helper_recursion_change()
else:
_helper_recursion_change()
return 1 + _helper_recursion_change(change_index, remaining_balance-available_coins[change_index], change_list))
print str(recursion_change([1, 5, 10, 25, 50, 100], 72)) # Current Output: 5
# Desired Output: ([2, 0, 2, 0, 1, 0], 5)
Quick overview: this coin-change algorithm is supposed to receive a list of possible change options and tender. It's supposed to recursively output a mirror array and the number of coins needed to make tender, and I think the best way to do that is with a tuple.
For example:
> recursion_change([1, 2, 5, 10, 25], 49)
>> ([0, 2, 0, 2, 1], 5)
Working code sample:
http://ideone.com/mmtuMr
def recursion_change(coins, money):
"""
Returns a tuple containing:
:an array counting which coins are used to make change, mirroring the input array
:the number of coins to make tender.
:coins: List[int]
:money: int
:rtype: (List[int], int)
"""
change_list = [0] * len(coins)
def _helper_recursion_change(i, k, change_list):
if k == 0: # Base case: money in this (sub)problem matches change precisely
return 0
elif i == -1 or k < 0: # Base case: change cannot be made for this subproblem
return float('inf')
else: # Otherwise, simplify by recursing:
# Take the minimum of:
# the number of coins to make i cents
# the number of coins to make k-i cents
return min(_helper_recursion_change(i-1, k, change_list), 1 + _helper_recursion_change(i, k-coins[i], change_list))
return (_helper_recursion_change(len(coins)-1, money, change_list))
print str(recursion_change([1, 5, 10, 25, 50, 100], 6)) # Current Output: 2
# Desired Output: ([1, 1, 0, 0, 0, 0], 2)
Particularly, this line:
1 + _helper_recursion_change(i, k-coins[i], change_list))
It's easy enough to catch the number of coins we need, as the program does now. Do I have to change the return value to include change_list, so I can increment it? What's the best way to do that without messing with the recursion, as it currently returns just a simple integer.
Replacing change_list in the list above with change_list[i] + 1 gives me a
TypeError: 'int' object is unsubscriptable or change_list[i] += 1 fails to run because it's 'invalid syntax'.

Resources