resize! for 2-dimensional arrays (matrices) in Julia - memory-management

The function resize!() in Base takes care of carefully allocating memory to accommodate a given vector size of given type:
v = Vector{Float64}(3)
resize!(v, 5) # allocates two extra indices
Since Julia is column-major, I was wondering if it would be possible to define a resizecols! function for matrices that would allocate extra columns in an efficient way:
A = Matrix{Float64}(3,3)
resizecols!(A, 5) # allocates two extra columns
This is useful in many statistical methods where the number of training examples is not known a priori in a loop. One can start allocating a design matrix X with n columns and then expand it if necessary in the loop.

The package ElasticArrays.jl defines an array type that can be resized in the last dimension. Perfect for the case of resizing the columns of a matrix efficiently.

Related

Efficient all-pairs set intersection on GPU

I have n sets, subsets of a finite universe. I want to calculate the n*n matrix in which the (I, J) entry contains the cardinality of the intersection of set I and set J. n is in the order of 50000.
My idea is to split the matrix into blocks sufficiently small so to have one thread per entry. Every thread should calculate the intersection using bitwise and.
Are there more efficient approaches to solve this problem?
I'm going to assume you want to compute it as you described: actually computing the intersection of every pair of sets, using bitwise and of bitsets.
With the right mathematical setup, you are literally computing the outer product of two vectors, so I will think in terms of high performance linear algebra.
The key to performance is going to be reducing memory traffic, and that means holding things in registers when you can. The overwhelmingly most significant factor is that your elements are huge; it takes 6250 32-bit words to store a single set! An entire multiprocessor of cuda compute capability 3.0, for example, can only hold a mere 10 sets in registers.
What you probably want to do is to spread each element out across an entire thread block. With 896 threads in a block and 7 registers per block, you can store one set of 200704 elements. With cuda compute capability 3.0, you will have 36 registers available per block.
The simplest implementation would be to have each block own one row of the output matrix. It loads the corresponding element of the second vector and stores it in registers, and then iterates over all of the elements of the first vector, computing the intersection, computing and reducing the popcount, and then storing the result in the output vector.
This optimization should reduce the overall number of memory reads by a factor of 2, and thus is likely to double performance.
Better would be to have each block own 3-4 rows of the output matrix at once, and loads the corresponding 3-4 elements of the second vector into registers. Then the block iterates over all of the elements of the first register, and for each it computes the 3-4 intersections it can, storing the result in the output matrix.
This optimization reduces the memory traffic by an additional factor of 3-4.
A completely different approach would be to work with each element of the universe individually: for each element of the universe, you compute which sets actually contain that element, and then (atomically) increment the corresponding entries of the output matrix.
Asymptotically, this should be much more efficient than computing the intersections of sets. Unfortunately, it sounds hard to implement efficiently.
An improvement is to work with, say, 4 elements of the universe at a time. You split up all of your sets up into 16 buckets, depending on which of those 4 elements the set contains. Then, for each of the 16*16 possible pairs of buckets, you iterate through all pairs of vectors from the buckets and (atomically) update the corresponding entry of the matrix appropriately.
This should be even faster than the version described above, but it still may potentially be difficult to implement.
To reduce the difficulty of getting all of the synchronization worked out, you could partition all of input sets into k groups of n/k sets each. Then, the (i,j)-th thread (or warp or block) only does the updates for the corresponding block of the output matrix.
A different approach to breaking up the problem is to to split the universe into smaller partitions of 1024 elements each, and compute just the size of the intersections in this part of the universe.
I'm not sure if I've described that well; basically you're computing
A[i,j] = sum((k in v[i]) * (k in w[j]) for k in the_universe)
where v and w are the two lists of sets, and k in S is 1 if true and 0 otherwise. The point is to permute the indices so that k is in the outer loop rather than the inner loop, although for efficiency you will have to work with many consecutive k at once, rather than one at a time.
That is, you initialize the output matrix to all zeroes, and for each block of 1024 universe elements, you compute the sizes of the intersections and accumulate the results into the output matrix.
I choose 1024, because I imagine you'll have a data layout where that's probably the smallest size where you can still get the full memory bandwidth when reading from device memory, and all of the threads in warp work together. (adjust this appropriately if you know better than me, or you aren't using nVidia and whatever other GPUs you're using would work with something better)
Now that your elements are a reasonable size, you can now appeal to traditional linear algebra optimizations to compute this product. I would probably do the following:
Each warp is assigned a large number of rows of the output matrix. It reads the corresponding elements out of the second vector, and then iterates through the first vector, computing products.
You could have all of the warps operate independently, but it may be better to do the following:
All of the warps in a block work together to load some number of elements from the first vector
Each warp computes the intersections it can and writes the results to the output matrix
You could store the loaded elements in shared memory, but you might get better results holding them in registers. Each warp can only compute the intersections with the set elements its holding onto, and you but after doing so the warps can all rotate which warps are holding which elements.
If you do enough optimizations along these lines, you will probably reach the point where you are no longer memory bound, which means you might not have to go so far as to do the most complicated optimizations (e.g. the shared memory approach described above might already be enough).

Is it beneficial to transpose an array in order to use column-wise operations?

Assume that we are working with a language which stores arrays in column-major order. Assume also that we have a function which uses 2-D array as an argument, and returns it.
I'm wondering can you claim that it is (or isn't) in general beneficial to transpose this array when calling the function in order to work with column-wise operations instead of row-wise operations, or does the transposing negate the the benefits of column-wise operations?
As an example, in R I have a object of class ts named y which has dimension n x p, i.e I have p times series of length n.
I need to make some computations with y in Fortran, where I have two loops with following kind of structure:
do i = 1, n
do j= 1, p
!just an example, some row-wise operations on `y`
x(i,j) = a*y(i,j)
D = ddot(m,y(i,1:p),1,b,1)
! ...
end do
end do
As Fortran (as does R) uses column-wise storage, it would be better to make the computations with p x n array instead. So instead of
out<-.Fortran("something",y=array(y,dim(y)),x=array(0,dim(y)))
ynew<-out$out$y
x<-out$out$x
I could use
out<-.Fortran("something2",y=t(array(y,dim(y))),x=array(0,dim(y)[2:1]))
ynew<-t(out$out$y)
x<-t(out$out$x)
where Fortran subroutine something2 would be something like
do i = 1, n
do j= 1, p
!just an example, some column-wise operations on `y`
x(j,i) = a*y(j,i)
D = ddot(m,y(1:p,i),1,b,1)
! ...
end do
end do
Does the choice of approach always depend on the dimensions n and p or is it possible to say one approach is better in terms of computation speed and/or memory requirements? In my application n is usually much larger than p, which is 1 to 10 in most cases.
more of a comment, buy i wanted to put a bit of code: under old school f77 you would essentially be forced to use the second approach as
y(1:p,i)
is simply a pointer to y(1,i), with the following p values contiguous in memory.
the first construct
y(i,1:p)
is a list of values interspaced in memory, so it seems to require making a copy of the data to pass to the subroutine. I say it seems because i haven't the foggiest idea how a modern optimizing compiler deals with these things. I tend to think at best its a wash at worst this could really hurt. Imagine an array so large you need to page swap to access the whole vector.
In the end the only way to answer this is to test it yourself
----------edit
did a little testng and confirmed my hunch: passing rows y(i,1:p) does cost you vs passing columns y(1:p,i). I used a subroutine that does practically nothing to see the difference. My guess with any real subroutine the hit is negligable.
Btw (and maybe this helps understand what goes on) passing every other value in a column
y(1:p:2,i) takes longer (orders of magnitude) than passing the whole column, while passing every other value in a row cuts the time in half vs. passing a whole row.
(using gfortran 12..)

Matrix transposition without using loops..?

How to transpose a matrix without using any kind of loops. If it's nxn we can make the diagonal as base and shift elements. But for nxm matrix I think this solution is not feasible.
Anyway, to read or store we need to use loops right...??
Any solution without loops..??
If you have known at the beginning the dimension of the matrix, then you will not need any loop. Because you just can swap the matrix position to transpose the matrix overall. In this first condition you don't need loop as well even if the dimension is m x n.
But if you don't know the dimension of matrix in the beginning, then we definitely will need loop to iterate the matrix to read some position and swap to other position in process of transposing matrix.
For storing the entire transposed matrix, you definitely need to use a loop. This is not really a big deal since storing a matrix uses loops anyway, as you need to loop through the members of the matrix to store it.
If you are just reading it, you can use the definition of a matrix transpose and just translate the indicies. For example, in C:
int getTransposedElement(int i,int j, int** originalMatrix) {
return originalMatrix[j,i];
}
If you are using a language with classes and polymorphism, you can create a new matrix class that does this automatically. This has the additional benefit that it avoids copying the original matrix, which saves memory and allows changes to the transposed matrix to be reflected in the original matrix.

efficient way for finding min value on each given region

Given a
we first define two real-valued functions and as follows:
and we also define a value m(X) for each matrix X as follows:
Now given an , we have many regions of G, denoted as . Here, a region of G is formed by a submatrix of G that is randomly chosen from some columns and some rows of G. And our problem is to compute as fewer operations as possible. Is there any methods like building hash table, or sorting to get the results faster? Thanks!
========================
For example, if G={{1,2,3},{4,5,6},{7,8,9}}, then
G_1 could be {{1,2},{7,8}}
G_2 could be {{1,3},{4,6},{7,9}}
G_3 could be {{5,6},{8,9}}
=======================
Currently, for each G_i we need mxn comparisons to compute m(G_i). Thus, for m(G_1),...,m(G_r) there should be rxmxn comparisons. However, I can notice that G_i and G_j maybe overlapped, so there would be some other approach that is more effective. Any attention would be highly appreciated!
Depending on how many times the min/max type data is needed, you could consider a matrix that holds min/max information in-between the matrix values, i.e. in the interstices between values.. Thus, for your example G={{1,2,3},{4,5,6},{7,8,9}} we would define a relationship matrix R sized ((mxn),(mxn),(mxn)) and having values from the set C = {-1 = less than, 0 = equals and 1 = greater than}.
R would have nine relationship pairs (n,1), (n,2) to (n,9) where each value would be a member of C. Note (n,n is defined and will equal 0). Thus, R[4,,) = (1,1,1,0,-1,-1,-1,-1,-1). Now consider any of your subsets G_1 ..., Knowing the positional relationships of a subset's members will give you offsets into R which will resolve to indexes into each R(N,,) which will return the desired relationship information directly without comparisons.
You, of course, will have to decide if the overhead in space and calculations to build R exceeds the cost of just computing what you need each time it's needed. Certain optimizations including realization that the R matrix is reflected along the major diagonal and that you could declare "equals" to be called, say, less than (meaning C has only two values) are available. Depending on the original matrix G, other optimizations can be had if it is know that a row or column is sorted.
And since some computers (mainframes, supercomputers, etc) store data into RAM in column-major order, store your dataset so that it fills in with the rows and columns transposed thus allowing column-to-column type operations (vector calculations) to actually favor the columns. Check your architecture.

What is the main implementation idea behind sparse hash table?

Why does Google sparsehash open-source library has two implementations: a dense hashtable and a sparse one?
The dense hashtable is your ordinary textbook hashtable implementation.
The sparse hashtable stores only the elements that have actually been set, divided over a number of arrays. To quote from the comments in the implementation of sparse tables:
// The idea is that a table with (logically) t buckets is divided
// into t/M *groups* of M buckets each. (M is a constant set in
// GROUP_SIZE for efficiency.) Each group is stored sparsely.
// Thus, inserting into the table causes some array to grow, which is
// slow but still constant time. Lookup involves doing a
// logical-position-to-sparse-position lookup, which is also slow but
// constant time. The larger M is, the slower these operations are
// but the less overhead (slightly).
To know which elements of the arrays are set, a sparse table includes a bitmap:
// To store the sparse array, we store a bitmap B, where B[i] = 1 iff
// bucket i is non-empty. Then to look up bucket i we really look up
// array[# of 1s before i in B]. This is constant time for fixed M.
so that each element incurs an overhead of only 1 bit (in the limit).
sparsehash are a memory-efficient way of mapping keys to values (1-2 bits per key). Bloom filters can give you even fewer bits per key, but they don't attach values to keys other than outside/probably-inside, which is slightly less than a bit of information.

Resources