Efficient weighted cross-product in Armadillo - performance

I'm trying to find an efficient way to compute X^T * W * X where X is a dense mat of size e.g. 10,000 x 10 and W is a diagonal matrix (I store only the diagonal in an vec).
For now, I use this function
arma::mat& getXtW(const arma::mat& covar,
const arma::vec& w,
arma::mat& tcovar,
size_t n, size_t K) {
size_t i, k;
for (i = 0; i < n; i++) {
for (k = 0; k < K; k++) {
tcovar(k, i) = covar(i, k) * w(i);
}
}
return tcovar;
}
and compute
tcovar = getXtW(covar, w, tcovar, n, K);
cprod = tcovar * covar;
Yet, this seems not optimal.
PS: You can see the whole code there.
Edit1: Seems I can use covar.t() * (covar.each_col() % w), but this doesn't seem to be much faster.
Edit2: If I implement it myself with loops in Rcpp:
arma::mat testProdW2(const arma::mat& x, const arma::vec& w) {
int n = x.n_rows;
int K = x.n_cols;
arma::mat res(K, K);
double tmp;
for (int k = 0; k < K; k++) {
for (int j = k; j < K; j++) {
tmp = 0;
for (int i = 0; i < n; i++) {
tmp += x(i, j) * w[i] * x(i, k);
}
res(j, k) = tmp;
}
}
for (int k = 0; k < K; k++) {
for (int j = 0; j < k; j++) {
res(j, k) = res(k, j);
}
}
return res;
}
This is slower than the first implementation.

According to BLAS matrix by matrix transpose multiply there is no BLAS routine that does this directly. Instead there is the suggestion to loop over the rows of X and use dsyr. I found this an interesting question since I know how to link BLAS in Rcpp, but have not done so using RcppArmadillo. Stack Overflow knows an answer for that as well: Rcpparmadillo: can't call Fortran routine "dgebal"?. Note: I have not checked but I expect that dsyr is not part of the BLAS subset that come with R. So this will only work if your R is linked to a full BLAS implementation.
Combining this we get:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp/Benchmark/Timer.h>
#ifdef ARMA_USE_LAPACK
#if !defined(ARMA_BLAS_CAPITALS)
#define arma_dsyr dsyr
#else
#define arma_dsyr DSYR
#endif
extern "C"
void arma_fortran(arma_dsyr)(const char* uplo, const int* n, const double* alpha, const double* X, const int* incX, const double* A, const int* ldA);
#endif
// [[Rcpp::export]]
Rcpp::NumericVector getXtWX(const arma::mat& X, const arma::vec& w) {
Rcpp::Timer timer;
timer.step("start");
arma::mat result1 = X.t() * (X.each_col() % w);
timer.step("Armadillo result");
const int n = X.n_rows;
const int k = X.n_cols;
arma::mat result(k, k, arma::fill::zeros);
for (size_t i = 0; i < n; ++i) {
F77_CALL(dsyr)("U", &k, &w(i), &X(i,0), &n, result.memptr(), &k);
}
result = arma::symmatu(result);
timer.step("BLAS result");
Rcpp::NumericVector res(timer);
return res;
}
/*** R
n <- 10000
k <- 10
X <- matrix(runif(n*k), n, k)
w <- runif(n)
Reduce(rbind, lapply(1:6, function(x) diff(getXtWX(X, w))/1e6))
*/
However, for me the BLAS solution is quite a bit slower:
> Reduce(rbind, lapply(1:6, function(x) diff(getXtWX(X, w))/1e6))
Armadillo result BLAS result
init 1.291243 6.666026
1.176143 6.623282
1.102111 6.644165
1.094917 6.612596
1.098619 6.588431
1.069286 6.615529
I tried to improve this by first transposing the matrix, in the hope that memory access would be faster when looping over column matrices, but this did not make a difference on my (low-powered) system.

Related

Runtime error for large inputs for sorting ( quicksort)

This is a very simple program where the user inputs (x,y) coordinates and distance 'd' and the program has to find out the number of unrepeated coordinates from (x,y) to (x+d,y).
For eg: if input for one test case is: 4,9,2 then the unrepeated coordinates are (4,9),(5,9) and (6,9)(x=4,y=9,d=2). I have used a sorting algorithm as mentioned in the question (to keep track of multiple occurrences) however the program shows runtime error for test cases beyond 30. Is there any mistake in the code or is it an issue with my compiler?
For a detailed explanation of question: https://www.hackerearth.com/practice/algorithms/sorting/merge-sort/practice-problems/algorithm/missing-soldiers-december-easy-easy/
#include <stdio.h>
#include <stdlib.h>
int partition(int *arr, int p, int r) {
int x;
x = arr[r];
int tmp;
int i = p - 1;
for (int j = p; j <= r - 1; ++j) {
if (arr[j] <= x) {
i = i + 1;
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
}
}
tmp = arr[i + 1];
arr[i + 1] = arr[r];
arr[r] = tmp;
return (i + 1);
}
void quicksort(int *arr, int p, int r) {
int q;
if (p < r) {
q = partition(arr, p, r);
quicksort(arr, p, q - 1);
quicksort(arr, q + 1, r);
}
}
int count(int A[],int ct) {
int cnt = 0;
for (int i = 0; i < ct; ++i) {
if (A[i] != A[i + 1]) {
cnt++;
}
}
return cnt;
}
int main() {
int t;
scanf("%d", &t);
long int tmp, y, d;
int ct = 0;
int i = 0;
int x[1000];
int j = 0;
for (int l = 0; l < t; ++l) {
scanf("%d%d%d", &tmp, &y, &d);
ct = ct + d + 1; //this counts the total no of coordinates for each (x,y,d)
for (int i = 0; i <= d; ++i) {
x[j] = tmp + i; //storing all possible the x and x+d coordinates
j++;
}
}
int cnt;
int p = ct - 1;
quicksort(x, 0, p); //quicksort sorting
for (int l = 0; l < ct; ++l) {
printf("%d ", x[l]); //prints sorted array not necessary to question
}
cnt = count(x, ct); //counts the number of non-repeated vertices
printf("%d\n", cnt);
}
The problem was the bounds of the array int x[1000] is not enough for the data given below.

Find the number of intersections of n line segments with endpoints on two parallel lines

Finding the number of intersections of n line segments with endpoints on two parallel lines.
Let there be two sets of n points:
A={p1,p2,…,pn} on y=0
B={q1,q2,…,qn} on y=1
Each point pi is connected to its corresponding point qi to form a line segment.
I need to write a code using divide-and-conquer algorithm which returns the number of intersection points of all n line segments.
for example:
input:
3
1 101
-234 234
567 765
output:
1
I coded as below but it I have wrong answers.
can anyone help me with this code or give me another solution for the question?
#include<iostream>
#include <vector>
#include<algorithm>
using namespace std;
void merge1(vector< pair <int, int> > vect, int l, int m, int r)
{
int n1 = m - l + 1;
int n2 = r - m;
vector< pair <int, int> > vect_c_l(n1);
vector< pair <int, int> > vect_c_r(n2);
for (int i = 0; i < n1; i++)
vect_c_l[i] = vect[l + i];
for (int j = 0; j < n2; j++)
vect_c_r[j] = vect[m + 1 + j];
int i = 0;
int j = 0;
int k = l;
while (i < n1 && j < n2) {
if (vect_c_l[i].first <= vect_c_r[j].first) {
vect[k] = vect_c_l[i];
i++;
}
else {
vect[k] = vect_c_r[j];
j++;
}
k++;
}
while (i < n1) {
vect[k] = vect_c_l[i];
i++;
k++;
}
while (j < n2) {
vect[k] = vect_c_r[j];
j++;
k++;
}
}
int merge2(vector< pair <int, int> > vect, int l, int m, int r)
{
int n1 = m - l + 1;
int n2 = r - m;
int inv_count = 0;
vector< pair <int, int> > vect_c_l(n1);
vector< pair <int, int> > vect_c_r(n2);
for (int i = 0; i < n1; i++)
vect_c_l[i] = vect[l + i];
for (int j = 0; j < n2; j++)
vect_c_r[j] = vect[m + 1 + j];
int i = 0;
int j = 0;
int k = l;
while (i < n1 && j < n2) {
if (vect_c_l[i].second < vect_c_r[j].second) {
vect[k] = vect_c_l[i];
i++;
}
else {
vect[k] = vect_c_r[j];
j++;
inv_count = inv_count + (m - i);
}
k++;
}
while (i < n1) {
vect[k] = vect_c_l[i];
i++;
k++;
}
while (j < n2) {
vect[k] = vect_c_r[j];
j++;
k++;
}
return inv_count;
}
void mergeSort1(vector< pair <int, int> > vect, int l, int r) {
if (l >= r) {
return;
}
int m = l + (r - l) / 2;
mergeSort1(vect, l, m);
mergeSort1(vect, m + 1, r);
merge1(vect, l, m, r);
}
int mergeSort2(vector< pair <int, int> > vect, int l, int r) {
int inv_count = 0;
if (r > l) {
int m = l + (r - l) / 2;
inv_count += mergeSort2(vect, l, m);
inv_count += mergeSort2(vect, m+ 1, r);
/*Merge the two parts*/
inv_count += merge2(vect, l, m + 1, r);
}
return inv_count;
}
int main() {
int n,c=0;
cin >> n;
int a, b;
vector< pair <int, int> > vect;
for (int i = 0;i < n;i++) {
cin >> a >> b;
vect.push_back(make_pair(a, b));
}
mergeSort1(vect,0,n-1);
cout << mergeSort2(vect,0, n - 1);
}
I'd take advantage of the idea that computing whether the segments intersect is much simpler than computing where they intersect. Two segments intersect if their x values are on different sides of one another on y=1 and y=0. (i.e. if both x values on one segment are both smaller than the others, or both larger).
Objects make this easy to state. Build a segment object who's main job is to determine whether it intersects another instance.
class Segment {
constructor(x) {
this.x0 = x[0];
this.x1 = x[1];
}
// answer whether the reciever intersects the passed segment
intersects(segment) {
// this is ambiguous in the problem, but assume touching endpoints
// count as intersections
if (this.x0 === segment.x0 || this.x1 === segment.x1) return true;
let sort0 = this.x0 < segment.x0
let sort1 = this.x1 < segment.x1
return sort0 !== sort1
}
}
let input = [
[1, 101],
[-234, 234],
[567, 765]
];
let segments = input.map(x => new Segment(x))
// check segments with one another in pairs
let pairs = segments.map((v, i) => segments.slice(i + 1).map(w => [v, w])).flat();
let intersections = pairs.reduce((acc, p) => p[0].intersects(p[1]) ? acc + 1 : acc, 0)
console.log(intersections)
You can also see the problem by abstracting from all the lines.
If there were no intersection that would mean that the order of indexes on both parallel lines are the same.
So the number of intersections are equal to the number of swaps you need to perform on neughbor -points to get the same order of indexes on both sides
In your example you have the two sequences of indexes
1,3,4,2 on the upper line
2,1,4,3 on the lower line
to convert the lower sequence by swapping neighbours, you need 4 swaps:
2,1,4,3 start
-> 1,2,4,3
-> 1,4,2,3
-> 1,4,3,2
-> 1,3,4,2 = upper sequence

Merge sort: time limit exceed

Why I am getting time limit exceeded error in sorting array using merge sort algorithm? What is wrong with my code? I have taken an input of 9 elements.
Input: 4 2 1 8 5 9 6 7 0
Output: Time limit exceeded
#include <bits/stdc++.h>
using namespace std;
int a[100];
void merge(int a[], int l, int r, int m) {
int t[r - l + 1];
int i = l, j = m + 1, k = 0;
while (i <= m && j <= r) {
if (a[i] < a[j])
t[k++] = a[i++];
else
t[k++] = a[j++];
}
while (i <= m)
t[k++] = a[i++];
while (j <= r)
t[k++] = a[j++];
for (int i = l; i <= r; i++)
a[i] = t[i - l];
}
void msort(int a[], int l, int r) {
if (l > r)
return;
int m = (r + l) / 2;
msort(a, l, m);
msort(a, m + 1, r);
merge(a, l, r, m);
}
int main() {
int n;
cin >> n;
for (int i = 0; i < n; i++)
cin >> a[i];
msort(a, 0, n - 1);
for (int i = 0; i < n; i++)
cout << a[i] << " ";
cout << endl;
return 0;
}
There are some problems in your code:
The test for termination in msort() is incorrect: you should stop when the slice has a single element or less. You currently loop forever on slices of 1 element.
if (l >= r) return;
You should test in main() if the number n of elements read from the user is no greater than 100, the size of the global array a into which you read the elements to be sorted. You should instead use a local array with the proper size or allocate the array from the heap. The temporary array t in merge() might also be too large for automatic allocation. It is more efficient to allocate temporary space once and pass it recursively.
Note also that it is idiomatic in C and C++ to specify array slices with the index of the first element and the index of the element after the last one. This simplifies the code and allows for empty arrays and avoid special cases for unsigned index types.
Here is a modified version with this approach:
#include <bits/stdc++.h>
using namespace std;
void merge(int a[], int l, int r, int m, int t[]) {
int i = l, j = m, k = 0;
while (i < m && j < r) {
if (a[i] < a[j])
t[k++] = a[i++];
else
t[k++] = a[j++];
}
while (i < m)
t[k++] = a[i++];
while (j < r)
t[k++] = a[j++];
for (int i = l; i < r; i++)
a[i] = t[i - l];
}
void msort(int a[], int l, int r, int t[]) {
if (r - l > 1) {
int m = l + (r - l) / 2;
msort(a, l, m, t);
msort(a, m, r, t);
merge(a, l, r, m, t);
}
}
void msort(int a[], int n) {
if (n > 1) {
int *t = new int[n];
msort(a, 0, n, t);
delete[] t;
}
}
int main() {
int n;
cin >> n;
if (n <= 0)
return 1;
int *a = new int[n];
for (int i = 0; i < n; i++)
cin >> a[i];
msort(a, n);
for (int i = 0; i < n; i++)
cout << a[i] << " ";
cout << endl;
delete[] a;
return 0;
}

Generic fast Transpose of non-square matrix CUDA

The SDK provides an example and strategies for tackling a square matrix transpose but is there a good way of performing a transpose on a non square matrix? I have quite a naive implementation currently as follows which is probably terrible:
template<class S>
__global__ void transpose(S *Source, S *Destination, int SizeX, int SizeY) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid<SizeX*SizeY) {
int X = tid % SizeX;
int Y = tid / SizeX;
//(x,y) => (y,x)
int newId = (SizeY*X) + Y;
Destination[newId] = Source[tid];
}
}
Here my idea was to transpose the square part of the matrix with only the necessary threads/blocks (each thread swaps two entries of the square sub matrix), then traverse and transpose the remaining entries.
__global__ void kernelTranspuesta(float *a, float *c, int m, int n) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
int j = threadIdx.y + blockIdx.y*blockDim.y;
int smallest = M < N ? M : N;
while( j < smallest ){
i = threadIdx.x + blockIdx.x*blockDim.x;
while( i < j ){
c[i*m+j] = a[j*n+i];
c[j*m+i] = a[i*n+j];
i+= blockDim.x*gridDim.x;
}
if(i == j)
c[j*m+i] = a[i*n+j];
j+= blockDim.y*gridDim.y;
}
if( M > N ) {
i = threadIdx.x + blockIdx.x*blockDim.x + N;
j = threadIdx.y + blockIdx.y*blockDim.y;
while( i < M ){
j = threadIdx.y + blockIdx.y*blockDim.y;
while( j < N){
c[j*m+i] = a[i*n+j];
j+= blockDim.y*gridDim.y;
}
i+= blockDim.x*gridDim.x;
}
}else{
i = threadIdx.x + blockIdx.x*blockDim.x;
j = threadIdx.y + blockIdx.y*blockDim.y + M;
while( i < M ){
j = threadIdx.y + blockIdx.y*blockDim.y + M;
while( j < N){
c[j*m+i] = a[i*n+j];
j+= blockDim.y*gridDim.y;
}
i+= blockDim.x*gridDim.x;
}
}
}
The kernel call is
dim3 hilos(16,16); // hilos(blockDim.x, blockDim.y)
dim3 bloques(8,8); // bloques(gridDim.x, gridDim.y)
kernelTranspuesta<<<bloques, hilos>>>(aD, cD, m, n);
I tested it on 512x256 and 256x512 matrices, let me know what you think.

Find longest non-decreasing sequence

Given the following question,
Given an array of integers A of length n, find the longest sequence {i_1, ..., i_k} such that i_j < i_(j+1) and A[i_j] <= A[i_(j+1)] for any j in [1, k-1].
Here is my solution, is this correct?
max_start = 0; // store the final result
max_end = 0;
try_start = 0; // store the initial result
try_end = 0;
FOR i=0; i<(A.length-1); i++ DO
if A[i] <= A[i+1]
try_end = i+1; // satisfy the condition so move the ending point
else // now the condition is broken
if (try_end - try_start) > (max_end - max_start) // keep it if it is the maximum
max_end = try_end;
max_start = try_start;
endif
try_start = i+1; // reset the search
try_end = i+1;
endif
ENDFOR
// Checking the boundary conditions based on comments by Jason
if (try_end - try_start) > (max_end - max_start)
max_end = try_end;
max_start = try_start;
endif
Somehow, I don't think this is a correct solution but I cannot find a counter-example that disapprove this solution.
anyone can help?
Thank you
I don't see any backtracking in your algorithm, and it seems to be suited for contiguous blocks of non-decreasing numbers. If I understand correctly, for the following input:
1 2 3 4 10 5 6 7
your algorithm would return 1 2 3 4 10 instead of 1 2 3 4 5 6 7.
Try to find a solution using dynamic programming.
You're missing the case where the condition is not broken at its last iteration:
1, 3, 5, 2, 4, 6, 8, 10
You'll never promote try_start and try_end to max_start and max_end unless your condition is broken. You need to perform the same check at the end of the loop.
Well, it looks like you're finding the start and the end of the sequence, which may be correct but it wasn't what was asked. I'd start by reading http://en.wikipedia.org/wiki/Longest_increasing_subsequence - I believe this is the question that was asked and it's a fairly well-known problem. In general cannot be solved in linear time, and will also require some form of dynamic programming. (There's an easier n^2 variant of the algorithm on Wikipedia as well - just do a linear sweep instead of the binary search.)
#include <algorithm>
#include <vector>
#include <stdio.h>
#include <string.h>
#include <assert.h>
template<class RandIter>
class CompM {
const RandIter X;
typedef typename std::iterator_traits<RandIter>::value_type value_type;
struct elem {
value_type c; // char type
explicit elem(value_type c) : c(c) {}
};
public:
elem operator()(value_type c) const { return elem(c); }
bool operator()(int a, int b) const { return X[a] < X[b]; } // for is_sorted
bool operator()(int a, elem b) const { return X[a] < b.c; } // for find
bool operator()(elem a, int b) const { return a.c < X[b]; } // for find
explicit CompM(const RandIter X) : X(X) {}
};
template<class RandContainer, class Key, class Compare>
int upper(const RandContainer& a, int n, const Key& k, const Compare& comp) {
return std::upper_bound(a.begin(), a.begin() + n, k, comp) - a.begin();
}
template<class RandIter>
std::pair<int,int> lis2(RandIter X, std::vector<int>& P)
{
int n = P.size(); assert(n > 0);
std::vector<int> M(n);
CompM<RandIter> comp(X);
int L = 0;
for (int i = 0; i < n; ++i) {
int j = upper(M, L, comp(X[i]), comp);
P[i] = (j > 0) ? M[j-1] : -1;
if (j == L) L++;
M[j] = i;
}
return std::pair<int,int>(L, M[L-1]);
}
int main(int argc, char** argv)
{
if (argc < 2) {
fprintf(stderr, "usage: %s string\n", argv[0]);
return 3;
}
const char* X = argv[1];
int n = strlen(X);
if (n == 0) {
fprintf(stderr, "param string must not empty\n");
return 3;
}
std::vector<int> P(n), S(n), F(n);
std::pair<int,int> lt = lis2(X, P); // L and tail
int L = lt.first;
printf("Longest_increasing_subsequence:L=%d\n", L);
for (int i = lt.second; i >= 0; --i) {
if (!F[i]) {
int j, k = 0;
for (j = i; j != -1; j = P[j], ++k) {
S[k] = j;
F[j] = 1;
}
std::reverse(S.begin(), S.begin()+k);
for (j = 0; j < k; ++j)
printf("%c", X[S[j]]);
printf("\n");
}
}
return 0;
}

Resources