How to pack Boolean operations using gcc or other compilers? - gcc

Intel CPUs are capable of performing 512 or 1024 bitwise operations using vectorized operations. Assume I have a snippet of code that looks like this:
#include <stdio.h>
int main()
{
_Bool i0, i1, i2, i3, w0, w1, w2, w3, w4;
i0 = 1;
i1 = 1;
i2 = 0;
i3 = 0;
w0 = i0 & i1;
w1 = i1 & i2;
w2 = i0 & i3;
w3 = w0 & w1;
w4 = w1 & w2;
printf("%d %d %d %d\n", i0, i1, i2, i3);
printf("%d %d %d %d %d\n", w0, w1, w2, w3, w4);
return 0;
}
Does GCC or Intel compiler vectorize this code automatically or I need to rewrite the code to be able to benefit from vectorization? Ideally, I would like the first three operations to be performed in parallel and then, the next two computed in parallel.

Related

Please, can someone explain me what does this assembly code do?

codeA:
.data
N: .word 10
V: .word 90,50,40,20,30,10,80,70,60,100
.text
main:
li t2, 0 //t2 <-0
la t3, N //t3 <- #N
lw t3, 0(t3) //t3<- N
la t4, V //t4 <- V
addi a7, x0, 1 //a7 <- i <- 1
p1:
beq t2, t3, end
lw a0, 0(t4) // a0 <- V[?]
ecall
addi t2, t2, 1
addi t4, t4, 4
j p1
end:
I'm new in this and I'm struggling trying to understand what does this code do.
I'm trying to "translate" it to C language or some kind of pseudocode to make it more understandable. I just got stuck in the middle of it.
This is doing:
int N = 10;
int V [] = { 90, ... };
void main () {
for ( int i = 0; i < N; i++ ) {
printf ( "%d", V[i] );
}
}
so, printing all the elements of the array from 0 to <N.
However, it is using pointers instead of array indexing.  So, in C, it looks more like this:
int *p = &V;
for ( int i = 0; i < N; i++ ) {
printf ( "%d", *p );
p++; // adds 4 to p so it refers to the next int in the array
}
Now, it is not using printf ( "%d" ... ); but rather MARS/RARS syscall/ecall #1 to print an integer as text to the screen — but this printf is the best mapping of that sys/ecall for C programming.

Accurate floating-point computation of the sum and difference of two products

The difference of two products and the sum of two products are two primitives found in a variety of common computations. diff_of_products (a,b,c,d) := ab - cd and sum_of_products(a,b,c,d) := ab + cd are closely-related companion functions that differ only by the sign of some of their operands. Examples for the use of these primitives are:
Computation of a complex multiplication with x = (a + i b) and y = (c + i d):
x*y = diff_of_products (a, c, b, d) + i sum_of_products (a, d, b, c)
Computation of the determinant of a 2x2 matrix: diff_of_products (a, d, b, c):
| a b |
| c d |
In a right-angle triangle computation of the length of the opposite cathesus from the hypothenuse h and adjacent cathetus a: diff_of_products (h, h, a, a)
Computation of the two real solutions of a quadratic equation with positive discriminant:
q = -(b + copysign (sqrt (diff_of_products (b, b, 4a, c)), b)) / 2
x0 = q / a
x1 = c / q
Computation of a 3D cross product a = b ⨯ c:
ax = diff_of_products (by, cz, bz, cy)
ay = diff_of_products (bz, cx, bx, cz)
az = diff_of_products (bx, cy, by, cx)
When computing with IEEE-754 binary floating-point formats, besides obvious issues with potential overflow and underflow, naive implementations of either function can suffer from catastrophic cancellation when the two products are similar in magnitude but of opposite signs for sum_of_products() or same sign for diff_of_products().
Focusing only on the accuracy aspect, how can one implement these functions robustly in the context of IEEE-754 binary arithmetic? The availability of fused multiply-add operations can be assumed, as this operation is supported by most modern processor architectures and exposed, via standard functions, in many programming languages. Without loss of generality, discussion can be restricted to single precision (IEEE-754 binary32) format for ease of exposition and testing.
The utility of the fused-multiply add (FMA) operation in providing protection against subtractive cancellation stems from the participation of the full double-width product in the final addition. To my knowledge, the first publicl record of its utility for accurately and robustly computing the solutions of quadratic equations are two sets of informal notes by renowned floating-point expert William Kahan:
William Kahan, "Matlab’s Loss is Nobody’s Gain". August 1998, revised July 2004 (online)
William Kahan, "On the Cost of Floating-Point Computation Without Extra-Precise Arithmetic". November 2004 (online)
The standard work on numerical computing by Higham was the first in which I encountered Kahan's algorithm applied to the computation of the determinant of a 2x2 matrix (p. 65):
Nicholas J. Higham, "Accuracy and Stability of Numerical Algorithms", SIAM 1996
A different algorithm for the computation of ab+cd, also based on FMA, was published by three Intel researchers in the context of Intel's first CPU with FMA support, the Itanium processor (p. 273):
Marius Cornea, John Harrison, and Ping Tak Peter Tang: "Scientific Computing on Itanium-based Systems." Intel Press 2002
In recent years, four papers by French researchers examined both algorithms in detail and provided error bounds with mathematical proofs. For binary floating-point arithmetic, provided there is no overflow or underflow in intermediate computation, the maximum relative error of both Kahan's algorithm and the Cornea-Harrison-Tang (CHT) algorithm were shown to be twice the unit round-off asymptotically, that is, 2u. For IEEE-754 binary32 or single precision this error bound is 2-23 and for IEEE-754 binary64 or double precision this error bound is 2-52.
Furthermore it was shown that the error in Kahan's algorithm is at most 1.5 ulps for binary floating-point arithmetic. From the literature I am not aware of an equivalent result, that is, a proven ulp error bound, for the CHT algorithm. My own experiments using the code below suggest an error bound of 1.25 ulp.
Sylvie Boldo, "Kahan’s algorithm for a correct discriminant computation at last formally proven",
IEEE Transactions on Computers, Vol. 58, No. 2, February 2009, pp. 220-225 (online)
Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, "Further Analysis of Kahan's Algorithm for the Accurate Computation of 2x2 Determinants", Mathematics of Computation, Vol. 82, No. 284, Oct. 2013, pp. 2245-2264 (online)
Jean-Michel Muller, "On the Error of Computing ab+cd using Cornea, Harrison and Tang's Method", ACM Transactions on Mathematical Software, Vol. 41, No.2, January 2015, Article 7 (online)
Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software Vol. 42, No. 3, May 2016, Article 19 (online)
While Kahan's algorithm requires four floating-point operations, two of which are FMAs, the CHT algorithm requires seven floating-point operations, two of which are FMAs. I constructed the test framework below to explore what other trade-offs may exist. I experimentally confirmed the bounds from the literature on the relative error of both algorithms and the ulp error of Kahan's algorithm. My experiments indicate that the CHT algorithm provides a smaller ulp error bound of 1.25 ulp, but that it also produces incorrectly-rounded results at roughly twice the rate of Kahan's algorithm.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <float.h>
#include <math.h>
#define TEST_SUM (0) // function under test. 0: a*b-c*d; 1: a*b+c*d
#define USE_CHT (0) // algorithm. 0: Kahan; 1: Cornea-Harrison-Tang
/*
Compute a*b-c*d with error <= 1.5 ulp. Maximum relative err = 2**-23
Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller,
"Further Analysis of Kahan's Algorithm for the Accurate Computation
of 2x2 Determinants", Mathematics of Computation, Vol. 82, No. 284,
Oct. 2013, pp. 2245-2264
*/
float diff_of_products_kahan (float a, float b, float c, float d)
{
float w = d * c;
float e = fmaf (c, -d, w);
float f = fmaf (a, b, -w);
return f + e;
}
/*
Compute a*b-c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23
Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the
Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
Vol. 42, No. 3, Article 19 (May 2016).
*/
float diff_of_products_cht (float a, float b, float c, float d)
{
float p1 = a * b;
float p2 = c * d;
float e1 = fmaf (a, b, -p1);
float e2 = fmaf (c, -d, p2);
float r = p1 - p2;
float e = e1 + e2;
return r + e;
}
/*
Compute a*b+c*d with error <= 1.5 ulp. Maximum relative err = 2**-23
Jean-Michel Muller, "On the Error of Computing ab+cd using Cornea,
Harrison and Tang's Method", ACM Transactions on Mathematical Software,
Vol. 41, No.2, Article 7, (January 2015)
*/
float sum_of_products_kahan (float a, float b, float c, float d)
{
float w = c * d;
float e = fmaf (c, -d, w);
float f = fmaf (a, b, w);
return f - e;
}
/*
Compute a*b+c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23
Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the
Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software
Vol. 42, No. 3, Article 19 (May 2016).
*/
float sum_of_products_cht (float a, float b, float c, float d)
{
float p1 = a * b;
float p2 = c * d;
float e1 = fmaf (a, b, -p1);
float e2 = fmaf (c, d, -p2);
float r = p1 + p2;
float e = e1 + e2;
return r + e;
}
// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579) /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)
typedef struct {
double y;
double x;
} dbldbl;
dbldbl make_dbldbl (double head, double tail)
{
dbldbl z;
z.x = tail;
z.y = head;
return z;
}
dbldbl add_dbldbl (dbldbl a, dbldbl b) {
dbldbl z;
double t1, t2, t3, t4, t5;
t1 = a.y + b.y;
t2 = t1 - a.y;
t3 = (a.y + (t2 - t1)) + (b.y - t2);
t4 = a.x + b.x;
t2 = t4 - a.x;
t5 = (a.x + (t2 - t4)) + (b.x - t2);
t3 = t3 + t4;
t4 = t1 + t3;
t3 = (t1 - t4) + t3;
t3 = t3 + t5;
z.y = t4 + t3;
z.x = (t4 - z.y) + t3;
return z;
}
dbldbl sub_dbldbl (dbldbl a, dbldbl b)
{
dbldbl z;
double t1, t2, t3, t4, t5;
t1 = a.y - b.y;
t2 = t1 - a.y;
t3 = (a.y + (t2 - t1)) - (b.y + t2);
t4 = a.x - b.x;
t2 = t4 - a.x;
t5 = (a.x + (t2 - t4)) - (b.x + t2);
t3 = t3 + t4;
t4 = t1 + t3;
t3 = (t1 - t4) + t3;
t3 = t3 + t5;
z.y = t4 + t3;
z.x = (t4 - z.y) + t3;
return z;
}
dbldbl mul_dbldbl (dbldbl a, dbldbl b)
{
dbldbl t, z;
t.y = a.y * b.y;
t.x = fma (a.y, b.y, -t.y);
t.x = fma (a.x, b.x, t.x);
t.x = fma (a.y, b.x, t.x);
t.x = fma (a.x, b.y, t.x);
z.y = t.y + t.x;
z.x = (t.y - z.y) + t.x;
return z;
}
double prod_diff_ref (float a, float b, float c, float d)
{
dbldbl t = sub_dbldbl (
mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
);
return t.x + t.y;
}
double prod_sum_ref (float a, float b, float c, float d)
{
dbldbl t = add_dbldbl (
mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),
mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))
);
return t.x + t.y;
}
float __uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
uint32_t __float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
uint64_t __double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
static double floatUlpErr (float res, double ref)
{
uint64_t i, j, err;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan(res) || isnan (ref) || isinf(res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)__float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of float's limited exponent
range.
*/
expoRef = (int)(((__double_as_uint64(ref) >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = (__double_as_uint64(ref) & 0x8000000000000000ULL) |
0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((__double_as_uint64(ref) << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
} else {
j = ((__double_as_uint64(ref) << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
}
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
int main (void)
{
const float ULMT = sqrtf (FLT_MAX) / 2; // avoid overflow
const float LLMT = sqrtf (FLT_MIN) * 2; // avoid underflow
const uint64_t N = 1ULL << 38;
double ref, ulp, relerr, maxrelerr = 0, maxulp = 0;
uint64_t count = 0LL, incorrectly_rounded = 0LL;
uint32_t ai, bi, ci, di;
float af, bf, cf, df, resf;
#if TEST_SUM
printf ("testing a*b+c*d ");
#else
printf ("testing a*b-c*d ");
#endif // TEST_SUM
#if USE_CHT
printf ("using Cornea-Harrison-Tang algorithm\n");
#else
printf ("using Kahan algorithm\n");
#endif
do {
do {
ai = KISS;
af = __uint32_as_float (ai);
} while (!isfinite(af) || (fabsf (af) > ULMT) || (fabsf (af) < LLMT));
do {
bi = KISS;
bf = __uint32_as_float (bi);
} while (!isfinite(bf) || (fabsf (bf) > ULMT) || (fabsf (bf) < LLMT));
do {
ci = KISS;
cf = __uint32_as_float (ci);
} while (!isfinite(cf) || (fabsf (cf) > ULMT) || (fabsf (cf) < LLMT));
do {
di = KISS;
df = __uint32_as_float (di);
} while (!isfinite(df) || (fabsf (df) > ULMT) || (fabsf (df) < LLMT));
count++;
#if TEST_SUM
#if USE_CHT
resf = sum_of_products_cht (af, bf, cf, df);
#else // USE_CHT
resf = sum_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
ref = prod_sum_ref (af, bf, cf, df);
#else // TEST_SUM
#if USE_CHT
resf = diff_of_products_cht (af, bf, cf, df);
#else // USE_CHT
resf = diff_of_products_kahan (af, bf, cf, df);
#endif // USE_CHT
ref = prod_diff_ref (af, bf, cf, df);
#endif // TEST_SUM
ulp = floatUlpErr (resf, ref);
incorrectly_rounded += ulp > 0.5;
relerr = fabs ((resf - ref) / ref);
if ((ulp > maxulp) || ((ulp == maxulp) && (relerr > maxrelerr))) {
maxulp = ulp;
maxrelerr = relerr;
printf ("%13llu %12llu ulp=%.9f a=% 15.8e b=% 15.8e c=% 15.8e d=% 15.8e res=% 16.6a ref=% 23.13a relerr=%13.9e\n",
count, incorrectly_rounded, ulp, af, bf, cf, df, resf, ref, relerr);
}
} while (count <= N);
return EXIT_SUCCESS;
}

How to calculate log determinant of an Armadillo sparse matrix efficiently

I'm trying to write an MCMC procedure using RcppArmadillo which involves computing log determinants of some around 30,000 x 30,000 sparse matrices. It seems that log_det() in Armadillo does not support sp_mat right now so I'm doing something like this:
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::depends(RcppEigen)]]
#include <RcppArmadillo.h>
#include <RcppEigen.h>
using namespace arma;
double eigen_ldet(sp_mat arma_mat) {
Eigen::SparseMatrix<double> eigen_s = Rcpp::as<Eigen::SparseMatrix<double>>(Rcpp::wrap(arma_mat));
Eigen::SparseLU<Eigen::SparseMatrix<double>> solver;
solver.compute(eigen_s);
double det = solver.logAbsDeterminant();
return det;
}
I feel it is really crappy and slow. Any help would be much appreciated.
Edit:
Here is the mockup:
library(Matrix)
m_mat = function(i = 1688, j = 18, rho = 0.5, alp = 0.5){
w1 = matrix(runif(i^2),nrow = i, ncol = i)
w2 = matrix(runif(j^2),nrow = j, ncol = j)
w1 = w1/rowSums(w1)
w2 = w2/rowSums(w2)
diag(w1) = 0
diag(w2) = 0
w1 = diag(i) - rho*w1
w2 = diag(j) - alp*w2
w1 = kronecker(Matrix(diag(j)), w1)
w2 = kronecker(Matrix(diag(i)), w2)
ind = matrix(c(rep(seq(1,i), each = j), rep(seq(1,j),i)), ncol = 2)
w2 = cbind(ind, w2)
w2 = w2[order(w2[,2]),]
w2 = t(w2[, -c(1,2)])
w2 = cbind(as.matrix(ind), w2)
w2 = w2[order(w2[,2]),]
w2 = t(w2[, -c(1,2)])
return(w1 + w2)
}
Edit2: Here is the second mockup with a sparse w1:
m_mat2 = function(i = 1688, j = 18, nb = 4, range = 10, rho = 0.5, alp = 0.5){
w1 = Matrix(0, nrow = i, ncol = i)
for ( h in 1:i){
rnd = as.integer(rnorm(nb, h, range))
rnd = ifelse(rnd > 0 & rnd <= i, rnd, h)
rnd = unique(rnd)
w1[h, rnd] = 1
}
w1 = w1/rowSums(w1)
w2 = matrix(runif(j^2),nrow = j, ncol = j)
w2 = w2/rowSums(w2)
diag(w1) = 0
diag(w2) = 0
w1 = diag(i) - rho*w1
w2 = diag(j) - alp*w2
w1 = kronecker(Matrix(diag(j)), w1)
w2 = kronecker(Matrix(diag(i)), w2)
ind = matrix(c(rep(seq(1,i), each = j), rep(seq(1,j),i)), ncol = 2)
w2 = cbind(ind, w2)
w2 = w2[order(w2[,2]),]
w2 = t(w2[, -c(1,2)])
w2 = cbind(as.matrix(ind), w2)
w2 = w2[order(w2[,2]),]
w2 = t(w2[, -c(1,2)])
return(w1 + w2)
}
An actual sparse w1 case should be much more irregular, but it takes about the same time to calculate (by the above code) the determinant of this one as using an actual w1.
From my experiments I find that the conversion from Armadillo to Eigen matrix is quite fast. Most of the time is spent in solver.compute(). I do not know if there are any faster algorithms to determine the log determinant of a sparse matrix, but I have found an approximation that is at least applicable to your mock-up: Only use the (dense) block-diagonal (see e.g. here for ways to include other parts of the matrix). If an approximate solution is sufficient, this is quite good and fast:
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::depends(RcppEigen)]]
#include <RcppArmadillo.h>
#include <RcppEigen.h>
#include <Rcpp/Benchmark/Timer.h>
using namespace arma;
// [[Rcpp::export]]
double arma_sldet(sp_mat arma_mat, int blocks, int size) {
double ldet = 0.0;
double val = 0.0;
double sign = 0.0;
for (int i = 0; i < blocks; ++i) {
int begin = i * size;
int end = (i + 1) * size - 1;
sp_mat sblock = arma_mat.submat(begin, begin, end, end);
mat dblock(sblock);
log_det(val, sign, dblock);
ldet += val;
}
return ldet;
}
// [[Rcpp::export]]
Rcpp::List eigen_ldet(sp_mat arma_mat) {
Rcpp::Timer timer;
timer.step("start");
Eigen::SparseMatrix<double> eigen_s = Rcpp::as<Eigen::SparseMatrix<double>>(Rcpp::wrap(arma_mat));
timer.step("conversion");
Eigen::SparseLU<Eigen::SparseMatrix<double>> solver;
solver.compute(eigen_s);
timer.step("solver");
double det = solver.logAbsDeterminant();
timer.step("log_det");
Rcpp::NumericVector res(timer);
return Rcpp::List::create(Rcpp::Named("log_det") = det,
Rcpp::Named("timer") = res);
}
/*** R
library(Matrix)
m_mat = function(i = 1688, j = 18, rho = 0.5, alp = 0.5){
w1 = matrix(runif(i^2),nrow = i, ncol = i)
w2 = matrix(runif(j^2),nrow = j, ncol = j)
w1 = w1/rowSums(w1)
w2 = w2/rowSums(w2)
diag(w1) = 0
diag(w2) = 0
w1 = diag(i) - rho*w1
w2 = diag(j) - alp*w2
w1 = kronecker(Matrix(diag(j)), w1)
w2 = kronecker(Matrix(diag(i)), w2)
ind = matrix(c(rep(seq(1,i), each = j), rep(seq(1,j),i)), ncol = 2)
w2 = cbind(ind, w2)
w2 = w2[order(w2[,2]),]
w2 = t(w2[, -c(1,2)])
w2 = cbind(as.matrix(ind), w2)
w2 = w2[order(w2[,2]),]
w2 = t(w2[, -c(1,2)])
return(w1 + w2)
}
m <- m_mat(i = 200)
system.time(eigen <- eigen_ldet(m))
system.time(arma <- arma_sldet(m, 18, 200))
diff(eigen$timer)/1000000
all.equal(eigen$log_det, arma)
m <- m_mat()
#eigen_ldet(m) # takes to long ...
system.time(arma <- arma_sldet(m, 18, 1688))
*/
Results for a smaller mock-up:
> m <- m_mat(i = 200)
> system.time(eigen <- eigen_ldet(m))
user system elapsed
3.703 0.049 3.751
> system.time(arma <- arma_sldet(m, 18, 200))
user system elapsed
0.059 0.012 0.019
> diff(eigen$timer)/1000000
conversion solver log_det
5.208586 3738.131168 0.582578
> all.equal(eigen$log_det, arma)
[1] "Mean relative difference: 0.002874847"
The approximate solution is very close and much faster. We also see the timing distribution for the exact solution.
Results for the full mock-up:
> m <- m_mat()
> #eigen_ldet(m) # takes to long ...
> system.time(arma <- arma_sldet(m, 18, 1688))
user system elapsed
5.965 2.529 2.578
An even faster approximation can be achieved when only considering the diagonal:
// [[Rcpp::export]]
double arma_sldet_diag(sp_mat arma_mat) {
vec d(arma_mat.diag());
return sum(log(d));
}
If you have plenty of memory on your machine (say 32+ Gb), and a fast implementation of LAPACK (example: OpenBLAS or Intel MKL), a quick and dirty way is to convert the sparse matrix into a dense matrix, and compute the log determinant on the dense matrix.
Example:
sp_mat X = sprandu(30000,30000,0.01);
cx_double log_result = log_det( mat(X) );
While this obviously takes lots of memory, the advantage is that it avoids time consuming sparse solvers / factorizations. OpenBLAS or MKL will also take advantage of multiple cores.

Using 2.4" MCUFriend TFT LCD Display with Arduino

I am hoping that someone is familiar with the 2.4" TFT LCD Display board from MCUFriend. I am having troubling using this board with my Arduino Uno and I was hoping someone could help.
The problem that I am having is that there are all of these colored lines being drawn on the screen after a reset and initialization. Right now all i am trying to do is fill the screen and draw a box. here is my code:
#include <Adafruit_GFX.h>
#include <TouchScreen.h>
#include <Adafruit_TFTLCD.h>
//SPI Communication
#define LCD_CS A3
#define LCD_CD A2
#define LCD_WR A1
#define LCD_RD A0
#define LCD_RESET A4
//Color Definitons
#define BLACK 0x0000
#define WHITE 0xFFFF
#define BOXSIZE 40
Adafruit_TFTLCD tft(LCD_CS, LCD_CD, LCD_WR, LCD_RD, LCD_RESET);
void setup() {
Serial.begin(9600);
tft.reset();
tft.begin();
tft.fillScreen(BLACK);
tft.drawRect(100, 100, BOXSIZE, BOXSIZE, WHITE);
}
void loop() {
}
This is what my Screen is doing:
As you can see, the background is black, and a box is being drawn behind these colored bars.
Any help would be greatly appreciated!!
Thank you very very much!
Im new to Arduino myself but i do have the same screen which works perfect,your problem is probably that the TFT shield is shorting off the top off the arduino usb put something non conductive there and reset. if your still having trouble, try removing the shield and watch each pin as you insert it to make sure they are all inserted in the correct pins, LCD_02 should be in Dig pin 2.
here is the code i used for testing, it uses the same library, Hope that helps you.
#include <Adafruit_GFX.h> // Core graphics library
#include <Adafruit_TFTLCD.h> // Hardware-specific library
// The control pins for the LCD can be assigned to any digital or
// analog pins...but we'll use the analog pins as this allows us to
// double up the pins with the touch screen (see the TFT paint example).
#define LCD_CS A3 // Chip Select goes to Analog 3
#define LCD_CD A2 // Command/Data goes to Analog 2
#define LCD_WR A1 // LCD Write goes to Analog 1
#define LCD_RD A0 // LCD Read goes to Analog 0
#define LCD_RESET A4 // Can alternately just connect to Arduino's reset pin
// When using the BREAKOUT BOARD only, use these 8 data lines to the LCD:
// For the Arduino Uno, Duemilanove, Diecimila, etc.:
// D0 connects to digital pin 8 (Notice these are
// D1 connects to digital pin 9 NOT in order!)
// D2 connects to digital pin 2
// D3 connects to digital pin 3
// D4 connects to digital pin 4
// D5 connects to digital pin 5
// D6 connects to digital pin 6
// D7 connects to digital pin 7
// For the Arduino Mega, use digital pins 22 through 29
// (on the 2-row header at the end of the board).
// Assign human-readable names to some common 16-bit color values:
#define BLACK 0x0000
#define BLUE 0x001F
#define RED 0xF800
#define GREEN 0x07E0
#define CYAN 0x07FF
#define MAGENTA 0xF81F
#define YELLOW 0xFFE0
#define WHITE 0xFFFF
Adafruit_TFTLCD tft(LCD_CS, LCD_CD, LCD_WR, LCD_RD, LCD_RESET);
void setup(void) {
Serial.begin(9600);
Serial.println(F("TFT LCD test"));
#ifdef USE_ADAFRUIT_SHIELD_PINOUT
Serial.println(F("Using Adafruit 2.8\" TFT Arduino Shield Pinout"));
#else
Serial.println(F("Using Adafruit 2.8\" TFT Breakout Board Pinout"));
#endif
Serial.print("TFT size is "); Serial.print(tft.width()); Serial.print("x"); Serial.println(tft.height());
tft.reset();
uint16_t identifier = tft.readID();
if(identifier == 0x9325) {
Serial.println(F("Found ILI9325 LCD driver"));
} else if(identifier == 0x9327) {
Serial.println(F("Found ILI9327 LCD driver"));
} else if(identifier == 0x9328) {
Serial.println(F("Found ILI9328 LCD driver"));
} else if(identifier == 0x7575) {
Serial.println(F("Found HX8347G LCD driver"));
} else if(identifier == 0x9341) {
Serial.println(F("Found ILI9341 LCD driver"));
} else if(identifier == 0x8357) {
Serial.println(F("Found HX8357D LCD driver"));
} else if(identifier == 0x0154) {
Serial.println(F("Found S6D0154 LCD driver"));
} else {
Serial.print(F("Unknown LCD driver chip: "));
Serial.println(identifier, HEX);
Serial.println(F("If using the Adafruit 2.8\" TFT Arduino shield, the line:"));
Serial.println(F(" #define USE_ADAFRUIT_SHIELD_PINOUT"));
Serial.println(F("should appear in the library header (Adafruit_TFT.h)."));
Serial.println(F("If using the breakout board, it should NOT be #defined!"));
Serial.println(F("Also if using the breakout, double-check that all wiring"));
Serial.println(F("matches the tutorial."));
return;
}
tft.begin(identifier);
Serial.println(F("Benchmark Time (microseconds)"));
Serial.print(F("Screen fill "));
Serial.println(testFillScreen());
delay(500);
Serial.print(F("Text "));
Serial.println(testText());
delay(3000);
Serial.print(F("Lines "));
Serial.println(testLines(CYAN));
delay(500);
Serial.print(F("Horiz/Vert Lines "));
Serial.println(testFastLines(RED, BLUE));
delay(500);
Serial.print(F("Rectangles (outline) "));
Serial.println(testRects(GREEN));
delay(500);
Serial.print(F("Rectangles (filled) "));
Serial.println(testFilledRects(YELLOW, MAGENTA));
delay(500);
Serial.print(F("Circles (filled) "));
Serial.println(testFilledCircles(10, MAGENTA));
Serial.print(F("Circles (outline) "));
Serial.println(testCircles(10, WHITE));
delay(500);
Serial.print(F("Triangles (outline) "));
Serial.println(testTriangles());
delay(500);
Serial.print(F("Triangles (filled) "));
Serial.println(testFilledTriangles());
delay(500);
Serial.print(F("Rounded rects (outline) "));
Serial.println(testRoundRects());
delay(500);
Serial.print(F("Rounded rects (filled) "));
Serial.println(testFilledRoundRects());
delay(500);
Serial.println(F("Done!"));
}
void loop(void) {
for(uint8_t rotation=0; rotation<4; rotation++) {
tft.setRotation(rotation);
testText();
delay(2000);
}
}
unsigned long testFillScreen() {
unsigned long start = micros();
tft.fillScreen(BLACK);
tft.fillScreen(RED);
tft.fillScreen(GREEN);
tft.fillScreen(BLUE);
tft.fillScreen(BLACK);
return micros() - start;
}
unsigned long testText() {
tft.fillScreen(BLACK);
unsigned long start = micros();
tft.setCursor(0, 0);
tft.setTextColor(WHITE); tft.setTextSize(1);
tft.println("Hello World!");
tft.setTextColor(YELLOW); tft.setTextSize(2);
tft.println(1234.56);
tft.setTextColor(RED); tft.setTextSize(3);
tft.println(0xDEADBEEF, HEX);
tft.println();
tft.setTextColor(GREEN);
tft.setTextSize(5);
tft.println("Groop");
tft.setTextSize(2);
tft.println("I implore thee,");
tft.setTextSize(1);
tft.println("my foonting turlingdromes.");
tft.println("And hooptiously drangle me");
tft.println("with crinkly bindlewurdles,");
tft.println("Or I will rend thee");
tft.println("in the gobberwarts");
tft.println("with my blurglecruncheon,");
tft.println("see if I don't!");
return micros() - start;
}
unsigned long testLines(uint16_t color) {
unsigned long start, t;
int x1, y1, x2, y2,
w = tft.width(),
h = tft.height();
tft.fillScreen(BLACK);
x1 = y1 = 0;
y2 = h - 1;
start = micros();
for(x2=0; x2<w; x2+=6) tft.drawLine(x1, y1, x2, y2, color);
x2 = w - 1;
for(y2=0; y2<h; y2+=6) tft.drawLine(x1, y1, x2, y2, color);
t = micros() - start; // fillScreen doesn't count against timing
tft.fillScreen(BLACK);
x1 = w - 1;
y1 = 0;
y2 = h - 1;
start = micros();
for(x2=0; x2<w; x2+=6) tft.drawLine(x1, y1, x2, y2, color);
x2 = 0;
for(y2=0; y2<h; y2+=6) tft.drawLine(x1, y1, x2, y2, color);
t += micros() - start;
tft.fillScreen(BLACK);
x1 = 0;
y1 = h - 1;
y2 = 0;
start = micros();
for(x2=0; x2<w; x2+=6) tft.drawLine(x1, y1, x2, y2, color);
x2 = w - 1;
for(y2=0; y2<h; y2+=6) tft.drawLine(x1, y1, x2, y2, color);
t += micros() - start;
tft.fillScreen(BLACK);
x1 = w - 1;
y1 = h - 1;
y2 = 0;
start = micros();
for(x2=0; x2<w; x2+=6) tft.drawLine(x1, y1, x2, y2, color);
x2 = 0;
for(y2=0; y2<h; y2+=6) tft.drawLine(x1, y1, x2, y2, color);
return micros() - start;
}
unsigned long testFastLines(uint16_t color1, uint16_t color2) {
unsigned long start;
int x, y, w = tft.width(), h = tft.height();
tft.fillScreen(BLACK);
start = micros();
for(y=0; y<h; y+=5) tft.drawFastHLine(0, y, w, color1);
for(x=0; x<w; x+=5) tft.drawFastVLine(x, 0, h, color2);
return micros() - start;
}
unsigned long testRects(uint16_t color) {
unsigned long start;
int n, i, i2,
cx = tft.width() / 2,
cy = tft.height() / 2;
tft.fillScreen(BLACK);
n = min(tft.width(), tft.height());
start = micros();
for(i=2; i<n; i+=6) {
i2 = i / 2;
tft.drawRect(cx-i2, cy-i2, i, i, color);
}
return micros() - start;
}
unsigned long testFilledRects(uint16_t color1, uint16_t color2) {
unsigned long start, t = 0;
int n, i, i2,
cx = tft.width() / 2 - 1,
cy = tft.height() / 2 - 1;
tft.fillScreen(BLACK);
n = min(tft.width(), tft.height());
for(i=n; i>0; i-=6) {
i2 = i / 2;
start = micros();
tft.fillRect(cx-i2, cy-i2, i, i, color1);
t += micros() - start;
// Outlines are not included in timing results
tft.drawRect(cx-i2, cy-i2, i, i, color2);
}
return t;
}
unsigned long testFilledCircles(uint8_t radius, uint16_t color) {
unsigned long start;
int x, y, w = tft.width(), h = tft.height(), r2 = radius * 2;
tft.fillScreen(BLACK);
start = micros();
for(x=radius; x<w; x+=r2) {
for(y=radius; y<h; y+=r2) {
tft.fillCircle(x, y, radius, color);
}
}
return micros() - start;
}
unsigned long testCircles(uint8_t radius, uint16_t color) {
unsigned long start;
int x, y, r2 = radius * 2,
w = tft.width() + radius,
h = tft.height() + radius;
// Screen is not cleared for this one -- this is
// intentional and does not affect the reported time.
start = micros();
for(x=0; x<w; x+=r2) {
for(y=0; y<h; y+=r2) {
tft.drawCircle(x, y, radius, color);
}
}
return micros() - start;
}
unsigned long testTriangles() {
unsigned long start;
int n, i, cx = tft.width() / 2 - 1,
cy = tft.height() / 2 - 1;
tft.fillScreen(BLACK);
n = min(cx, cy);
start = micros();
for(i=0; i<n; i+=5) {
tft.drawTriangle(
cx , cy - i, // peak
cx - i, cy + i, // bottom left
cx + i, cy + i, // bottom right
tft.color565(0, 0, i));
}
return micros() - start;
}
unsigned long testFilledTriangles() {
unsigned long start, t = 0;
int i, cx = tft.width() / 2 - 1,
cy = tft.height() / 2 - 1;
tft.fillScreen(BLACK);
start = micros();
for(i=min(cx,cy); i>10; i-=5) {
start = micros();
tft.fillTriangle(cx, cy - i, cx - i, cy + i, cx + i, cy + i,
tft.color565(0, i, i));
t += micros() - start;
tft.drawTriangle(cx, cy - i, cx - i, cy + i, cx + i, cy + i,
tft.color565(i, i, 0));
}
return t;
}
unsigned long testRoundRects() {
unsigned long start;
int w, i, i2,
cx = tft.width() / 2 - 1,
cy = tft.height() / 2 - 1;
tft.fillScreen(BLACK);
w = min(tft.width(), tft.height());
start = micros();
for(i=0; i<w; i+=6) {
i2 = i / 2;
tft.drawRoundRect(cx-i2, cy-i2, i, i, i/8, tft.color565(i, 0, 0));
}
return micros() - start;
}
unsigned long testFilledRoundRects() {
unsigned long start;
int i, i2,
cx = tft.width() / 2 - 1,
cy = tft.height() / 2 - 1;
tft.fillScreen(BLACK);
start = micros();
for(i=min(tft.width(), tft.height()); i>20; i-=6) {
i2 = i / 2;
tft.fillRoundRect(cx-i2, cy-i2, i, i, i/8, tft.color565(0, i, 0));
}
return micros() - start;
}
in line:
uint16_t identifier = tft.readID();
you can fix your LCD driver like this:
uint16_t identifier = 0x7575; //Change according your LDC DRIVER ID, look in serial (Serial.println(identifier, HEX));

SSE floating point dot product for dummies

I have read many SO questions about SSE/SIMD (e.g., Getting started with SSE), but I'm still confused by all of it. All I want is a dot product between two double precision floating-point vectors, in C (C99 FWIW). I'm using GCC.
Can someone post a simple and complete example, including how to convert double vectors to the SSE types and back again?
[Edit 2012-10-08]
Here's some SSE2 code I managed to cobble together, critiques?
#include <emmintrin.h>
double dotprod(double *restrict a, double *restrict b, int n)
{
__m128d aa, bb, cc, ss;
int i, n1 = n - 1;
double *s = calloc(2, sizeof(double));
double s2 = 0;
ss = _mm_set1_pd(0);
for(i = 0 ; i < n1 ; i += 2)
{
aa = _mm_load_pd(a + i);
bb = _mm_load_pd(b + i);
cc = _mm_mul_pd(aa, bb);
ss = _mm_add_pd(ss, cc);
}
_mm_store_pd(s, ss);
s2 = s[0] + s[1];
if(i < n)
s2 += a[i] * b[i];
free(s);
return s2;
}

Resources