Repeated calls to CwiseUnaryView - better to copy to a matrix?

Repeated calls to CwiseUnaryView - better to copy to a matrix? - eigen

As the title says, I've got a custom UnaryView function that is getting called multiple times in different operations (mostly multiplication with other matrices). For example:
MatrixXd out = mat1.cview() * mat2;
MatrixXd out2 = mat1.cview() * mat3.transpose();
Would it be faster to first copy the custom view into a separate matrix and use that instead? For example:
MatrixXd mat1_dbl = mat1.cview();
MatrixXd out = mat1_dbl * mat2;
MatrixXd out2 = mat1_dbl * mat3.transpose();
Basically, is repeatedly using the UnaryView slower than copying to a matrix and using that instead?

Should have done my own benchmarking. Google benchmark shows that it is markedly faster to copy first:
2019-04-09 20:55:55
Running ./CView
Run on (16 X 4053.06 MHz CPU s)
CPU Caches:
L1 Data 32K (x8)
L1 Instruction 64K (x8)
L2 Unified 512K (x8)
L3 Unified 8192K (x2)
--------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------
UnaryView_Repeat 147390919 ns 147385796 ns 5
UnaryView_Copy 139456051 ns 139451409 ns 5
Tested with:
#include <stan/math/prim/mat/fun/Eigen.hpp>
#include <stan/math/fwd/mat.hpp>
#include <stan/math/fwd/core.hpp>
#include <benchmark/benchmark.h>
static void UnaryView_Repeat(benchmark::State& state) {
using Eigen::MatrixXd;
using stan::math::matrix_fd;
matrix_fd m_fd1(1000, 1000);
m_fd1.val_() = MatrixXd::Random(1000, 1000);
m_fd1.d_() = MatrixXd::Random(1000, 1000);
MatrixXd m_d2 = MatrixXd::Random(1000, 1000);
for (auto _ : state) {
MatrixXd out(1000,1000);
out = m_fd1.val_() * m_d2
+ m_fd1.val_().transpose() * m_d2
+ m_fd1.val_().array().exp().matrix();
}
}
BENCHMARK(UnaryView_Repeat);
static void UnaryView_Copy(benchmark::State& state) {
using Eigen::MatrixXd;
using stan::math::matrix_fd;
matrix_fd m_fd1(1000, 1000);
m_fd1.val_() = MatrixXd::Random(1000, 1000);
m_fd1.d_() = MatrixXd::Random(1000, 1000);
MatrixXd m_d2 = MatrixXd::Random(1000, 1000);
for (auto _ : state) {
MatrixXd out(1000,1000);
MatrixXd m_fd1_val = m_fd1.val_();
out = m_fd1_val * m_d2 + m_fd1_val.transpose() * m_d2
+ m_fd1_val.array().exp().matrix();
}
}
BENCHMARK(UnaryView_Copy);
BENCHMARK_MAIN();

Related

Does clflush instruction flush block only from Level 1 Cache?

I have a multi-core system with 4 cores each of them having private L1 and L2 caches and shared LLC. Caches have inclusive property meaning that Higher level Caches are super-set of lower level Caches. Can I directly flush a block on the LLC or does it have to go through the lower level first?
I am trying to understand flush+ reload and flush+flush Cache side Channel attacks.

clflush is architecturally required / guaranteed to evict the line from all levels of cache, making it useful for committing data to non-volatile DIMMs. (e.g. Battery-backed DRAM or 3D XPoint).
The wording in the manual seems pretty clear:
Invalidates from every level of the cache hierarchy in the cache coherence domain ... If that cache line contains modified data at any level of the cache hierarchy, that data is written back to memory
I think if multiple cores have a line in Shared state, clflush / clflushopt on one core has to evict it from the private caches of all cores. (This would happen anyway as part of evicting from inclusive L3 cache, but Skylake-X changed to a NINE (not-inclusive not-exclusive) L3 cache.)
Can I directly flush a block on the LLC or does it have to go through the lower level first?
Not clear what you're asking. Are you asking if you can ask the CPU to flush a block from L3 only, without disturbing L1/L2? You already know L3 is inclusive on most Intel CPUs, so the net effect would be the same as clflush. For cores to talk to L3, they have to go through their own L1d and L2.
clflush still works if the data is only present in L3 but not the private L1d or L2 of the core executing it. It's not a "hint" like a prefetch, or a local-only thing.
In future Silvermont-family CPUs, there will be a cldemote instruction that lets you flush a block to the LLC, but not all the way to DRAM. (And it's only a hint, so it doesn't force the CPU to obey it if the write-back path is busy with evictions to make room for demand-loads.)

That couldn't be true that CLFLUSH always evicts from every cache-level. I just wrote a little program (C++17) where flushing cachlines is always below 5ns on my machine (3990X):
#include <iostream>
#include <chrono>
#include <cstring>
#include <vector>
#include <charconv>
#include <sstream>
#include <cmath>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__)
#include <x86intrin.h>
#endif
using namespace std;
using namespace chrono;
size_t parseSize( char const *str );
string blockSizeStr( size_t blockSize );
int main( int argc, char **argv )
{
static size_t const DEFAULT_MAX_BLOCK_SIZE = (size_t)512 * 1024;
size_t blockSize = argc < 2 ? DEFAULT_MAX_BLOCK_SIZE : parseSize( argv[1] );
if( blockSize == -1 )
return EXIT_FAILURE;
blockSize = blockSize >= 4096 ? blockSize : 4096;
vector<char> block( blockSize );
size_t size = 4096;
static size_t const ITERATIONS_64K = 100;
do
{
uint64_t avg = 0;
size = size <= blockSize ? size : blockSize;
size_t iterations = (size_t)((double)0x10000 / size * ITERATIONS_64K + 0.5);
iterations += (size_t)!iterations;
for( size_t it = 0; it != iterations; ++it )
{
// make cachlines to get modified for sure by
// modifying to a differnt value each iteration
for( size_t i = 0; i != size; ++i )
block[i] = (i + it) % 0x100;
auto start = high_resolution_clock::now();
for( char *p = &*block.begin(), *end = p + size; p < end; p += 64 )
_mm_clflush( p );
avg += duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();
}
double nsPerCl = ((double)(int64_t)avg / iterations) / (double)(ptrdiff_t)(size / 64);
cout << blockSizeStr( size ) << " " << nsPerCl << "ns" << endl;
} while( (size *= 2) <= blockSize );
}
size_t parseSize( char const *str )
{
double dSize;
from_chars_result fcr = from_chars( str, str + strlen( str ), dSize, chars_format::general );
if( fcr.ec != errc() )
return -1;
if( !*(str = fcr.ptr) || str[1] )
return -1;
static const
struct suffix_t
{
char suffix;
size_t mult;
} suffixes[]
{
{ 'k', 1024 },
{ 'm', (size_t)1024 * 1024 },
{ 'g', (size_t)1024 * 1024 * 1024 }
};
char cSuf = tolower( *str );
for( suffix_t const &suf : suffixes )
if( suf.suffix == cSuf )
{
dSize = trunc( dSize * (ptrdiff_t)suf.mult );
if( dSize < 1.0 || dSize >= (double)numeric_limits<ptrdiff_t>::max() )
return -1;
return (ptrdiff_t)dSize;
}
return -1;
}
string blockSizeStr( size_t blockSize )
{
ostringstream oss;
double dSize = (double)(ptrdiff_t)blockSize;
if( dSize < 1024.0 )
oss << blockSize;
else if( dSize < 1024.0 * 1024.0 )
oss << dSize / 1024.0 << "kB";
else if( blockSize < (size_t)1024 * 1024 * 1024 )
oss << dSize / (1024.0 * 1024.0) << "MB";
else
oss << (double)blockSize / (1024.0 * 1024.0 * 1024.0) << "GB";
return oss.str();
}
There's no DDR-whatever memory that can handle flushing a single cacheline below 5ns.

Rcpp Parallel or openmp for matrixvector product

I am trying to program the naive parallel version of Conjugate gradient, so I started with the simple Wikipedia algorithm, and I want to change the dot-products and MatrixVector products by their appropriate parallel version, The Rcppparallel documentation has the code for the dot-product using parallelReduce; I think I'm gonna use that version for my code, but I'm trying to make the MatrixVector multiplication, but I haven't achieved good results compared to R base (no parallel)
Some versions of parallel matrix multiplication: using OpenMP, Rcppparallel, serial version, a serial version with Armadillo, and the benchmark
// [[Rcpp::depends(RcppParallel)]]
#include <Rcpp.h>
#include <RcppParallel.h>
#include <numeric>
// #include <cstddef>
// #include <cstdio>
#include <iostream>
using namespace RcppParallel;
using namespace Rcpp;
struct InnerProduct : public Worker
{
// source vectors
const RVector<double> x;
const RVector<double> y;
// product that I have accumulated
double product;
// constructors
InnerProduct(const NumericVector x, const NumericVector y)
: x(x), y(y), product(0) {}
InnerProduct(const InnerProduct& innerProduct, Split)
: x(innerProduct.x), y(innerProduct.y), product(0) {}
// process just the elements of the range I've been asked to
void operator()(std::size_t begin, std::size_t end) {
product += std::inner_product(x.begin() + begin,
x.begin() + end,
y.begin() + begin,
0.0);
}
// join my value with that of another InnerProduct
void join(const InnerProduct& rhs) {
product += rhs.product;
}
};
struct MatrixMultiplication : public Worker
{
// source matrix
const RMatrix<double> A;
//source vector
const RVector<double> x;
// destination matrix
RMatrix<double> out;
// initialize with source and destination
MatrixMultiplication(const NumericMatrix A, const NumericVector x, NumericMatrix out)
: A(A), x(x), out(out) {}
// take the square root of the range of elements requested
void operator()(std::size_t begin, std::size_t end) {
for (std::size_t i = begin; i < end; i++) {
// rows we will operate on
//RMatrix<double>::Row rowi = A.row(i);
RMatrix<double>::Row rowi = A.row(i);
//double res = std::inner_product(rowi.begin(), rowi.end(), x.begin(), 0.0);
//Rcout << "res" << res << std::endl;
out(i,1) = std::inner_product(rowi.begin(), rowi.end(), x.begin(), 0.0);
//Rcout << "res" << out(i,1) << std::endl;
}
}
};
// [[Rcpp::export]]
double parallelInnerProduct(NumericVector x, NumericVector y) {
// declare the InnerProduct instance that takes a pointer to the vector data
InnerProduct innerProduct(x, y);
// call paralleReduce to start the work
parallelReduce(0, x.length(), innerProduct);
// return the computed product
return innerProduct.product;
}
//librar(Rbenchmark)
// [[Rcpp::export]]
NumericVector matrixXvectorRcppParallel(NumericMatrix A, NumericVector x) {
// // declare the InnerProduct instance that takes a pointer to the vector data
// InnerProduct innerProduct(x, y);
int nrows = A.nrow();
NumericVector out(nrows);
for(int i = 0; i< nrows;i++ )
{
out(i) = parallelInnerProduct(A(i,_),x);
}
// return the computed product
return out;
}
// [[Rcpp::export]]
arma::rowvec matrixXvectorParallel(arma::mat A, arma::colvec x){
arma::rowvec y = A.row(0)*0;
int filas = A.n_rows;
int columnas = A.n_cols;
#pragma omp parallel for
for(int j=0;j<columnas;j++)
{
//y(j) = A.row(j)*x(j))
y(j) = dotproduct(A.row(j),x);
}
return y;
}
arma::mat matrixXvector2(arma::mat A, arma::mat x){
//arma::rowvec y = A.row(0)*0;
//y=A*x;
return A*x;
}
arma::rowvec matrixXvectorParallel2(arma::mat A, arma::colvec x){
arma::rowvec y = A.row(0)*0;
int filas = A.n_rows;
int columnas = A.n_cols;
#pragma omp parallel for
for(int j = 0; j < columnas ; j++){
double result = 0;
for(int i = 0; i < filas; i++){
result += x(i)*A(j,i);
}
y(j) = result;
}
return y;
}
Benchmark
test replications elapsed relative user.self sys.self user.child sys.child
1 M %*% a 20 0.026 1.000 0.140 0.060 0 0
2 matrixXvector2(M, as.matrix(a)) 20 0.040 1.538 0.101 0.217 0 0
4 matrixXvectorParallel2(M, a) 20 0.063 2.423 0.481 0.000 0 0
3 matrixXvectorParallel(M, a) 20 0.146 5.615 0.745 0.398 0 0
5 matrixXvectorRcppParallel(M, a) 20 0.335 12.885 2.305 0.079 0 0
My last trial at the moment was using parallefor with Rcppparallel, but I'm getting memory errors and I dont have idea where the problem is
// [[Rcpp::export]]
NumericVector matrixXvectorRcppParallel2(NumericMatrix A, NumericVector x) {
// // declare the InnerProduct instance that takes a pointer to the vector data
int nrows = A.nrow();
NumericMatrix out(nrows,1); //allocar mempria de vector de salida
//crear worker
MatrixMultiplication matrixMultiplication(A, x, out);
parallelFor(0,A.nrow(),matrixMultiplication);
// return the computed product
return out;
}
What I notice is that when I check in my terminal using htop how the processors are working, I see in htop when I apply the conventional Matrix vector multiplication using R-base, that is using all the processors, so Does the matrix multiplication perform parallel by default? because in theory, only one processor should be working if is the serial version.
If someone knows which is the better path, OpenMP or Rcppparallel, or another way, that gives me better performance than the apparently serial version of R-base.
The serial code for conjugte gradient at the moment
// [[Rcpp::export]]
arma::colvec ConjugateGradient(arma::mat A, arma::colvec xini, arma::colvec b, int num_iteraciones){
//arma::colvec xnew = xini*0 //inicializar en 0's
arma::colvec x= xini; //inicializar en 0's
arma::colvec rkold = b - A*xini;
arma::colvec rknew = b*0;
arma::colvec pk = rkold;
int k=0;
double alpha_k=0;
double betak=0;
double normak = 0.0;
for(k=0; k<num_iteraciones;k++){
Rcout << "iteracion numero " << k << std::endl;
alpha_k = sum(rkold.t() * rkold) / sum(pk.t()*A*pk); //sum de un elemento para realizar casting
(pk.t()*A*pk);
x = x+ alpha_k * pk;
rknew = rkold - alpha_k*A*pk;
normak = sum(rknew.t()*rknew);
if( normak < 0.000001){
break;
}
betak = sum(rknew.t()*rknew) / sum( rkold.t() * rkold );
//actualizar valores para siguiente iteracion
pk = rknew + betak*pk;
rkold = rknew;
}
return x;
}
I wasn't aware of the use of BLAS in R, thanks Hong Ooi and tim18, so the new benchmark using option(matprod="internal") and option(matprod="blas")
options(matprod = "internal")
res<-benchmark(M%*%a,matrixXvector2(M,as.matrix(a)),matrixXvectorParallel(M,a),matrixXvectorParallel2(M,a),matrixXvectorRcppParallel(M,a),order="relative",replications = 20)
res
test replications elapsed relative user.self sys.self user.child sys.child
2 matrixXvector2(M, as.matrix(a)) 20 0.043 1.000 0.107 0.228 0 0
4 matrixXvectorParallel2(M, a) 20 0.069 1.605 0.530 0.000 0 0
1 M %*% a 20 0.072 1.674 0.071 0.000 0 0
3 matrixXvectorParallel(M, a) 20 0.140 3.256 0.746 0.346 0 0
5 matrixXvectorRcppParallel(M, a) 20 0.343 7.977 2.272 0.175 0 0
options(matprod="blas")
options(matprod = "blas")
res<-benchmark(M%*%a,matrixXvector2(M,as.matrix(a)),matrixXvectorParallel(M,a),matrixXvectorParallel2(M,a),matrixXvectorRcppParallel(M,a),order="relative",replications = 20)
res
test replications elapsed relative user.self sys.self user.child sys.child
1 M %*% a 20 0.021 1.000 0.093 0.054 0 0
2 matrixXvector2(M, as.matrix(a)) 20 0.092 4.381 0.177 0.464 0 0
5 matrixXvectorRcppParallel(M, a) 20 0.328 15.619 2.143 0.109 0 0
4 matrixXvectorParallel2(M, a) 20 0.438 20.857 3.036 0.000 0 0
3 matrixXvectorParallel(M, a) 20 0.546 26.000 3.667 0.127 0 0

As you already found out, the base R matrix multiplication can be multi-threaded, if a multi-threaded BLAS implementation is used. This is the case for the rocker/* docker images, which typically use OpenBLAS.
In addition, (Rcpp)Armadillo already uses the BLAS library used by R (in this case multi-threaded OpenBLAS) as well as OpenMP. So your "serial" version is actually multi-threaded. You can verify this in htop with a large enough matrix as input.
BTW, what you are trying to do looks like premature optimization to me.

Floating point min/max in CUDA slower than CPU version. Why?

I wrote a kernel for computing the min and max values of an array of about 100,000 floats using reduction (see code below). I use thread blocks to reduce chunks of 1024 values to a single value (in shared memory), and then do the final reduction among the blocks on the CPU.
I then compared this with a serial calculation just on the CPU. The CUDA version takes 2.2ms, and the CPU version takes 0.21ms. Why is the CUDA version much slower? Is the array size not large enough to take advantage of the parallelism, or is my code not optimized somehow?
This is part of an exercise in the Udacity Parallel Programming class. I am running this through their web site, so I don't know what the exact hardware is, but they claim the code runs on actual GPUs.
Here is the CUDA code:
__global__ void min_max_kernel(const float* const d_logLuminance,
const size_t length,
float* d_min_logLum,
float* d_max_logLum) {
// Shared working memory
extern __shared__ float sh_logLuminance[];
int blockWidth = blockDim.x;
int x = blockDim.x * blockIdx.x + threadIdx.x;
float* min_logLuminance = sh_logLuminance;
float* max_logLuminance = sh_logLuminance + blockWidth;
// Copy this block's chunk of the data to shared memory
// We copy twice so we compute min and max at the same time
if (x < length) {
min_logLuminance[threadIdx.x] = d_logLuminance[x];
max_logLuminance[threadIdx.x] = min_logLuminance[threadIdx.x];
}
else {
// Pad if we're out of range
min_logLuminance[threadIdx.x] = FLT_MAX;
max_logLuminance[threadIdx.x] = -FLT_MAX;
}
__syncthreads();
// Reduce
for (int s = blockWidth/2; s > 0; s /= 2) {
if (threadIdx.x < s) {
if (min_logLuminance[threadIdx.x + s] < min_logLuminance[threadIdx.x]) {
min_logLuminance[threadIdx.x] = min_logLuminance[threadIdx.x + s];
}
if (max_logLuminance[threadIdx.x + s] > max_logLuminance[threadIdx.x]) {
max_logLuminance[threadIdx.x] = max_logLuminance[threadIdx.x + s];
}
}
__syncthreads();
}
// Write to global memory
if (threadIdx.x == 0) {
d_min_logLum[blockIdx.x] = min_logLuminance[0];
d_max_logLum[blockIdx.x] = max_logLuminance[0];
}
}
size_t get_num_blocks(size_t inputLength, size_t threadsPerBlock) {
return inputLength / threadsPerBlock +
((inputLength % threadsPerBlock == 0) ? 0 : 1);
}
/*
* Compute min, max over the data by first reducing on the device, then
* doing the final reducation on the host.
*/
void compute_min_max(const float* const d_logLuminance,
float& min_logLum,
float& max_logLum,
const size_t numRows,
const size_t numCols) {
// Compute min, max
printf("\n=== computing min/max ===\n");
const size_t blockWidth = 1024;
const size_t numPixels = numRows * numCols;
size_t numBlocks = get_num_blocks(numPixels, blockWidth);
printf("Num min/max blocks = %d\n", numBlocks);
float* d_min_logLum;
float* d_max_logLum;
int alloc_size = sizeof(float) * numBlocks;
checkCudaErrors(cudaMalloc(&d_min_logLum, alloc_size));
checkCudaErrors(cudaMalloc(&d_max_logLum, alloc_size));
min_max_kernel<<<numBlocks, blockWidth, sizeof(float) * blockWidth * 2>>>
(d_logLuminance, numPixels, d_min_logLum, d_max_logLum);
float* h_min_logLum = (float*) malloc(alloc_size);
float* h_max_logLum = (float*) malloc(alloc_size);
checkCudaErrors(cudaMemcpy(h_min_logLum, d_min_logLum, alloc_size, cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_max_logLum, d_max_logLum, alloc_size, cudaMemcpyDeviceToHost));
min_logLum = FLT_MAX;
max_logLum = -FLT_MAX;
// Reduce over the block results
// (would be a bit faster to do it on the GPU, but it's just 96 numbers)
for (int i = 0; i < numBlocks; i++) {
if (h_min_logLum[i] < min_logLum) {
min_logLum = h_min_logLum[i];
}
if (h_max_logLum[i] > max_logLum) {
max_logLum = h_max_logLum[i];
}
}
printf("min_logLum = %.2f\nmax_logLum = %.2f\n", min_logLum, max_logLum);
checkCudaErrors(cudaFree(d_min_logLum));
checkCudaErrors(cudaFree(d_max_logLum));
free(h_min_logLum);
free(h_max_logLum);
}
And here is the host version:
void compute_min_max_on_host(const float* const d_logLuminance, size_t numPixels) {
int alloc_size = sizeof(float) * numPixels;
float* h_logLuminance = (float*) malloc(alloc_size);
checkCudaErrors(cudaMemcpy(h_logLuminance, d_logLuminance, alloc_size, cudaMemcpyDeviceToHost));
float host_min_logLum = FLT_MAX;
float host_max_logLum = -FLT_MAX;
printf("HOST ");
for (int i = 0; i < numPixels; i++) {
if (h_logLuminance[i] < host_min_logLum) {
host_min_logLum = h_logLuminance[i];
}
if (h_logLuminance[i] > host_max_logLum) {
host_max_logLum = h_logLuminance[i];
}
}
printf("host_min_logLum = %.2f\nhost_max_logLum = %.2f\n",
host_min_logLum, host_max_logLum);
free(h_logLuminance);
}

As #talonmies suggests, behavior may be different for larger sizes; 100,000 is really not that much: Much of it fits within the combined overall L1 cache of the cores on a modern CPU; half of it fits in a single core's L2 cache.
Transfer over PCI express takes time; and in your case, double the time it might have, since you don't use pinned memory.
You're not overlapping computation and PCI express I/O (not that it would make much sense for only 100,000 elements)
Your kernel is rather slow, for more than one reason; not the least of which is the extensive use of shared memory, most of which is unnecessary
More generally: Always profile your code using nvvp (or nvprof for getting textual information for further analysis).

Resize 8-bit image by 2 with ARM NEON

I have an 8-bit 640x480 image that I would like to shrink to a 320x240 image:
void reducebytwo(uint8_t *dst, uint8_t *src)
//src is 640x480, dst is 320x240
What would be the best way to do that using ARM SIMD NEON? Any sample code somewhere?
As a starting point, I simply would like to do the equivalent of:
for (int h = 0; h < 240; h++)
for (int w = 0; w < 320; w++)
dst[h * 320 + w] = (src[640 * h * 2 + w * 2] + src[640 * h * 2 + w * 2 + 1] + src[640 * h * 2 + 640 + w * 2] + src[640 * h * 2 + 640 + w * 2 + 1]) / 4;

This is a one to one translation of your code to arm NEON intrinsics:
#include <arm_neon.h>
#include <stdint.h>
static void resize_line (uint8_t * __restrict src1, uint8_t * __restrict src2, uint8_t * __restrict dest)
{
int i;
for (i=0; i<640; i+=16)
{
// load upper line and add neighbor pixels:
uint16x8_t a = vpaddlq_u8 (vld1q_u8 (src1));
// load lower line and add neighbor pixels:
uint16x8_t b = vpaddlq_u8 (vld1q_u8 (src2));
// sum of upper and lower line:
uint16x8_t c = vaddq_u16 (a,b);
// divide by 4, convert to char and store:
vst1_u8 (dest, vshrn_n_u16 (c, 2));
// move pointers to next chunk of data
src1+=16;
src2+=16;
dest+=8;
}
}
void resize_image (uint8_t * src, uint8_t * dest)
{
int h;
for (h = 0; h < 240 - 1; h++)
{
resize_line (src+640*(h*2+0),
src+640*(h*2+1),
dest+320*h);
}
}
It processes 32 source-pixels and generates 8 output pixels per iteration.
I did a quick look at the assembler output and it looks okay. You can get better performance if you write the resize_line function in assembler, unroll the loop and eliminate pipeline stalls. That would give you an estimated factor of three performance boost.
It should be a lot faster than your implementation without assembler changes though.
Note: I haven't tested the code...

If you're not too concerned with precision then this inner loop should give you twice the compute throughput compared to the more accurate algorithm:
for (i=0; i<640; i+= 32)
{
uint8x16x2_t a, b;
uint8x16_t c, d;
/* load upper row, splitting even and odd pixels into a.val[0]
* and a.val[1] respectively. */
a = vld2q_u8(src1);
/* as above, but for lower row */
b = vld2q_u8(src2);
/* compute average of even and odd pixel pairs for upper row */
c = vrhaddq_u8(a.val[0], a.val[1]);
/* compute average of even and odd pixel pairs for lower row */
d = vrhaddq_u8(b.val[0], b.val[1]);
/* compute average of upper and lower rows, and store result */
vst1q_u8(dest, vrhaddq_u8(c, d));
src1+=32;
src2+=32;
dest+=16;
}
It works by using the vhadd operation, which has a result the same size as the input. This way you don't have to shift the final sum back down to 8-bit, and all of the arithmetic throughout is eight-bit, which means you can perform twice as many operations per instruction.
However it is less accurate, because the intermediate sum is quantised, and GCC 4.7 does a terrible job of generating code. GCC 4.8 does just fine.
The whole operation has a good chance of being I/O bound, though. The loop should be unrolled to maximise separation between loads and arithmetic, and __builtin_prefetch() (or PLD) should be used to hoist the incoming data into caches before it's needed.

Here is the asm version on reduce_line that #Nils Pipenbrinck suggested
static void reduce2_neon_line(uint8_t* __restrict src1, uint8_t* __restrict src2, uint8_t* __restrict dest, int width) {
for(int i=0; i<width; i+=16) {
asm (
"pld [%[line1], #0xc00] \n"
"pld [%[line2], #0xc00] \n"
"vldm %[line1]!, {d0,d1} \n"
"vldm %[line2]!, {d2,d3} \n"
"vpaddl.u8 q0, q0 \n"
"vpaddl.u8 q1, q1 \n"
"vadd.u16 q0, q1 \n"
"vshrn.u16 d0, q0, #2 \n"
"vst1.8 {d0}, [%[dst]]! \n"
:
: [line1] "r"(src1), [line2] "r"(src2), [dst] "r"(dest)
: "q0", "q1", "memory"
);
}
}
It is about 4 times faster then C version (tested on iPhone 5).

Cuda Bayer/CFA demosaicing example

I've written a CUDA4 Bayer demosaicing routine, but it's slower than single threaded CPU code, running on a16core GTS250.
Blocksize is (16,16) and the image dims are a multiple of 16 - but changing this doesn't improve it.
Am I doing anything obviously stupid?
--------------- calling routine ------------------
uchar4 *d_output;
size_t num_bytes;
cudaGraphicsMapResources(1, &cuda_pbo_resource, 0);
cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes, cuda_pbo_resource);
// Do the conversion, leave the result in the PBO fordisplay
kernel_wrapper( imageWidth, imageHeight, blockSize, gridSize, d_output );
cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0);
--------------- cuda -------------------------------
texture<uchar, 2, cudaReadModeElementType> tex;
cudaArray *d_imageArray = 0;
__global__ void convertGRBG(uchar4 *d_output, uint width, uint height)
{
uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;
uint i = __umul24(y, width) + x;
// input is GR/BG output is BGRA
if ((x < width) && (y < height)) {
if ( y & 0x01 ) {
if ( x & 0x01 ) {
d_output[i].x = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2; // B
d_output[i].y = (tex2D(tex,x,y)); // G in B
d_output[i].z = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2; // R
} else {
d_output[i].x = (tex2D(tex,x,y)); //B
d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4; // G
d_output[i].z = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4; // R
}
} else {
if ( x & 0x01 ) {
// odd col = R
d_output[i].y = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4; // B
d_output[i].z = (tex2D(tex,x,y)); //R
d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4; // G
} else {
d_output[i].x = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2; // B
d_output[i].y = (tex2D(tex,x,y)); // G in R
d_output[i].z = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2; // R
}
}
}
}
void initTexture(int imageWidth, int imageHeight, uchar *imagedata)
{
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
cutilSafeCall( cudaMallocArray(&d_imageArray, &channelDesc, imageWidth, imageHeight) );
uint size = imageWidth * imageHeight * sizeof(uchar);
cutilSafeCall( cudaMemcpyToArray(d_imageArray, 0, 0, imagedata, size, cudaMemcpyHostToDevice) );
cutFree(imagedata);
// bind array to texture reference with point sampling
tex.addressMode[0] = cudaAddressModeClamp;
tex.addressMode[1] = cudaAddressModeClamp;
tex.filterMode = cudaFilterModePoint;
tex.normalized = false;
cutilSafeCall( cudaBindTextureToArray(tex, d_imageArray) );
}

There aren't any obvious bugs in your code, but there are several obvious performance opportunities:
1) for best performance, you should use texture to stage into shared memory - see the 'SobelFilter' SDK sample.
2) As written, the code is writing bytes to global memory, which is guaranteed to incur a large performance hit. You can use shared memory to stage results before committing them to global memory.
3) There is a surprisingly big performance advantage to sizing blocks in a way that match the hardware's texture cache attributes. On Tesla-class hardware, the optimal block size for kernels using the same addressing scheme as your kernel is 16x4. (64 threads per block)
For workloads like this, it may be hard to compete with optimized CPU code. SSE2 can do 16 byte-sized operations in a single instruction, and CPUs are clocked about 5 times as fast.

Based on answer on Nvidia forums, here (for the search engines) is a slightly more optomised version which writes a 2x2 block of pixels in each thread. Although the difference in speed isn't measurable on my setup.
Note it should be called with a gridsize half the size of the image;
dim3 blockSize(16, 16); // for example
dim3 gridSize((width/2) / blockSize.x, (height/2) / blockSize.y);
__global__ void d_convertGRBG(uchar4 *d_output, uint width, uint height)
{
uint x = 2 * (__umul24(blockIdx.x, blockDim.x) + threadIdx.x);
uint y = 2 * (__umul24(blockIdx.y, blockDim.y) + threadIdx.y);
uint i = __umul24(y, width) + x;
// input is GR/BG output is BGRA
if ((x < width-1) && (y < height-1)) {
// x+1, y+1:
d_output[i+width+1] = make_uchar4( (tex2D(tex,x+2,y+1)+tex2D(tex,x,y+1))/2, // B
(tex2D(tex,x+1,y+1)), // G in B
(tex2D(tex,x+1,y+2)+tex2D(tex,x+1,y))/2, // R
0xff);
// x, y+1:
d_output[i+width] = make_uchar4( (tex2D(tex,x,y+1)), //B
(tex2D(tex,x+1,y+1) + tex2D(tex,x-1,y+1)+tex2D(tex,x,y+2)+tex2D(tex,x,y))/4, // G
(tex2D(tex,x+1,y+2) + tex2D(tex,x+1,y)+tex2D(tex,x-1,y+2)+tex2D(tex,x-1,y))/4, // R
0xff);
// x+1, y:
d_output[i+1] = make_uchar4( (tex2D(tex,x,y-1) + tex2D(tex,x+2,y-1)+tex2D(tex,x,y+1)+tex2D(tex,x+2,y-1))/4, // B
(tex2D(tex,x+2,y) + tex2D(tex,x,y)+tex2D(tex,x+1,y+1)+tex2D(tex,x+1,y-1))/4, // G
(tex2D(tex,x+1,y)), //R
0xff);
// x, y:
d_output[i] = make_uchar4( (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2, // B
(tex2D(tex,x,y)), // G in R
(tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2, // R
0xff);
}
}

There are many if's and else's in the code. If you structure the code to eliminate all the conditional statements then you will get a huge performance boost as branching is a performance killer. It is indeed possible to remove the branches. There are exactly 30 cases which you will have to code explicitly. I have implemented it on CPU and it does not contain any conditional statements. I am thinking of making a blog explaining it. Will post it once its done.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Repeated calls to CwiseUnaryView - better to copy to a matrix? - eigen

Related

Does clflush instruction flush block only from Level 1 Cache?

Rcpp Parallel or openmp for matrixvector product

Floating point min/max in CUDA slower than CPU version. Why?

Resize 8-bit image by 2 with ARM NEON

Cuda Bayer/CFA demosaicing example

Categories

Resources