Related
As shown in the question, is there any sample code to calculate this matrix multiplication?
Here is a link to Dense matrix.
I think this example demonstrates what you need.
#include <Eigen/eigen>
int m = 8;
int n = 5;
//Eigen has no built-in random sparse function (that I know of!)
Eigen::MatrixXd A_dense = MatrixXd::Random(m, n);
//create a sparse copy to demonstrate functionality
Eigen::SparseMatrix<double> A = A_dense.sparseView();
//create a sparse matrix of compatible dimensions for A^T * A
Eigen::SparseMatrix<double> ATA(n, n);
//compute A^T * A
ATA.selfadjointView<Lower>().rankUpdate(A.transpose(), 1.0);
//print A in dense format so its readable
std::cout << A.toDense() << "\n\n";
//print ATA in dense format so its readable
std::cout << ATA.toDense() << "\n\n";
//check with intuitive / less optimized operation
std::cout << (A.transpose() * A).toDense() << "\n\n";
This will use a specialized routine to calculate A^T * A and will only compute your preferred triangle (in this case, the lower half) as the result is a symmetric matrix.
My output:
-0.997497 0.64568 -0.817194 -0.982177 -0.0984222
0.127171 0.49321 -0.271096 -0.24424 -0.295755
-0.613392 -0.651784 -0.705374 0.0633259 -0.885922
0.617481 0.717887 -0.668203 0.142369 0.215369
0.170019 0.421003 0.97705 0.203528 0.566637
-0.0402539 0.0270699 -0.108615 0.214331 0.605213
-0.299417 -0.39201 -0.761834 -0.667531 0.0397656
0.791925 -0.970031 -0.990661 0.32609 -0.3961
2.51603 0 0 0 0
-0.318591 2.87295 0 0 0
0.414807 0.986724 4.21357 0 0
1.48181 -0.656854 1.09012 1.6879 0
0.483357 1.1462 1.49161 0.232797 1.77424
2.51603 -0.318591 0.414807 1.48181 0.483357
-0.318591 2.87295 0.986724 -0.656854 1.1462
0.414807 0.986724 4.21357 1.09012 1.49161
1.48181 -0.656854 1.09012 1.6879 0.232797
0.483357 1.1462 1.49161 0.232797 1.77424
I have recently coded a parallel SVD decomposition routine, based on a "one sided Jacobi rotations" algorithm. The code works correctly but is tremendously slow.
In fact it should exploit the parallelism in the inner for loop for(int g=0;g<n;g++), but on commenting out the #pragma omp paralell for directive I can appreciate just a very slight decrease in performances. In other words there is no appreciable speed up on going parallel (the code does run parallel with 4 threads).
Note 1: almost all the work is concentrated in the three following loops involving the matrices A and V, which are relatively large.
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
and
double Ahi,Vhi;
for(h=0;h<N;h++)//...rotate Ai & Aj (only columns i & j are changend)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j are updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
All the parallelism should be exploited there but is not. And I can't understand why.
Note 2: The same happens both on Windows (cygWin compiler) and Linux (GCC) platforms.
Note 3: matrices are represented by column major arrays
So I'm looking for some help in finding out why the parallelism is not exploited. Did I miss something? There is some hidden overhead in the parallel for I cannot see?
Thank you very much for any suggestion
int sweep(double* A,double*V,int N,double tol)
{
static int*I=new int[(int)ceil(0.5*(N-1))];
static int*J=new int[(int)ceil(0.5*(N-1))];
int ntol=0;
for(int r=0;r<N;r++) //fill in i,j indexes of parallel rotations in vectors I & J
{
int k=r+1;
if (k==N)
{
for(int i=2;i<=(int)ceil(0.5*N);i++){
I[i-2]=i-1;
J[i-2]=N+2-i-1;
}
}
else
{
for(int i=1;i<=(int)ceil(0.5*(N-k));i++)I[i-1]=i-1;
for(int i=1;i<=(int)ceil(0.5*(N-k));i++)J[i-1]=N-k+2-i-1;
if(k>2)
{
int j=(int)ceil(0.5*(N-k));
for(int i=N-k+2;i<=N-(int)floor(0.5*k);i++){
I[j]=i-1;
J[j]=2*N-k+2-i-1;
j++;
}
}
}
int n=(k%2==0)?(int)floor(0.5*(N-1)):(int)floor(0.5*N);
#pragma omp parallel for schedule(dynamic,5) reduction(+:ntol) default(none) shared(std::cout,I,J,A,V,N,n,tol)
for(int g=0;g<n;g++)
{
int i=I[g];
int j=J[g];
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
int h;
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if(p*p/(qi*qj)<tol) ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj (only columns i & j are changend)
double Ahi,Vhi;
for(h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j are updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough to each other stop sweep
return(0);
}
Thanks to Z boson remarks I managed to write a far better performing paralell SVD decomposition. It runs many times faster than the original one, and I guess it could be still improved by the use of SIMD instructions.
I post the code, in the case anyone should find it of use. All tests I performed gave me correct results, in any case there is no warranty for its use.
I'm really sorry not to have had the time to properly comment the code for a better comprehensibility.
int sweep(double* A,double*V,int N,double tol,int M, int n)
{
/********************************************************************************************
This routine performs a parallel "sweep" of the SVD factorization algorithm for the matrix A.
It implements a single sided Jacobi rotation algorithm, described by Michael W. Berry,
Dani Mezher, Bernard Philippe, and Ahmed Sameh in "Parallel Algorithms for the Singular Value
Decomposition".
At each sweep the A matrix becomes a little more orthogonal, until each column of A is orthogonal
to each other within a give tolerance. At this point the sweep() routine returns 1 and convergence
is attained.
Arguments:
A : on input the square matrix to be orthogonalized, on exit a more "orthogonal" matrix
V : on input the accumulated rotation matrix, on exit it is updated with the current rotations
N : dimension of the matrices.
tol :tolerance for convergence of orthogonalization. See the explainations for SVD() routine
M : number of blocks (it is calculated from the given block size n)
n : block size (number of columns on each block). It should be an optimal value according to the hardware available
returns : 1 if in the last sweep convergence is attained, 0 if not and one more sweep is necessary
Author : Renato Talucci 2015
*************************************************************************************************/
#include <math.h>
int ntol=0;
//STEP 1 : INTERNAL BLOCK ORTHOGONALISATION
#pragma omp paralell for reduction(+:ntol) shared(A,V,n,tol,N) default(none)
for(int a=0;a<M;a++)
{
for(int i=a*n;i<a*n+imin(n,N-a*n)-1;i++)
{
for(int j=i+1;j<a*n+imin(n,N-a*n);j++)
{
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
for(int h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj
double Ahi,Vhi;
for(int h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j atre updated)
for(int h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
}
//STEP 2 : PARALLEL BLOCK MUTUAL ORTHOGONALISATION
static int*I=new int[(int)ceil(0.5*(M-1))];
static int*J=new int[(int)ceil(0.5*(M-1))];
for(int h=0;h<M;h++)
{
//fill in i,j indexes of blocks to be mutually orthogonalized at each turn
int k=h+1;
if (k==M)
{
for(int i=2;i<=(int)ceil(0.5*M);i++){
I[i-2]=i-1;
J[i-2]=M+2-i-1;
}
}
else
{
for(int i=1;i<=(int)ceil(0.5*(M-k));i++)I[i-1]=i-1;
for(int i=1;i<=(int)ceil(0.5*(M-k));i++)J[i-1]=M-k+2-i-1;
if(k>2)
{
int j=(int)ceil(0.5*(M-k));
for(int i=M-k+2;i<=M-(int)floor(0.5*k);i++){
I[j]=i-1;
J[j]=2*M-k+2-i-1;
j++;
}
}
}
int ng=(k%2==0)?(int)floor(0.5*(M-1)):(int)floor(0.5*M);
#pragma omp parallel for schedule(static,5) shared(A,V,I,J,n,tol,N,ng) reduction(+:ntol) default(none)
for(int g=0;g<ng;g++)
{
int block_i=I[g];
int block_j=J[g];
for(int i=block_i*n;i<block_i*n+imin(n,N-block_i*n);i++)
{
for(int j=block_j*n;j<block_j*n+imin(n,N-block_j*n);j++)
{
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
int h;
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj
double Ahi,Vhi;
for(h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j atre updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
}
}
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough to each other stop sweep
return(0);
}
int SVD(double* A,double* U,double*V,int N,double tol,double* sigma)
{
/********************************************************************************************
This routine calculates the SVD decomposition of the square matrix A [NxN]
A = U * S * V'
Arguments :
A : on input NxN square matrix to be factorized, on exit contains the
rotated matrix A*V=U*S.
V : on input an identity NxN matrix, on exit is the right orthogonal matrix
of the decomposition A = U*S*V'
U : NxN matrix, on exit is the left orthogonal matrix of the decomposition A = U*S*V'.
sigma : array of dimension N. On exit contains the singular values, i.e. the diagonal
elements of the matrix S.
N : Dimension of the A matrix.
tol : Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two
columns of A, the convergence is attained when Ai*Aj / ( |Ai|*|Aj| ) < tol for each i,j=0,..,N-1 (i!=j)
The software returns the number of sweeps needed for convergence.
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
int n=24;//this is the dimension of block submatrices, you shall enter an optimal value for your hardware
int M=N/n+int(((N%n)!=0)?1:0);
int swp=0;//sweeps counter
int converged=0;
while(converged==0) {
converged=sweep(A,V,N,tol,M,n);
swp++;
}
#pragma omp parallel for default(none) shared(sigma,A,U,N)
for(int i=0;i<N;i++)
{
double si=0;
for(int j=0;j<N;j++) si+=A[j+N*i]*A[j+N*i];
si=sqrt(si);
for(int k=0;k<N;k++) U[k+N*i]=A[k+N*i]/si;
sigma[i]=si;
}
return(swp);
}
Note : if some user prefers to left A unchanged upon exit, it is sufficient to calculate U=A*V before entering the while loop, and instead of passing the A matrix to the sweep() routine, passing the obtained U matrix. The U matrix shall be orthonormalized instead of A, after convergence of sweep() and the #pragma omp directive must include U in the shared variables instead of A.
Note 2:if you have (as I have) to factorize a sequence of A(k) matrices each of which A(k) can be considered a perturbation of the previous A(k-1), as a jacobian matrix can be considered in a multistep Newton solver, the driver can be easily modified to update from A(k-1) to A(k) instead of calculating A(k) from begin. Here is the code:
int updateSVD(double* DA,double* U,double*V,int N,double tol,double* sigma)
{
/********************************************************************************************
Given a previously factorization
A(k-1) = U(k-1) * S(k-1) * V(k-1)'
and given a perturbation DA(k) of A(k-1), i.e.
A(k) = A(k-1) + DA(k)
this routine calculates the SVD factorization of A(k), starting from the factorization of A(k-1)
Arguments:
DA : on input NxN perturbation matrix, unchanged on exit
U : on input NxN orthonormal left matrix of the previous (k-1) factorization, on exit
orthonormal right matrix of the current factorization
V : on input NxN orthonormal right matrix of the previous (k-1) factorization, on exit
orthonormal right matrix of the current factorization
N : dimension of the matrices
tol : Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two
columns of A, the convergence is attained when Ai*Aj / ( |Ai|*|Aj| ) < tol for each i,j=0,..,N-1 (i!=j)
sigma : on input, array with the N singular values of the previuos factorization, on exit
array with the N singular values of the current factorization
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
for(int i=0;i<N;i++) for(int j=0;j<N;j++) U[i+N*j]*=sigma[j];
int n=24; //this is the dimension of block submatrices, you shall enter an optimal value for your hardware
matmat_col_col(DA,V,U,N,n); //U =U(k-1)*S(k-1) + DA(k)*V(k-1) = A(k)*V(k-1)
int M=N/n+int(((N%n)!=0)?1:0); //number of blocks
int swp=0;//sweeps counter
int converged=0;
while(converged==0) {
converged=sweep(U,V,N,tol,M,n);
swp++;
}
#pragma omp parallel for default(none) shared(sigma,U,N)
for(int i=0;i<N;i++)
{
double si=0;
for(int j=0;j<N;j++) si+=U[j+N*i]*U[j+N*i];
si=sqrt(si);
for(int k=0;k<N;k++) U[k+N*i]=U[k+N*i]/si;
sigma[i]=si;
}
return(swp);
}
Finally, the routine matmat_col_col(DA,V,U,N,n) is a paralell block matrix product. Here is the code:
inline int imin(int a,int b) {return((a<=b)?a:b);}
void matmat_col_col(double* A,double* B,double*C,int N,int n)
/********************************************************************************************
square matrix block product NxN :
C = C + A * B
n is the optimal block dimension
N is the dimension of the matrices
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING M(i,j) = M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
{
int M=N/n+int(((N%n)!=0)?1:0);
#pragma omp parallel for shared(M,A,B,C)
for(int a=0;a<M;a++)
{
for(int b=0;b<M;b++)
{
for(int c=0;c<M;c++)
{
for(int i=a*n;i<imin((a+1)*n,N);i++)
{
for(int j=b*n;j<imin((b+1)*n,N);j++)
{
for(int k=c*n;k<imin((c+1)*n,N);k++)
{
C[i+N*j]+=A[i+N*k]*B[k+N*j];
}
}
}
}
}
}
return;
}
I hope no typos have been created from so much copy&paste.
It would be nice if anyone could improve the code.
The FLOPS for Singular value decomposition (SVD) should go as O(N^3) whereas the reads as O(N^2). This means it may be possible to parallelize the algorithm and have it scale well with the number of cores.
However, your implementation of SVD is memory bandwidth bound which means it can't scale well with the number of cores for a single socket system. Your three nested loops currently each go over the entire range of N which causes much of the data to be evicted from the cache the next time it need to be reused.
In order to be compute bound you're going to have to change your algorithm so that, instead of operating on long strips/dot products of size N, it uses loop tiling to maximize the n^3 (where n is the size of the block) calculations in the cache of size n^2.
Here is paper for parallel SVD computation from a google search which says in the abstract
the key point of our proposed block JRS algorithm is reusing the loaded data into cache memory by performing computations on matrix blocks (b rows) instead of on strips of vectors as in JRS iteration algorithms.
I used loop tiling/blocks to achieve good scaling with the number of cores for cholesky decomposition.
Note that using tiles/blocks will improve the performance of your single threaded code as well.
Situation is the following: I have a number (1000s) of elements which are given by small matrices of dimensions 4x2, 9x3 ... you get the idea. All matrices have the same dimension.
I want to multiply each of these matrices with a fixed vector of precalculated values. In short:
for(i = 1...n)
X[i] = M[i] . N;
What is the best approach to do this in parallel using Thrust? How do I lay out my data in memory?
NB: There might be specialized, more suitable libraries to do this on GPUs. I'm interested in Thrust because it allows me to deploy to different backends, not just CUDA.
One possible approach:
flatten the arrays (matrices) into a single data vector. This is an advantageous step for enabling general thrust processing anyway.
use a strided range mechanism to take your scaling vector and extend it to the overall length of your flattened data vector
use thrust::transform with thrust::multiplies to multiply the two vectors together.
If you need to access the matrices later out of your flattened data vector (or result vector), you can do so with pointer arithmetic, or a combination of fancy iterators.
If you need to re-use the extended scaling vector, you may want to use the method outlined in step 2 exactly (i.e. create an actual vector using that method, length = N matrices, repeated). If you are only doing this once, you can achieve the same effect with a counting iterator, followed by a transform iterator (modulo the length of your matrix in elements), followed by a permutation iterator, to index into your original scaling vector (length = 1 matrix).
The following example implements the above, without using the strided range iterator method:
#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/transform.h>
#define N_MAT 1000
#define H_MAT 4
#define W_MAT 3
#define RANGE 1024
struct my_modulo_functor : public thrust::unary_function<int, int>
{
__host__ __device__
int operator() (int idx) {
return idx%(H_MAT*W_MAT);}
};
int main(){
thrust::host_vector<int> data(N_MAT*H_MAT*W_MAT);
thrust::host_vector<int> scale(H_MAT*W_MAT);
// synthetic; instead flatten/copy matrices into data vector
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++) data[i] = rand()%RANGE;
for (int i = 0; i < H_MAT*W_MAT; i++) scale[i] = rand()%RANGE;
thrust::device_vector<int> d_data = data;
thrust::device_vector<int> d_scale = scale;
thrust::device_vector<int> d_result(N_MAT*H_MAT*W_MAT);
thrust::transform(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_scale.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_modulo_functor())) ,d_result.begin(), thrust::multiplies<int>());
thrust::host_vector<int> result = d_result;
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++)
if (result[i] != data[i] * scale[i%(H_MAT*W_MAT)]) {std::cout << "Mismatch at: " << i << " cpu result: " << (data[i] * scale[i%(H_MAT*W_MAT)]) << " gpu result: " << result[i] << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
EDIT: Responding to a question below:
The benefit of fancy iterators (i.e. transform(numbers, iterator)) is that they often allow for eliminaion of extra data copies/data movement, as compared to assembling other number (which requires extra steps and data movement) and then passing it to transform(numbers, other numbers). If you're only going to use other numbers once, then the fancy iterators will generally be better. If you're going to use other numbers again, then you may want to assemble it explicitly. This preso is instructive, in particular "Fusion".
For a one-time use of other numbers the overhead of assembling it on the fly using fancy iterators and the functor is generally lower than explicitly creating a new vector, and then passing that new vector to the transform routine.
When looking for a software library which is concisely made for multiplying small matrices, then one may have a look at https://github.com/hfp/libxsmm. Below, the code requests a specialized matrix kernel according to the typical GEMM parameters (please note that some limitations apply).
double alpha = 1, beta = 1;
const char transa = 'N', transb = 'N';
int flags = LIBXSMM_GEMM_FLAGS(transa, transb);
int prefetch = LIBXSMM_PREFETCH_AUTO;
libxsmm_blasint m = 23, n = 23, k = 23;
libxsmm_dmmfunction xmm = NULL;
xmm = libxsmm_dmmdispatch(m, n, k,
&m/*lda*/, &k/*ldb*/, &m/*ldc*/,
&alpha, &beta, &flags, &prefetch);
Given the above code, one can proceed and run "xmm" for an entire series of (small) matrices without a particular data structure (below code also uses "prefetch locations").
if (0 < n) { /* check that n is at least 1 */
# pragma parallel omp private(i)
for (i = 0; i < (n - 1); ++i) {
const double *const ai = a + i * asize;
const double *const bi = b + i * bsize;
double *const ci = c + i * csize;
xmm(ai, bi, ci, ai + asize, bi + bsize, ci + csize);
}
xmm(a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize,
/* pseudo prefetch for last element of batch (avoids page fault) */
a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize);
}
In addition to the manual loop control as shown above, libxsmm_gemm_batch (or libxsmm_gemm_batch_omp) can be used (see ReadTheDocs). The latter is useful if data structures exist that describe the series of operands (A, B, and C matrices).
There are two reasons why this library gives superior performance: (1) on-the-fly code specialization using an in-memory code generation technique, and (2) loading the next matrix operands while calculating the current product.
( Given one is looking for something that blends well with C/C++, this library supports it. However, it does not aim for CUDA/Thrust. )
I do not intend to use this for security purposes or statistical analysis. I need to create a simple random number generator for use in my computer graphics application. I don't want to use the term "random number generator", since people think in very strict terms about it, but I can't think of any other word to describe it.
it has to be fast.
it must be repeatable, given a particular seed.
Eg: If seed = x, then the series a,b,c,d,e,f..... should happen every time I use the seed x.
Most importantly, I need to be able to compute the nth term in the series in constant time.
It seems, that I cannot achieve this with rand_r or srand(), since these need are state dependent, and I may need to compute the nth in some unknown order.
I've looked at Linear Feedback Shift registers, but these are state dependent too.
So far I have this:
int rand = (n * prime1 + seed) % prime2
n = used to indicate the index of the term in the sequence. Eg: For
first term, n ==1
prime1 and prime2 are prime numbers where
prime1 > prime2
seed = some number which allows one to use the same function to
produce a different series depending on the seed, but the same series
for a given seed.
I can't tell how good or bad this is, since I haven't used it enough, but it would be great if people with more experience in this can point out the problems with this, or help me improve it..
EDIT - I don't care if it is predictable. I'm just trying to creating some randomness in my computer graphics.
Use a cryptographic block cipher in CTR mode. The Nth output is just encrypt(N). Not only does this give you the desired properties (O(1) computation of the Nth output); it also has strong non-predictability properties.
I stumbled on this a while back, looking for a solution for the same problem. Recently, I figured out how to do it in low-constant O(log(n)) time. While this doesn't quite match the O(1) requested by the author, It may be fast enough (a sample run, compiled with -O3, achieved performance of 1 billion arbitrary index random numbers, with n varying between 1 and 2^48, in 55.7s -- just shy of 18M numbers/s).
First, the theory behind the solution:
A common type of RNGs are Linear Congruential Generators, basically, they work as follows:
random(n) = (m*random(n-1) + b) mod p
Where m and b, and p are constants (see a reference on LCGs for how they are chosen). From this, we can devise the following using a bit of modular arithmetic:
random(0) = seed mod p
random(1) = m*seed + b mod p
random(2) = m^2*seed + m*b + b mod p
...
random(n) = m^n*seed + b*Sum_{i = 0 to n - 1} m^i mod p
= m^n*seed + b*(m^n - 1)/(m - 1) mod p
Computing the above can be a problem, since the numbers will quickly exceed numeric limits. The solution for the generic case is to compute m^n in modulo with p*(m - 1), however, if we take b = 0 (a sub-case of LCGs sometimes called Multiplicative congruential Generators), we have a much simpler solution, and can do our computations in modulo p only.
In the following, I use the constant parameters used by RANF (developed by CRAY), where p = 2^48 and g = 44485709377909. The fact that p is a power of 2 reduces the number of operations required (as expected):
#include <cassert>
#include <stdint.h>
#include <cstdlib>
class RANF{
// MCG constants and state data
static const uint64_t m = 44485709377909ULL;
static const uint64_t n = 0x0000010000000000ULL; // 2^48
static const uint64_t randMax = n - 1;
const uint64_t seed;
uint64_t state;
public:
// Constructors, which define the seed
RANF(uint64_t seed) : seed(seed), state(seed) {
assert(seed > 0 && "A seed of 0 breaks the LCG!");
}
// Gets the next random number in the sequence
inline uint64_t getNext(){
state *= m;
return state & randMax;
}
// Sets the MCG to a specific index
inline void setPosition(size_t index){
state = seed;
uint64_t mPower = m;
for (uint64_t b = 1; index; b <<= 1){
if (index & b){
state *= mPower;
index ^= b;
}
mPower *= mPower;
}
}
};
#include <cstdio>
void example(){
RANF R(1);
// Gets the number through random-access -- O(log(n))
R.setPosition(12345); // Goes to the nth random number
printf("fast nth number = %lu\n", R.getNext());
// Gets the number through standard, sequential access -- O(n)
R.setPosition(0);
for(size_t i = 0; i < 12345; i++) R.getNext();
printf("slow nth number = %lu\n", R.getNext());
}
While I presume the author has moved on by now, hopefully this will be of use to someone else.
If you're really concerned about runtime performance, the above can be made about 10x faster with lookup tables, at the cost of compilation time and binary size (it also is O(1) w.r.t the desired random index, as requested by OP)
In the version below, I used c++14 constexpr to generate the lookup tables at compile time, and got to 176M arbitrary index random numbers per second (doing this did however add about 12s of extra compilation time, and a 1.5MB increase in binary size -- the added time may be mitigated if partial recompilation is used).
class RANF{
// MCG constants and state data
static const uint64_t m = 44485709377909ULL;
static const uint64_t n = 0x0000010000000000ULL; // 2^48
static const uint64_t randMax = n - 1;
const uint64_t seed;
uint64_t state;
// Lookup table
struct lookup_t{
uint64_t v[3][65536];
constexpr lookup_t() : v() {
uint64_t mi = RANF::m;
for (size_t i = 0; i < 3; i++){
v[i][0] = 1;
uint64_t val = mi;
for (uint16_t j = 0x0001; j; j++){
v[i][j] = val;
val *= mi;
}
mi = val;
}
}
};
friend struct lookup_t;
public:
// Constructors, which define the seed
RANF(uint64_t seed) : seed(seed), state(seed) {
assert(seed > 0 && "A seed of 0 breaks the LCG!");
}
// Gets the next random number in the sequence
inline uint64_t getNext(){
state *= m;
return state & randMax;
}
// Sets the MCG to a specific index
// Note: idx.u16 indices need to be adapted for big-endian machines!
inline void setPosition(size_t index){
static constexpr auto lookup = lookup_t();
union { uint16_t u16[4]; uint64_t u64; } idx;
idx.u64 = index;
state = seed * lookup.v[0][idx.u16[0]] * lookup.v[1][idx.u16[1]] * lookup.v[2][idx.u16[2]];
}
};
Basically, what this does is splits the computation of, for example, m^0xAAAABBBBCCCC mod p, into (m^0xAAAA00000000 mod p)*(m^0xBBBB0000 mod p)*(m^CCCC mod p) mod p, and then precomputes tables for each of the values in the 0x0000 - 0xFFFF range that could fill AAAA, BBBB or CCCC.
RNG in a normal sense, have the sequence pattern like f(n) = S(f(n-1))
They also lost precision at some point (like % mod), due to computing convenience, therefore it is not possible to expand the sequence to a function like X(n) = f(n) = trivial function with n only.
This mean at best you have O(n) with that.
To target for O(1) you therefore need to abandon the idea of f(n) = S(f(n-1)), and designate a trivial formula directly so that the N'th number can be calculated directly without knowing (N-1)'th; this also render the seed meaningless.
So, you end up have a simple algebra function and not a sequence. For example:
int my_rand(int n) { return 42; } // Don't laugh!
int my_rand(int n) { 3*n*n + 2*n + 7; }
If you want to put more constraint to the generated pattern (like distribution), it become a complex maths problem.
However, for your original goal, if what you want is constant speed to get pseudo-random numbers, I suggest to pre-generate it with traditional RNG and access with lookup table.
EDIT: I noticed you have concern with a table size for a lot of numbers, however you may introduce some hybrid model, like a table of N entries, and do f(k) = g( tbl[k%n], k), which at least provide good distribution across N continue sequence.
This demonstrates an PRNG implemented as a hashed counter. This might appear to duplicate R.'s suggestion (using a block cipher in CTR mode as a stream cipher), but for this, I avoided using cryptographically secure primitives: for speed of execution and because security wasn't a desired feature.
If we were trying to create a secure stream cipher with your requirement that any emitted sequence be trivially repeatable, given knowledge of its index...
...then we could choose a secure hash algorithm (like SHA256) and a counter with a lot of bits (maybe 2048 -> sequence repeats every 2^2048 generated numbers without reseeding).
HOWEVER, the version I present here uses Bob Jenkins' famous hash function (simple and fast, but not secure) along with a 64-bit counter (which is as big as integers can get on my system, without needing custom incrementing code).
Code in main demonstrates that knowledge of the RNG's counter (seed) after initialization allows a PRNG sequence to be repeated, as long as we know how many values were generated leading up to the repetition point.
Actually, if you know the counter's value at any point in the output sequence, you will be able to retrieve all values generated previous to that point, AND all values which will be generated afterward. This only involves adding or subtracting ordinal differences to/from a reference counter value associated with a known point in the output sequence.
It should be pretty easy to adapt this class for use as a testing framework -- you could plug in other hash functions and change the counter's size to see what kind of impact there is on speed as well as the distribution of generated values (the only uniformity analysis I did was to look for patterns in the screenfuls of hexadecimal numbers printed by main()).
#include <iostream>
#include <iomanip>
#include <ctime>
using namespace std;
class CHashedCounterRng {
static unsigned JenkinsHash(const void *input, unsigned len) {
unsigned hash = 0;
for(unsigned i=0; i<len; ++i) {
hash += static_cast<const unsigned char*>(input)[i];
hash += hash << 10;
hash ^= hash >> 6;
}
hash += hash << 3;
hash ^= hash >> 11;
hash += hash << 15;
return hash;
}
unsigned long long m_counter;
void IncrementCounter() { ++m_counter; }
public:
unsigned long long GetSeed() const {
return m_counter;
}
void SetSeed(unsigned long long new_seed) {
m_counter = new_seed;
}
unsigned int operator ()() {
// the next random number is generated here
const auto r = JenkinsHash(&m_counter, sizeof(m_counter));
IncrementCounter();
return r;
}
// the default coontructor uses time()
// to seed the counter
CHashedCounterRng() : m_counter(time(0)) {}
// you can supply a predetermined seed here,
// or after construction with SetSeed(seed)
CHashedCounterRng(unsigned long long seed) : m_counter(seed) {}
};
int main() {
CHashedCounterRng rng;
// time()'s high bits change very slowly, so look at low digits
// if you want to verify that the seed is different between runs
const auto stored_counter = rng.GetSeed();
cout << "initial seed: " << stored_counter << endl;
for(int i=0; i<20; ++i) {
for(int j=0; j<8; ++j) {
const unsigned x = rng();
cout << setfill('0') << setw(8) << hex << x << ' ';
}
cout << endl;
}
cout << endl;
cout << "The last line again:" << endl;
rng.SetSeed(stored_counter + 19 * 8);
for(int j=0; j<8; ++j) {
const unsigned x = rng();
cout << setfill('0') << setw(8) << hex << x << ' ';
}
cout << endl << endl;
return 0;
}
I have a set of image files, and I want to reduce the number of colors of them to 64. How can I do this with OpenCV?
I need this so I can work with a 64-sized image histogram.
I'm implementing CBIR techniques
What I want is color quantization to a 4-bit palette.
This subject was well covered on OpenCV 2 Computer Vision Application Programming Cookbook:
Chapter 2 shows a few reduction operations, one of them demonstrated here in C++ and later in Python:
#include <iostream>
#include <vector>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
void colorReduce(cv::Mat& image, int div=64)
{
int nl = image.rows; // number of lines
int nc = image.cols * image.channels(); // number of elements per line
for (int j = 0; j < nl; j++)
{
// get the address of row j
uchar* data = image.ptr<uchar>(j);
for (int i = 0; i < nc; i++)
{
// process each pixel
data[i] = data[i] / div * div + div / 2;
}
}
}
int main(int argc, char* argv[])
{
// Load input image (colored, 3-channel, BGR)
cv::Mat input = cv::imread(argv[1]);
if (input.empty())
{
std::cout << "!!! Failed imread()" << std::endl;
return -1;
}
colorReduce(input);
cv::imshow("Color Reduction", input);
cv::imwrite("output.jpg", input);
cv::waitKey(0);
return 0;
}
Below you can find the input image (left) and the output of this operation (right):
The equivalent code in Python would be the following:
(credits to #eliezer-bernart)
import cv2
import numpy as np
input = cv2.imread('castle.jpg')
# colorReduce()
div = 64
quantized = input // div * div + div // 2
cv2.imwrite('output.jpg', quantized)
You might consider K-means, yet in this case it will most likely be extremely slow. A better approach might be doing this "manually" on your own. Let's say you have image of type CV_8UC3, i.e. an image where each pixel is represented by 3 RGB values from 0 to 255 (Vec3b). You might "map" these 256 values to only 4 specific values, which would yield 4 x 4 x 4 = 64 possible colors.
I've had a dataset, where I needed to make sure that dark = black, light = white and reduce the amount of colors of everything between. This is what I did (C++):
inline uchar reduceVal(const uchar val)
{
if (val < 64) return 0;
if (val < 128) return 64;
return 255;
}
void processColors(Mat& img)
{
uchar* pixelPtr = img.data;
for (int i = 0; i < img.rows; i++)
{
for (int j = 0; j < img.cols; j++)
{
const int pi = i*img.cols*3 + j*3;
pixelPtr[pi + 0] = reduceVal(pixelPtr[pi + 0]); // B
pixelPtr[pi + 1] = reduceVal(pixelPtr[pi + 1]); // G
pixelPtr[pi + 2] = reduceVal(pixelPtr[pi + 2]); // R
}
}
}
causing [0,64) to become 0, [64,128) -> 64 and [128,255) -> 255, yielding 27 colors:
To me this seems to be neat, perfectly clear and faster than anything else mentioned in other answers.
You might also consider reducing these values to one of the multiples of some number, let's say:
inline uchar reduceVal(const uchar val)
{
if (val < 192) return uchar(val / 64.0 + 0.5) * 64;
return 255;
}
which would yield a set of 5 possible values: {0, 64, 128, 192, 255}, i.e. 125 colors.
There are many ways to do it. The methods suggested by jeff7 are OK, but some drawbacks are:
method 1 have parameters N and M, that you must choose, and you must also convert it to another colorspace.
method 2 answered can be very slow, since you should compute a 16.7 Milion bins histogram and sort it by frequency (to obtain the 64 higher frequency values)
I like to use an algorithm based on the Most Significant Bits to use in a RGB color and convert it to a 64 color image. If you're using C/OpenCV, you can use something like the function below.
If you're working with gray-level images I recommed to use the LUT() function of the OpenCV 2.3, since it is faster. There is a tutorial on how to use LUT to reduce the number of colors. See: Tutorial: How to scan images, lookup tables... However I find it more complicated if you're working with RGB images.
void reduceTo64Colors(IplImage *img, IplImage *img_quant) {
int i,j;
int height = img->height;
int width = img->width;
int step = img->widthStep;
uchar *data = (uchar *)img->imageData;
int step2 = img_quant->widthStep;
uchar *data2 = (uchar *)img_quant->imageData;
for (i = 0; i < height ; i++) {
for (j = 0; j < width; j++) {
// operator XXXXXXXX & 11000000 equivalent to XXXXXXXX AND 11000000 (=192)
// operator 01000000 >> 2 is a 2-bit shift to the right = 00010000
uchar C1 = (data[i*step+j*3+0] & 192)>>2;
uchar C2 = (data[i*step+j*3+1] & 192)>>4;
uchar C3 = (data[i*step+j*3+2] & 192)>>6;
data2[i*step2+j] = C1 | C2 | C3; // merges the 2 MSB of each channel
}
}
}
Here's a Python implementation of color quantization using K-Means Clustering with cv2.kmeans. The idea is to reduce the number of distinct colors in an image while preserving the color appearance of the image as much as possible. Here's the result:
Input -> Output
Code
import cv2
import numpy as np
def kmeans_color_quantization(image, clusters=8, rounds=1):
h, w = image.shape[:2]
samples = np.zeros([h*w,3], dtype=np.float32)
count = 0
for x in range(h):
for y in range(w):
samples[count] = image[x][y]
count += 1
compactness, labels, centers = cv2.kmeans(samples,
clusters,
None,
(cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10000, 0.0001),
rounds,
cv2.KMEANS_RANDOM_CENTERS)
centers = np.uint8(centers)
res = centers[labels.flatten()]
return res.reshape((image.shape))
image = cv2.imread('1.jpg')
result = kmeans_color_quantization(image, clusters=8)
cv2.imshow('result', result)
cv2.waitKey()
The answers suggested here are really good. I thought I would add my idea as well. I follow the formulation of many comments here, in which it is said that 64 colors can be represented by 2 bits of each channel in an RGB image.
The function in code below takes as input an image and the number of bits required for quantization. It uses bit manipulation to 'drop' the LSB bits and keep only the required number of bits. The result is a flexible method that can quantize the image to any number of bits.
#include "include\opencv\cv.h"
#include "include\opencv\highgui.h"
// quantize the image to numBits
cv::Mat quantizeImage(const cv::Mat& inImage, int numBits)
{
cv::Mat retImage = inImage.clone();
uchar maskBit = 0xFF;
// keep numBits as 1 and (8 - numBits) would be all 0 towards the right
maskBit = maskBit << (8 - numBits);
for(int j = 0; j < retImage.rows; j++)
for(int i = 0; i < retImage.cols; i++)
{
cv::Vec3b valVec = retImage.at<cv::Vec3b>(j, i);
valVec[0] = valVec[0] & maskBit;
valVec[1] = valVec[1] & maskBit;
valVec[2] = valVec[2] & maskBit;
retImage.at<cv::Vec3b>(j, i) = valVec;
}
return retImage;
}
int main ()
{
cv::Mat inImage;
inImage = cv::imread("testImage.jpg");
char buffer[30];
for(int i = 1; i <= 8; i++)
{
cv::Mat quantizedImage = quantizeImage(inImage, i);
sprintf(buffer, "%d Bit Image", i);
cv::imshow(buffer, quantizedImage);
sprintf(buffer, "%d Bit Image.png", i);
cv::imwrite(buffer, quantizedImage);
}
cv::waitKey(0);
return 0;
}
Here is an image that is used in the above function call:
Image quantized to 2 bits for each RGB channel (Total 64 Colors):
3 bits for each channel:
4 bits ...
There is the K-means clustering algorithm which is already available in the OpenCV library. In short it determines which are the best centroids around which to cluster your data for a user-defined value of k ( = no of clusters). So in your case you could find the centroids around which to cluster your pixel values for a given value of k=64. The details are there if you google around. Here's a short intro to k-means.
Something similar to what you are probably trying was asked here on SO using k-means, hope it helps.
Another approach would be to use the pyramid mean shift filter function in OpenCV. It yields somewhat "flattened" images, i.e. the number of colors are less so it might be able to help you.
If you want a quick and dirty method in C++, in 1 line:
capImage &= cv::Scalar(0b11000000, 0b11000000, 0b11000000);
So, what it does is keep the upper 2 bits of each R, G, B component, and discards the lower 6 bits, hence the 0b11000000.
Because of the 3 channels in RGB, you get maximum 4 R x 4 B x 4 B = max 64 colors. The advantage of doing this is that you can run this on any number of images and the same colors will be mapped.
Note that this can make your image a bit darker since it discards some bits.
For a greyscale image, you can do:
capImage &= 0b11111100;
This will keep the upper 6 bits, which means you get 64 grays out of 256, and again the image can become a bit darker.
Here's an example, original image = 251424 unique colors.
And the resulting image has 46 colors:
Assuming that you want to use the same 64 colors for all images (ie palette not optimized per image), there are a at least a couple choices I can think of:
1) Convert to Lab or YCrCb colorspace and quantize using N bits for luminance and M bits for each color channel, N should be greater than M.
2) Compute a 3D histogram of color values over all your training images, then choose the 64 colors with the largest bin values. Quantize your images by assigning each pixel the color of the closest bin from the training set.
Method 1 is the most generic and easiest to implement, while method 2 can be better tailored to your specific dataset.
Update:
For example, 32 colors is 5 bits so assign 3 bits to the luminance channel and 1 bits to each color channel. To do this quantization, do integer division of the luminance channel by 2^8/2^3 = 32 and each color channel by 2^8/2^1 = 128. Now there are only 8 different luminance values and 2 different color channels each. Recombine these values into a single integer doing bit shifting or math (quantized color value = luminance*4+color1*2+color2);
A simple bitwise and with a proper bitmask would do the trick.
python, for 64 colors,
img = img & int("11000000", 2)
The number of colors for an RGB image should be a perfect cube (same across 3 channels).
For this method, the number of possible values for a channel should be a power of 2. (This check is ignored by the code and the next lower power of 2 is taken by it)
import numpy as np
import cv2 as cv
def is_cube(n):
cbrt = np.cbrt(n)
return cbrt ** 3 == n, int(cbrt)
def reduce_color_space(img, n_colors=64):
n_valid, cbrt = is_cube(n_colors)
if not n_valid:
print("n_colors should be a perfect cube")
return
n_bits = int(np.log2(cbrt))
if n_bits > 8:
print("Can't generate more colors")
return
bitmask = int(f"{'1' * n_bits}{'0' * (8 - n_bits)}", 2)
return img & bitmask
img = cv.imread("image.png")
cv.imshow("orig", img)
cv.imshow("reduced", reduce_color_space(img))
cv.waitKey(0)
img = numpy.multiply(img//32, 32)
Why don't you just do Matrix multiplication/division? Values will be automatically rounded.
Pseudocode:
convert your channels to unsigned characters (CV_8UC3),
Divide by
total colors / desired colors. Mat = Mat / (256/64). Decimal points
will be truncated.
Multiply by the same number. Mat = mat * 4
Done. Each channel now only contains 64 colors.