PETSC time stepping under mpirun mode - parallel-processing

Dear all,
I'm a freshman in learning PETSC. I wrote a very simple 1D diffusion problem code based on PETSC library(Just simple FDM code). I want to use PETSC's parallel solver in each time step. Here is the pseudo code:
Initial(C0);
for(timestep=0;timestep<MaxTimeStep;timestep++)
{
Assemble(A,C,F); // Assemble system's A*C=F matrix and vector
KSP.solve(A,C,F); // Use PETSC's ksp solver to solve A*C=F
// I just want to make the solve step paralleled
// other part don't need to be paralleled
C0=C; // Update
SaveResult(C0); // Output the result C0
}
This code works very fine under sequential mode(the matrix and vector are both based on the MPI type), but when I try to use mpirun -np 4 ./a.out, the result is totally wrong. I guess mpirun make the timestep loop also paralleled, but I'm not sure about it.
So, my problem are:
How to use PETSC's parallel solver inside a loop, or in other words, how to use parallel solver or mpirun inside a sequential time stepping?
Is it possible that I can use mpirun -np 4 ./myapp to just parallel the solver, but for the time stepping and result output,they are still sequential? In the simulation, time stepping and result output must be executed one by one, so I just need to parallel the solver, not the whole program.
Thank you!
#include "petscksp.h"
void PrintToVTU(PetscInt k,PetscReal dx,Vec u);
int main(int argc,char **args)
{
Vec u,u0;
Mat A;
KSP ksp;
PetscRandom rctx;
PetscReal norm;
PetscInt ne;
PetscReal Length,dx,D;
PetscReal Ca,Cb,alpha;
PetscReal T,dt,ti;
PetscInt i,j,Ii,J,Istart,Iend,m,n,its;
PetscBool flg=PETSC_FALSE;
PetscScalar v;
printf("Input the Length:");
scanf("%f",&Length);
printf("Input the element number:");
scanf("%d",&ne);
printf("Input the Ca,Cb and D:");
scanf("%f %f %f",&Ca,&Cb,&D);
printf("Input dt and T:");
scanf("%f %f",&dt,&T);
n=ne+1;m=n;
dx=Length/ne;
alpha=dt*D/(dx*dx);
#if defined(PETSC_USE_LOG)
PetscLogStage stage;
#endif
static char help[] = "Solves a linear system in parallel with
KSP.\n\
Iput parameters include:\n-random_exact_sol : use a random exact
solution vector\n\
-view_exact_sol : write exact solution vector to stdout\n\
-m <mesh_x> : number of mesh points in x-direction\n\
-n <mesh_n> : number of mesh points in y-direction\n\n";
PetscInitialize(&argc,&args,(char*)0,help);
PetscOptionsGetInt(NULL,NULL,"-m",&m,NULL);
PetscOptionsGetInt(NULL,NULL,"-n",&n,NULL);
printf("m=%d\tn=%d\n",m,n);
// Create Matrix
MatCreate(PETSC_COMM_WORLD,&A);
MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m,n);
MatSetFromOptions(A);
MatMPIAIJSetPreallocation(A,5,NULL,5,NULL);
MatSeqAIJSetPreallocation(A,5,NULL);
MatSeqSBAIJSetPreallocation(A,1,5,NULL);
// Create Vector
VecCreate(PETSC_COMM_WORLD,&u0);
VecSetSizes(u0,PETSC_DECIDE,m);
VecSetFromOptions(u0);
VecDuplicate(u0,&u);
MatGetOwnershipRange(A,&Istart,&Iend);
PetscLogStageRegister("Assembly",&stage);
PetscLogStagePush(stage);
for(Ii=Istart;Ii<Iend;Ii++)
{
if(Ii==0)
{
v=1.0+2.0*alpha;J=Ii;
MatSetValues(A,1,&Ii,1,&J,&v,ADD_VALUES);
v=-2.0*alpha;J=Ii+1;
MatSetValues(A,1,&Ii,1,&J,&v,ADD_VALUES);
}
else if(Ii==n-1)
{
v=-2.0*alpha;J=Ii-1;
MatSetValues(A,1,&Ii,1,&J,&v,ADD_VALUES);
v=1.0+2.0*alpha;J=Ii;
MatSetValues(A,1,&Ii,1,&J,&v,ADD_VALUES);
}
else
{
v=-alpha;J=Ii-1;
MatSetValues(A,1,&Ii,1,&J,&v,ADD_VALUES);
v=1.0+2.0*alpha;J=Ii;
MatSetValues(A,1,&Ii,1,&J,&v,ADD_VALUES);
v=-alpha;J=Ii+1;
MatSetValues(A,1,&Ii,1,&J,&v,ADD_VALUES);
}
if(Ii<=m/2)
{
VecSetValues(u0,1,&Ii,&Ca,ADD_VALUES);
}
else
{
VecSetValues(u0,1,&Ii,&Cb,ADD_VALUES);
}
}
MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);
MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);
VecAssemblyBegin(u0);
VecAssemblyEnd(u0);
KSPCreate(PETSC_COMM_WORLD,&ksp);
KSPSetOperators(ksp,A,A);
KSPSetFromOptions(ksp);
dt=1.0;T=2.0;i=0;
PrintToVTU(i,dx,u0);
printf("Hello\n");
for(ti=0.0;ti<=T;ti+=dt)
{
i+=1;
KSPSolve(ksp,u0,u);
VecCopy(u,u0);
PrintToVTU(i,dx,u0);
}
}

Related

How can I execute a TensorFlow graph from a protobuf in C++?

I got a simple code form tutorial and output it to .pb file as below:
mnist_softmax_train.py
x = tf.placeholder("float", shape=[None, 784], name='input_x')
y_ = tf.placeholder("float", shape=[None, 10], name='input_y')
W = tf.Variable(tf.zeros([784, 10]), name='W')
b = tf.Variable(tf.zeros([10]), name='b')
tf.initialize_all_variables().run()
y = tf.nn.softmax(tf.matmul(x,W)+b, name='softmax')
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy, name='train_step')
train_step.run(feed_dict={x:input_x, y_:input_y})
In C++, I load the same graph, and feed in fake data for testing:
Tensor input_x(DT_FLOAT, TensorShape({10,784}));
Tensor input_y(DT_FLOAT, TensorShape({10,10}));
Tensor W(DT_FLOAT, TensorShape({784,10}));
Tensor b(DT_FLOAT, TensorShape({10,10}));
Tensor input_test_x(DT_FLOAT, TensorShape({1,784}));
for(int i=0;i<10;i++){
for(int j=0;j<10;j++)
input_x.matrix<float>()(i,i+j) = 1.0;
input_y.matrix<float>()(i,i) = 1.0;
input_test_x.matrix<float>()(0,i) = 1.0;
}
std::vector<std::pair<string, tensorflow::Tensor>> inputs = {
{ "input_x", input_x },
{ "input_y", input_y },
{ "W", W },
{ "b", b },
{ "input_test_x", input_test_x },
};
std::vector<tensorflow::Tensor> outputs;
status = session->Run(inputs, {}, {"train_step"}, &outputs);
std::cout << outputs[0].DebugString() << "\n";
However, this fails with the error:
Invalid argument: Input 0 of node train_step/update_W/ApplyGradientDescent was passed float from _recv_W_0:0 incompatible with expected float_ref.
The graph runs correctly in Python. How can I run it correctly in C++?
The issue here is that you are running the "train_step" target, which performs much more work than just inference. In particular, it attempts to update the variables W and b with the result of the gradient descent step. The error message
Invalid argument: Input 0 of node train_step/update_W/ApplyGradientDescent was passed float from _recv_W_0:0 incompatible with expected float_ref.
...means that one of the nodes you attempted to run ("train_step/update_W/ApplyGradientDescent") expected a mutable input (with type float_ref) but it got an immutable input (with type float) because the value was fed in.
There are (at least) two possible solutions:
If you only want to see predictions for a given input and given weights, fetch "softmax:0" instead of "train_step" in the call to Session::Run().
If you want to perform training in C++, do not feed W and b, but instead assign values to those variables, then continue to execute "train_step". You may find it easier to create a tf.train.Saver when you build the graph in Python, and then invoke the operations that it produces to save and restore values from a checkpoint.

Understanding crc32

I try to understand the computation of crc32. It's new to me, so the question is a basic one. With the following code I have two different ways of computing the CRC32 sum. They should (in theory) be the same, but they differ. What am I doing wrong?
The Go stdlib implementation (what a surprise) seems to be correct, but I can't find the error in my implementation.
https://play.golang.org/p/QJH2K3IQEj
package main
import (
"fmt"
"hash/crc32"
)
func main() {
message := []byte("hello")
// Is this the correct polynomial table? This is the table from
// http://gnuradio.org/redmine/projects/gnuradio/repository/revisions/1cb52da49230c64c3719b4ab944ba1cf5a9abb92/entry/gr-digital/lib/digital_crc32.cc
tbl := [256]uint32{0x00000000, 0x04C11DB7, 0x09823B6E, 0x0D4326D9,
0x130476DC, 0x17C56B6B, 0x1A864DB2, 0x1E475005,
0x2608EDB8, 0x22C9F00F, 0x2F8AD6D6, 0x2B4BCB61,
0x350C9B64, 0x31CD86D3, 0x3C8EA00A, 0x384FBDBD,
0x4C11DB70, 0x48D0C6C7, 0x4593E01E, 0x4152FDA9,
0x5F15ADAC, 0x5BD4B01B, 0x569796C2, 0x52568B75,
0x6A1936C8, 0x6ED82B7F, 0x639B0DA6, 0x675A1011,
0x791D4014, 0x7DDC5DA3, 0x709F7B7A, 0x745E66CD,
0x9823B6E0, 0x9CE2AB57, 0x91A18D8E, 0x95609039,
0x8B27C03C, 0x8FE6DD8B, 0x82A5FB52, 0x8664E6E5,
0xBE2B5B58, 0xBAEA46EF, 0xB7A96036, 0xB3687D81,
0xAD2F2D84, 0xA9EE3033, 0xA4AD16EA, 0xA06C0B5D,
0xD4326D90, 0xD0F37027, 0xDDB056FE, 0xD9714B49,
0xC7361B4C, 0xC3F706FB, 0xCEB42022, 0xCA753D95,
0xF23A8028, 0xF6FB9D9F, 0xFBB8BB46, 0xFF79A6F1,
0xE13EF6F4, 0xE5FFEB43, 0xE8BCCD9A, 0xEC7DD02D,
0x34867077, 0x30476DC0, 0x3D044B19, 0x39C556AE,
0x278206AB, 0x23431B1C, 0x2E003DC5, 0x2AC12072,
0x128E9DCF, 0x164F8078, 0x1B0CA6A1, 0x1FCDBB16,
0x018AEB13, 0x054BF6A4, 0x0808D07D, 0x0CC9CDCA,
0x7897AB07, 0x7C56B6B0, 0x71159069, 0x75D48DDE,
0x6B93DDDB, 0x6F52C06C, 0x6211E6B5, 0x66D0FB02,
0x5E9F46BF, 0x5A5E5B08, 0x571D7DD1, 0x53DC6066,
0x4D9B3063, 0x495A2DD4, 0x44190B0D, 0x40D816BA,
0xACA5C697, 0xA864DB20, 0xA527FDF9, 0xA1E6E04E,
0xBFA1B04B, 0xBB60ADFC, 0xB6238B25, 0xB2E29692,
0x8AAD2B2F, 0x8E6C3698, 0x832F1041, 0x87EE0DF6,
0x99A95DF3, 0x9D684044, 0x902B669D, 0x94EA7B2A,
0xE0B41DE7, 0xE4750050, 0xE9362689, 0xEDF73B3E,
0xF3B06B3B, 0xF771768C, 0xFA325055, 0xFEF34DE2,
0xC6BCF05F, 0xC27DEDE8, 0xCF3ECB31, 0xCBFFD686,
0xD5B88683, 0xD1799B34, 0xDC3ABDED, 0xD8FBA05A,
0x690CE0EE, 0x6DCDFD59, 0x608EDB80, 0x644FC637,
0x7A089632, 0x7EC98B85, 0x738AAD5C, 0x774BB0EB,
0x4F040D56, 0x4BC510E1, 0x46863638, 0x42472B8F,
0x5C007B8A, 0x58C1663D, 0x558240E4, 0x51435D53,
0x251D3B9E, 0x21DC2629, 0x2C9F00F0, 0x285E1D47,
0x36194D42, 0x32D850F5, 0x3F9B762C, 0x3B5A6B9B,
0x0315D626, 0x07D4CB91, 0x0A97ED48, 0x0E56F0FF,
0x1011A0FA, 0x14D0BD4D, 0x19939B94, 0x1D528623,
0xF12F560E, 0xF5EE4BB9, 0xF8AD6D60, 0xFC6C70D7,
0xE22B20D2, 0xE6EA3D65, 0xEBA91BBC, 0xEF68060B,
0xD727BBB6, 0xD3E6A601, 0xDEA580D8, 0xDA649D6F,
0xC423CD6A, 0xC0E2D0DD, 0xCDA1F604, 0xC960EBB3,
0xBD3E8D7E, 0xB9FF90C9, 0xB4BCB610, 0xB07DABA7,
0xAE3AFBA2, 0xAAFBE615, 0xA7B8C0CC, 0xA379DD7B,
0x9B3660C6, 0x9FF77D71, 0x92B45BA8, 0x9675461F,
0x8832161A, 0x8CF30BAD, 0x81B02D74, 0x857130C3,
0x5D8A9099, 0x594B8D2E, 0x5408ABF7, 0x50C9B640,
0x4E8EE645, 0x4A4FFBF2, 0x470CDD2B, 0x43CDC09C,
0x7B827D21, 0x7F436096, 0x7200464F, 0x76C15BF8,
0x68860BFD, 0x6C47164A, 0x61043093, 0x65C52D24,
0x119B4BE9, 0x155A565E, 0x18197087, 0x1CD86D30,
0x029F3D35, 0x065E2082, 0x0B1D065B, 0x0FDC1BEC,
0x3793A651, 0x3352BBE6, 0x3E119D3F, 0x3AD08088,
0x2497D08D, 0x2056CD3A, 0x2D15EBE3, 0x29D4F654,
0xC5A92679, 0xC1683BCE, 0xCC2B1D17, 0xC8EA00A0,
0xD6AD50A5, 0xD26C4D12, 0xDF2F6BCB, 0xDBEE767C,
0xE3A1CBC1, 0xE760D676, 0xEA23F0AF, 0xEEE2ED18,
0xF0A5BD1D, 0xF464A0AA, 0xF9278673, 0xFDE69BC4,
0x89B8FD09, 0x8D79E0BE, 0x803AC667, 0x84FBDBD0,
0x9ABC8BD5, 0x9E7D9662, 0x933EB0BB, 0x97FFAD0C,
0xAFB010B1, 0xAB710D06, 0xA6322BDF, 0xA2F33668,
0xBCB4666D, 0xB8757BDA, 0xB5365D03, 0xB1F740B4}
crc := uint32(0xffffffff)
// different result
for _, v := range message {
crc = tbl[v^(byte(crc>>24)&0xff)] ^ (crc << 8)
}
crc = ^crc
fmt.Printf("%10d == 0x%x\n", crc, crc) // 422667581 == 0x1931653d
// same as http://zorc.breitbandkatze.de/crc.html
chk := crc32.ChecksumIEEE(message)
fmt.Printf("%10d == 0x%x\n", chk, chk) // 907060870 == 0x3610a686
}
Edit (from a comment below): I understand that the source code of Go's crc32 and mine are different. But why is my implementation giving a different result? The table I am using has the same starting polynome (from the wiki page for CRC-32) as Go's implementation (0x04C11DB7), except that Go uses the reversed polynome and therfore a different algorithm is not surprising. "My" algorithm comes from various different C/C++ sources, such as the one linked in the source.
First, it seem that you are not using the right table (or at least the same than the golang standard library): you can check it simply by comparing your table to the one available in the crc32 package. Here is a playground example.
Then, you algorithm is also wrong. To be honest, I didn't even tried to check whether you put the right values at the right places, but simply dug out the code used by crc32.Checksum. Here is a fixed version of your algorithm:
crc := uint32(0xffffffff)
for _, v := range message {
crc = tbl[byte(crc)^v] ^ (crc >> 8)
}
return ^crc
As you see, the in-loop calculation is a bit different. you can see it in action with the crc32.IEEE polynominal table here.

Setting up Rcpp Armadillo in windows with Rstudio

I am trying to set up RcppArmadillo in my windows system with Rstudio. I have successfully installed RcppArmadillo with the command
install.packages("RcppArmadillo")
in R console.
But when I try to compile a c++ code with RcppArmadillo dependency I get a error like
g++ -m64 -I"C:/PROGRA~1/R/R-30~1.3/include" -DNDEBUG -I"C:/PROGRA~1/R/R-30~1.3/library/Rcpp/include" -I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -mtune=core2 -c colrowStat.cpp -o colrowStat.o colrowStat.cpp:5:26: fatal error: RcppArmadillo.h: No such file or directory compilation terminated. make: *** [colrowStat.o] Error 1 Warning message: running command 'make -f "C:/PROGRA~1/R/R-30~1.3/etc/x64/Makeconf" -f "C:/PROGRA~1/R/R-30~1.3/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="sourceCpp_38187.dll" WIN=64 TCLBIN=64 OBJECTS="colrowStat.o"' had status 2
But the header files are available in path_to_my_documents/R/win-libraries/3.0/RcppArmadillo/Include
I think the include path for compilation dose not have this path. I don't how to add this folder to the path. I greatly appreciate any help with this problem.
You are doing it wrong. There are many ways to do it, and we documented several of them. What you do here is not one of them.
Try this instead and go from there:
R> library(Rcpp)
R> cppFunction("arma::mat op(arma::vec x) { return(x*x.t()); }",
+ depends="RcppArmadillo")
R> op(1:2)
[,1] [,2]
[1,] 1 2
[2,] 2 4
R>
This is one of the basic examples: take a vector, multiply it by its transpose and return the result outer product matrix.
What you ultimately want is a package, and for that you could do much worse than starting by RcppArmadillo.package.skeleton().
Your question is short on details, but if you are on a Windows machine and are using RStudio, then here is a fully reproducible example on how to use RcppArmadillo without using the inline package, which is not ideal except for very short functions.
As Dirk has pointed out, this advice is available elsewhere -- the Rcpp* ecosystem is bizarrely well-documented, but this might help a novice.
0. Preliminaries:
You should have the following installed:
Rtools
R package devtools
R package Rcpp
R package RcppArmadillo
1. C++ code:
The example is a simple one of computing the OLS estimator for a linear regression model. Here is what the C++ file, with one function (fnLinRegRcpp) that takes the design matrices as inputs and produces the OLS coefficient estimates and the model residuals as an Rcpp List looks like:
// LinearRegression.cpp
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace arma; // use the Armadillo library for matrix computations
using namespace Rcpp;
// [[Rcpp::export]]
List fnLinRegRcpp(vec vY, mat mX) {
// compute the OLS estimator & model residuals
vec vBeta = solve(mX.t()*mX, mX.t()*vY);
vec vResid = vY - mX * vBeta;
// construct the return object
List ret;
ret["beta"] = vBeta;
ret["resid"] = vResid;
return ret;
}
// END
Note the use of Rcpp attributes:
// [[Rcpp::depends(RcppArmadillo)]]
to indicate library dependencies on the Armadillo library.
2. R code
Here is an example of the compilation of the C++ code using the sourceCpp function, together with an example of the use of the function, and a comparison of the output to the built-in lm.fit function.
# LinearRegression.R
library(devtools)
library(Rcpp)
library(RcppArmadillo)
Rcpp::sourceCpp("code/LinearRegression.cpp",
showOutput = TRUE,
rebuild = FALSE)
# generate some sample data
iK = 4
iN = 100
mX = cbind(1, matrix(rnorm(iK*iN), iN, iK))
vBeta0 = c(2, 3.5, 0.11, 6.33, 23)
vY = rnorm(iN, mean = mX %*% vBeta0)
# test the function
linReg1 = fnLinRegRcpp(vY, mX)
linReg1$beta # coefficient estimates
# compare the results to the built-in lm.fit function
lm.fit(y = vY, x = mX)$coef # coefficient estimates
# END

Aparapi add sample

I'm studing Aparapi (https://code.google.com/p/aparapi/) and have a strange behaviour of one of the sample included.
The sample is the first, "add". Building and executing it, is ok. I also put the following code for testing if the GPU is really used
if(!kernel.getExecutionMode().equals(Kernel.EXECUTION_MODE.GPU)){
System.out.println("Kernel did not execute on the GPU!");
}
and it works fine.
But, if I try to change the size of the array from 512 to a number greater than 999 (for example 1000), I have the following output:
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize[0] = 1000, localSize[0] = 128
Apr 18, 2013 1:31:01 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
JTP
Kernel did not execute on the GPU!
Here's my code:
final int size = 1000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float)(Math.random()*100);
b[i] = (float)(Math.random()*100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
#Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
Range range = Range.create(size);
kernel.execute(range);
System.out.println(kernel.getExecutionMode());
if (!kernel.getExecutionMode().equals(Kernel.EXECUTION_MODE.GPU)){
System.out.println("Kernel did not execute on the GPU!");
}
kernel.dispose();
}
I tried specifying the size using
Range range = Range.create(size, 128);
as suggested in a Google group, but nothing changed.
I'm currently running on Mac OS X 10.8 with Java 1.6.0_43. Aparapi version is the latest (2012-01-23).
Am I missing something? Any ideas?
Thanks in advance
Aparapi inherits a 'Grid Style' of implementation from OpenCL. When you specify a range of execution (say 1024), OpenCL will break this 'range' into groups of equal size. Possibly 4 groups of 256, or 8 groups of 128.
The group size must be a factor of range (so assert(range%groupSize==0)).
By default Aparapi internally selects the group size.
But you are choosing to fully specify the range and group size to using
Range r= Range.range(n,128)
You are responsible for ensuring that n%128==0.
From the error, it looks like you chose Range.range(1000,128).
Sadly 1000 % 128 != 0 so this range will fail.
If you specifiy
Range r = Range.range(n)
Aparapi will choose a valid group size, by finding the highest common factor of n.
Try dropping the 128 as the the second arg.
Gary

Robust and fast checksum algorithm?

Which checksum algorithm can you recommend in the following use case?
I want to generate checksums of small JPEG files (~8 kB each) to check if the content changed. Using the filesystem's date modified is unfortunately not an option.
The checksum need not be cryptographically strong but it should robustly indicate changes of any size.
The second criterion is speed since it should be possible to process at least hundreds of images per second (on a modern CPU).
The calculation will be done on a server with several clients. The clients send the images over Gigabit TCP to the server. So there's no disk I/O as bottleneck.
If you have many small files, your bottleneck is going to be file I/O and probably not a checksum algorithm.
A list of hash functions (which can be thought of as a checksum) can be found here.
Is there any reason you can't use the filesystem's date modified to determine if a file has changed? That would probably be faster.
There are lots of fast CRC algorithms that should do the trick:
http://www.google.com/search?hl=en&q=fast+crc&aq=f&oq=
Edit: Why the hate? CRC is totally appropriate, as evidenced by the other answers. A Google search was also appropriate, since no language was specified. This is an old, old problem which has been solved so many times that there isn't likely to be a definitive answer.
CRC-32 comes into mind mainly because it's cheap to calculate
Any kind of I/O comes into mind mainly because this will be the limiting factor for such an undertaking ;)
The problem is not calculating the checksums, the problem is to get the images into memory to calculate the checksum.
I would suggest "stagged" monitoring:
stage 1: check for changes of file timestamps and if you detect a change there hand over to...(not needed in your case as described in the edited version)
stage 2: get the image into memory and calculate the checksum
For sure important as well: multi-threading: setting up a pipeline which enables processing of several images in parallel if several CPU cores are available.
If you are receiving the files over network you can calculate the checksum as you receive the file. This will ensure that you will calculate the checksum while the data is in memory. Hence you won't have to load them into memory from disk.
I believe if you apply this method, you'll see almost-zero overhead on your system.
This is the routines I'm using on an embedded system which does checksum control on firmware and other stuff.
static const uint32_t crctab[] = {
0x0,
0x04c11db7, 0x09823b6e, 0x0d4326d9, 0x130476dc, 0x17c56b6b,
0x1a864db2, 0x1e475005, 0x2608edb8, 0x22c9f00f, 0x2f8ad6d6,
0x2b4bcb61, 0x350c9b64, 0x31cd86d3, 0x3c8ea00a, 0x384fbdbd,
0x4c11db70, 0x48d0c6c7, 0x4593e01e, 0x4152fda9, 0x5f15adac,
0x5bd4b01b, 0x569796c2, 0x52568b75, 0x6a1936c8, 0x6ed82b7f,
0x639b0da6, 0x675a1011, 0x791d4014, 0x7ddc5da3, 0x709f7b7a,
0x745e66cd, 0x9823b6e0, 0x9ce2ab57, 0x91a18d8e, 0x95609039,
0x8b27c03c, 0x8fe6dd8b, 0x82a5fb52, 0x8664e6e5, 0xbe2b5b58,
0xbaea46ef, 0xb7a96036, 0xb3687d81, 0xad2f2d84, 0xa9ee3033,
0xa4ad16ea, 0xa06c0b5d, 0xd4326d90, 0xd0f37027, 0xddb056fe,
0xd9714b49, 0xc7361b4c, 0xc3f706fb, 0xceb42022, 0xca753d95,
0xf23a8028, 0xf6fb9d9f, 0xfbb8bb46, 0xff79a6f1, 0xe13ef6f4,
0xe5ffeb43, 0xe8bccd9a, 0xec7dd02d, 0x34867077, 0x30476dc0,
0x3d044b19, 0x39c556ae, 0x278206ab, 0x23431b1c, 0x2e003dc5,
0x2ac12072, 0x128e9dcf, 0x164f8078, 0x1b0ca6a1, 0x1fcdbb16,
0x018aeb13, 0x054bf6a4, 0x0808d07d, 0x0cc9cdca, 0x7897ab07,
0x7c56b6b0, 0x71159069, 0x75d48dde, 0x6b93dddb, 0x6f52c06c,
0x6211e6b5, 0x66d0fb02, 0x5e9f46bf, 0x5a5e5b08, 0x571d7dd1,
0x53dc6066, 0x4d9b3063, 0x495a2dd4, 0x44190b0d, 0x40d816ba,
0xaca5c697, 0xa864db20, 0xa527fdf9, 0xa1e6e04e, 0xbfa1b04b,
0xbb60adfc, 0xb6238b25, 0xb2e29692, 0x8aad2b2f, 0x8e6c3698,
0x832f1041, 0x87ee0df6, 0x99a95df3, 0x9d684044, 0x902b669d,
0x94ea7b2a, 0xe0b41de7, 0xe4750050, 0xe9362689, 0xedf73b3e,
0xf3b06b3b, 0xf771768c, 0xfa325055, 0xfef34de2, 0xc6bcf05f,
0xc27dede8, 0xcf3ecb31, 0xcbffd686, 0xd5b88683, 0xd1799b34,
0xdc3abded, 0xd8fba05a, 0x690ce0ee, 0x6dcdfd59, 0x608edb80,
0x644fc637, 0x7a089632, 0x7ec98b85, 0x738aad5c, 0x774bb0eb,
0x4f040d56, 0x4bc510e1, 0x46863638, 0x42472b8f, 0x5c007b8a,
0x58c1663d, 0x558240e4, 0x51435d53, 0x251d3b9e, 0x21dc2629,
0x2c9f00f0, 0x285e1d47, 0x36194d42, 0x32d850f5, 0x3f9b762c,
0x3b5a6b9b, 0x0315d626, 0x07d4cb91, 0x0a97ed48, 0x0e56f0ff,
0x1011a0fa, 0x14d0bd4d, 0x19939b94, 0x1d528623, 0xf12f560e,
0xf5ee4bb9, 0xf8ad6d60, 0xfc6c70d7, 0xe22b20d2, 0xe6ea3d65,
0xeba91bbc, 0xef68060b, 0xd727bbb6, 0xd3e6a601, 0xdea580d8,
0xda649d6f, 0xc423cd6a, 0xc0e2d0dd, 0xcda1f604, 0xc960ebb3,
0xbd3e8d7e, 0xb9ff90c9, 0xb4bcb610, 0xb07daba7, 0xae3afba2,
0xaafbe615, 0xa7b8c0cc, 0xa379dd7b, 0x9b3660c6, 0x9ff77d71,
0x92b45ba8, 0x9675461f, 0x8832161a, 0x8cf30bad, 0x81b02d74,
0x857130c3, 0x5d8a9099, 0x594b8d2e, 0x5408abf7, 0x50c9b640,
0x4e8ee645, 0x4a4ffbf2, 0x470cdd2b, 0x43cdc09c, 0x7b827d21,
0x7f436096, 0x7200464f, 0x76c15bf8, 0x68860bfd, 0x6c47164a,
0x61043093, 0x65c52d24, 0x119b4be9, 0x155a565e, 0x18197087,
0x1cd86d30, 0x029f3d35, 0x065e2082, 0x0b1d065b, 0x0fdc1bec,
0x3793a651, 0x3352bbe6, 0x3e119d3f, 0x3ad08088, 0x2497d08d,
0x2056cd3a, 0x2d15ebe3, 0x29d4f654, 0xc5a92679, 0xc1683bce,
0xcc2b1d17, 0xc8ea00a0, 0xd6ad50a5, 0xd26c4d12, 0xdf2f6bcb,
0xdbee767c, 0xe3a1cbc1, 0xe760d676, 0xea23f0af, 0xeee2ed18,
0xf0a5bd1d, 0xf464a0aa, 0xf9278673, 0xfde69bc4, 0x89b8fd09,
0x8d79e0be, 0x803ac667, 0x84fbdbd0, 0x9abc8bd5, 0x9e7d9662,
0x933eb0bb, 0x97ffad0c, 0xafb010b1, 0xab710d06, 0xa6322bdf,
0xa2f33668, 0xbcb4666d, 0xb8757bda, 0xb5365d03, 0xb1f740b4
};
typedef struct crc32ctx
{
uint32_t crc;
uint32_t length;
} CRC32Ctx;
#define COMPUTE(var, ch) (var) = (var) << 8 ^ crctab[(var) >> 24 ^ (ch)]
void crc32_stream_init( CRC32Ctx* ctx )
{
ctx->crc = 0;
ctx->length = 0;
}
void crc32_stream_compute_uint32( CRC32Ctx* ctx, uint32_t data )
{
COMPUTE( ctx->crc, data & 0xFF );
COMPUTE( ctx->crc, ( data >> 8 ) & 0xFF );
COMPUTE( ctx->crc, ( data >> 16 ) & 0xFF );
COMPUTE( ctx->crc, ( data >> 24 ) & 0xFF );
ctx->length += 4;
}
void crc32_stream_compute_uint8( CRC32Ctx* ctx, uint8_t data )
{
COMPUTE( ctx->crc, data );
ctx->length++;
}
void crc32_stream_finilize( CRC32Ctx* ctx )
{
uint32_t len = ctx->length;
for( ; len != 0; len >>= 8 )
{
COMPUTE( ctx->crc, len & 0xFF );
}
ctx->crc = ~ctx->crc;
}
/*** pseudo code ***/
CRC32Ctx crc;
crc32_stream_init(&crc);
while((just_received_buffer_len = received_anything()))
{
for(int i = 0; i < just_received_buffer_len; i++)
{
crc32_stream_compute_uint8(&crc, buf[i]); // assuming buf is uint8_t*
}
}
crc32_stream_finilize(&crc);
printf("%x", crc.crc); // ta daaa
CRC
adler32, available in the zlib headers, is advertised as being significantly faster than crc32, while being only slightly less accurate.
CRC32 is probably good enough, although there's a small chance you might get a collision, such that a file that has been modified might look like it hasn't been because the two versions generate the same checksum. To avoid this possibility I'd therefore suggest using MD5, which will easily be fast enough, and the chances of a collision occurring is reduced to the point where it's almost infinitessimal.
As others have said, with lots of small files your real performance bottleneck is going to be I/O so the issue is dealing with that. If you post up a few more details somebody will probably suggest a way of sorting that out as well.
Your most important requirement is "to check if the content changed".
If it most important that ANY change in the file be detected, MD-5, SHA-1 or even SHA-256 should be your choice.
Given that you indicated that the checksum NOT be cryptographically good, I would recommend CRC-32 for three reasons. CRC-32 gives good hamming distances over an 8K file. CRC-32 will be at least an order of magnitude faster than MD-5 to calculate (your second requirement). Sometimes as important, CRC-32 only requires 32 bits to store the value to be compared. MD-5 requires 4 times the storage and SHA-1 requires 5 times the storage.
BTW, any technique will be strengthened by prepending the length of the file when calculating the hash.
According to the Wiki page pointed to by Luke, MD5 is actually faster than CRC32!
I have tried this myself by using Python 2.6 on Windows Vista, and got the same result.
Here are some results:
crc32: 162.481544276 MBps
md5: 224.489791549 MBps
crc32: 168.332996575 MBps
md5: 226.089336532 MBps
crc32: 155.851515828 MBps
md5: 194.943289532 MBps
I am thinking about the same question as well, and I'm tempted to use the Rsync's variation of Adler-32 for detecting file differences.
Just a postscript to the above; jpegs use lossy compression and the extent of the compression may depend upon the program used to create the jpeg, the colour pallette and/or bit-depth on the system, display gamma, graphics card and user-set compression levels/colour settings. Therefore, comparing jpegs built on different computers/platforms or using different software will be very difficult at the byte level.
This is 5 times faster than CCITT and makes exactly the same job:
Python:
def crc16_fast(data: bytearray, length):
crc = 0xCACA
for i in range(length):
crc ^= data[i]
return crc
C:
uint16_t crc16_fast(const uint16_t* data, size_t length)
{
uint16_t crc = 0xCACA;
for (size_t i = 0; i < length; i++)
crc ^= data[i];
return crc;
}

Resources