Understanding crc32 - go

I try to understand the computation of crc32. It's new to me, so the question is a basic one. With the following code I have two different ways of computing the CRC32 sum. They should (in theory) be the same, but they differ. What am I doing wrong?
The Go stdlib implementation (what a surprise) seems to be correct, but I can't find the error in my implementation.
https://play.golang.org/p/QJH2K3IQEj
package main
import (
"fmt"
"hash/crc32"
)
func main() {
message := []byte("hello")
// Is this the correct polynomial table? This is the table from
// http://gnuradio.org/redmine/projects/gnuradio/repository/revisions/1cb52da49230c64c3719b4ab944ba1cf5a9abb92/entry/gr-digital/lib/digital_crc32.cc
tbl := [256]uint32{0x00000000, 0x04C11DB7, 0x09823B6E, 0x0D4326D9,
0x130476DC, 0x17C56B6B, 0x1A864DB2, 0x1E475005,
0x2608EDB8, 0x22C9F00F, 0x2F8AD6D6, 0x2B4BCB61,
0x350C9B64, 0x31CD86D3, 0x3C8EA00A, 0x384FBDBD,
0x4C11DB70, 0x48D0C6C7, 0x4593E01E, 0x4152FDA9,
0x5F15ADAC, 0x5BD4B01B, 0x569796C2, 0x52568B75,
0x6A1936C8, 0x6ED82B7F, 0x639B0DA6, 0x675A1011,
0x791D4014, 0x7DDC5DA3, 0x709F7B7A, 0x745E66CD,
0x9823B6E0, 0x9CE2AB57, 0x91A18D8E, 0x95609039,
0x8B27C03C, 0x8FE6DD8B, 0x82A5FB52, 0x8664E6E5,
0xBE2B5B58, 0xBAEA46EF, 0xB7A96036, 0xB3687D81,
0xAD2F2D84, 0xA9EE3033, 0xA4AD16EA, 0xA06C0B5D,
0xD4326D90, 0xD0F37027, 0xDDB056FE, 0xD9714B49,
0xC7361B4C, 0xC3F706FB, 0xCEB42022, 0xCA753D95,
0xF23A8028, 0xF6FB9D9F, 0xFBB8BB46, 0xFF79A6F1,
0xE13EF6F4, 0xE5FFEB43, 0xE8BCCD9A, 0xEC7DD02D,
0x34867077, 0x30476DC0, 0x3D044B19, 0x39C556AE,
0x278206AB, 0x23431B1C, 0x2E003DC5, 0x2AC12072,
0x128E9DCF, 0x164F8078, 0x1B0CA6A1, 0x1FCDBB16,
0x018AEB13, 0x054BF6A4, 0x0808D07D, 0x0CC9CDCA,
0x7897AB07, 0x7C56B6B0, 0x71159069, 0x75D48DDE,
0x6B93DDDB, 0x6F52C06C, 0x6211E6B5, 0x66D0FB02,
0x5E9F46BF, 0x5A5E5B08, 0x571D7DD1, 0x53DC6066,
0x4D9B3063, 0x495A2DD4, 0x44190B0D, 0x40D816BA,
0xACA5C697, 0xA864DB20, 0xA527FDF9, 0xA1E6E04E,
0xBFA1B04B, 0xBB60ADFC, 0xB6238B25, 0xB2E29692,
0x8AAD2B2F, 0x8E6C3698, 0x832F1041, 0x87EE0DF6,
0x99A95DF3, 0x9D684044, 0x902B669D, 0x94EA7B2A,
0xE0B41DE7, 0xE4750050, 0xE9362689, 0xEDF73B3E,
0xF3B06B3B, 0xF771768C, 0xFA325055, 0xFEF34DE2,
0xC6BCF05F, 0xC27DEDE8, 0xCF3ECB31, 0xCBFFD686,
0xD5B88683, 0xD1799B34, 0xDC3ABDED, 0xD8FBA05A,
0x690CE0EE, 0x6DCDFD59, 0x608EDB80, 0x644FC637,
0x7A089632, 0x7EC98B85, 0x738AAD5C, 0x774BB0EB,
0x4F040D56, 0x4BC510E1, 0x46863638, 0x42472B8F,
0x5C007B8A, 0x58C1663D, 0x558240E4, 0x51435D53,
0x251D3B9E, 0x21DC2629, 0x2C9F00F0, 0x285E1D47,
0x36194D42, 0x32D850F5, 0x3F9B762C, 0x3B5A6B9B,
0x0315D626, 0x07D4CB91, 0x0A97ED48, 0x0E56F0FF,
0x1011A0FA, 0x14D0BD4D, 0x19939B94, 0x1D528623,
0xF12F560E, 0xF5EE4BB9, 0xF8AD6D60, 0xFC6C70D7,
0xE22B20D2, 0xE6EA3D65, 0xEBA91BBC, 0xEF68060B,
0xD727BBB6, 0xD3E6A601, 0xDEA580D8, 0xDA649D6F,
0xC423CD6A, 0xC0E2D0DD, 0xCDA1F604, 0xC960EBB3,
0xBD3E8D7E, 0xB9FF90C9, 0xB4BCB610, 0xB07DABA7,
0xAE3AFBA2, 0xAAFBE615, 0xA7B8C0CC, 0xA379DD7B,
0x9B3660C6, 0x9FF77D71, 0x92B45BA8, 0x9675461F,
0x8832161A, 0x8CF30BAD, 0x81B02D74, 0x857130C3,
0x5D8A9099, 0x594B8D2E, 0x5408ABF7, 0x50C9B640,
0x4E8EE645, 0x4A4FFBF2, 0x470CDD2B, 0x43CDC09C,
0x7B827D21, 0x7F436096, 0x7200464F, 0x76C15BF8,
0x68860BFD, 0x6C47164A, 0x61043093, 0x65C52D24,
0x119B4BE9, 0x155A565E, 0x18197087, 0x1CD86D30,
0x029F3D35, 0x065E2082, 0x0B1D065B, 0x0FDC1BEC,
0x3793A651, 0x3352BBE6, 0x3E119D3F, 0x3AD08088,
0x2497D08D, 0x2056CD3A, 0x2D15EBE3, 0x29D4F654,
0xC5A92679, 0xC1683BCE, 0xCC2B1D17, 0xC8EA00A0,
0xD6AD50A5, 0xD26C4D12, 0xDF2F6BCB, 0xDBEE767C,
0xE3A1CBC1, 0xE760D676, 0xEA23F0AF, 0xEEE2ED18,
0xF0A5BD1D, 0xF464A0AA, 0xF9278673, 0xFDE69BC4,
0x89B8FD09, 0x8D79E0BE, 0x803AC667, 0x84FBDBD0,
0x9ABC8BD5, 0x9E7D9662, 0x933EB0BB, 0x97FFAD0C,
0xAFB010B1, 0xAB710D06, 0xA6322BDF, 0xA2F33668,
0xBCB4666D, 0xB8757BDA, 0xB5365D03, 0xB1F740B4}
crc := uint32(0xffffffff)
// different result
for _, v := range message {
crc = tbl[v^(byte(crc>>24)&0xff)] ^ (crc << 8)
}
crc = ^crc
fmt.Printf("%10d == 0x%x\n", crc, crc) // 422667581 == 0x1931653d
// same as http://zorc.breitbandkatze.de/crc.html
chk := crc32.ChecksumIEEE(message)
fmt.Printf("%10d == 0x%x\n", chk, chk) // 907060870 == 0x3610a686
}
Edit (from a comment below): I understand that the source code of Go's crc32 and mine are different. But why is my implementation giving a different result? The table I am using has the same starting polynome (from the wiki page for CRC-32) as Go's implementation (0x04C11DB7), except that Go uses the reversed polynome and therfore a different algorithm is not surprising. "My" algorithm comes from various different C/C++ sources, such as the one linked in the source.

First, it seem that you are not using the right table (or at least the same than the golang standard library): you can check it simply by comparing your table to the one available in the crc32 package. Here is a playground example.
Then, you algorithm is also wrong. To be honest, I didn't even tried to check whether you put the right values at the right places, but simply dug out the code used by crc32.Checksum. Here is a fixed version of your algorithm:
crc := uint32(0xffffffff)
for _, v := range message {
crc = tbl[byte(crc)^v] ^ (crc >> 8)
}
return ^crc
As you see, the in-loop calculation is a bit different. you can see it in action with the crc32.IEEE polynominal table here.

Related

Mathematica Series and Solve function

This is my first mathmatica code,
I defined the functions:
\[Beta] := v/c
\[Gamma] := 1/Sqrt[1 - \[Beta]^2]
TotalE[\[Gamma][\[Beta]]] := \[Gamma]mc^2
KE := TotalE[\[Gamma][\[Beta]]] - mc^2
No i want to make a series expansion of KE at β → 0 up to order 2,
I tried:
Series[KE, {\[Beta], 1, 2}]
But i got the error massage:
General::ivar: v/c is not a valid variable.
I also wanted to define Ekin as function of β,
so i used Solve function to get the inverse function, β[Ekin]:
Solve[KE, \[Beta]]
The same errors arises again:
Solve::ivar: v/c is not a valid variable.
Try this
Clear[\[Gamma],\[Beta],mc,KE,s,v,c]
\[Gamma] = 1/Sqrt[1 - \[Beta]^2];
TotalE[\[Gamma]*\[Beta]] = \[Gamma]*mc^2;
KE = TotalE[\[Gamma]*\[Beta]] - mc^2;
s=Normal[Series[KE, {\[Beta], 1, 2}]]/.\[Beta]->v/c
Reduce[KE==0, \[Beta]]/.\[Beta]->v/c
which returns
O-mc^2 + mc^2/(Sqrt[2]*Sqrt[1 - v/c]) -
(mc^2*(-1 + v/c))/(4*Sqrt[2]*Sqrt[1 - v/c]) +
(3*mc^2*(-1 + v/c)^2)/(32*Sqrt[2]*Sqrt[1 - v/c])
and
(mc != 0 && v/c == 0)||(-1+v^2/c^2 !=0 && mc == 0)
What that is trying to do is do your calculations with the simple variable beta, before you turn that into v/c and after the calculations replace beta with v/c.
But there are still things about the way you have written that which worry me. You are kind of writing TotalE like it is a function, but that is not the way to define a Mathematica function and I am concerned this may be going to get you into trouble.
Please let me know if I have misunderstood some of what you are trying to do and explain what I've done wrong and I will try to find a way to fix that.

How to get same cci values from Trading view in Golang?

I'm trying to replicate values from pine script cci() function in golang. I've found this lib https://github.com/markcheno/go-talib/blob/master/talib.go#L1821
but it gives totally different values than cci function does
pseudo code how do I use the lib
cci := talib.Cci(latest14CandlesHighArray, latest14CandlesLowArray, latest14CandlesCloseArray, 14)
The lib gives me the following data
Timestamp: 2021-05-22 18:59:27.675, Symbol: BTCUSDT, Interval: 5m, Open: 38193.78000000, Close: 38122.16000000, High: 38283.55000000, Low: 38067.92000000, StartTime: 2021-05-22 18:55:00.000, EndTime: 2021-05-22 18:59:59.999, Sma: 38091.41020000, Cci0: -16.63898084, Cci1: -53.92565811,
While current cci values on TradingView are: cci0 - -136, cci1 - -49
could anyone guide what do I miss?
Thank you
P.S. cci0 - current candle cci, cci1 - previous candle cci
PineScript has really great reference when looking for functions, usually even supplying the pine code to recreate it.
https://www.tradingview.com/pine-script-reference/v4/#fun_cci
The code wasn't provided for cci, but a step-by-step explanation was.
Here is how I managed to recreate the cci function using Pine, following the steps in the reference:
// This source code is subject to the terms of the Mozilla Public License 2.0 at https://mozilla.org/MPL/2.0/
// © bajaco
//#version=4
study("CCI Breakdown", overlay=false, precision=16)
cci_breakdown(src, p) =>
// The CCI (commodity channel index) is calculated as the
// 1. difference between the typical price of a commodity and its simple moving average,
// divided by the
// 2. mean absolute deviation of the typical price.
// 3. The index is scaled by an inverse factor of 0.015
// to provide more readable numbers
// 1. diff
ma = sma(src,p)
diff = src - ma
// 2. mad
s = 0.0
for i = 0 to p - 1
s := s + abs(src[i] - ma)
mad = s / p
// 3. Scaling
mcci = diff/mad / 0.015
mcci
plot(cci(close, 100))
plot(cci_breakdown(close,100))
I didn't know what mean absolute deviation meant, but at least in their implementation it appears to be taking the difference from the mean for each value in the range, but NOT changing the mean value as you go back.
I don't know Go but that's the logic.

Robust Standard Errors in lm() using stargazer()

I have read a lot about the pain of replicate the easy robust option from STATA to R to use robust standard errors. I replicated following approaches: StackExchange and Economic Theory Blog. They work but the problem I face is, if I want to print my results using the stargazer function (this prints the .tex code for Latex files).
Here is the illustration to my problem:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
stargazer(reg1)
This prints the R output as .tex code (non-robust SE) If i want to use robust SE, i can do it with the sandwich package as follow:
vcov <- vcovHC(reg1, "HC1")
if I now use stargazer(vcov) only the output of the vcovHC function is printed and not the regression output itself.
With the package lmtest() it is possible to print at least the estimator, but not the observations, R2, adj. R2, Residual, Residual St.Error and the F-Statistics.
lmtest::coeftest(reg1, vcov. = sandwich::vcovHC(reg1, type = 'HC1'))
This gives the following output:
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.54923 6.85521 -0.3719 0.710611
id 0.39634 0.12376 3.2026 0.001722 **
source 1.48164 4.20183 0.3526 0.724960
country -4.00398 4.00256 -1.0004 0.319041
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How can I add or get an output with the following parameters as well?
Residual standard error: 17.43 on 127 degrees of freedom
Multiple R-squared: 0.09676, Adjusted R-squared: 0.07543
F-statistic: 4.535 on 3 and 127 DF, p-value: 0.00469
Did anybody face the same problem and can help me out?
How can I use robust standard errors in the lm function and apply the stargazer function?
You already calculated robust standard errors, and there's an easy way to include it in the stargazeroutput:
library("sandwich")
library("plm")
library("stargazer")
data("Produc", package = "plm")
# Regression
model <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc,
index = c("state","year"),
method="pooling")
# Adjust standard errors
cov1 <- vcovHC(model, type = "HC1")
robust_se <- sqrt(diag(cov1))
# Stargazer output (with and without RSE)
stargazer(model, model, type = "text",
se = list(NULL, robust_se))
Solution found here: https://www.jakeruss.com/cheatsheets/stargazer/#robust-standard-errors-replicating-statas-robust-option
Update I'm not so much into F-Tests. People are discussing those issues, e.g. https://stats.stackexchange.com/questions/93787/f-test-formula-under-robust-standard-error
When you follow http://www3.grips.ac.jp/~yamanota/Lecture_Note_9_Heteroskedasticity
"A heteroskedasticity-robust t statistic can be obtained by dividing an OSL estimator by its robust standard error (for zero null hypotheses). The usual F-statistic, however, is invalid. Instead, we need to use the heteroskedasticity-robust Wald statistic."
and use a Wald statistic here?
This is a fairly simple solution using coeftest:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
cl_robust <- coeftest(reg1, vcov = vcovCL, type = "HC1", cluster = ~
country)
se_robust <- cl_robust[, 2]
stargazer(reg1, reg1, cl_robust, se = list(NULL, se_robust, NULL))
Note that I only included cl_robust in the output as a verification that the results are identical.

Setting up Rcpp Armadillo in windows with Rstudio

I am trying to set up RcppArmadillo in my windows system with Rstudio. I have successfully installed RcppArmadillo with the command
install.packages("RcppArmadillo")
in R console.
But when I try to compile a c++ code with RcppArmadillo dependency I get a error like
g++ -m64 -I"C:/PROGRA~1/R/R-30~1.3/include" -DNDEBUG -I"C:/PROGRA~1/R/R-30~1.3/library/Rcpp/include" -I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -mtune=core2 -c colrowStat.cpp -o colrowStat.o colrowStat.cpp:5:26: fatal error: RcppArmadillo.h: No such file or directory compilation terminated. make: *** [colrowStat.o] Error 1 Warning message: running command 'make -f "C:/PROGRA~1/R/R-30~1.3/etc/x64/Makeconf" -f "C:/PROGRA~1/R/R-30~1.3/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="sourceCpp_38187.dll" WIN=64 TCLBIN=64 OBJECTS="colrowStat.o"' had status 2
But the header files are available in path_to_my_documents/R/win-libraries/3.0/RcppArmadillo/Include
I think the include path for compilation dose not have this path. I don't how to add this folder to the path. I greatly appreciate any help with this problem.
You are doing it wrong. There are many ways to do it, and we documented several of them. What you do here is not one of them.
Try this instead and go from there:
R> library(Rcpp)
R> cppFunction("arma::mat op(arma::vec x) { return(x*x.t()); }",
+ depends="RcppArmadillo")
R> op(1:2)
[,1] [,2]
[1,] 1 2
[2,] 2 4
R>
This is one of the basic examples: take a vector, multiply it by its transpose and return the result outer product matrix.
What you ultimately want is a package, and for that you could do much worse than starting by RcppArmadillo.package.skeleton().
Your question is short on details, but if you are on a Windows machine and are using RStudio, then here is a fully reproducible example on how to use RcppArmadillo without using the inline package, which is not ideal except for very short functions.
As Dirk has pointed out, this advice is available elsewhere -- the Rcpp* ecosystem is bizarrely well-documented, but this might help a novice.
0. Preliminaries:
You should have the following installed:
Rtools
R package devtools
R package Rcpp
R package RcppArmadillo
1. C++ code:
The example is a simple one of computing the OLS estimator for a linear regression model. Here is what the C++ file, with one function (fnLinRegRcpp) that takes the design matrices as inputs and produces the OLS coefficient estimates and the model residuals as an Rcpp List looks like:
// LinearRegression.cpp
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace arma; // use the Armadillo library for matrix computations
using namespace Rcpp;
// [[Rcpp::export]]
List fnLinRegRcpp(vec vY, mat mX) {
// compute the OLS estimator & model residuals
vec vBeta = solve(mX.t()*mX, mX.t()*vY);
vec vResid = vY - mX * vBeta;
// construct the return object
List ret;
ret["beta"] = vBeta;
ret["resid"] = vResid;
return ret;
}
// END
Note the use of Rcpp attributes:
// [[Rcpp::depends(RcppArmadillo)]]
to indicate library dependencies on the Armadillo library.
2. R code
Here is an example of the compilation of the C++ code using the sourceCpp function, together with an example of the use of the function, and a comparison of the output to the built-in lm.fit function.
# LinearRegression.R
library(devtools)
library(Rcpp)
library(RcppArmadillo)
Rcpp::sourceCpp("code/LinearRegression.cpp",
showOutput = TRUE,
rebuild = FALSE)
# generate some sample data
iK = 4
iN = 100
mX = cbind(1, matrix(rnorm(iK*iN), iN, iK))
vBeta0 = c(2, 3.5, 0.11, 6.33, 23)
vY = rnorm(iN, mean = mX %*% vBeta0)
# test the function
linReg1 = fnLinRegRcpp(vY, mX)
linReg1$beta # coefficient estimates
# compare the results to the built-in lm.fit function
lm.fit(y = vY, x = mX)$coef # coefficient estimates
# END

Robust and fast checksum algorithm?

Which checksum algorithm can you recommend in the following use case?
I want to generate checksums of small JPEG files (~8 kB each) to check if the content changed. Using the filesystem's date modified is unfortunately not an option.
The checksum need not be cryptographically strong but it should robustly indicate changes of any size.
The second criterion is speed since it should be possible to process at least hundreds of images per second (on a modern CPU).
The calculation will be done on a server with several clients. The clients send the images over Gigabit TCP to the server. So there's no disk I/O as bottleneck.
If you have many small files, your bottleneck is going to be file I/O and probably not a checksum algorithm.
A list of hash functions (which can be thought of as a checksum) can be found here.
Is there any reason you can't use the filesystem's date modified to determine if a file has changed? That would probably be faster.
There are lots of fast CRC algorithms that should do the trick:
http://www.google.com/search?hl=en&q=fast+crc&aq=f&oq=
Edit: Why the hate? CRC is totally appropriate, as evidenced by the other answers. A Google search was also appropriate, since no language was specified. This is an old, old problem which has been solved so many times that there isn't likely to be a definitive answer.
CRC-32 comes into mind mainly because it's cheap to calculate
Any kind of I/O comes into mind mainly because this will be the limiting factor for such an undertaking ;)
The problem is not calculating the checksums, the problem is to get the images into memory to calculate the checksum.
I would suggest "stagged" monitoring:
stage 1: check for changes of file timestamps and if you detect a change there hand over to...(not needed in your case as described in the edited version)
stage 2: get the image into memory and calculate the checksum
For sure important as well: multi-threading: setting up a pipeline which enables processing of several images in parallel if several CPU cores are available.
If you are receiving the files over network you can calculate the checksum as you receive the file. This will ensure that you will calculate the checksum while the data is in memory. Hence you won't have to load them into memory from disk.
I believe if you apply this method, you'll see almost-zero overhead on your system.
This is the routines I'm using on an embedded system which does checksum control on firmware and other stuff.
static const uint32_t crctab[] = {
0x0,
0x04c11db7, 0x09823b6e, 0x0d4326d9, 0x130476dc, 0x17c56b6b,
0x1a864db2, 0x1e475005, 0x2608edb8, 0x22c9f00f, 0x2f8ad6d6,
0x2b4bcb61, 0x350c9b64, 0x31cd86d3, 0x3c8ea00a, 0x384fbdbd,
0x4c11db70, 0x48d0c6c7, 0x4593e01e, 0x4152fda9, 0x5f15adac,
0x5bd4b01b, 0x569796c2, 0x52568b75, 0x6a1936c8, 0x6ed82b7f,
0x639b0da6, 0x675a1011, 0x791d4014, 0x7ddc5da3, 0x709f7b7a,
0x745e66cd, 0x9823b6e0, 0x9ce2ab57, 0x91a18d8e, 0x95609039,
0x8b27c03c, 0x8fe6dd8b, 0x82a5fb52, 0x8664e6e5, 0xbe2b5b58,
0xbaea46ef, 0xb7a96036, 0xb3687d81, 0xad2f2d84, 0xa9ee3033,
0xa4ad16ea, 0xa06c0b5d, 0xd4326d90, 0xd0f37027, 0xddb056fe,
0xd9714b49, 0xc7361b4c, 0xc3f706fb, 0xceb42022, 0xca753d95,
0xf23a8028, 0xf6fb9d9f, 0xfbb8bb46, 0xff79a6f1, 0xe13ef6f4,
0xe5ffeb43, 0xe8bccd9a, 0xec7dd02d, 0x34867077, 0x30476dc0,
0x3d044b19, 0x39c556ae, 0x278206ab, 0x23431b1c, 0x2e003dc5,
0x2ac12072, 0x128e9dcf, 0x164f8078, 0x1b0ca6a1, 0x1fcdbb16,
0x018aeb13, 0x054bf6a4, 0x0808d07d, 0x0cc9cdca, 0x7897ab07,
0x7c56b6b0, 0x71159069, 0x75d48dde, 0x6b93dddb, 0x6f52c06c,
0x6211e6b5, 0x66d0fb02, 0x5e9f46bf, 0x5a5e5b08, 0x571d7dd1,
0x53dc6066, 0x4d9b3063, 0x495a2dd4, 0x44190b0d, 0x40d816ba,
0xaca5c697, 0xa864db20, 0xa527fdf9, 0xa1e6e04e, 0xbfa1b04b,
0xbb60adfc, 0xb6238b25, 0xb2e29692, 0x8aad2b2f, 0x8e6c3698,
0x832f1041, 0x87ee0df6, 0x99a95df3, 0x9d684044, 0x902b669d,
0x94ea7b2a, 0xe0b41de7, 0xe4750050, 0xe9362689, 0xedf73b3e,
0xf3b06b3b, 0xf771768c, 0xfa325055, 0xfef34de2, 0xc6bcf05f,
0xc27dede8, 0xcf3ecb31, 0xcbffd686, 0xd5b88683, 0xd1799b34,
0xdc3abded, 0xd8fba05a, 0x690ce0ee, 0x6dcdfd59, 0x608edb80,
0x644fc637, 0x7a089632, 0x7ec98b85, 0x738aad5c, 0x774bb0eb,
0x4f040d56, 0x4bc510e1, 0x46863638, 0x42472b8f, 0x5c007b8a,
0x58c1663d, 0x558240e4, 0x51435d53, 0x251d3b9e, 0x21dc2629,
0x2c9f00f0, 0x285e1d47, 0x36194d42, 0x32d850f5, 0x3f9b762c,
0x3b5a6b9b, 0x0315d626, 0x07d4cb91, 0x0a97ed48, 0x0e56f0ff,
0x1011a0fa, 0x14d0bd4d, 0x19939b94, 0x1d528623, 0xf12f560e,
0xf5ee4bb9, 0xf8ad6d60, 0xfc6c70d7, 0xe22b20d2, 0xe6ea3d65,
0xeba91bbc, 0xef68060b, 0xd727bbb6, 0xd3e6a601, 0xdea580d8,
0xda649d6f, 0xc423cd6a, 0xc0e2d0dd, 0xcda1f604, 0xc960ebb3,
0xbd3e8d7e, 0xb9ff90c9, 0xb4bcb610, 0xb07daba7, 0xae3afba2,
0xaafbe615, 0xa7b8c0cc, 0xa379dd7b, 0x9b3660c6, 0x9ff77d71,
0x92b45ba8, 0x9675461f, 0x8832161a, 0x8cf30bad, 0x81b02d74,
0x857130c3, 0x5d8a9099, 0x594b8d2e, 0x5408abf7, 0x50c9b640,
0x4e8ee645, 0x4a4ffbf2, 0x470cdd2b, 0x43cdc09c, 0x7b827d21,
0x7f436096, 0x7200464f, 0x76c15bf8, 0x68860bfd, 0x6c47164a,
0x61043093, 0x65c52d24, 0x119b4be9, 0x155a565e, 0x18197087,
0x1cd86d30, 0x029f3d35, 0x065e2082, 0x0b1d065b, 0x0fdc1bec,
0x3793a651, 0x3352bbe6, 0x3e119d3f, 0x3ad08088, 0x2497d08d,
0x2056cd3a, 0x2d15ebe3, 0x29d4f654, 0xc5a92679, 0xc1683bce,
0xcc2b1d17, 0xc8ea00a0, 0xd6ad50a5, 0xd26c4d12, 0xdf2f6bcb,
0xdbee767c, 0xe3a1cbc1, 0xe760d676, 0xea23f0af, 0xeee2ed18,
0xf0a5bd1d, 0xf464a0aa, 0xf9278673, 0xfde69bc4, 0x89b8fd09,
0x8d79e0be, 0x803ac667, 0x84fbdbd0, 0x9abc8bd5, 0x9e7d9662,
0x933eb0bb, 0x97ffad0c, 0xafb010b1, 0xab710d06, 0xa6322bdf,
0xa2f33668, 0xbcb4666d, 0xb8757bda, 0xb5365d03, 0xb1f740b4
};
typedef struct crc32ctx
{
uint32_t crc;
uint32_t length;
} CRC32Ctx;
#define COMPUTE(var, ch) (var) = (var) << 8 ^ crctab[(var) >> 24 ^ (ch)]
void crc32_stream_init( CRC32Ctx* ctx )
{
ctx->crc = 0;
ctx->length = 0;
}
void crc32_stream_compute_uint32( CRC32Ctx* ctx, uint32_t data )
{
COMPUTE( ctx->crc, data & 0xFF );
COMPUTE( ctx->crc, ( data >> 8 ) & 0xFF );
COMPUTE( ctx->crc, ( data >> 16 ) & 0xFF );
COMPUTE( ctx->crc, ( data >> 24 ) & 0xFF );
ctx->length += 4;
}
void crc32_stream_compute_uint8( CRC32Ctx* ctx, uint8_t data )
{
COMPUTE( ctx->crc, data );
ctx->length++;
}
void crc32_stream_finilize( CRC32Ctx* ctx )
{
uint32_t len = ctx->length;
for( ; len != 0; len >>= 8 )
{
COMPUTE( ctx->crc, len & 0xFF );
}
ctx->crc = ~ctx->crc;
}
/*** pseudo code ***/
CRC32Ctx crc;
crc32_stream_init(&crc);
while((just_received_buffer_len = received_anything()))
{
for(int i = 0; i < just_received_buffer_len; i++)
{
crc32_stream_compute_uint8(&crc, buf[i]); // assuming buf is uint8_t*
}
}
crc32_stream_finilize(&crc);
printf("%x", crc.crc); // ta daaa
CRC
adler32, available in the zlib headers, is advertised as being significantly faster than crc32, while being only slightly less accurate.
CRC32 is probably good enough, although there's a small chance you might get a collision, such that a file that has been modified might look like it hasn't been because the two versions generate the same checksum. To avoid this possibility I'd therefore suggest using MD5, which will easily be fast enough, and the chances of a collision occurring is reduced to the point where it's almost infinitessimal.
As others have said, with lots of small files your real performance bottleneck is going to be I/O so the issue is dealing with that. If you post up a few more details somebody will probably suggest a way of sorting that out as well.
Your most important requirement is "to check if the content changed".
If it most important that ANY change in the file be detected, MD-5, SHA-1 or even SHA-256 should be your choice.
Given that you indicated that the checksum NOT be cryptographically good, I would recommend CRC-32 for three reasons. CRC-32 gives good hamming distances over an 8K file. CRC-32 will be at least an order of magnitude faster than MD-5 to calculate (your second requirement). Sometimes as important, CRC-32 only requires 32 bits to store the value to be compared. MD-5 requires 4 times the storage and SHA-1 requires 5 times the storage.
BTW, any technique will be strengthened by prepending the length of the file when calculating the hash.
According to the Wiki page pointed to by Luke, MD5 is actually faster than CRC32!
I have tried this myself by using Python 2.6 on Windows Vista, and got the same result.
Here are some results:
crc32: 162.481544276 MBps
md5: 224.489791549 MBps
crc32: 168.332996575 MBps
md5: 226.089336532 MBps
crc32: 155.851515828 MBps
md5: 194.943289532 MBps
I am thinking about the same question as well, and I'm tempted to use the Rsync's variation of Adler-32 for detecting file differences.
Just a postscript to the above; jpegs use lossy compression and the extent of the compression may depend upon the program used to create the jpeg, the colour pallette and/or bit-depth on the system, display gamma, graphics card and user-set compression levels/colour settings. Therefore, comparing jpegs built on different computers/platforms or using different software will be very difficult at the byte level.
This is 5 times faster than CCITT and makes exactly the same job:
Python:
def crc16_fast(data: bytearray, length):
crc = 0xCACA
for i in range(length):
crc ^= data[i]
return crc
C:
uint16_t crc16_fast(const uint16_t* data, size_t length)
{
uint16_t crc = 0xCACA;
for (size_t i = 0; i < length; i++)
crc ^= data[i];
return crc;
}

Resources