what does it mean files overflow_xxxx.bin while training glove - stanford-nlp

I'm training a word embedding model based on Glove method. While the algorith shows a logger like:
$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 8 < /home/ignacio/data/GUsDany/corpus/GUs_regulon_pubMed.txt > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 8
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 145223095 words.
Building lookup table...table contains 228170143 elements.
Processing token: 5478600000
The home directory of Glove is filled with files caled overflow_0534.bin. Can someone tell whether all is going well?
Thanks

Everything is OK.
You can view the source code of Glove cooccur program at Github.
At the line 57 of the file:
long long overflow_length; // Number of cooccurrence records whose product exceeds max_product to store in memory before writing to disk
If your corpus has too many co-occurrence records, then there will be some data to be written into some temp bin disk files.
while (1) {
if (ind >= overflow_length - window_size) { // If overflow buffer is (almost) full, sort it and write it to temporary file
qsort(cr, ind, sizeof(CREC), compare_crec);
write_chunk(cr,ind,foverflow);
fclose(foverflow);
fidcounter++;
sprintf(filename,"%s_%04d.bin",file_head,fidcounter);
foverflow = fopen(filename,"w");
ind = 0;
}
The variable overflow_length depends on your memory settings.
Line 463:
if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]);
Line 467:
rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));
Line 470:
overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1

Related

Fast and furious random reading of huge files

I know that the question isn't new but I haven't found anything useful. In my case I have a 20 GB file and I need to read random lines from it. Now I have simple file index which contains line numbers and corresponding seek offsets. Also I disabled buffering when reading to read only the needed line.
And this is my code:
def create_random_file_gen(file_path, batch_size=0, dtype=np.float32, delimiter=','):
index = load_file_index(file_path)
if (batch_size > len(index)) or (batch_size == 0):
batch_size = len(index)
lines_indices = np.random.random_integers(0, len(index), batch_size)
with io.open(file_path, 'rb', buffering=0) as f:
for line_index in lines_indices:
f.seek(index[line_index])
line = f.readline(2048)
yield __get_features_from_line(line, delimiter, dtype)
The problem is that it's extremely slow: reading of 5000 lines takes 89 seconds on my Mac(here I point to ssd drive). There is code I used for testing:
features_gen = tedlium_random_speech_gen(5000) # just a wrapper for function given above
i = 0
for feature, cls in features_gen:
if i % 1000 == 0:
print("Got %d features" % i)
i += 1
print("Total %d features" % i)
I've read something about files memory mapping but I don't really understand how it works: how the mapping works in essence and will it speed up the process or no.
So the main question what are the possible ways to speed up the process? The only way I see now is to read randomly not every line but blocks of lines.

Splitting file into parts by bits

Ok, so this is a unique question.
We are getting files (daily) from a company. These files are downloaded from their servers to ours (SFTP). The company that we deal with deals with a third party provider that creates the files (and reduces their size) to make downloads faster and also reduce file-size on their servers.
We download 9 files daily from the server, 3 groups of 3 files
Each group of files consists of 2 XML files and one "image" file.
One of these XML files gives us information on the 'image' file.
Information in the XML file we need:
offset: Gives us where a section of data starts
length: Used with offset, gives us the end of that section
count: Gives us the number of elements held in the file
The 'image' file itself is unusable until we split the file into pieces based on the offset and length of each image in the file. The images are basically concatenated together. We need to extract these images to be able to view them.
An example of offset, length and count values are as follows:
offset: 0
length: 2670
offset: 2670
length: 2670
offset: 5340
length: 2670
offset: 8010
length: 2670
count: 4
This means that there are 4 (count) items. The first count item begins at offset[0] and is length[0] in length. The second item begins at offset[1] and is length[1] in length, etc.
I need to split the images at these points and these points PRECISELY without room for error. The third party provider will not provide us with the code and we are to figure this out ourselves. The image file is not readable without splitting the files and are essentially useless until then.
My question: Does anyone have a way of splitting files at a specific byte?
P.S. I do not have any code yet. I don't even know where to begin with this one. I am not new to coding, but I have never done file splitting by the byte.
I don't care which language this uses. I just need to make it work.
EDIT
The OS is Windows
You hooked me. Here's a rough Java method that can split a file based on offset and length. This requires at least Java 8.
A few of the classes used:
SeekableByteChannel
ByteBuffer
And an article I found helpful in producing this example.
/**
* Method that splits the data provided in fileToSplit into outputDirectory based on the
* collection of offsets and lengths provided in offsetAndLength.
*
* Example of input offsetAndLength:
* Long[][] data = new Long[][]{
* {0, 2670},
* {2670, 2670},
* {5340, 2670},
* {8010, 2670}
* };
*
* Output files will be placed in outputDirectory and named img0, img1... imgN
*
* #param fileToSplit
* #param outputDirectory
* #param offsetAndLength
* #throws IOException
*/
public static void split( Path fileToSplit, Path outputDirectory, Long[][] offsetAndLength ) throws IOException{
try (SeekableByteChannel sbc = Files.newByteChannel(fileToSplit, StandardOpenOption.READ )){
for(int x = 0; x < offsetAndLength.length; x++){
ByteBuffer buffer = ByteBuffer.allocate(offsetAndLength[x][4].intValue());
sbc.position(offsetAndLength[x][0]);
sbc.read(buffer);
buffer.flip();
File img = new File(outputDirectory.toFile(), "img"+x);
img.createNewFile();
try(FileChannel output = FileChannel.open(img.toPath(), StandardOpenOption.WRITE)){
output.write(buffer);
}
buffer.clear();
}
}
}
I leave parsing the XML file to you.

How does SparkContext.textFile work under the covers?

I am trying to understand the textFile method deeply, but I think my
lack of Hadoop knowledge is holding me back here. Let me lay out my
understanding and maybe you can correct anything that is incorrect
When sc.textFile(path) is called, then defaultMinPartitions is used,
which is really just math.min(taskScheduler.defaultParallelism, 2). Let's
assume we are using the SparkDeploySchedulerBackend and this is
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(),
2))
So, now let's say the default is 2, going back to the textFile, this is
passed in to HadoopRDD. The true size is determined in getPartitions() using
inputFormat.getSplits(jobConf, minPartitions). But, from what I can find,
the partitions is merely a hint and is in fact mostly ignored, so you will
probably get the total number of blocks.
OK, this fits with expectations, however what if the default is not used and
you provide a partition size that is larger than the block size. If my
research is right and the getSplits call simply ignores this parameter, then
wouldn't the provided min end up being ignored and you would still just get
the block size?
Cross posted with the spark mailing list
Short Version:
Split size is determined by mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize, if it's bigger than HDFS's blockSize, multiple blocks inside a same file would be combined into a single split.
Detailed Version:
I think you are right in understanding the procedure before inputFormat.getSplits.
Inside inputFormat.getSplits, more specifically, inside FileInputFormat's getSplits, it is mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize that would at last determine split size. (I'm not sure which would be effective in Spark, I prefer to believe the former one).
Let's see the code: FileInputFormat from Hadoop 2.4.0
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
String[] splitHosts = getSplitHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts));
}
} else {
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
Inside the for loop, makeSplit() is used to generate each split, and splitSize is the effective Split Size. The computeSplitSize Function to generate splitSize:
protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
Therefore, if minSplitSize > blockSize, the output splits are actually a combination of several blocks in the same HDFS file, on the other hand, if minSplitSize < blockSize, each split corresponds to a HDFS's block.
I will add more points with examples to Yijie Shen answer
Before we go into details,lets understand the following
Assume that we are working on Spark Standalone local system with 4 cores
In the application if master is configured as like below
new SparkConf().setMaster("**local[*]**") then
defaultParallelism : 4 (taskScheduler.defaultParallelism ie no.of cores)
/* Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */
defaultMinPartitions : 2 //Default min number of partitions for Hadoop RDDs when not given by user
* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
logic to find defaultMinPartitions as below
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
The actual partition size is defined by the following formula in the method FileInputFormat.computeSplitSize
package org.apache.hadoop.mapred;
public abstract class FileInputFormat<K, V> implements InputFormat<K, V> {
protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
}
where,
minSize is the hadoop parameter mapreduce.input.fileinputformat.split.minsize (default mapreduce.input.fileinputformat.split.minsize = 1 byte)
blockSize is the value of the dfs.block.size in cluster mode(**dfs.block.size - The default value in Hadoop 2.0 is 128 MB**) and fs.local.block.size in the local mode (**default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes**)
goalSize = totalInputSize/numPartitions
where,
totalInputSize is the total size in bytes of all the files in the input path
numPartitions is the custom parameter provided to the method sc.textFile(inputPath, numPartitions) - if not provided it will be defaultMinPartitions ie 2 if master is set as local(*)
blocksize = file size in bytes = 33554432
33554432/1024 = 32768 KB
32768/1024 = 32 MB
Ex1:- If our file size is 91 bytes
minSize=1 (mapreduce.input.fileinputformat.split.minsize = 1 byte)
goalSize = totalInputSize/numPartitions
goalSize = 91(file size)/12(partitions provided as 2nd paramater in sc.textFile) = 7
splitSize = Math.max(minSize, Math.min(goalSize, blockSize)); => Math.max(1,Math.min(7,33554432)) = 7 // 33554432 is block size in local mode
Splits = 91(file size 91 bytes) / 7 (splitSize) => 13
FileInputFormat: Total # of splits generated by getSplits: 13
=> while calculating splitSize if file size is > 32 MB then the split size will be taken the default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes

Aparapi add sample

I'm studing Aparapi (https://code.google.com/p/aparapi/) and have a strange behaviour of one of the sample included.
The sample is the first, "add". Building and executing it, is ok. I also put the following code for testing if the GPU is really used
if(!kernel.getExecutionMode().equals(Kernel.EXECUTION_MODE.GPU)){
System.out.println("Kernel did not execute on the GPU!");
}
and it works fine.
But, if I try to change the size of the array from 512 to a number greater than 999 (for example 1000), I have the following output:
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize[0] = 1000, localSize[0] = 128
Apr 18, 2013 1:31:01 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
JTP
Kernel did not execute on the GPU!
Here's my code:
final int size = 1000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float)(Math.random()*100);
b[i] = (float)(Math.random()*100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
#Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
Range range = Range.create(size);
kernel.execute(range);
System.out.println(kernel.getExecutionMode());
if (!kernel.getExecutionMode().equals(Kernel.EXECUTION_MODE.GPU)){
System.out.println("Kernel did not execute on the GPU!");
}
kernel.dispose();
}
I tried specifying the size using
Range range = Range.create(size, 128);
as suggested in a Google group, but nothing changed.
I'm currently running on Mac OS X 10.8 with Java 1.6.0_43. Aparapi version is the latest (2012-01-23).
Am I missing something? Any ideas?
Thanks in advance
Aparapi inherits a 'Grid Style' of implementation from OpenCL. When you specify a range of execution (say 1024), OpenCL will break this 'range' into groups of equal size. Possibly 4 groups of 256, or 8 groups of 128.
The group size must be a factor of range (so assert(range%groupSize==0)).
By default Aparapi internally selects the group size.
But you are choosing to fully specify the range and group size to using
Range r= Range.range(n,128)
You are responsible for ensuring that n%128==0.
From the error, it looks like you chose Range.range(1000,128).
Sadly 1000 % 128 != 0 so this range will fail.
If you specifiy
Range r = Range.range(n)
Aparapi will choose a valid group size, by finding the highest common factor of n.
Try dropping the 128 as the the second arg.
Gary

Robust and fast checksum algorithm?

Which checksum algorithm can you recommend in the following use case?
I want to generate checksums of small JPEG files (~8 kB each) to check if the content changed. Using the filesystem's date modified is unfortunately not an option.
The checksum need not be cryptographically strong but it should robustly indicate changes of any size.
The second criterion is speed since it should be possible to process at least hundreds of images per second (on a modern CPU).
The calculation will be done on a server with several clients. The clients send the images over Gigabit TCP to the server. So there's no disk I/O as bottleneck.
If you have many small files, your bottleneck is going to be file I/O and probably not a checksum algorithm.
A list of hash functions (which can be thought of as a checksum) can be found here.
Is there any reason you can't use the filesystem's date modified to determine if a file has changed? That would probably be faster.
There are lots of fast CRC algorithms that should do the trick:
http://www.google.com/search?hl=en&q=fast+crc&aq=f&oq=
Edit: Why the hate? CRC is totally appropriate, as evidenced by the other answers. A Google search was also appropriate, since no language was specified. This is an old, old problem which has been solved so many times that there isn't likely to be a definitive answer.
CRC-32 comes into mind mainly because it's cheap to calculate
Any kind of I/O comes into mind mainly because this will be the limiting factor for such an undertaking ;)
The problem is not calculating the checksums, the problem is to get the images into memory to calculate the checksum.
I would suggest "stagged" monitoring:
stage 1: check for changes of file timestamps and if you detect a change there hand over to...(not needed in your case as described in the edited version)
stage 2: get the image into memory and calculate the checksum
For sure important as well: multi-threading: setting up a pipeline which enables processing of several images in parallel if several CPU cores are available.
If you are receiving the files over network you can calculate the checksum as you receive the file. This will ensure that you will calculate the checksum while the data is in memory. Hence you won't have to load them into memory from disk.
I believe if you apply this method, you'll see almost-zero overhead on your system.
This is the routines I'm using on an embedded system which does checksum control on firmware and other stuff.
static const uint32_t crctab[] = {
0x0,
0x04c11db7, 0x09823b6e, 0x0d4326d9, 0x130476dc, 0x17c56b6b,
0x1a864db2, 0x1e475005, 0x2608edb8, 0x22c9f00f, 0x2f8ad6d6,
0x2b4bcb61, 0x350c9b64, 0x31cd86d3, 0x3c8ea00a, 0x384fbdbd,
0x4c11db70, 0x48d0c6c7, 0x4593e01e, 0x4152fda9, 0x5f15adac,
0x5bd4b01b, 0x569796c2, 0x52568b75, 0x6a1936c8, 0x6ed82b7f,
0x639b0da6, 0x675a1011, 0x791d4014, 0x7ddc5da3, 0x709f7b7a,
0x745e66cd, 0x9823b6e0, 0x9ce2ab57, 0x91a18d8e, 0x95609039,
0x8b27c03c, 0x8fe6dd8b, 0x82a5fb52, 0x8664e6e5, 0xbe2b5b58,
0xbaea46ef, 0xb7a96036, 0xb3687d81, 0xad2f2d84, 0xa9ee3033,
0xa4ad16ea, 0xa06c0b5d, 0xd4326d90, 0xd0f37027, 0xddb056fe,
0xd9714b49, 0xc7361b4c, 0xc3f706fb, 0xceb42022, 0xca753d95,
0xf23a8028, 0xf6fb9d9f, 0xfbb8bb46, 0xff79a6f1, 0xe13ef6f4,
0xe5ffeb43, 0xe8bccd9a, 0xec7dd02d, 0x34867077, 0x30476dc0,
0x3d044b19, 0x39c556ae, 0x278206ab, 0x23431b1c, 0x2e003dc5,
0x2ac12072, 0x128e9dcf, 0x164f8078, 0x1b0ca6a1, 0x1fcdbb16,
0x018aeb13, 0x054bf6a4, 0x0808d07d, 0x0cc9cdca, 0x7897ab07,
0x7c56b6b0, 0x71159069, 0x75d48dde, 0x6b93dddb, 0x6f52c06c,
0x6211e6b5, 0x66d0fb02, 0x5e9f46bf, 0x5a5e5b08, 0x571d7dd1,
0x53dc6066, 0x4d9b3063, 0x495a2dd4, 0x44190b0d, 0x40d816ba,
0xaca5c697, 0xa864db20, 0xa527fdf9, 0xa1e6e04e, 0xbfa1b04b,
0xbb60adfc, 0xb6238b25, 0xb2e29692, 0x8aad2b2f, 0x8e6c3698,
0x832f1041, 0x87ee0df6, 0x99a95df3, 0x9d684044, 0x902b669d,
0x94ea7b2a, 0xe0b41de7, 0xe4750050, 0xe9362689, 0xedf73b3e,
0xf3b06b3b, 0xf771768c, 0xfa325055, 0xfef34de2, 0xc6bcf05f,
0xc27dede8, 0xcf3ecb31, 0xcbffd686, 0xd5b88683, 0xd1799b34,
0xdc3abded, 0xd8fba05a, 0x690ce0ee, 0x6dcdfd59, 0x608edb80,
0x644fc637, 0x7a089632, 0x7ec98b85, 0x738aad5c, 0x774bb0eb,
0x4f040d56, 0x4bc510e1, 0x46863638, 0x42472b8f, 0x5c007b8a,
0x58c1663d, 0x558240e4, 0x51435d53, 0x251d3b9e, 0x21dc2629,
0x2c9f00f0, 0x285e1d47, 0x36194d42, 0x32d850f5, 0x3f9b762c,
0x3b5a6b9b, 0x0315d626, 0x07d4cb91, 0x0a97ed48, 0x0e56f0ff,
0x1011a0fa, 0x14d0bd4d, 0x19939b94, 0x1d528623, 0xf12f560e,
0xf5ee4bb9, 0xf8ad6d60, 0xfc6c70d7, 0xe22b20d2, 0xe6ea3d65,
0xeba91bbc, 0xef68060b, 0xd727bbb6, 0xd3e6a601, 0xdea580d8,
0xda649d6f, 0xc423cd6a, 0xc0e2d0dd, 0xcda1f604, 0xc960ebb3,
0xbd3e8d7e, 0xb9ff90c9, 0xb4bcb610, 0xb07daba7, 0xae3afba2,
0xaafbe615, 0xa7b8c0cc, 0xa379dd7b, 0x9b3660c6, 0x9ff77d71,
0x92b45ba8, 0x9675461f, 0x8832161a, 0x8cf30bad, 0x81b02d74,
0x857130c3, 0x5d8a9099, 0x594b8d2e, 0x5408abf7, 0x50c9b640,
0x4e8ee645, 0x4a4ffbf2, 0x470cdd2b, 0x43cdc09c, 0x7b827d21,
0x7f436096, 0x7200464f, 0x76c15bf8, 0x68860bfd, 0x6c47164a,
0x61043093, 0x65c52d24, 0x119b4be9, 0x155a565e, 0x18197087,
0x1cd86d30, 0x029f3d35, 0x065e2082, 0x0b1d065b, 0x0fdc1bec,
0x3793a651, 0x3352bbe6, 0x3e119d3f, 0x3ad08088, 0x2497d08d,
0x2056cd3a, 0x2d15ebe3, 0x29d4f654, 0xc5a92679, 0xc1683bce,
0xcc2b1d17, 0xc8ea00a0, 0xd6ad50a5, 0xd26c4d12, 0xdf2f6bcb,
0xdbee767c, 0xe3a1cbc1, 0xe760d676, 0xea23f0af, 0xeee2ed18,
0xf0a5bd1d, 0xf464a0aa, 0xf9278673, 0xfde69bc4, 0x89b8fd09,
0x8d79e0be, 0x803ac667, 0x84fbdbd0, 0x9abc8bd5, 0x9e7d9662,
0x933eb0bb, 0x97ffad0c, 0xafb010b1, 0xab710d06, 0xa6322bdf,
0xa2f33668, 0xbcb4666d, 0xb8757bda, 0xb5365d03, 0xb1f740b4
};
typedef struct crc32ctx
{
uint32_t crc;
uint32_t length;
} CRC32Ctx;
#define COMPUTE(var, ch) (var) = (var) << 8 ^ crctab[(var) >> 24 ^ (ch)]
void crc32_stream_init( CRC32Ctx* ctx )
{
ctx->crc = 0;
ctx->length = 0;
}
void crc32_stream_compute_uint32( CRC32Ctx* ctx, uint32_t data )
{
COMPUTE( ctx->crc, data & 0xFF );
COMPUTE( ctx->crc, ( data >> 8 ) & 0xFF );
COMPUTE( ctx->crc, ( data >> 16 ) & 0xFF );
COMPUTE( ctx->crc, ( data >> 24 ) & 0xFF );
ctx->length += 4;
}
void crc32_stream_compute_uint8( CRC32Ctx* ctx, uint8_t data )
{
COMPUTE( ctx->crc, data );
ctx->length++;
}
void crc32_stream_finilize( CRC32Ctx* ctx )
{
uint32_t len = ctx->length;
for( ; len != 0; len >>= 8 )
{
COMPUTE( ctx->crc, len & 0xFF );
}
ctx->crc = ~ctx->crc;
}
/*** pseudo code ***/
CRC32Ctx crc;
crc32_stream_init(&crc);
while((just_received_buffer_len = received_anything()))
{
for(int i = 0; i < just_received_buffer_len; i++)
{
crc32_stream_compute_uint8(&crc, buf[i]); // assuming buf is uint8_t*
}
}
crc32_stream_finilize(&crc);
printf("%x", crc.crc); // ta daaa
CRC
adler32, available in the zlib headers, is advertised as being significantly faster than crc32, while being only slightly less accurate.
CRC32 is probably good enough, although there's a small chance you might get a collision, such that a file that has been modified might look like it hasn't been because the two versions generate the same checksum. To avoid this possibility I'd therefore suggest using MD5, which will easily be fast enough, and the chances of a collision occurring is reduced to the point where it's almost infinitessimal.
As others have said, with lots of small files your real performance bottleneck is going to be I/O so the issue is dealing with that. If you post up a few more details somebody will probably suggest a way of sorting that out as well.
Your most important requirement is "to check if the content changed".
If it most important that ANY change in the file be detected, MD-5, SHA-1 or even SHA-256 should be your choice.
Given that you indicated that the checksum NOT be cryptographically good, I would recommend CRC-32 for three reasons. CRC-32 gives good hamming distances over an 8K file. CRC-32 will be at least an order of magnitude faster than MD-5 to calculate (your second requirement). Sometimes as important, CRC-32 only requires 32 bits to store the value to be compared. MD-5 requires 4 times the storage and SHA-1 requires 5 times the storage.
BTW, any technique will be strengthened by prepending the length of the file when calculating the hash.
According to the Wiki page pointed to by Luke, MD5 is actually faster than CRC32!
I have tried this myself by using Python 2.6 on Windows Vista, and got the same result.
Here are some results:
crc32: 162.481544276 MBps
md5: 224.489791549 MBps
crc32: 168.332996575 MBps
md5: 226.089336532 MBps
crc32: 155.851515828 MBps
md5: 194.943289532 MBps
I am thinking about the same question as well, and I'm tempted to use the Rsync's variation of Adler-32 for detecting file differences.
Just a postscript to the above; jpegs use lossy compression and the extent of the compression may depend upon the program used to create the jpeg, the colour pallette and/or bit-depth on the system, display gamma, graphics card and user-set compression levels/colour settings. Therefore, comparing jpegs built on different computers/platforms or using different software will be very difficult at the byte level.
This is 5 times faster than CCITT and makes exactly the same job:
Python:
def crc16_fast(data: bytearray, length):
crc = 0xCACA
for i in range(length):
crc ^= data[i]
return crc
C:
uint16_t crc16_fast(const uint16_t* data, size_t length)
{
uint16_t crc = 0xCACA;
for (size_t i = 0; i < length; i++)
crc ^= data[i];
return crc;
}

Resources