Splitting file into parts by bits - filesplitting

Ok, so this is a unique question.
We are getting files (daily) from a company. These files are downloaded from their servers to ours (SFTP). The company that we deal with deals with a third party provider that creates the files (and reduces their size) to make downloads faster and also reduce file-size on their servers.
We download 9 files daily from the server, 3 groups of 3 files
Each group of files consists of 2 XML files and one "image" file.
One of these XML files gives us information on the 'image' file.
Information in the XML file we need:
offset: Gives us where a section of data starts
length: Used with offset, gives us the end of that section
count: Gives us the number of elements held in the file
The 'image' file itself is unusable until we split the file into pieces based on the offset and length of each image in the file. The images are basically concatenated together. We need to extract these images to be able to view them.
An example of offset, length and count values are as follows:
offset: 0
length: 2670
offset: 2670
length: 2670
offset: 5340
length: 2670
offset: 8010
length: 2670
count: 4
This means that there are 4 (count) items. The first count item begins at offset[0] and is length[0] in length. The second item begins at offset[1] and is length[1] in length, etc.
I need to split the images at these points and these points PRECISELY without room for error. The third party provider will not provide us with the code and we are to figure this out ourselves. The image file is not readable without splitting the files and are essentially useless until then.
My question: Does anyone have a way of splitting files at a specific byte?
P.S. I do not have any code yet. I don't even know where to begin with this one. I am not new to coding, but I have never done file splitting by the byte.
I don't care which language this uses. I just need to make it work.
EDIT
The OS is Windows

You hooked me. Here's a rough Java method that can split a file based on offset and length. This requires at least Java 8.
A few of the classes used:
SeekableByteChannel
ByteBuffer
And an article I found helpful in producing this example.
/**
* Method that splits the data provided in fileToSplit into outputDirectory based on the
* collection of offsets and lengths provided in offsetAndLength.
*
* Example of input offsetAndLength:
* Long[][] data = new Long[][]{
* {0, 2670},
* {2670, 2670},
* {5340, 2670},
* {8010, 2670}
* };
*
* Output files will be placed in outputDirectory and named img0, img1... imgN
*
* #param fileToSplit
* #param outputDirectory
* #param offsetAndLength
* #throws IOException
*/
public static void split( Path fileToSplit, Path outputDirectory, Long[][] offsetAndLength ) throws IOException{
try (SeekableByteChannel sbc = Files.newByteChannel(fileToSplit, StandardOpenOption.READ )){
for(int x = 0; x < offsetAndLength.length; x++){
ByteBuffer buffer = ByteBuffer.allocate(offsetAndLength[x][4].intValue());
sbc.position(offsetAndLength[x][0]);
sbc.read(buffer);
buffer.flip();
File img = new File(outputDirectory.toFile(), "img"+x);
img.createNewFile();
try(FileChannel output = FileChannel.open(img.toPath(), StandardOpenOption.WRITE)){
output.write(buffer);
}
buffer.clear();
}
}
}
I leave parsing the XML file to you.

Related

what does it mean files overflow_xxxx.bin while training glove

I'm training a word embedding model based on Glove method. While the algorith shows a logger like:
$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 8 < /home/ignacio/data/GUsDany/corpus/GUs_regulon_pubMed.txt > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 8
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 145223095 words.
Building lookup table...table contains 228170143 elements.
Processing token: 5478600000
The home directory of Glove is filled with files caled overflow_0534.bin. Can someone tell whether all is going well?
Thanks
Everything is OK.
You can view the source code of Glove cooccur program at Github.
At the line 57 of the file:
long long overflow_length; // Number of cooccurrence records whose product exceeds max_product to store in memory before writing to disk
If your corpus has too many co-occurrence records, then there will be some data to be written into some temp bin disk files.
while (1) {
if (ind >= overflow_length - window_size) { // If overflow buffer is (almost) full, sort it and write it to temporary file
qsort(cr, ind, sizeof(CREC), compare_crec);
write_chunk(cr,ind,foverflow);
fclose(foverflow);
fidcounter++;
sprintf(filename,"%s_%04d.bin",file_head,fidcounter);
foverflow = fopen(filename,"w");
ind = 0;
}
The variable overflow_length depends on your memory settings.
Line 463:
if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]);
Line 467:
rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));
Line 470:
overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1

Bio-Formats conversion of LIF file with splitted timestep and splitted channel (macro ImageJ)

I'm trying to open a LIF file (Leica format) with ImageJ in a macro. The LIF file contains 4 "LIF series" of 52 or 45 timesteps. Each timestep contains 3 channels.
filepath = "/mydirectory/myfile.lif";
out = "/mydirectory/myTIFF/";
run("Bio-Formats Importer", "open=["+ filepath + "] color_mode=Default split_timepoints split_channels view=Hyperstack stack_order=XYCZT");
for(myt ...) { // I don't know how having all the UNIQUE timestep of all the series
for(mych=0;mych<3;mych++) {
saveAs("Tiff", "save=[" + out + "/myID" + "_t" + myt + "_ch" + mych + ".tif" + "]");
}
}
I want to save as TIFF one file per timestep+channel.
But
1: I cannot open the series with the "split_channels" and "split_timepoins" options cause ImageJ crash (not enough memory)
2: I cannot open all the series automatically cause ImageJ crash... And I don't know how I can open the series one by one. Or/and the timestep one by one
Do you know how I can open timestep by timestep the LIF file and save it automatically in Fiji ? Or how I can split channel and timestep of a LIF file ?
I find what I needed ! I can work serie by serie now.
Ext.setId(path);// Initializes the given path (filename).
Ext.getSeriesCount(seriesCount); // Gets the number of image series in the active dataset.
I have to do the code to generate unique timestep but it should be fine.
Source

How does SparkContext.textFile work under the covers?

I am trying to understand the textFile method deeply, but I think my
lack of Hadoop knowledge is holding me back here. Let me lay out my
understanding and maybe you can correct anything that is incorrect
When sc.textFile(path) is called, then defaultMinPartitions is used,
which is really just math.min(taskScheduler.defaultParallelism, 2). Let's
assume we are using the SparkDeploySchedulerBackend and this is
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(),
2))
So, now let's say the default is 2, going back to the textFile, this is
passed in to HadoopRDD. The true size is determined in getPartitions() using
inputFormat.getSplits(jobConf, minPartitions). But, from what I can find,
the partitions is merely a hint and is in fact mostly ignored, so you will
probably get the total number of blocks.
OK, this fits with expectations, however what if the default is not used and
you provide a partition size that is larger than the block size. If my
research is right and the getSplits call simply ignores this parameter, then
wouldn't the provided min end up being ignored and you would still just get
the block size?
Cross posted with the spark mailing list
Short Version:
Split size is determined by mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize, if it's bigger than HDFS's blockSize, multiple blocks inside a same file would be combined into a single split.
Detailed Version:
I think you are right in understanding the procedure before inputFormat.getSplits.
Inside inputFormat.getSplits, more specifically, inside FileInputFormat's getSplits, it is mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize that would at last determine split size. (I'm not sure which would be effective in Spark, I prefer to believe the former one).
Let's see the code: FileInputFormat from Hadoop 2.4.0
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
String[] splitHosts = getSplitHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts));
}
} else {
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
Inside the for loop, makeSplit() is used to generate each split, and splitSize is the effective Split Size. The computeSplitSize Function to generate splitSize:
protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
Therefore, if minSplitSize > blockSize, the output splits are actually a combination of several blocks in the same HDFS file, on the other hand, if minSplitSize < blockSize, each split corresponds to a HDFS's block.
I will add more points with examples to Yijie Shen answer
Before we go into details,lets understand the following
Assume that we are working on Spark Standalone local system with 4 cores
In the application if master is configured as like below
new SparkConf().setMaster("**local[*]**") then
defaultParallelism : 4 (taskScheduler.defaultParallelism ie no.of cores)
/* Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */
defaultMinPartitions : 2 //Default min number of partitions for Hadoop RDDs when not given by user
* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
logic to find defaultMinPartitions as below
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
The actual partition size is defined by the following formula in the method FileInputFormat.computeSplitSize
package org.apache.hadoop.mapred;
public abstract class FileInputFormat<K, V> implements InputFormat<K, V> {
protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
}
where,
minSize is the hadoop parameter mapreduce.input.fileinputformat.split.minsize (default mapreduce.input.fileinputformat.split.minsize = 1 byte)
blockSize is the value of the dfs.block.size in cluster mode(**dfs.block.size - The default value in Hadoop 2.0 is 128 MB**) and fs.local.block.size in the local mode (**default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes**)
goalSize = totalInputSize/numPartitions
where,
totalInputSize is the total size in bytes of all the files in the input path
numPartitions is the custom parameter provided to the method sc.textFile(inputPath, numPartitions) - if not provided it will be defaultMinPartitions ie 2 if master is set as local(*)
blocksize = file size in bytes = 33554432
33554432/1024 = 32768 KB
32768/1024 = 32 MB
Ex1:- If our file size is 91 bytes
minSize=1 (mapreduce.input.fileinputformat.split.minsize = 1 byte)
goalSize = totalInputSize/numPartitions
goalSize = 91(file size)/12(partitions provided as 2nd paramater in sc.textFile) = 7
splitSize = Math.max(minSize, Math.min(goalSize, blockSize)); => Math.max(1,Math.min(7,33554432)) = 7 // 33554432 is block size in local mode
Splits = 91(file size 91 bytes) / 7 (splitSize) => 13
FileInputFormat: Total # of splits generated by getSplits: 13
=> while calculating splitSize if file size is > 32 MB then the split size will be taken the default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes

The size of a PE Header

Is there a way to find out the size of a PE Header without reading all of it or the entire file?
You can calculate the total size of the PE header like this:
sizeof(Signature) + sizeof(FileHeader) + sizeof(OptionalHeader) + sizeof(SectionTable)
The file header always has the same size but the OptionalHeader's size can differ, as can the section table size.
The OptionalHeader's size is stored in FileHeader.SizeOfOptionalHeader, and the section table size equals FileHeader.NumberOfSections * sizeof(IMAGE_SECTION_HEADER)
And some C code:
DWORD SizeOfPEHeader(const IMAGE_NT_HEADERS * pNTH)
{
return (offsetof(IMAGE_NT_HEADERS, OptionalHeader) + pNTH->FileHeader.SizeOfOptionalHeader + (pNTH->FileHeader.NumberOfSections * sizeof(IMAGE_SECTION_HEADER)));
}
All you have to do is read the DOS header, get the PE offset (e_lfanew) and read PE.Signature + PE.FileHeader into memory. That's two reading operations of fixed size and you have all the info you need.

Robust and fast checksum algorithm?

Which checksum algorithm can you recommend in the following use case?
I want to generate checksums of small JPEG files (~8 kB each) to check if the content changed. Using the filesystem's date modified is unfortunately not an option.
The checksum need not be cryptographically strong but it should robustly indicate changes of any size.
The second criterion is speed since it should be possible to process at least hundreds of images per second (on a modern CPU).
The calculation will be done on a server with several clients. The clients send the images over Gigabit TCP to the server. So there's no disk I/O as bottleneck.
If you have many small files, your bottleneck is going to be file I/O and probably not a checksum algorithm.
A list of hash functions (which can be thought of as a checksum) can be found here.
Is there any reason you can't use the filesystem's date modified to determine if a file has changed? That would probably be faster.
There are lots of fast CRC algorithms that should do the trick:
http://www.google.com/search?hl=en&q=fast+crc&aq=f&oq=
Edit: Why the hate? CRC is totally appropriate, as evidenced by the other answers. A Google search was also appropriate, since no language was specified. This is an old, old problem which has been solved so many times that there isn't likely to be a definitive answer.
CRC-32 comes into mind mainly because it's cheap to calculate
Any kind of I/O comes into mind mainly because this will be the limiting factor for such an undertaking ;)
The problem is not calculating the checksums, the problem is to get the images into memory to calculate the checksum.
I would suggest "stagged" monitoring:
stage 1: check for changes of file timestamps and if you detect a change there hand over to...(not needed in your case as described in the edited version)
stage 2: get the image into memory and calculate the checksum
For sure important as well: multi-threading: setting up a pipeline which enables processing of several images in parallel if several CPU cores are available.
If you are receiving the files over network you can calculate the checksum as you receive the file. This will ensure that you will calculate the checksum while the data is in memory. Hence you won't have to load them into memory from disk.
I believe if you apply this method, you'll see almost-zero overhead on your system.
This is the routines I'm using on an embedded system which does checksum control on firmware and other stuff.
static const uint32_t crctab[] = {
0x0,
0x04c11db7, 0x09823b6e, 0x0d4326d9, 0x130476dc, 0x17c56b6b,
0x1a864db2, 0x1e475005, 0x2608edb8, 0x22c9f00f, 0x2f8ad6d6,
0x2b4bcb61, 0x350c9b64, 0x31cd86d3, 0x3c8ea00a, 0x384fbdbd,
0x4c11db70, 0x48d0c6c7, 0x4593e01e, 0x4152fda9, 0x5f15adac,
0x5bd4b01b, 0x569796c2, 0x52568b75, 0x6a1936c8, 0x6ed82b7f,
0x639b0da6, 0x675a1011, 0x791d4014, 0x7ddc5da3, 0x709f7b7a,
0x745e66cd, 0x9823b6e0, 0x9ce2ab57, 0x91a18d8e, 0x95609039,
0x8b27c03c, 0x8fe6dd8b, 0x82a5fb52, 0x8664e6e5, 0xbe2b5b58,
0xbaea46ef, 0xb7a96036, 0xb3687d81, 0xad2f2d84, 0xa9ee3033,
0xa4ad16ea, 0xa06c0b5d, 0xd4326d90, 0xd0f37027, 0xddb056fe,
0xd9714b49, 0xc7361b4c, 0xc3f706fb, 0xceb42022, 0xca753d95,
0xf23a8028, 0xf6fb9d9f, 0xfbb8bb46, 0xff79a6f1, 0xe13ef6f4,
0xe5ffeb43, 0xe8bccd9a, 0xec7dd02d, 0x34867077, 0x30476dc0,
0x3d044b19, 0x39c556ae, 0x278206ab, 0x23431b1c, 0x2e003dc5,
0x2ac12072, 0x128e9dcf, 0x164f8078, 0x1b0ca6a1, 0x1fcdbb16,
0x018aeb13, 0x054bf6a4, 0x0808d07d, 0x0cc9cdca, 0x7897ab07,
0x7c56b6b0, 0x71159069, 0x75d48dde, 0x6b93dddb, 0x6f52c06c,
0x6211e6b5, 0x66d0fb02, 0x5e9f46bf, 0x5a5e5b08, 0x571d7dd1,
0x53dc6066, 0x4d9b3063, 0x495a2dd4, 0x44190b0d, 0x40d816ba,
0xaca5c697, 0xa864db20, 0xa527fdf9, 0xa1e6e04e, 0xbfa1b04b,
0xbb60adfc, 0xb6238b25, 0xb2e29692, 0x8aad2b2f, 0x8e6c3698,
0x832f1041, 0x87ee0df6, 0x99a95df3, 0x9d684044, 0x902b669d,
0x94ea7b2a, 0xe0b41de7, 0xe4750050, 0xe9362689, 0xedf73b3e,
0xf3b06b3b, 0xf771768c, 0xfa325055, 0xfef34de2, 0xc6bcf05f,
0xc27dede8, 0xcf3ecb31, 0xcbffd686, 0xd5b88683, 0xd1799b34,
0xdc3abded, 0xd8fba05a, 0x690ce0ee, 0x6dcdfd59, 0x608edb80,
0x644fc637, 0x7a089632, 0x7ec98b85, 0x738aad5c, 0x774bb0eb,
0x4f040d56, 0x4bc510e1, 0x46863638, 0x42472b8f, 0x5c007b8a,
0x58c1663d, 0x558240e4, 0x51435d53, 0x251d3b9e, 0x21dc2629,
0x2c9f00f0, 0x285e1d47, 0x36194d42, 0x32d850f5, 0x3f9b762c,
0x3b5a6b9b, 0x0315d626, 0x07d4cb91, 0x0a97ed48, 0x0e56f0ff,
0x1011a0fa, 0x14d0bd4d, 0x19939b94, 0x1d528623, 0xf12f560e,
0xf5ee4bb9, 0xf8ad6d60, 0xfc6c70d7, 0xe22b20d2, 0xe6ea3d65,
0xeba91bbc, 0xef68060b, 0xd727bbb6, 0xd3e6a601, 0xdea580d8,
0xda649d6f, 0xc423cd6a, 0xc0e2d0dd, 0xcda1f604, 0xc960ebb3,
0xbd3e8d7e, 0xb9ff90c9, 0xb4bcb610, 0xb07daba7, 0xae3afba2,
0xaafbe615, 0xa7b8c0cc, 0xa379dd7b, 0x9b3660c6, 0x9ff77d71,
0x92b45ba8, 0x9675461f, 0x8832161a, 0x8cf30bad, 0x81b02d74,
0x857130c3, 0x5d8a9099, 0x594b8d2e, 0x5408abf7, 0x50c9b640,
0x4e8ee645, 0x4a4ffbf2, 0x470cdd2b, 0x43cdc09c, 0x7b827d21,
0x7f436096, 0x7200464f, 0x76c15bf8, 0x68860bfd, 0x6c47164a,
0x61043093, 0x65c52d24, 0x119b4be9, 0x155a565e, 0x18197087,
0x1cd86d30, 0x029f3d35, 0x065e2082, 0x0b1d065b, 0x0fdc1bec,
0x3793a651, 0x3352bbe6, 0x3e119d3f, 0x3ad08088, 0x2497d08d,
0x2056cd3a, 0x2d15ebe3, 0x29d4f654, 0xc5a92679, 0xc1683bce,
0xcc2b1d17, 0xc8ea00a0, 0xd6ad50a5, 0xd26c4d12, 0xdf2f6bcb,
0xdbee767c, 0xe3a1cbc1, 0xe760d676, 0xea23f0af, 0xeee2ed18,
0xf0a5bd1d, 0xf464a0aa, 0xf9278673, 0xfde69bc4, 0x89b8fd09,
0x8d79e0be, 0x803ac667, 0x84fbdbd0, 0x9abc8bd5, 0x9e7d9662,
0x933eb0bb, 0x97ffad0c, 0xafb010b1, 0xab710d06, 0xa6322bdf,
0xa2f33668, 0xbcb4666d, 0xb8757bda, 0xb5365d03, 0xb1f740b4
};
typedef struct crc32ctx
{
uint32_t crc;
uint32_t length;
} CRC32Ctx;
#define COMPUTE(var, ch) (var) = (var) << 8 ^ crctab[(var) >> 24 ^ (ch)]
void crc32_stream_init( CRC32Ctx* ctx )
{
ctx->crc = 0;
ctx->length = 0;
}
void crc32_stream_compute_uint32( CRC32Ctx* ctx, uint32_t data )
{
COMPUTE( ctx->crc, data & 0xFF );
COMPUTE( ctx->crc, ( data >> 8 ) & 0xFF );
COMPUTE( ctx->crc, ( data >> 16 ) & 0xFF );
COMPUTE( ctx->crc, ( data >> 24 ) & 0xFF );
ctx->length += 4;
}
void crc32_stream_compute_uint8( CRC32Ctx* ctx, uint8_t data )
{
COMPUTE( ctx->crc, data );
ctx->length++;
}
void crc32_stream_finilize( CRC32Ctx* ctx )
{
uint32_t len = ctx->length;
for( ; len != 0; len >>= 8 )
{
COMPUTE( ctx->crc, len & 0xFF );
}
ctx->crc = ~ctx->crc;
}
/*** pseudo code ***/
CRC32Ctx crc;
crc32_stream_init(&crc);
while((just_received_buffer_len = received_anything()))
{
for(int i = 0; i < just_received_buffer_len; i++)
{
crc32_stream_compute_uint8(&crc, buf[i]); // assuming buf is uint8_t*
}
}
crc32_stream_finilize(&crc);
printf("%x", crc.crc); // ta daaa
CRC
adler32, available in the zlib headers, is advertised as being significantly faster than crc32, while being only slightly less accurate.
CRC32 is probably good enough, although there's a small chance you might get a collision, such that a file that has been modified might look like it hasn't been because the two versions generate the same checksum. To avoid this possibility I'd therefore suggest using MD5, which will easily be fast enough, and the chances of a collision occurring is reduced to the point where it's almost infinitessimal.
As others have said, with lots of small files your real performance bottleneck is going to be I/O so the issue is dealing with that. If you post up a few more details somebody will probably suggest a way of sorting that out as well.
Your most important requirement is "to check if the content changed".
If it most important that ANY change in the file be detected, MD-5, SHA-1 or even SHA-256 should be your choice.
Given that you indicated that the checksum NOT be cryptographically good, I would recommend CRC-32 for three reasons. CRC-32 gives good hamming distances over an 8K file. CRC-32 will be at least an order of magnitude faster than MD-5 to calculate (your second requirement). Sometimes as important, CRC-32 only requires 32 bits to store the value to be compared. MD-5 requires 4 times the storage and SHA-1 requires 5 times the storage.
BTW, any technique will be strengthened by prepending the length of the file when calculating the hash.
According to the Wiki page pointed to by Luke, MD5 is actually faster than CRC32!
I have tried this myself by using Python 2.6 on Windows Vista, and got the same result.
Here are some results:
crc32: 162.481544276 MBps
md5: 224.489791549 MBps
crc32: 168.332996575 MBps
md5: 226.089336532 MBps
crc32: 155.851515828 MBps
md5: 194.943289532 MBps
I am thinking about the same question as well, and I'm tempted to use the Rsync's variation of Adler-32 for detecting file differences.
Just a postscript to the above; jpegs use lossy compression and the extent of the compression may depend upon the program used to create the jpeg, the colour pallette and/or bit-depth on the system, display gamma, graphics card and user-set compression levels/colour settings. Therefore, comparing jpegs built on different computers/platforms or using different software will be very difficult at the byte level.
This is 5 times faster than CCITT and makes exactly the same job:
Python:
def crc16_fast(data: bytearray, length):
crc = 0xCACA
for i in range(length):
crc ^= data[i]
return crc
C:
uint16_t crc16_fast(const uint16_t* data, size_t length)
{
uint16_t crc = 0xCACA;
for (size_t i = 0; i < length; i++)
crc ^= data[i];
return crc;
}

Resources