How to do large file integrity check - algorithm

I need to do an integrity check for a single big file. I have read the SHA code for Android, but it will need one another file for the result digest. Is there another method using a single file?
I need a simple and quick method. Can I merge the two files into a single file?
The file is binary and the file name is fixed. I can get the file size using fstat. My problem is that I can only have one single file. Maybe I should use CRC, but it would be very slow because it is a large file.
My object is to ensure the file on the SD card is not corrupt. I write it on a PC and read it on an embedded platform. The file is around 200 MB.

You have to store the hash somehow, no way around it.
You can try writing it to the file itself (at the beginning or end) and skip it when performing the integrity check. This can work for things like XML files, but not for images or binaries.
You can also put the hash in the filename, or just keep a database of all your hashes.
It really all depends on what your program does and how it's set up.

Related

How to create a partially modifiable binary file format?

I'm creating my custom binary file extension.
I use the RIFF standard for encoding data. And it seems to work pretty well.
But there are some additional requirements:
Binary files could be large up to 500 MB.
Real-time saving data into the binary file in intervals when data on the application has changed.
Application could run on the browser.
The problem I face is when I want to save data it needs to read everything from memory and rewrite the whole binary file.
This won't be a problem when data is small. But when it's getting larger, the Real-time saving feature seems to be unscalable.
So main requirement of this binary file could be:
Able to partially read the binary file (Cause file is huge)
Able to partially write changed data into the file without rewriting the whole file.
Streaming protocol like .m3u8 is not an option, We can't split it into chunks and point it using separate URLs.
Any guidance on how to design a binary file system that scales in this scenario?
There is an answer from a random user that has been deleted here.
It seems great to me.
You can claim your answer back and I'll delete this one.
He said:
If we design the file to be support addition then we able to add whatever data we want without needing to rewrite the whole file.
This idea gives me a very great starting point.
So I can append more and more changes at the end of the file.
Then obsolete old chunks of data in the middle of the file.
I can then reuse these obsolete data slots later if I want to.
The downside is that I need to clean up the obsolete slot when I have a chance to rewrite the whole file.

How can you identify a file without a filename or filepath?

If I were to give you a file. You can read the file but you can't change it or copy it. Then I take the file, rename it, move it to a new location. How could you identify that file? (Fairly reliably)
I'm looking if I have a database of media files for a program and the user alters the location/name of file, could I find the file by searching a directory and looking for something.
I have done exactly this, it's not hard.
I take a 256-bit hash (I forget which routine I used off the top of my head) of the file and the filesize and write it to a table. If they match the files match. (And I think tracking the size is more paranoia than necessity.) To speed things up I also fold that hash to a 32-bit value. If the 32-bit values match then I check all the data.
For the sake of performance I persist the last 10 million files I have examined. The 32-bit values go in one file which is read in it's entirety, when a main record needs to be examined I pull in a "page" (I forget exactly how big) of them which is padded to align it with the disk.

Ruby PStore file too large

I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.

Initializing large arrays efficiently in Xcode

I need to store a large number of different kind of confidential data in my project.
The data can be represented as encoded NSStrings. I rather like to initialize this in code than read from a file, since that is more secure.
So I would need about 100k lines like this:
[myData addObject: #"String"];
or like this
myData[n++] = #"String";
Putting these lines in Xcode causes compile time to increase extensively up to several hours (by the way in Eclipse it takes a fraction of a second to compile 100k lines like this)
What would be feasible secure alternatives?
(please do not suggest reading from a file since this makes the data much easier to crack)
Strings in your code can be readily dumped with tools like strings.
Anyway, if you want to incorporate a data file directly into the executable, you can do that with the -sectcreate linker option. Add something like -Wl,-sectcreate,MYSEG,MYSECT,path to the Other Linker Commands build setting. In your code, you can use getsectdata() to access that data section.
However, you must not consider any of the data that you actually deliver to the end user, whether in code or resource files, as "confidential". It isn't and can never be.
I would put the strings in a plist file and read it into an NSArray at run time. For security encrypt the file.

Generate File Names Automatically without collision

I'm writing a "file sharing hosting" and I want to rename all the files when uploading to a unique name and somehow keep track of the names on the database. Since I don't want two or more files having same name (which is surely impossible), I'm looking for an algorithm which based on key or something generates random names for me.
Moreover, I don't want to generate a name and search the database to see if the file already exists. I want to make sure 100% or 99% that the generated filename has never been created earlier by my application.
Any idea how I can write such application?
You could produce a hash based on the file contents itself. There are two good reasons to do this:
Allows you to never store the same file twice - for example, if you have two copies of a music file which are identical in content you could check to see if you have already stored that file, and just store it once.
You separate meta-data (file name is just meta data) from the blob. So you would have a storage system which is indexed by the hash of the file contents, and you then associate the file meta-data with that hash lookup code.
The risk of finding two files that compute the same hash that aren't indeed the same contents, depending on the size of the hash would be low, and you can effectively mitigate that by perhaps hashing the file in chunks (which could then lead to some interesting storage optimisation scenarios :P).
GUIDs are one way. You're basically guaranteed to not get any repeats (if you have a proper random generator).
You could also append with the time since epoch.
The best solution have already been mentioned. I just want to add some thoughts.
The simplest solution is to have a counter and increment on every new file. This works quite well as long as only one thread creates new files. If multiple threads, processes or even systems add new files, things get a bit more complicated. You must coordinate the creation of new ids with locking or any similar synchronisation method. You could also assign id ranges to every proceses to reduce the synchronisation work, or extend the file id by a unique process id.
A better solution might be to use GUIDs in this scenario and do not have to care about synchronisation between processes.
Finally, you can at some random data to every identifier to make them harder to guess if this is a requirement.
Also coommon is storing files in a directory structure where the location of a file depends on its name. File abcdef1234.xyz might be stored as /ab/cd/ef/1234.xyz. This avoids directories with a huge number of files. I am not really aware why this is done - may be file system limitations, performance issues - but it is quite common. I do not know if similar things are common if the files are stored directly in the database.
The best way is to simply use a counter. The first file is 1, the next is 2, another is 3, and so on...
But, it seems you want random. To quickly do this, you could make sure that your random number is greater than the last file created. You can cache the last file and then just offset your random number with its last name.
file = last_file + random(1 through 10)

Resources