I have files with different extensions, some are text files, others are zipped files or images. How can I programmatically add a checksum to the files?
For example, my idea was to add a checksum somewhere in the metadata of the files. I tried doing it with PowerShell, but the properties of the files are read-only. I don't want to create a separate file that contains the checksum of the files. I want the checksum itself to be included somewhere in the file itself or in its metadata.
On Windows, with NTFS filesystem, you can use Alternate Data Streams.
They act exactly like files, but hidden and attached to the main file - until it's copied on a non-NTFS partition.
Otherwise, you can't just add a checksum to a file (even a short CRC32) without consequences, and how would you be SURE that the last N bytes are your checksum, and not file's data? You'll need to add a header (so even more bytes), etc. and it can mess up the file loading - simply think about a simple, plain text file, if you add N bytes of binary data at end!
Related
I've got an edge case where two files have the same name but different contents and are written to the same tarball. This causes there to be two entries in the tarball. I'm wondering if there's anything I can do to make the tar overwrite the file if it already exists in the tarball as opposed to creating another file with the same name.
No way as the first file have already been written when you ask to write the second one and the stream has advanced the position. Remember tar files are sequentially accessed.
You should do deduplication before starting to write.
In a Windows 8 Command Prompt, I had a backup drive plugged in and I navigated to my User directory. I executed the command:
copy Documents G:/Seagate_backup/Documents
What I assumed was that copy would create the Documents directory on my backup drive and then copy the contents of the C: Documents directory into it. That is not what happened!
I proceeded to wipe my hard-drive and re-install the operating system, thinking I had backed up the important files, only to find out that copy seemingly concatenated all the C: Documents files of different types (.doc, .pdf, .txt, etc) into one file called "Documents." This file is of course unreadable but opening it in Notepad reveals what happened. I can see some of my documents which were plain text throughout the massively long file.
How do I undo this!!? It's terrible because I was actually helping a friend and was so sure of myself but now this has happened. The only thing I can think of doing is searching for some common separator amongst the concatenated files and write some sort of script to split the file back apart. But then I would have to guess the extensions of each of the pieces...
Merging files together in the fashion that copy uses, discards important file system information such as file size and file name. While the file name may not be as important the size is. Both parameters are used by the OS to discriminate files.
This problem might sound familiar if you have damaged your file allocation table before and all files disappeared. In both cases, you will end up with a binary blob (be it an actual disk or something like your file which might resemble a disk image) that lacks any size and filename information.
Fortunately, this is where a lot of file system recovery tools can help. They are specialized in matching patterns. Specifically they are looking for giveaway clues to what type a file is of, where it starts and what it's size is.
This is for instance enabled by many file types having a set of magic numbers that are used to allow a program to check if a file really is of the type that the extension claims to be.
In principle it is possible to undo this process more or less well.
You will need to use data recovery tools or other analysis tools like binwalk to extract the concatenated binary blob. Essentially the same tools that are used to recover deleted files should be able to extract your documents again. Without any filename of course. I recommend renaming the file to a disk image (.img) and either mounting it from within the operating system as a virtual harddisk (don't worry that it has no file system - it should show up as an unformatted drive) or directly using a data recovery tool or analysis tool which can read binary files (binwalk, for instance, can do that directly, but may not find all types of files as it's mainly for unpacking firmware images that may be assembled in the same or a similar way to how your files ended up).
I'm interested in validating medical image files of certain formats. When I say validate I mean make sure they are indeed files of that kind and not, say, some malware disguised as a file. So for example if someone has a file virus.exe and they changed it into virus.dcm I'd like to be able to tell it's not a legit .dcm file
I've seen an answer for validating dicom files that says I should look at offset 0x80 for a certain label. But I'm not sure if it's possible for someone to insert that label into virus.dcm.
The file types I want to validate are DICOM files (.dcm, .PAR/.REC), NIFTI files (.nii, .nii.gz), ANALYZE files (.img/.hdr), and .zip files
I'm not looking for code per se (though that would be nice), but I'd like to know what's the best way to distinguish legitimate files of these types from malware files that have been changed to look like these files.
Validating a dicom file is quite difficult: the problem is that the DICOM standard allows for the first 128 bytes of the file to contain absolutely anything (including executable code). After the first 128 bytes there is the DICM signature (offset 0x80).
So, even if you manage to open the DICOM file and see a valid image and tags in a DICOM viewer, the file could still contain executable code in the first 128 bytes (it would probably contain pointers to some portions at the end of the DICOM data).
I suggest to mark all the DICOM files as non-executable using chmod on Linux or this suggestion on Windows
I have a central repository where I store some binary (zip) files. A client can download specific files from this repository, unzips them locally and then places the resulting files in a designated folder.
At some point, this changes the encoding of one of the files in the stored zip. I have no explanation for why this happens, but it does. My own files are in UTF-8 and contain a character whose code point is represented as C3B3 in a hex editor. The client changes the encoding of at least one of these files to Windows-1252, so that the character is represented as F3. This happens on their machine, but not on mine, for the same operation.
Any ideas?
This all checks out, 0xc3 0xb3 is the utf-8 encoding for รณ and that's indeed 0xf3 in code page 1252. Zip archives do have code page awareness, it stores strings. But that only applies to the dictionary for the archive, the names of the files. And a possible password. Never to the zipped files themselves, they are just treated like binary blobs of bytes.
The much more likely scenario here is that whatever program the customer uses to read the file is making this conversion. Could be Notepad for example. A very long distance shot is that the unzipper that the customer uses is somehow aware that the zipped file is a text file, pretty unlikely. You'll need to get ahead by asking the customer what exactly they do with the .zip archive.
I need to do an integrity check for a single big file. I have read the SHA code for Android, but it will need one another file for the result digest. Is there another method using a single file?
I need a simple and quick method. Can I merge the two files into a single file?
The file is binary and the file name is fixed. I can get the file size using fstat. My problem is that I can only have one single file. Maybe I should use CRC, but it would be very slow because it is a large file.
My object is to ensure the file on the SD card is not corrupt. I write it on a PC and read it on an embedded platform. The file is around 200 MB.
You have to store the hash somehow, no way around it.
You can try writing it to the file itself (at the beginning or end) and skip it when performing the integrity check. This can work for things like XML files, but not for images or binaries.
You can also put the hash in the filename, or just keep a database of all your hashes.
It really all depends on what your program does and how it's set up.