How to validate medical image files - image

I'm interested in validating medical image files of certain formats. When I say validate I mean make sure they are indeed files of that kind and not, say, some malware disguised as a file. So for example if someone has a file virus.exe and they changed it into virus.dcm I'd like to be able to tell it's not a legit .dcm file
I've seen an answer for validating dicom files that says I should look at offset 0x80 for a certain label. But I'm not sure if it's possible for someone to insert that label into virus.dcm.
The file types I want to validate are DICOM files (.dcm, .PAR/.REC), NIFTI files (.nii, .nii.gz), ANALYZE files (.img/.hdr), and .zip files
I'm not looking for code per se (though that would be nice), but I'd like to know what's the best way to distinguish legitimate files of these types from malware files that have been changed to look like these files.

Validating a dicom file is quite difficult: the problem is that the DICOM standard allows for the first 128 bytes of the file to contain absolutely anything (including executable code). After the first 128 bytes there is the DICM signature (offset 0x80).
So, even if you manage to open the DICOM file and see a valid image and tags in a DICOM viewer, the file could still contain executable code in the first 128 bytes (it would probably contain pointers to some portions at the end of the DICOM data).
I suggest to mark all the DICOM files as non-executable using chmod on Linux or this suggestion on Windows

Related

How to add checksum to a file in Windows?

I have files with different extensions, some are text files, others are zipped files or images. How can I programmatically add a checksum to the files?
For example, my idea was to add a checksum somewhere in the metadata of the files. I tried doing it with PowerShell, but the properties of the files are read-only. I don't want to create a separate file that contains the checksum of the files. I want the checksum itself to be included somewhere in the file itself or in its metadata.
On Windows, with NTFS filesystem, you can use Alternate Data Streams.
They act exactly like files, but hidden and attached to the main file - until it's copied on a non-NTFS partition.
Otherwise, you can't just add a checksum to a file (even a short CRC32) without consequences, and how would you be SURE that the last N bytes are your checksum, and not file's data? You'll need to add a header (so even more bytes), etc. and it can mess up the file loading - simply think about a simple, plain text file, if you add N bytes of binary data at end!

How to compress file on HFS programmatically?

macOS HFS+ supports transparent filesystem-level compression. How can I enable this compression for certain files via a programmatic API? (e.g. Cocoa or C interface)
I'd like to achieve effect of ditto --hfsCompression src dst, but without shelling out.
To clarify: I'm asking how to make uncompressed file compressed. I'm not interested in reading or preserving existing HFS compression state.
There's afsctool which is an open source implementation of HFS+ compression. It was originally by hacker brkirch (macrumors forum link, as he still visits there) but has since been expanded and improved a great deal by #rjvb who is doing amazing things with it.
The copyfile.c file discloses some of the implementation details.
There's also a compression tool based on that: afsctool.
I think you're asking two different questions (and might not know it).
If you're asking "How can I make arbitrary file 'A' an HFS compressed file?" the answer is, you can,'t. HFS compressed files are created by the installer and (technically[1]) only Apple can create them.
If you are asking "How can I emulate the --hfsCompression logic in ditto such that I can copy an HFS compressed file from one HFS+ volume to another HFS+ volume and preserve its compression?" the answer to that is pretty straight forward, albeit not well documented.
HFS Compressed files have a special UF_COMPRESSED file flag. If you see that, the data fork of the file is actually an uncompressed image of a hidden resource. The compressed version of the file is stored in a special extended attribute. It's special because it normally doesn't appear in the list of attributes when you request them (so if you just ls -l# the file, for example, you won't see it). To list and read this special attribute you must pass the XATTR_SHOWCOMPRESSION flag to both the listxattr() and getxattr() functions.
To restore a compressed file, you reverse the process: Write an empty file, then restore all of its extended attributes, specifically the special one. When you're done, set the file's UF_COMPRESSED flag and the uncompressed data will magically appear in its data fork.
[1] Note: It's rumored that the compressed resource of a file is just a ZIPed version of the data, possibly with some wrapper around it. I've never taken the time to experiment, but if you're intent on creating your own compressed files you could take a stab at reverse-engineering the compressed extended attribute.

Steganography - Is there a more smart/complex way for hiding files ?

**So, by far every tutorial I've seen regarding this topic either for linux or windows, are including the compressing of files and putting them inside an image, and by so creating a new one containing these files.
--
A zip file appended to a jpg file can be easily detected. With a little analysis you can easily understand that the jpg file has some extra information at the end, and you can recognize the header of a zip file after the normal jpeg data (even if the zip file is encrypted)
Question is, is there a bit more smarter/complex method for hiding files inside an image ?
thnx for any help.**
Steganography can be done in 3 domains; the spatial, frequency and compressed domain. Each of these domains have their pros and cons.
For example, in the compressed domain, you can hide a large secret in a compressed image and it will be very difficult to detect as the binary stream is what will be transmitted from sender to receiver. However, in the spatial domain, the existence of a secret message becomes easier to detect as the natural image is transferred as a whole.
There are many new steganographic algorithms being published in journals each month, you can look at the top journals in this field (such as IEEE, or Information Sciences) to find a suitable technique.

Does Windows ever change an encoding upon move or read?

I have a central repository where I store some binary (zip) files. A client can download specific files from this repository, unzips them locally and then places the resulting files in a designated folder.
At some point, this changes the encoding of one of the files in the stored zip. I have no explanation for why this happens, but it does. My own files are in UTF-8 and contain a character whose code point is represented as C3B3 in a hex editor. The client changes the encoding of at least one of these files to Windows-1252, so that the character is represented as F3. This happens on their machine, but not on mine, for the same operation.
Any ideas?
This all checks out, 0xc3 0xb3 is the utf-8 encoding for รณ and that's indeed 0xf3 in code page 1252. Zip archives do have code page awareness, it stores strings. But that only applies to the dictionary for the archive, the names of the files. And a possible password. Never to the zipped files themselves, they are just treated like binary blobs of bytes.
The much more likely scenario here is that whatever program the customer uses to read the file is making this conversion. Could be Notepad for example. A very long distance shot is that the unzipper that the customer uses is somehow aware that the zipped file is a text file, pretty unlikely. You'll need to get ahead by asking the customer what exactly they do with the .zip archive.

In-depth understanding of binary files

I am learning C++ specially about binary file structure/manipulation, and since I am totally new to the subject of binary files, bits, bites & hexadecimal numbers, I decided to take one step backward and establish a solid understanding on the subjects.
In the picture I have included below, I wrote two words (blue thief) in a .txt file.
The reason for this, is when I decode the file using a hexeditor, I wanted to understand how the information is really stored in hex format. Now, don't get me wrong, I am not trying to make a living out of reading hex formats all day, but only to have a minimum level of understanding the basics of a binary file's composition. I also, know all files have different structures, but just for the sake of understanding, I wanted to know, how exactly the words "blue thief" and a single ' ' (space) were converted into those characters.
One more thing, is that, I have heard that binary files contain three types of information:
header, ftm & and the data! is that only concerned with multimedia files like audios, videos? because, I can't seem to see anything, other than what it looks like a the data chunk in this file only.
The characters in your text file are encoded in a Windows extension of ASCII--one byte for each character that you see in Notepad. What you see is what you get.
Generally, a hard distinction is made between text and binary files on Windows systems. On Unix/Linux systems, the distinction is fuzzier... you could argue that there is no distinction, in fact.
On Windows systems, the distinction is enforced by file extensions. All files with the extension ".TXT" are assumed to be text files (i.e., to contain only hex codes that represent visible onscreen characters, where "visible" includes whitespace).
Binary files are a whole different kettle of fish. Most, as you mention, include some sort of header describing how the data that follows is encoded. These headers can vary tremendously in size depending on the type of data (again, assumed to be indicated by the extension on Windows systems as well as Unix). A simple example is the WAV format for uncompressed audio. If you open a WAV file in your hex editing program, you'll see that the first four bytes are "RIFF"--this is a marker, often called a "magic number" even though it is readable as text, indicating that the contents are an audio file. Newer versions of the WAV specification have complicated this somewhat, but originally the WAV header was just the "RIFF" tag plus a dozen or so bytes indicating the sample rate of the following data. (You can see this by comparing the raw data in a track on an audio CD to the WAV file created by ripping an uncompressed copy of that track at 44.1 KHz--the data should be the same, with just a header section added at the start of the WAV file.)
Executable files (compiled programs) are a special type of binary file, but they follow roughly the same scheme of a header followed by data in a prescribed format. In this case, though, the "data" is executable machine code, and the header indicates, among other things, what operating system the file runs on. (For example, most Linux executables begin with the characters "ELF".)

Resources