I have a central repository where I store some binary (zip) files. A client can download specific files from this repository, unzips them locally and then places the resulting files in a designated folder.
At some point, this changes the encoding of one of the files in the stored zip. I have no explanation for why this happens, but it does. My own files are in UTF-8 and contain a character whose code point is represented as C3B3 in a hex editor. The client changes the encoding of at least one of these files to Windows-1252, so that the character is represented as F3. This happens on their machine, but not on mine, for the same operation.
Any ideas?
This all checks out, 0xc3 0xb3 is the utf-8 encoding for รณ and that's indeed 0xf3 in code page 1252. Zip archives do have code page awareness, it stores strings. But that only applies to the dictionary for the archive, the names of the files. And a possible password. Never to the zipped files themselves, they are just treated like binary blobs of bytes.
The much more likely scenario here is that whatever program the customer uses to read the file is making this conversion. Could be Notepad for example. A very long distance shot is that the unzipper that the customer uses is somehow aware that the zipped file is a text file, pretty unlikely. You'll need to get ahead by asking the customer what exactly they do with the .zip archive.
Related
I have files with different extensions, some are text files, others are zipped files or images. How can I programmatically add a checksum to the files?
For example, my idea was to add a checksum somewhere in the metadata of the files. I tried doing it with PowerShell, but the properties of the files are read-only. I don't want to create a separate file that contains the checksum of the files. I want the checksum itself to be included somewhere in the file itself or in its metadata.
On Windows, with NTFS filesystem, you can use Alternate Data Streams.
They act exactly like files, but hidden and attached to the main file - until it's copied on a non-NTFS partition.
Otherwise, you can't just add a checksum to a file (even a short CRC32) without consequences, and how would you be SURE that the last N bytes are your checksum, and not file's data? You'll need to add a header (so even more bytes), etc. and it can mess up the file loading - simply think about a simple, plain text file, if you add N bytes of binary data at end!
macOS HFS+ supports transparent filesystem-level compression. How can I enable this compression for certain files via a programmatic API? (e.g. Cocoa or C interface)
I'd like to achieve effect of ditto --hfsCompression src dst, but without shelling out.
To clarify: I'm asking how to make uncompressed file compressed. I'm not interested in reading or preserving existing HFS compression state.
There's afsctool which is an open source implementation of HFS+ compression. It was originally by hacker brkirch (macrumors forum link, as he still visits there) but has since been expanded and improved a great deal by #rjvb who is doing amazing things with it.
The copyfile.c file discloses some of the implementation details.
There's also a compression tool based on that: afsctool.
I think you're asking two different questions (and might not know it).
If you're asking "How can I make arbitrary file 'A' an HFS compressed file?" the answer is, you can,'t. HFS compressed files are created by the installer and (technically[1]) only Apple can create them.
If you are asking "How can I emulate the --hfsCompression logic in ditto such that I can copy an HFS compressed file from one HFS+ volume to another HFS+ volume and preserve its compression?" the answer to that is pretty straight forward, albeit not well documented.
HFS Compressed files have a special UF_COMPRESSED file flag. If you see that, the data fork of the file is actually an uncompressed image of a hidden resource. The compressed version of the file is stored in a special extended attribute. It's special because it normally doesn't appear in the list of attributes when you request them (so if you just ls -l# the file, for example, you won't see it). To list and read this special attribute you must pass the XATTR_SHOWCOMPRESSION flag to both the listxattr() and getxattr() functions.
To restore a compressed file, you reverse the process: Write an empty file, then restore all of its extended attributes, specifically the special one. When you're done, set the file's UF_COMPRESSED flag and the uncompressed data will magically appear in its data fork.
[1] Note: It's rumored that the compressed resource of a file is just a ZIPed version of the data, possibly with some wrapper around it. I've never taken the time to experiment, but if you're intent on creating your own compressed files you could take a stab at reverse-engineering the compressed extended attribute.
In a Windows 8 Command Prompt, I had a backup drive plugged in and I navigated to my User directory. I executed the command:
copy Documents G:/Seagate_backup/Documents
What I assumed was that copy would create the Documents directory on my backup drive and then copy the contents of the C: Documents directory into it. That is not what happened!
I proceeded to wipe my hard-drive and re-install the operating system, thinking I had backed up the important files, only to find out that copy seemingly concatenated all the C: Documents files of different types (.doc, .pdf, .txt, etc) into one file called "Documents." This file is of course unreadable but opening it in Notepad reveals what happened. I can see some of my documents which were plain text throughout the massively long file.
How do I undo this!!? It's terrible because I was actually helping a friend and was so sure of myself but now this has happened. The only thing I can think of doing is searching for some common separator amongst the concatenated files and write some sort of script to split the file back apart. But then I would have to guess the extensions of each of the pieces...
Merging files together in the fashion that copy uses, discards important file system information such as file size and file name. While the file name may not be as important the size is. Both parameters are used by the OS to discriminate files.
This problem might sound familiar if you have damaged your file allocation table before and all files disappeared. In both cases, you will end up with a binary blob (be it an actual disk or something like your file which might resemble a disk image) that lacks any size and filename information.
Fortunately, this is where a lot of file system recovery tools can help. They are specialized in matching patterns. Specifically they are looking for giveaway clues to what type a file is of, where it starts and what it's size is.
This is for instance enabled by many file types having a set of magic numbers that are used to allow a program to check if a file really is of the type that the extension claims to be.
In principle it is possible to undo this process more or less well.
You will need to use data recovery tools or other analysis tools like binwalk to extract the concatenated binary blob. Essentially the same tools that are used to recover deleted files should be able to extract your documents again. Without any filename of course. I recommend renaming the file to a disk image (.img) and either mounting it from within the operating system as a virtual harddisk (don't worry that it has no file system - it should show up as an unformatted drive) or directly using a data recovery tool or analysis tool which can read binary files (binwalk, for instance, can do that directly, but may not find all types of files as it's mainly for unpacking firmware images that may be assembled in the same or a similar way to how your files ended up).
I'm interested in validating medical image files of certain formats. When I say validate I mean make sure they are indeed files of that kind and not, say, some malware disguised as a file. So for example if someone has a file virus.exe and they changed it into virus.dcm I'd like to be able to tell it's not a legit .dcm file
I've seen an answer for validating dicom files that says I should look at offset 0x80 for a certain label. But I'm not sure if it's possible for someone to insert that label into virus.dcm.
The file types I want to validate are DICOM files (.dcm, .PAR/.REC), NIFTI files (.nii, .nii.gz), ANALYZE files (.img/.hdr), and .zip files
I'm not looking for code per se (though that would be nice), but I'd like to know what's the best way to distinguish legitimate files of these types from malware files that have been changed to look like these files.
Validating a dicom file is quite difficult: the problem is that the DICOM standard allows for the first 128 bytes of the file to contain absolutely anything (including executable code). After the first 128 bytes there is the DICM signature (offset 0x80).
So, even if you manage to open the DICOM file and see a valid image and tags in a DICOM viewer, the file could still contain executable code in the first 128 bytes (it would probably contain pointers to some portions at the end of the DICOM data).
I suggest to mark all the DICOM files as non-executable using chmod on Linux or this suggestion on Windows
I am learning C++ specially about binary file structure/manipulation, and since I am totally new to the subject of binary files, bits, bites & hexadecimal numbers, I decided to take one step backward and establish a solid understanding on the subjects.
In the picture I have included below, I wrote two words (blue thief) in a .txt file.
The reason for this, is when I decode the file using a hexeditor, I wanted to understand how the information is really stored in hex format. Now, don't get me wrong, I am not trying to make a living out of reading hex formats all day, but only to have a minimum level of understanding the basics of a binary file's composition. I also, know all files have different structures, but just for the sake of understanding, I wanted to know, how exactly the words "blue thief" and a single ' ' (space) were converted into those characters.
One more thing, is that, I have heard that binary files contain three types of information:
header, ftm & and the data! is that only concerned with multimedia files like audios, videos? because, I can't seem to see anything, other than what it looks like a the data chunk in this file only.
The characters in your text file are encoded in a Windows extension of ASCII--one byte for each character that you see in Notepad. What you see is what you get.
Generally, a hard distinction is made between text and binary files on Windows systems. On Unix/Linux systems, the distinction is fuzzier... you could argue that there is no distinction, in fact.
On Windows systems, the distinction is enforced by file extensions. All files with the extension ".TXT" are assumed to be text files (i.e., to contain only hex codes that represent visible onscreen characters, where "visible" includes whitespace).
Binary files are a whole different kettle of fish. Most, as you mention, include some sort of header describing how the data that follows is encoded. These headers can vary tremendously in size depending on the type of data (again, assumed to be indicated by the extension on Windows systems as well as Unix). A simple example is the WAV format for uncompressed audio. If you open a WAV file in your hex editing program, you'll see that the first four bytes are "RIFF"--this is a marker, often called a "magic number" even though it is readable as text, indicating that the contents are an audio file. Newer versions of the WAV specification have complicated this somewhat, but originally the WAV header was just the "RIFF" tag plus a dozen or so bytes indicating the sample rate of the following data. (You can see this by comparing the raw data in a track on an audio CD to the WAV file created by ripping an uncompressed copy of that track at 44.1 KHz--the data should be the same, with just a header section added at the start of the WAV file.)
Executable files (compiled programs) are a special type of binary file, but they follow roughly the same scheme of a header followed by data in a prescribed format. In this case, though, the "data" is executable machine code, and the header indicates, among other things, what operating system the file runs on. (For example, most Linux executables begin with the characters "ELF".)