In a Windows 8 Command Prompt, I had a backup drive plugged in and I navigated to my User directory. I executed the command:
copy Documents G:/Seagate_backup/Documents
What I assumed was that copy would create the Documents directory on my backup drive and then copy the contents of the C: Documents directory into it. That is not what happened!
I proceeded to wipe my hard-drive and re-install the operating system, thinking I had backed up the important files, only to find out that copy seemingly concatenated all the C: Documents files of different types (.doc, .pdf, .txt, etc) into one file called "Documents." This file is of course unreadable but opening it in Notepad reveals what happened. I can see some of my documents which were plain text throughout the massively long file.
How do I undo this!!? It's terrible because I was actually helping a friend and was so sure of myself but now this has happened. The only thing I can think of doing is searching for some common separator amongst the concatenated files and write some sort of script to split the file back apart. But then I would have to guess the extensions of each of the pieces...
Merging files together in the fashion that copy uses, discards important file system information such as file size and file name. While the file name may not be as important the size is. Both parameters are used by the OS to discriminate files.
This problem might sound familiar if you have damaged your file allocation table before and all files disappeared. In both cases, you will end up with a binary blob (be it an actual disk or something like your file which might resemble a disk image) that lacks any size and filename information.
Fortunately, this is where a lot of file system recovery tools can help. They are specialized in matching patterns. Specifically they are looking for giveaway clues to what type a file is of, where it starts and what it's size is.
This is for instance enabled by many file types having a set of magic numbers that are used to allow a program to check if a file really is of the type that the extension claims to be.
In principle it is possible to undo this process more or less well.
You will need to use data recovery tools or other analysis tools like binwalk to extract the concatenated binary blob. Essentially the same tools that are used to recover deleted files should be able to extract your documents again. Without any filename of course. I recommend renaming the file to a disk image (.img) and either mounting it from within the operating system as a virtual harddisk (don't worry that it has no file system - it should show up as an unformatted drive) or directly using a data recovery tool or analysis tool which can read binary files (binwalk, for instance, can do that directly, but may not find all types of files as it's mainly for unpacking firmware images that may be assembled in the same or a similar way to how your files ended up).
Related
Recently I came across the os library in Python and found out about the existence of symbolic links. I would like to know what a symbolic link is, why it exists, and what are various uses of it?
I will answer this from a perspective of an *nix user (specifically Linux). If you're interested in how this relates to Windows I suggest you look for tutorials like this one. This will be a bit of a roundabout, but I find it that symbolic links or symlinks are best explained together with hard links and generic properties of a filesystem on Linux.
Links and files on Linux
As a rule of thumb, in Linux everything is treated as a file. Directories are files that contain mappings from names (paths) to inodes, which are just unique identifiers of different objects residing on your system. Basically, if I give you a name like /home/gst/mydog.png the accessing process will first look into the / directory (the root directory) where it will find information on where to find home, then opening that file it will look into it to see where gst is and finally in that file it will try to find the location of mydog.png, and if successful try do whatever it set out to do with it. Going back to directory files, the mappings they contain are called links. Which brings us to hard and symbolic links.
Hard vs Symbolic links
A hard link is just a mapping like the one we discussed previously. It points directly to a certain object. A symlink on the other hand does not point directly to an object. Rather it just saves a path to an object. For example, say that I created a symbolic link to /home/gst/mydog.png at /home/gst/Desktop/mycat.png with os.symlink("/home/gst/mydog.png", "/home/gst/Desktop/mycat.png"). When I try to open it, the name /home/gst/Desktop/mycat.png is usually resolved to /home/gst/mydog.png. By following the symlink located at /home/gst/Desktop/mycat.png I actually (try to) access an object pointed to by /home/gst/mydog.png.
If I create a hard link (for example by calling os.link) I just add entries to the relevant directory files, such that the specific name can be followed to the linked object. When I create a symbolic link I create a file that contains a path to another file (which might be another symbolic link).
More specific to your question, if I pass /home/gst/Desktop/mycat.png to os.readlink it will return /home/gst/mydog.png. This name resolution also happens when calling functions in os with an (optional) parameter follow_symlinks set to True, however, if it's set to False the name does not get resolved (for instance you'd set it to false when you want to manipulate the symlink itself not the object it points to). From the module documentation:
not following symlinks: If follow_symlinks is False, and the last element of the path to operate on is a symbolic link, the function will operate on the symbolic link itself instead of the file the link points to. (For POSIX systems, Python will call the l... version of the function.)
You can check whether or not follow_symlinks is supported on your platform using os.supports_follow_symlinks. If it is unavailable, using it will raise a NotImplementedError.
Why use hard links?
This question has already been answered here, quoting from the accepted answer:
The main advantage of hard links is that, compared to soft links, there is no size or speed penalty. Soft links are an extra layer of indirection on top of normal file access; the kernel has to dereference the link when you open the file, and this takes a small amount of time. The link also takes a small amount of space on the disk, to hold the text of the link. These penalties do not exist with hard links because they are built into the very structure of the filesystem.
I'd like to add that hard links allow for an easy method of file backup. For every file the system keeps a count of its hard links. Once this count reaches 0 the memory segment on which the file is located is marked as free, meaning that the system will eventually overwrite it with another data (effectively deleting the previous file - which doesn't happen for at least as long as a running process has an opened stream associated with the file, but that's another story). Why would that matter?
Let's say you have a huge directory full of files you'd like to manipulate somehow (rename some, delete others, etc.) and you write a script to do this for you. However, you're not completely sure that the script will work as intended and you fear it might delete some wrong files. You also don't want to copy all the files, as this would take up too much space and time. One solution is to just create a hard link for each file at some other point in the filesystem. If you delete a file in the target directory, the associated object is still available because there's another hard link associated with it. Creating that many hard links will consume much less time and space than copying all the file, yet it will give you a reasonable backup strategy.
This is not the case with symbolic links. Remember, symlinks point to other links (possibly another symlink as well) not to actual files. Hence, I might create a symlink to a file, but that it will only save the link. If the (eventual) hard link that the symlink is pointing to gets removed from the system, trying to resolve the symlink won't lead you to a file. Such symlinks are said to be "broken" or "dangling". Thus you cannot rely on symlinks to preserve access to a certain file. (Conversely, deleting a symlink does not affect the link count associated with a target file.) So what's their use?
Why use symbolic links?
You can operate on symlinks as if they were the actual files to which they pointing somewhere down the line (except deleting them). This allows you to have multiple "access points" to a file, without having excess copies (that remain up to date, since they always access the same file). If you want to replace the file that is being accessed you only need to change it once and all of the symlinks will point to it (as long as the path saved by them is not changed). However, if you have hard links to a certain file and you then replace that file with another one, you also need to replace the hard links as otherwise they'll still be pointing to the old file.
Lastly, it is not uncommon to have different filesystems mounted on the same Linux machine. That is to say, that the way data is organized and interpreted at some point in the file hierarchy (say /home/gst/fs1) can be different to how it is organized and interpreted at another point (say /home/gst/Desktop/fs2). A hard link can only reside on the same filesystem as the file it's pointing to. Whereas, a symlink can be created on one filesystem but effectively pointing to a file on another filesystem (see answers to this question).
Symbolic links, also known as soft links, are special types of files that point to other files, much like shortcuts in Windows and Macintosh aliases. The data in the target file does not appear in a symbolic link, unlike a hard link. Instead, it points to another file system entry.
Read more here:
https://kb.iu.edu/d/abbe#:~:text=A%20symbolic%20link%2C%20also%20termed,somewhere%20in%20the%20file%20system.
macOS HFS+ supports transparent filesystem-level compression. How can I enable this compression for certain files via a programmatic API? (e.g. Cocoa or C interface)
I'd like to achieve effect of ditto --hfsCompression src dst, but without shelling out.
To clarify: I'm asking how to make uncompressed file compressed. I'm not interested in reading or preserving existing HFS compression state.
There's afsctool which is an open source implementation of HFS+ compression. It was originally by hacker brkirch (macrumors forum link, as he still visits there) but has since been expanded and improved a great deal by #rjvb who is doing amazing things with it.
The copyfile.c file discloses some of the implementation details.
There's also a compression tool based on that: afsctool.
I think you're asking two different questions (and might not know it).
If you're asking "How can I make arbitrary file 'A' an HFS compressed file?" the answer is, you can,'t. HFS compressed files are created by the installer and (technically[1]) only Apple can create them.
If you are asking "How can I emulate the --hfsCompression logic in ditto such that I can copy an HFS compressed file from one HFS+ volume to another HFS+ volume and preserve its compression?" the answer to that is pretty straight forward, albeit not well documented.
HFS Compressed files have a special UF_COMPRESSED file flag. If you see that, the data fork of the file is actually an uncompressed image of a hidden resource. The compressed version of the file is stored in a special extended attribute. It's special because it normally doesn't appear in the list of attributes when you request them (so if you just ls -l# the file, for example, you won't see it). To list and read this special attribute you must pass the XATTR_SHOWCOMPRESSION flag to both the listxattr() and getxattr() functions.
To restore a compressed file, you reverse the process: Write an empty file, then restore all of its extended attributes, specifically the special one. When you're done, set the file's UF_COMPRESSED flag and the uncompressed data will magically appear in its data fork.
[1] Note: It's rumored that the compressed resource of a file is just a ZIPed version of the data, possibly with some wrapper around it. I've never taken the time to experiment, but if you're intent on creating your own compressed files you could take a stab at reverse-engineering the compressed extended attribute.
I am trying to defragment a single file through Windows defragmentation API ( http://msdn.microsoft.com/en-us/library/aa363911(VS.85).aspx ) but if there is no free space block large enough for my file I would like to move other parts of files to make room for it.
The linked article mentions moving parts of other files but I can't find any information about how to find out which files to move. From the free space bitmap I can find an almost large enough space and I know the logical cluster numbers surrounding it, but from this I can't find out which files are surrounding it and a handle to the files is required to do FSCTL_MOVE_FILE which moves parts of files.
Is there any way, through the API or by parsing the MFT, to find out what file a logical cluster number is part of, and what virtual cluster number in the file corresponds to the logical cluster number found through the bitmap?
The slow but compatible method is to recursively scan all directories for files, and use the FSCTL_GET_RETRIEVAL_POINTERS. Then scan the resulting VCN-LCN mapping for the cluster in question.
Another option would be to query the USN Journal of the drive to get the File Reference IDs, then use FSCT_GET_NTFS_FILE_RECORD to get the $MFT file record.
I'm currently working on a simple Defrag program (written in Java) with the aim to pack files of a directory (e.g. all files of a large game) close together to reduce loading times and loading lags.
I use a faster method to retrieve the file mappings on the NTFS or FAT32 drive.
I parse the $MFT file directly (the format has some pitfalls), or the FAT32 file allocation table along with the directories.
The trick is to open the drive (e.g. "c:") with FileCreate for fully shared GENERIC read. The resulting handle can then be read with FileRead and FileSeek on a byte granularity. This works only in administrator mode (or elevated).
On NTFS, the $MFT might be fragmented and is a bit tricky to locate it from the boot sector info. I use the FSCTL_GET_RETRIEVAL_POINTERS on the C:\$MFT file to get its clusters.
On FAT32, one must parse the boot sector to locate the FAT table and the cluster containing root directory file. You need to parse the directory entries and recursively locate the clusters of the sub-directories.
There is no O(1) way of mapping from block # to file. You need to walk the entire MFT looking for files that contain that block.
Of course, in a live system, once you've read that data it's out-of-date and you must be prepared for failures in the move data FSCTL.
I wanted to make some simple file recovery software, where I want to try to recover files which happen to have been deleted by pressing Shift + Delete. I'm working in Windows, can anyone show me any links or documents which can help me to do so programatically? I know C, C++, .NET. Any pointers?
http://www.google.hu/search?q=file+recovery+theory&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a :)
Mainly file recoveries are looking for file headers and/or filenames in the disk as I know, then try to get the whole file by the header information.
This could be a good start: http://geeksaresexy.blogspot.com/2006/02/theory-behind-deleted-files-recovery.html
The principle of all recovery tools is that deleting a file only removes a pointer in a folder and (quick) formatting of a partition only rewrites the first sectors of the partition which contains the headers of the filesystem. An in depth analysis of the partition data (at sector level) can rebuild a big part of the filesystem data, cluster allocation tables, folders, and file cluster chains.
All course if you use a surface test tool while formatting the partition that will rewrite all sectors to make sure that they are correct, nothing will be recoverable - unless you use specialized hardware to look at remanent magnetism on the edges of the actual tracks
In windows when a file is deleted(permanent delete) it's not actually deleted from disk but the file name added with char( _ I guess) in front of it and windows ignores these when showing in explorer... and recovery tools will search these kind of file names in the disk. And your file recover integrity based on some data over written on location of deleted file. Don't know this pattern still used by windows.. but long time back I have read this some where
I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.
I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.
I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?
Thanks,
Barry.
Update:
Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)
There are several things you could/should do
Disable automatic NTFS short file name generation (google it)
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
As already posted consider splitting up the files in multiple directories.
.e.g. instead of
directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg
use
directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg
You could try an SSD....
http://www.crucial.com/promo/index.aspx?prog=ssd
Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.
Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.
The solution is most likely to restrict the number of files per directory.
I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.
gbp97m.xls
was stored in
g/b/p97m.xls
This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.
One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.
Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.