How does file recovery software work? - windows

I wanted to make some simple file recovery software, where I want to try to recover files which happen to have been deleted by pressing Shift + Delete. I'm working in Windows, can anyone show me any links or documents which can help me to do so programatically? I know C, C++, .NET. Any pointers?

http://www.google.hu/search?q=file+recovery+theory&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a :)
Mainly file recoveries are looking for file headers and/or filenames in the disk as I know, then try to get the whole file by the header information.
This could be a good start: http://geeksaresexy.blogspot.com/2006/02/theory-behind-deleted-files-recovery.html

The principle of all recovery tools is that deleting a file only removes a pointer in a folder and (quick) formatting of a partition only rewrites the first sectors of the partition which contains the headers of the filesystem. An in depth analysis of the partition data (at sector level) can rebuild a big part of the filesystem data, cluster allocation tables, folders, and file cluster chains.
All course if you use a surface test tool while formatting the partition that will rewrite all sectors to make sure that they are correct, nothing will be recoverable - unless you use specialized hardware to look at remanent magnetism on the edges of the actual tracks

In windows when a file is deleted(permanent delete) it's not actually deleted from disk but the file name added with char( _ I guess) in front of it and windows ignores these when showing in explorer... and recovery tools will search these kind of file names in the disk. And your file recover integrity based on some data over written on location of deleted file. Don't know this pattern still used by windows.. but long time back I have read this some where

Related

Undo a botched command prompt copy which concatenated all of my files

In a Windows 8 Command Prompt, I had a backup drive plugged in and I navigated to my User directory. I executed the command:
copy Documents G:/Seagate_backup/Documents
What I assumed was that copy would create the Documents directory on my backup drive and then copy the contents of the C: Documents directory into it. That is not what happened!
I proceeded to wipe my hard-drive and re-install the operating system, thinking I had backed up the important files, only to find out that copy seemingly concatenated all the C: Documents files of different types (.doc, .pdf, .txt, etc) into one file called "Documents." This file is of course unreadable but opening it in Notepad reveals what happened. I can see some of my documents which were plain text throughout the massively long file.
How do I undo this!!? It's terrible because I was actually helping a friend and was so sure of myself but now this has happened. The only thing I can think of doing is searching for some common separator amongst the concatenated files and write some sort of script to split the file back apart. But then I would have to guess the extensions of each of the pieces...
Merging files together in the fashion that copy uses, discards important file system information such as file size and file name. While the file name may not be as important the size is. Both parameters are used by the OS to discriminate files.
This problem might sound familiar if you have damaged your file allocation table before and all files disappeared. In both cases, you will end up with a binary blob (be it an actual disk or something like your file which might resemble a disk image) that lacks any size and filename information.
Fortunately, this is where a lot of file system recovery tools can help. They are specialized in matching patterns. Specifically they are looking for giveaway clues to what type a file is of, where it starts and what it's size is.
This is for instance enabled by many file types having a set of magic numbers that are used to allow a program to check if a file really is of the type that the extension claims to be.
In principle it is possible to undo this process more or less well.
You will need to use data recovery tools or other analysis tools like binwalk to extract the concatenated binary blob. Essentially the same tools that are used to recover deleted files should be able to extract your documents again. Without any filename of course. I recommend renaming the file to a disk image (.img) and either mounting it from within the operating system as a virtual harddisk (don't worry that it has no file system - it should show up as an unformatted drive) or directly using a data recovery tool or analysis tool which can read binary files (binwalk, for instance, can do that directly, but may not find all types of files as it's mainly for unpacking firmware images that may be assembled in the same or a similar way to how your files ended up).

Deleted file recovery program using C C++ [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I want to write a program that can recover deleted files from hard drive ( FAT32/NTFS partition Windows). I don't know where to start from. What should be the starting point of this? What should i read to pursue this? Help is required. Which system level structs should i study?
It's entirely a matter of the filesystem layout, how a "file" actually looks on disk, and what remains when a file is deleted. As such, pretty much all you need to understand is the filesystem spec (for each and every filesystem you want to support), and how to get direct block-level access to the HD data. It might be possible to reuse some code from existing filesystem drivers, but it will need to be modified to process structures that, from the point of view of the filesystem, are gone.
NTFS technical reference
NTFS.com
FAT32 spec
You should know first how file deletion is done in FAT32/NTFS, and how other undelete softwares work.
Undelete software understands the internals of the system used to store files on a disk (the file system) and uses this knowledge to locate the disk space that was occupied by a deleted file. Because another file may have used some or all of this disk space there is no guarantee that a deleted file can be recovered or if it is, that it won't have suffered some corruption. But because the space isn't re-used straight away there is a very good chance that you will recover the deleted file 100% intact. People who use deleted file recovery software are often amazed to find that it finds files that were deleted months or even years ago. The best undelete programs give you an indication of the chances of recovering a file intact and even provide file viewers so you can check the contents before recovery.
Here's a good read (but not so technical): http://www.tech-pro.net/how-to-recover-deleted-files.html
This is not as difficult as you think. You need to understand how files are stored in fat32 and NTFS. I recommend you use winhex an application used for digital forensics to check your address calculations are correct.
Ie NTFS uses master file records to store data of the file in clusters. Unlink deletes file in c but if you look at the source code all it does is removes entry from table and updates the records. Use an app like winhex to read information of the master file record. Here are some useful info.
Master boot record - sector 0
Hex 0x55AA is the end of MBR. Next will be mft
File name is mft header.
There is a flag to denote folder or file (not sure where).
The file located flag tells if file is marked deleted. You will need to change this flag if you to recover deleted file.
You need cluster size and number of clusters as well as the cluster number of where your data starts to calculate the start address if you want to access data from the master file table.
Not sure of FAT32 but just use same approach. There is a useful 21 YouTube video which explains how to use winhex to access deleted file data on NTFS. Not sure the video but just type in winhex digital forensics recover deleted file. Once you watch this video it will become much clearer.
good luck
Just watched the 21 min YouTube video on how to recover files deleted in NTFS using winhex. Don't forget resident flag which denotes if the file is resident or not. This gives you some idea of how the file is stored either in clusters or just in the mft data section if small. This may be required if you want to access the deleted data. This video is perfect to start with as it contains all the offset byte position to access most of the required information relative to beginning of the file record. It even shows you how to do the address calculation for the start of the cluster. You will need to access the table in binary format using a pointer and adding offsets to the pointer to access the required information. The only way to do it is go through the whole table and do a binary comparison of the filename byte for byte. Some fields are little eindian so make sure you got winhex to check your address calculations.

How can I find information about a file from logical cluster number in NTFS/FAT32?

I am trying to defragment a single file through Windows defragmentation API ( http://msdn.microsoft.com/en-us/library/aa363911(VS.85).aspx ) but if there is no free space block large enough for my file I would like to move other parts of files to make room for it.
The linked article mentions moving parts of other files but I can't find any information about how to find out which files to move. From the free space bitmap I can find an almost large enough space and I know the logical cluster numbers surrounding it, but from this I can't find out which files are surrounding it and a handle to the files is required to do FSCTL_MOVE_FILE which moves parts of files.
Is there any way, through the API or by parsing the MFT, to find out what file a logical cluster number is part of, and what virtual cluster number in the file corresponds to the logical cluster number found through the bitmap?
The slow but compatible method is to recursively scan all directories for files, and use the FSCTL_GET_RETRIEVAL_POINTERS. Then scan the resulting VCN-LCN mapping for the cluster in question.
Another option would be to query the USN Journal of the drive to get the File Reference IDs, then use FSCT_GET_NTFS_FILE_RECORD to get the $MFT file record.
I'm currently working on a simple Defrag program (written in Java) with the aim to pack files of a directory (e.g. all files of a large game) close together to reduce loading times and loading lags.
I use a faster method to retrieve the file mappings on the NTFS or FAT32 drive.
I parse the $MFT file directly (the format has some pitfalls), or the FAT32 file allocation table along with the directories.
The trick is to open the drive (e.g. "c:") with FileCreate for fully shared GENERIC read. The resulting handle can then be read with FileRead and FileSeek on a byte granularity. This works only in administrator mode (or elevated).
On NTFS, the $MFT might be fragmented and is a bit tricky to locate it from the boot sector info. I use the FSCTL_GET_RETRIEVAL_POINTERS on the C:\$MFT file to get its clusters.
On FAT32, one must parse the boot sector to locate the FAT table and the cluster containing root directory file. You need to parse the directory entries and recursively locate the clusters of the sub-directories.
There is no O(1) way of mapping from block # to file. You need to walk the entire MFT looking for files that contain that block.
Of course, in a live system, once you've read that data it's out-of-date and you must be prepared for failures in the move data FSCTL.

Millions of small graphics files and how to overcome slow file system access on XP

I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.
I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.
I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?
Thanks,
Barry.
Update:
Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)
There are several things you could/should do
Disable automatic NTFS short file name generation (google it)
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
As already posted consider splitting up the files in multiple directories.
.e.g. instead of
directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg
use
directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg
You could try an SSD....
http://www.crucial.com/promo/index.aspx?prog=ssd
Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.
Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.
The solution is most likely to restrict the number of files per directory.
I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.
gbp97m.xls
was stored in
g/b/p97m.xls
This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.
One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.
Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

Files on Windows and Contiguous Sectors

Is there a way to guarantee that a file on Windows (using the NTFS file system) will use contiguous sectors on the hard disk? In other words, the first chunk of the file will be stored in a certain sector, the second chunk of the file will be stored in the next sector, and so on.
I should add that I want to be able to create this file programmatically, so I'd rather not just ask the user to defrag their harddrive after creating this file. If there is a way to programmatically defrag just the file that I create, then that would be OK too.
I would start here:
http://technet.microsoft.com/en-us/sysinternals/bb897428.aspx
and follow Mark's documentation of the defrag stuff:
http://technet.microsoft.com/en-us/sysinternals/bb897427.aspx
I know of no such guarantees.
But also keep in mind that NTFS "files" are comprised of multiple data streams. So you are actually looking for a way to guarantee that a stream is contiguous.
I believe there's no way to achieve that. You can only defragment the file after it's been written.

Resources