recover deleted data from hdfs - hadoop

We have a Hadoop cluster v1.2.1. We deleted one of hdfs folders by mistake, but immediately, we shut down the cluster. Is there any way to get back our data?
Even if we can get back a part of our data, it would be better than none! As the data size was so much, most probably a little data has been removed.
thanks for your help.

This could be an easy fix if you have set the fs.trash.interval > 1. If this is true, HDFS's trash option is enabled, and your files should be located in the trash directory. By default, this directory is located at /user/X/.Trash.
Otherwise, your best option is probably to find and use a data recovery tool. Some quick Googling turned up this cross-platform tool available under GNU licensing that runs from the terminal: http://www.cgsecurity.org/wiki/PhotoRec. It works on many different types of file systems, and it's possible it may work for HDFS.

Related

Could HDFS orphan files in datanode?

During a routine log pruning job where logs older than 60 days were being removed, a system administrator upgraded CDH from 4.3 to 4.6, (I know, I know)...
Normally, the log pruning job frees about 40% of HDFS's available storage. However, during the upgrade, datanodes went down, were rebooted, and all sorts of madness.
What's known is that HDFS received the delete commands, since the HDFS files / folders no longer exist, but disk utilization is still unchanged.
My question is, could HDFS have removed the files from the NameNode's metadata without actually fulfilling the file block deletes among the DataNodes, effectively orphaning the file blocks?
I think the namenode tells the datanodes to delete orphaned blocks, once it gets their report on the blocks they hold and it notices that some of them don't belong to any file.
If you don't want these blocks to be deleted, you can put the system in safemode and try to manually look through the disks and copy the data. There is no automated way of doing this, but a tool to list orphaned blocks may be added in the future (as suggested in this JIRA).
Additionally, you can try to check the health of the namesystem using Hadoop's fsck.

How to take a snapshot of the entire system of Macbook Pro OS x 10.8 Mountain Lion

I want to be able to take a snapshot of the current system and will go back to it whenever I mess up with files. I looked at the Time Machine solution, but realized that it's only a good solution when I know what file I am looking for. But sometimes, some installation process creates binary files in multiple system paths, which are very hard to locate and identify. Say I installed a package, but then I felt like I shouldn't have done that. Uninstallation might still leave files around. So What might be some of the graceful solutions to go back to a status of the machine when everything is nice and clean.
Use disk utility (Applications | Utilities)
Click on the HDD and then click on new image. You can choose to have a compressed image or not. If you don't have much stuff on the drive it shouldn't be more than 30-40GB. Once you have the dmg file, stick it somewhere for backup purposes.
Also, create a recovery disk/stick with the recovery tool.
I dunno about "graceful" but Carbon Copy Cloner is definitely an easy solution for rolling back to a previous state. You can make an exact clone of your drive, then restore it back if something goes horribly wrong. I use CCC to make periodic backups of my Macs, as a sort of secondary backup to time-machine, which is easy to use but which I don't have total confidence in.
You can restore an entire system from a Time Machine snapshot, but it requires booting from the Recovery Partition or a Recovery disk. Basically, once you've rebooted in recovery mode, you can choose Restore From Time Machine Backup and then you'll be asked to locate the drive. Once you've done that, a list of Time Machine snapshots will be presented for restoring.
I haven't done this recently, but there are indications that the time of the backups may always be in PST, so be careful when looking at the times.
While OSX comes with TimeMachine, it also has the well-known (in Linux community!) command line tool called rsync.
With Google, I'm sure you can find many articles of how to use it, though here's an interesting blog of why its author uses rsync with Time Machine.

Avoid updating last-accessed date/time when reading a file

We're building a Windows-based application that traverses a directory structure recursively, looking for files that meet certain criteria and then doing some processing on them. In order to decide whether or not to process a particular file, we have to open that file and read some of its contents.
This approach seems great in principle, but some customers testing an early version of the application have reported that it's changing the last-accessed time of large numbers of their files (not surprisingly, as it is in fact accessing the files). This is a problem for these customers because they have archive policies based on the last-accessed times of files (e.g. they archive files that have not been accessed in the past 12 months). Because our application is scheduled to run more frequently than the archive "window", we're effectively preventing any of these files from ever being archived.
We tried adding some code to save each file's last-accessed time before reading it, then write it back afterwards (hideous, I know) but that caused problems for another customer who was doing incremental backups based on a file system transaction log. Our explicit setting of the last-accessed time on files was causing those files to be included in every incremental backup, even though they hadn't actually changed.
So here's the question: is there any way whatsoever in a Windows environment that we can read a file without the last-accessed time being updated?
Thanks in advance!
EDIT: Despite the "ntfs" tag, we actually can't rely on the filesystem being NTFS. Many of our customers run our application over a network, so it could be just about anything on the other end.
The documentation indicates you can do this, though I've never tried it myself.
To preserve the existing last access time for a file even after accessing a file, call SetFileTime immediately after opening the file handle with this parameter's FILETIME structure members initialized to 0xFFFFFFFF.
From Vista onwards NTFS does not update the last access time by default. To enable this see http://technet.microsoft.com/en-us/library/cc959914.aspx
Starting NTFS transaction and rolling back is very bad, and the performance will be terrible.
You can also do
FSUTIL behavior set disablelastaccess 0
I don't know what your client minimum requirements are, but have you tried NTFS Transactions? On the desktop the first OS to support it was Vista and on the server it was Windows Server 2008. But, it may be worth a look at.
Start an NTFS transaction, read your file, rollback the transaction. Simple! :-). I actually don't know if it will rollback the Last Access Date though. You will have to test it for yourself.
Here is a link to a MSDN Magazine article on NTFS transactions which includes other links. http://msdn.microsoft.com/en-us/magazine/cc163388.aspx
Hope it helps.

Change Journal for Blocks in Windows(NTFS)

I have written a backup tool that is able to backup files and images of volumes for Windows. To detect which files have changed I use the Windows Change Journal. I already use the shadow copy functionality to do a consistent copy of both the files and the volume images.
To detect which blocks have changed I use hashes at the moment. This means the whole volume has to be read once (because to see which block has changed hashes of all blocks have to be calculated).
The backup integrated into Windows 7 is able to create incremental volume images without checking all blocks. I wasn't able to find an API for a kind of block level change journal.
Does anybody know how to access this information?
(I'm willing to dive deep into NTFS internals - even reading and parsing special files)
I don't think block level change info is available anywhere. Most probably what the Windows 7 integrated backup does is it installs a File System Filter Driver like some backup products does and anti-virus software. A filter driver can intercept all file system calls and in this way know which blocks changed. If you do this you can basically build your own change journal that works block level but only for the files that you are interested in.
I would really like to know a better answer myself here.
When you say Windows Change Journal I take it you are referring to the NTFS USN? It looks very much like the Windows 7 backup uses a combination of VSC and NTFS USN to detect changes and create incremental images much like you are already doing.

Unmovable Files on Windows XP

When I defragment my XP machine I notice that there is a block of "Unmovable Files". Is there a file attribute I can use to make my own files unmovable?
Just to clarify, I want a way to programmatically tell Windows that a file that I create should be unmovable. Is this possible, and if so, how can I do it?
Thanks,
Terry
A lot of system files cannot be moved after the system boots, such as the page file and registry database files.
This utility runs before Windows boots to defragment those files. I have it set to run at every boot, and it works well for me on several machines.
Note that the very first time you boot up with this utility set to run, it may take several minutes to defrag. After that first run though, it finishes in just 3 or 4 seconds.
Edit0: To respond to your clarification- that link says windows has marked the page file and registry files as open for exclusive access. So you should be able to do the same thing with the LockFile API Call. However, that's not an attribute of the file itself. You'd have to actually run some background program that locks the file for exclusive access.
There are no file attributes that you can place on your files to mark them as immovable. The only way that a file cannot be moved (I think) during defragmentation is to have some other process have the file open (for read or write, I'm not even sure that you need to have the file open in exclusive mode or not).
Quite frankly, I cannot think of a reason that you'd want your files not to move, unless you have specific requirements about where on the disk platter your files reside. Defragmentation should generally lead to faster disk access and that seems to be desireable in all cases :-)
This usually means that the file is in use by some process. If you're defragmenting, you'll likely see this with a lot of system files. If the file should legitimately be movable and is stuck (it's being held by a process that runs at startup but shouldn't be, for example), the most useful way of resolving the problem is to remove all permissions on the file, reboot, restore the permissions, and then get rid of the file/run the program that's trying to use it.
I suppose the ugly way is to have an application boot on startup, check every few seconds if defrag is running and if so open the file in exclusive mode.
This is really ugly and I don't recommend it unless there is no cleaner solution.
Terry, the answers all mention ways to prevent files from becoming unmovable during defragmentation. From your question it appears that you are in fact wanting to make your personal files unmovable. Can you please clarify what is appealing about making your files unmovable.
I assume you're using the defragger that comes with Windows. Some commercial ones like DiskKeeper can move some of these files (usually system files). You can try their trial versions.
Contig might serve your purpose http://technet.microsoft.com/en-us/sysinternals/bb897428.aspx
I'm relatively certain I ran across some methods/attributes you could access programatically to do exactly what you want. This was back in NT4 days though and my memory isn't that good.
For a little more complete solution try Raxco's PerfectDisk. While it is a commercial product it does a very good job and supports boot time defrag of system files. The first defrag takes longer than say DiskKeeper but its a single pass defragger and supports defragging with very little free space left on the drive. Overall its a much smarter defrag program then any other I've seen and supports systems of any size.
http://www.raxco.com/
first try to move(or delete) the files within safe mode. If can not, try to move(or delete) the files with linux.
But be careful if those are the windows system files, then you are failed to boot up your windows.
Some reason why the files are unmovable are : the file size is too big, the files are being in open/in use condition, insufficient security privileges, being access by other computer/s, and many other things.

Resources