Is this correct, that /tmp directory in hdfs is authomatically clearing every 24 hours (by default)?
HDFS /tmp directory is mainly used as a temporary storage during mapreduce operation. Mapreduce artifacts, intermediate data etc will be kept under this directory. These files should be automatically cleared out when mapreduce job execution completes. If you delete this temporary files, it can affect the currently running mapreduce jobs.
Temporary files are created by pig. Temporary files deletion happens at the end. Pig does not handles temporary files deletion if the script execution filed or killed. Then you got to handles this situation. You better handle this temporary files clean up activity in the script itself.
Related
I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us
Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.
We have a full HDFS backup using distcp that takes a long time to run, some of the data on HDFS is "moving", that is it is created and deleted. This results in mappers failing with java.io.FileNotFoundException: No such file or directory. Such files are unimportant, we just want the backup to do the best it can.
Now it seems that -i "ignore failures" is not quite what we want because it will ignore at the map level rather than the file level, that is if a map task fails all files associated to that map task will be ignored. What we want is just that file to be ignored.
I am looking for a way to push log data from a read-only folder to hdfs using flume. as I know, flume spoolDir needs write access to change the completed file name when done, so I wanted to create a temp folder as a spoolDir and use rsync to copy files to it and then use it as a spoolDir.
but, as much as I know, once the file is changed on the dest folder by flume (myfile.COMPLETED) the rsync process will copy it again, right?
Any other solution?
An alternative source is the ExecSource. You can run a tail command on a single read-only file and start processing the data. Nevertheless, you must have into account this is an unreliable source since there is no way to recover from an error while putting the data into the agent channel.
I have cluster of 4 datanodes and hdfs structure on each node is as below
I am facing disk space issue , as you can see the /tmp folder from hdfs has occupied more space(217GB). So i tried to investigate the data from /tmp folder. I found following temp files. I accessed these temp folders each contains some part files of 10gb to 20 gb in size.
I want to clear this /tmp directory. can anyone please let me know the consequences of deleting these tmp folders or part files. Will it affect my cluster?
HDFS /tmp directory mainly used as a temporary storage during mapreduce operation. Mapreduce artifacts, intermediate data etc will be kept under this directory. These files will be automatically cleared out when mapreduce job execution completes. If you delete this temporary files, it can affect the currently running mapreduce jobs.
Temporary files are created by pig. Temporary files deletion happens at the end. Pig does not handle temporary files deletion if the script execution failed or killed. Then you have to handle this situation. You better handle this temporary files clean up activity in the script itself.
Following article gives you a good understanding
http://www.lopakalogic.com/articles/hadoop-articles/pig-keeps-temp-files/
In one of my folders on the HDFS, i have about 37 gigabytes of data
hadoop fs -dus my-folder-name
When i execute a
hadoop fs -rmr my-folder-name
the command executes in a flash. However on non-distributed files systems, an rm -rf would take much longer for a similarly sized directory
Why is there so much of a difference? I have a 2 node cluster
The fact is that when you issue hadoop fs -rmr, the Hadoop moved the files to .Trash folder under your home directory on HDFS. Under the hood I believe it's just a record change in the namenode to move the files location on HDFS. This is the reason why it's very fast.
Usually in an OS, a delete command deletes the associated meta data and not the actual data and so the reason why it is fast. The same is the case with the HDFS also, the block might be still in the DN's, but all the references to them are removed. Note that the delete command frees up the space though.