what is /tmp directory in hadoop hdfs? - hadoop

I have cluster of 4 datanodes and hdfs structure on each node is as below
I am facing disk space issue , as you can see the /tmp folder from hdfs has occupied more space(217GB). So i tried to investigate the data from /tmp folder. I found following temp files. I accessed these temp folders each contains some part files of 10gb to 20 gb in size.
I want to clear this /tmp directory. can anyone please let me know the consequences of deleting these tmp folders or part files. Will it affect my cluster?

HDFS /tmp directory mainly used as a temporary storage during mapreduce operation. Mapreduce artifacts, intermediate data etc will be kept under this directory. These files will be automatically cleared out when mapreduce job execution completes. If you delete this temporary files, it can affect the currently running mapreduce jobs.
Temporary files are created by pig. Temporary files deletion happens at the end. Pig does not handle temporary files deletion if the script execution failed or killed. Then you have to handle this situation. You better handle this temporary files clean up activity in the script itself.
Following article gives you a good understanding
http://www.lopakalogic.com/articles/hadoop-articles/pig-keeps-temp-files/

Related

How can I append multiple files in HDFS to a single file in HDFS without the help of local file system?

I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename

Combine Map output for directory to one file

I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us
Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.

Is there a tool to continuously copy contents of a directory to HDFS as they are?

I tried using flume directory spooler source and HDFS sink. But this does not serve my purpose because, the files are read by Flume and then get written to HDFS as part files which can be rolled by size/time (Please correct me if I've got this wrong). Is there a tool that continously does something like an HDFS put on all files that are dumped in the spool directory ?
If i got your question correctly then you have a and you are getting files into it and that file you want to move to HDFS without reading it and HDFS copyFromLocal will solve your issue then you just need to have an logic which can return you the recent files in the directory and run CopyFromLocal command to copy it in HDFS.

Clearing /tmp directory in hdfs

Is this correct, that /tmp directory in hdfs is authomatically clearing every 24 hours (by default)?
HDFS /tmp directory is mainly used as a temporary storage during mapreduce operation. Mapreduce artifacts, intermediate data etc will be kept under this directory. These files should be automatically cleared out when mapreduce job execution completes. If you delete this temporary files, it can affect the currently running mapreduce jobs.
Temporary files are created by pig. Temporary files deletion happens at the end. Pig does not handles temporary files deletion if the script execution filed or killed. Then you got to handles this situation. You better handle this temporary files clean up activity in the script itself.

Why is the Hadoop HDFS -rmr command super fast

In one of my folders on the HDFS, i have about 37 gigabytes of data
hadoop fs -dus my-folder-name
When i execute a
hadoop fs -rmr my-folder-name
the command executes in a flash. However on non-distributed files systems, an rm -rf would take much longer for a similarly sized directory
Why is there so much of a difference? I have a 2 node cluster
The fact is that when you issue hadoop fs -rmr, the Hadoop moved the files to .Trash folder under your home directory on HDFS. Under the hood I believe it's just a record change in the namenode to move the files location on HDFS. This is the reason why it's very fast.
Usually in an OS, a delete command deletes the associated meta data and not the actual data and so the reason why it is fast. The same is the case with the HDFS also, the block might be still in the DN's, but all the references to them are removed. Note that the delete command frees up the space though.

Resources