In one of my folders on the HDFS, i have about 37 gigabytes of data
hadoop fs -dus my-folder-name
When i execute a
hadoop fs -rmr my-folder-name
the command executes in a flash. However on non-distributed files systems, an rm -rf would take much longer for a similarly sized directory
Why is there so much of a difference? I have a 2 node cluster
The fact is that when you issue hadoop fs -rmr, the Hadoop moved the files to .Trash folder under your home directory on HDFS. Under the hood I believe it's just a record change in the namenode to move the files location on HDFS. This is the reason why it's very fast.
Usually in an OS, a delete command deletes the associated meta data and not the actual data and so the reason why it is fast. The same is the case with the HDFS also, the block might be still in the DN's, but all the references to them are removed. Note that the delete command frees up the space though.
Related
I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename
I have dir which contain multiple folder with N number of files in each dir. single file size would be 15 GB. I don't know what is the best way to copy/move the file from local to HDFS.
There are many ways to do this (using traditional methods), like,
hdfs dfs -put /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -copyFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -moveFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hadoop distcp file:///path/to/localdir/ hdfs://namenode:port/path/to/hdfsdir
Option 1 & 2 are same in your case. There will not be any difference in copy time.
Option 3 might take some more time as it copies the data to HDFS filesystem (same as -put) and then deletes the file from the local filesystem.
Option 4 is a tricky one. It is designed for large inter/intra-cluster copying. But you can use the same command for local files also by providing local file URL with "file://" prefix. It is not the optimal solution w.r.t distcp as the tool is designed to work in parallel (using MapReduce) and as the file is on local, it cannot make use of its strength. (You can try by creating a mount on the cluster nodes which might increase the performance of distcp)
I have created a file in hdfs using below command
hdfs dfs -touchz /hadoop/dir1/file1.txt
I could see the created file by using below command
hdfs dfs -ls /hadoop/dir1/
But, I could not find the location itself by using linux commands (using find or locate). I searched on internet and found following link.
How to access files in Hadoop HDFS? . It says, hdfs is virtual storage. In that case, How its taking partition which one or how much it needs to be used, where the meta data being stored
Is it taking datanode location for virtual storage which I have mentioned in hdfs-site.xml to store all the data?
I looked into datanode location and there are files available. But I could not find out anything related to my file or folder which I have created.
(I am using hadoop 2.6.0)
HDFS file system is a distributed storage system wherein the storage location is virtual and created using the disk space from all the DataNodes. While installing hadoop, you must have specified paths for dfs.namenode.name.dir and dfs.datanode.data.dir. These are the locations at which all the HDFS related files are stored on individual nodes.
While storing the data onto HDFS, it is stored as blocks of a specified size (default 128MB in Hadoop 2.X). When you use hdfs dfs commands you will see the complete files but internally HDFS stores these files as blocks. If you check the above mentioned paths on your local file system, you will see a bunch of files which correcpond to files on your HDFS. But again, you will not see them as actual files as they are split into blocks.
Check below mentioned command's output to get more details on how much space from each DataNode is used to create the virtual HDFS storage.
hdfs dfsadmin -report #Or
sudo -u hdfs hdfs dfsadmin -report
HTH
As we creating a file in local file system i.e on creating a directory in it
for ex:$/mkdir MITHUN94** it is a directory entering into that(LFS) cd MITHUN90
in that **create a new file as **$nano file1.log .
And now create a directory in** hdfs for ex: hdfs dfs -mkdir /mike90 .Here "mike90"
refers to directory name . After that creating a directory send files from LFS to hdfs. By using this command $hdfs dfs -copyFromLocal /home/gopalkrishna/file1.log
/mike90
Here '/home/gopalkrishna/file1.log' refers to pwd (present working directory)
and '/mike90' refers to directory in hdfs. By clickig $hdfs dfs -ls /mike90
the list of files .
I have 300000+ files on a HDFS data directory.
When I do a hadoop fs -ls and I am getting an out of memory error saying GC Limit has exceeded. The cluster nodes have 256 GB of RAM each. How do I fix it?
You can make more memory available to the hdfs command by specifying 'HADOOP_CLIENT_OPTS'
HADOOP_CLIENT_OPTS="-Xmx4g" hdfs dfs -ls /
Found here: http://lecluster.delaurent.com/hdfs-ls-and-out-of-memory-gc-overhead-limit/
This fixed the problem for me, I had over 400k files in one directory and needed to delete most but not all of them.
Write a python script to split the files into multiple directories and run through them. First of all what are you trying to achieve when you know you have 300000+ files in a directory. If you want to concatenate better arrange them into sub dirs.
I have cluster of 4 datanodes and hdfs structure on each node is as below
I am facing disk space issue , as you can see the /tmp folder from hdfs has occupied more space(217GB). So i tried to investigate the data from /tmp folder. I found following temp files. I accessed these temp folders each contains some part files of 10gb to 20 gb in size.
I want to clear this /tmp directory. can anyone please let me know the consequences of deleting these tmp folders or part files. Will it affect my cluster?
HDFS /tmp directory mainly used as a temporary storage during mapreduce operation. Mapreduce artifacts, intermediate data etc will be kept under this directory. These files will be automatically cleared out when mapreduce job execution completes. If you delete this temporary files, it can affect the currently running mapreduce jobs.
Temporary files are created by pig. Temporary files deletion happens at the end. Pig does not handle temporary files deletion if the script execution failed or killed. Then you have to handle this situation. You better handle this temporary files clean up activity in the script itself.
Following article gives you a good understanding
http://www.lopakalogic.com/articles/hadoop-articles/pig-keeps-temp-files/