hadoop fs -ls out of memory error - hadoop

I have 300000+ files on a HDFS data directory.
When I do a hadoop fs -ls and I am getting an out of memory error saying GC Limit has exceeded. The cluster nodes have 256 GB of RAM each. How do I fix it?

You can make more memory available to the hdfs command by specifying 'HADOOP_CLIENT_OPTS'
HADOOP_CLIENT_OPTS="-Xmx4g" hdfs dfs -ls /
Found here: http://lecluster.delaurent.com/hdfs-ls-and-out-of-memory-gc-overhead-limit/
This fixed the problem for me, I had over 400k files in one directory and needed to delete most but not all of them.

Write a python script to split the files into multiple directories and run through them. First of all what are you trying to achieve when you know you have 300000+ files in a directory. If you want to concatenate better arrange them into sub dirs.

Related

What is the best way of loading huge size files from local to hdfs

I have dir which contain multiple folder with N number of files in each dir. single file size would be 15 GB. I don't know what is the best way to copy/move the file from local to HDFS.
There are many ways to do this (using traditional methods), like,
hdfs dfs -put /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -copyFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -moveFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hadoop distcp file:///path/to/localdir/ hdfs://namenode:port/path/to/hdfsdir
Option 1 & 2 are same in your case. There will not be any difference in copy time.
Option 3 might take some more time as it copies the data to HDFS filesystem (same as -put) and then deletes the file from the local filesystem.
Option 4 is a tricky one. It is designed for large inter/intra-cluster copying. But you can use the same command for local files also by providing local file URL with "file://" prefix. It is not the optimal solution w.r.t distcp as the tool is designed to work in parallel (using MapReduce) and as the file is on local, it cannot make use of its strength. (You can try by creating a mount on the cluster nodes which might increase the performance of distcp)

How files or directories are getting stored in hadoop hdfs

I have created a file in hdfs using below command
hdfs dfs -touchz /hadoop/dir1/file1.txt
I could see the created file by using below command
hdfs dfs -ls /hadoop/dir1/
But, I could not find the location itself by using linux commands (using find or locate). I searched on internet and found following link.
How to access files in Hadoop HDFS? . It says, hdfs is virtual storage. In that case, How its taking partition which one or how much it needs to be used, where the meta data being stored
Is it taking datanode location for virtual storage which I have mentioned in hdfs-site.xml to store all the data?
I looked into datanode location and there are files available. But I could not find out anything related to my file or folder which I have created.
(I am using hadoop 2.6.0)
HDFS file system is a distributed storage system wherein the storage location is virtual and created using the disk space from all the DataNodes. While installing hadoop, you must have specified paths for dfs.namenode.name.dir and dfs.datanode.data.dir. These are the locations at which all the HDFS related files are stored on individual nodes.
While storing the data onto HDFS, it is stored as blocks of a specified size (default 128MB in Hadoop 2.X). When you use hdfs dfs commands you will see the complete files but internally HDFS stores these files as blocks. If you check the above mentioned paths on your local file system, you will see a bunch of files which correcpond to files on your HDFS. But again, you will not see them as actual files as they are split into blocks.
Check below mentioned command's output to get more details on how much space from each DataNode is used to create the virtual HDFS storage.
hdfs dfsadmin -report #Or
sudo -u hdfs hdfs dfsadmin -report
HTH
As we creating a file in local file system i.e on creating a directory in it
for ex:$/mkdir MITHUN94** it is a directory entering into that(LFS) cd MITHUN90
in that **create a new file as **$nano file1.log .
And now create a directory in** hdfs for ex: hdfs dfs -mkdir /mike90 .Here "mike90"
refers to directory name . After that creating a directory send files from LFS to hdfs. By using this command $hdfs dfs -copyFromLocal /home/gopalkrishna/file1.log
/mike90
Here '/home/gopalkrishna/file1.log' refers to pwd (present working directory)
and '/mike90' refers to directory in hdfs. By clickig $hdfs dfs -ls /mike90
the list of files .

HDFS Block Split

My Hadoop knowledge is 4 weeks old. I am using a sandbox with Hadoop.
According to the theory, when a file is copied into the HDFS file system, it will be split into 128 MB blocks. Each block will then be copied into different data nodes and then replicated to data nodes.
Question:
When I copy a data file (~500 MB) from local file system into HDFS (put command) entire file is still present in HDFS (-ls command). I was expecting to see 128 MB block. What am I doing wrong here ?
If suppose, I manage to split & distribute data file in HDFS, is there a way to combine and retrieve original file back to local file system ?
You won't see the individual blocks from the -ls command. These are the logical equivalent of blocks on a hard drive not showing up in Linux's ls or Windows Explorer. You can do this on the commandline like hdfs fsck /user/me/someFile.avro -files -blocks -locations, or you can use the NameNode UI to see which hosts have the blocks for a file, and on which hosts each block is replicated.
Sure. You'd just do something like hdfs dfs -get /user/me/someFile.avro or download the file using HUE or the NameNode UI. All these options will stream the appropriate blocks to you to assemble the logical file back together.

How to find out which application consume the most space on hadoop?

My hadoop cluster shows it has less than 20% disk space left. I am using this command to see disk space
hdfs dfsadmin -report
However, i don't know which directory/files that takes up the most space. Is there a way to find out?
use the following command.
hdfs dfs -du /
It displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

Why is the Hadoop HDFS -rmr command super fast

In one of my folders on the HDFS, i have about 37 gigabytes of data
hadoop fs -dus my-folder-name
When i execute a
hadoop fs -rmr my-folder-name
the command executes in a flash. However on non-distributed files systems, an rm -rf would take much longer for a similarly sized directory
Why is there so much of a difference? I have a 2 node cluster
The fact is that when you issue hadoop fs -rmr, the Hadoop moved the files to .Trash folder under your home directory on HDFS. Under the hood I believe it's just a record change in the namenode to move the files location on HDFS. This is the reason why it's very fast.
Usually in an OS, a delete command deletes the associated meta data and not the actual data and so the reason why it is fast. The same is the case with the HDFS also, the block might be still in the DN's, but all the references to them are removed. Note that the delete command frees up the space though.

Resources