Is there a way to iterate through hdfs dfs file system in unix to search for huge number of small files - shell

Need to iterate through hdfs dfs file system where the owner of the file system is my service account and find directories/folders where the number of small files are more than a lakh.
The problem I am facing is respect to the depths of these folders, for example:
Path 1 : hdfs dfs edx/home/krn/zxy/
Path 2 : hdfs dfs edx/home/nzy/
In Path 1 small files are present inside zxy which means the depth is 3 whereas for Path 2 small files are present in nzy which is of depth2.
I need help in write a bash/shell script where I can have the depth as dynamic and get all the locations/paths where there are > 100k small files

Related

Merging small files into single file in hdfs

In a cluster of hdfs, i receive multiple files on a daily basis which can be of 3 types :
1) product_info_timestamp
2) user_info_timestamp
3) user_activity_timestamp
The number of files received can be of any number but they will belong to one of these 3 categories only.
I want to merge all the files(after checking whether they are less than 100mb) belonging to one category into a single file.
for eg: 3 files named product_info_* should be merged into one file named product_info.
How do i achieve this?
You can use getmerge toachieve this, but the result will be stored in your local node (edge node), so you need to be sure you have enough space there.
hadoop fs -getmerge /hdfs_path/product_info_* /local_path/product_inf
You can move them back to hdfs with put
hadoop fs -put /local_path/product_inf /hdfs_path
You can use hadoop archive (.har file) or sequence file. It is very simple to use - just google "hadoop archive" or "sequence file".
Another set of commands along the similar lines as suggested by #SCouto
hdfs dfs -cat /hdfs_path/product_info_* > /local_path/product_info_combined.txt
hdfs dfs -put /local_path/product_info_combined.txt /hdfs_path/

Concatenating multiple text files into one very large file in HDFS

I have the multiple text files.
The total size of them exceeds the largest disk size available to me (~1.5TB)
A spark program reads a single input text file from HDFS. So I need to combine those files into one. (I cannot re-write the program code. I am given only the *.jar file for execution)
Does HDFS have such a capability? How can I achieve this?
What I understood from your question is you want to Concatenate multiple files into one. Here is a solution which might not be the most efficient way of doing it but it works. suppose you have two files: file1 and file2 and you want to get a combined file as ConcatenatedFile
.Here is the script for that.
hadoop fs -cat /hadoop/path/to/file/file1.txt /hadoop/path/to/file/file2.txt | hadoop fs -put - /hadoop/path/to/file/Concatenate_file_Folder/ConcatenateFile.txt
Hope this helps.
HDFS by itself does not provide such capabilities. All out-of-the-box features (like hdfs dfs -text * with pipes or FileUtil's copy methods) use your client server to transfer all data.
In my experience we always used our own written MapReduce jobs to merge many small files in HDFS in distributed way.
So you have two solutions:
Write your own simple MapReduce/Spark job to combine text files with
your format.
Find already implemented solution for such kind of
purposes.
About solution #2: there is the simple project FileCrush for combining text or sequence files in HDFS. It might be suitable for you, check it.
Example of usage:
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
--input-format=text \
--output-format=text \
--compress=none \
/input/dir /output/dir 20161228161647
I had a problem to run it without these options (especially -Ddfs.block.size and output file date prefix 20161228161647) so make sure you run it properly.
You can do a pig job:
A = LOAD '/path/to/inputFiles' as (SCHEMA);
STORE A into '/path/to/outputFile';
Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will degradate your network

Load a folder from LocalSystem to HDFS

I have a folder in my LocalSystem. It contains 1000 files, and I would move or copy him from my LocalSystem to HDFS
I tried by these two commands:
hadoop fs copyFromLocal C:/Users/user/Downloads/ProjectSpark/ling-spam /tmp
And I also tried this command:
hdfs dfs -put /C:/Users/user/Downloads/ProjectSpark/ling-spam
/tmp/ling-spam
It displays an error message which says that my directory not found and yet I'm sure that correct.
I found a function getmerge() to move a folder from HDFS to LocalSystem, but I did not find the inverse.
Please, can you help me?
my VirtualBox on Windows, and i work by HDP2.3.2 with the console secure shell
You can't copy files from your Windows machine to HDFS. You have to first SCP the files into the VM (I recommend WinSCP or Filezilla) and only then can you use hadoop fs to put files onto HDFS.
The error was correct in that C:/Users/user/Downloads does not exist on the HDP sandbox because it's a Linux machine.
As noted, you can also try and use the Ambari HDFS file viewer, but I still standby by note that SCP is the official way because not all Hadoop systems have Ambari (or at least the HDFS file view for Ambari)
I would take the Mutual Information for classification of the word spam or ham. I have this operation: MI(Word)= ∑ Probabi(Occ,Class) * Log2 * (Probabi(Occuren,Class)/Probabi(Occurren) * Probabi(Class)).
I understand the function, I must compute 4 operation (true,ham), (false,ham), (true,spam) and (false,spam).
I do not understand who i do write exactly, in fact, I computed the number of the file in which in occur.
But I do not who exactly I must write in my function.
Thank you very much!
This isthe corps of my function:
def computeMutualInformationFactor(
probaWC:RDD[(String, Double)],// probability of occurrence of the word in a given class.
probaW:RDD[(String, Double)],// probability of occurrence of the word in whether class
probaC: Double, //probability an email appears in class (spam or ham)
probaDefault: Double // default value when a probability is missing
):RDD[(String, Double)] = {

Does Hadoop copyFromLocal creates 2 copies? - 1 inside hdfs and other inside datanode?

I have installed a pseudo distributed standalone hadoop version on Ubuntu present inside my vmware installed on my windows10.
I downloaded a file from internet and copied into ubuntu local directory /lab/data
I have created namenode and datanode folders(not hadoop folder) with name namenodep and datan1 in ubuntu. I have also created a folder inside hdfs as /input.
When I copied the file from ubuntu local to hdfs, why is that file is present in both the below directories?
$ hadoop fs -copyFromLocal /lab/data/Civil_List_2014.csv /input
$hadoop fs -ls /input/
input/Civil_List_2014.csv ?????
$cd lab/hdfs/datan1/current
blk_3621390486220058643 ?????
blk_3621390486220058643_1121.meta
Basically I want to understand if it created 2 copies, 1 inside datan1 folder and the other inside hdfs?
Thanks
No. Only one copy is created.
When you create a file in HDFS, the contents of the file are stored on one of the disks of the Data Node. The disk location where the Data Node stores the data is determined by the configuration parameter: dfs.datanode.data.dir (present in hdfs-site.xml)
Check the description of this property:
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///e:/hdpdatadn/dn</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
<final>true</final>
</property>
So above, the contents of your file HDFS file "/input/Civil_List_2014.csv", are stored in physical location: lab/hdfs/datan1/current/blk_3621390486220058643.
"blk_3621390486220058643_1121.meta" contains the check sum of the data stored in "blk_3621390486220058643".
This file may be small enough to be put in a single file. But, if a file is big (assuming > 256 MB and a Hadoop block size of 256 MB), then Hadoop splits the contents of the file into 'n' number of blocks and stores them on the disk. In that case, you will see 'n' number of "blk_*" files in the data node's data directory.
Also, since the replication factor is typically set to "3", 3 instances of the same block are created.
The output from the hadoop fs -ls /input/ command is actually showing you the metadata information and is not actually a physical file, its logical abstraction around the files which are hosted by datanode's. This metadata information is stored by NameNode.
The actual physical file's are split into blocks and are hosted by the datanode's in the path specified in the configuration in your case lab/hdfs/datan1/current.

How to make Hadoop Distcp copy custom list of folders?

I'm looking for efficient way to sync list of directories from one Hadoop filesytem to another with same directory structure.
For example lets say HDFS1 is official source where data is created and once a week we need to copy newly created data under all data-2 directories to HDFS2:
**HDFS1**
hdfs://namenode1:port/repo/area-1/data-1
hdfs://namenode1:port/repo/area-1/data-2
hdfs://namenode1:port/repo/area-1/data-3
hdfs://namenode1:port/repo/area-2/data-1
hdfs://namenode1:port/repo/area-2/data-2
hdfs://namenode1:port/repo/area-3/data-1
**HDFS2** (subset of HDFS1 - only data-2)
hdfs://namenode2:port/repo/area-1/dir2
hdfs://namenode2:port/repo/area-2/dir2
In this case we have 2 directories to sync:
/repo/area-1/data-2
/repo/area-1/data-2
This can be done by:
hadoop distcp hdfs://namenode1:port/repo/area-1/data-2 hdfs://namenode2:port/repo/area-1
hadoop distcp hdfs://namenode1:port/repo/area-2/data-2 hdfs://namenode2:port/repo/area-2
This will run 2 Hadoop jobs, and if number of directories is big, let's say 500 different non overlapping directories under hdfs://namenode1:port/ - this will create 500 Hadoop jobs which is obvious overkill.
Is there a way to inject custom directory list into distcp?
How to make distcp create one job copying all paths in custom list of directories?
Not sure if this answers the problem, but I noticed you haven't used the "update" operator. The "-update" operator will only copy over the difference in the blocks between the two file systems...

Resources