How does getMerge work in Hadoop? - hadoop

I would like to know, how does the getMerge command work in OS/HDFS level. Will it copy each and every byte/blocks from one file to another file,or just a simple file descriptor change? How costliest operation is it?

getmerge
Usage: hadoop fs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.
So, to answer your question,
Will it copy each and every byte/blocks from one file to another file
Yes, and no. It will find every HDFS block containing the files in the given source directory and concatenate them together into a single file on your local filesystem.
a simple file descriptor change
Not sure what you mean by that. getmerge doesn't change any file descriptors; it is just reading data from HDFS to your local filesystem.
How costliest operation is it?
Expect it to be as costly as manually cat-ing all the files in an HDFS directory. The same operation for
hadoop fs -getmerge /tmp/ /home/user/myfile
Could be achieved by doing
hadoop fs -cat /tmp/* > /home/user/myfile
The costly operation being the fetching of many file pointers and transferring those records over the network to your local disk.

Related

How can I append multiple files in HDFS to a single file in HDFS without the help of local file system?

I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename

Combine Map output for directory to one file

I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us
Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.

Transferring many small files into Hadoop file system

I want to transfer too many small files (e.g. 200k files) in a zip file into HDFS from the local machine. When I unzip the zip file and tranfer the files into HDFS, it takes a long time. Is there anyway I can transfer the original zip file into HDFS and unzip it there?
If your file is in GB's then this command would certainly help to avoid out of space errors as there is no need to unzip the file on local filesystem.
put command in hadoop supports reading input from stdin. For reading the input from stdin use '-' as source file.
Compressed filename: compressed.tar.gz
gunzip -c compressed.tar.gz | hadoop fs -put - /user/files/uncompressed_data
Only Disadvantage: The only drawback of this approach is that in HDFS the data will be merged into a single file even though the local compressed file contains more than one file.
http://bigdatanoob.blogspot.in/2011/07/copy-and-uncompress-file-to-hdfs.html

getmerge command in hadoop datacopy

My aim is to read all the files that starts with "trans" in a directory and convert them into a single file and load that single file into HDFS location
my source directory is /user/cloudera/inputfiles/
Assume that inside that above directory , there are lot of file , but i need all the files that start with "trans"
my destination directory is /user/cloudera/transfiles/
So i tried this command below
hadoop dfs - getmerge /user/cloudera/inputfiles/trans* /user/cloudera/transfiles/records.txt
but the above command is not working .
If i try the below command then it works
hadoop dfs - getmerge /user/cloudera/inputfiles /user/cloudera/transfiles/records.txt
Any suggestion on how do i merge some files from a hdfs location and store that merged single file in another hdfs location
Below is the usage of getmerge command:
Usage: hdfs dfs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and
concatenates files in src into the destination local file.
Optionally addnl can be set to enable adding a newline character at the
end of each file.
It expects directory as first parameter.
you can try cat command like this:
hadoop dfs -cat /user/cloudera/inputfiles/trans* > /<local_fs_dir>/records.txt
hadoop dfs -copyFromLocal /<local_fs_dir>/records.txt /user/cloudera/transfiles/records.txt

Reading files from hdfs vs local directory

I am a beginner in hadoop. I have two doubts
1) how to access files stored in the hdfs? Is it same as using a FileReader in java.io and giving the local path or is it something else?
2) i have created a folder where i have copied the file to be stored in hdfs and the jar file of the mapreduce program. When I run the command in any directory
${HADOOP_HOME}/bin/hadoop dfs -ls
it just shows me all the files in the current dir. So does that mean all the files got added without me explicitly adding it?
Yes, it's pretty much the same. Read this post to read files from HDFS.
You should keep in mind that HDFS is different than your local file system. With hadoop dfs you access the HDFS, not the local file system. So, hadoop dfs -ls /path/in/HDFS shows you the contents of the /path/in/HDFS directory, not the local one. That's why it's the same, no matter where you run it from.
If you want to "upload" / "download" files to/from HDFS you should use the commads:
hadoop dfs -copyFromLocal /local/path /path/in/HDFS and
hadoop dfs -copyToLocal /path/in/HDFS /local/path, respectively.

Resources