How does cp command work in Hadoop? - hadoop

I am reading "Hadoop: The Defnitive Guide" and to explain my question let me quote from the book
distcp is implemented as a MapReduce job where the work of copying is done by the
maps that run in parallel across the cluster. There are no reducers. Each file is copied
by a single map, and distcp tries to give each map approximately the same amount of
data by bucketing files into roughly equal allocations. By default, up to 20 maps are used, but this can be changed by specifying the -m argument to distcp.
and in a footnote
Even for a single file copy, the distcp variant is preferred for large files since hadoop fs -cp copies the file
via the client running the command.
I understand why distcp works better for collection of files as different mappers are performing parallelly each on a single file. But when only a single file is to be copied why distcp performs better when the file size is large (according to the footnote). I am only getting started so it would be helpful if how cp command in hadoop works is explained and what is meant by "hadoop fs -cp copies the file via the client running the command.". I understand the write process of Hadoop which is explained in the book where a pipeline of datanodes are formed and each datanode is responsible to write data to the following datanode in the pipeline.

When a file is copied "via the client", the byte content is streamed from HDFS, to the local node running the command, then uploaded back to the destination HDFS location. The file metadata is not simply copied over to a new spot between datanodes directly as you'd expect.
Compare that to distcp, which creates smaller, parallel cp commands spread out over multiple hosts

Related

How can I append multiple files in HDFS to a single file in HDFS without the help of local file system?

I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename

Difference between hadoop fs -put and hadoop distcp

We are going to do the ingestion phase in our data lake project and I have mostly used hadoop fs -put throughout my Hadoop developer experience. So what's the difference with hadoop distcp and the difference in usage?
Distcp is a special tool used for copying the data from one cluster to another. Basically you usually copy from one hdfs to hdfs, but not for local file system. Another very important thing is that the process in done as a mapreduce job with 0 reduce task which makes it more fast due to the distribution of operations. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list
hdfs put - copies the data from local system to hdfs. Uses hdfs client for this behind the scene and does all the work sequentially through accessing NameNode and Datanodes. Does not create MapReduce jobs for processing the data.
hdfs or hadoop put is used for data ingestion from Local to HDFS file system
distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem
We extensively use distcp for (Archiving ) Back-up and Restore of the HDFS files something like this
hadoop distcp $CURRENT_HDFS_PATH $BACKUP_HDFS_PATH
"distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem"
-> it can, use "file" (eg. "file:///tmp/test.txt") as schema in URL (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html)
Hint: use "hadoop distcp -D dfs.replication=1" to decrease distcp process time during copy operation and later replicate the copied files.
Distcp is command is used for coying the data from one cluster's hdfs location to another cluster's hdfs location only. create MapReduce jobs with 0 reducer for processing the data.
hadoop -distcp webhdfs://source-ip/directory/filename webhdfs://target-ip/directory/
scp is the command used for copying the data from one cluster's local file system to another cluster's local file system.
scp //source-ip/directory/filename //target-ip/directory/
hdfs put command - copies the data from local file system to hdfs. Does not create MapReduce jobs for processing the data.
hadoop fs -put -f /path/file /hdfspath/file
hdfs get command -copies the data from hdfs to local file system
first, go to the directory where you want to copy the file then run below command
hadoop fs -get /hdfsloc/file

What is the best way of loading huge size files from local to hdfs

I have dir which contain multiple folder with N number of files in each dir. single file size would be 15 GB. I don't know what is the best way to copy/move the file from local to HDFS.
There are many ways to do this (using traditional methods), like,
hdfs dfs -put /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -copyFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -moveFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hadoop distcp file:///path/to/localdir/ hdfs://namenode:port/path/to/hdfsdir
Option 1 & 2 are same in your case. There will not be any difference in copy time.
Option 3 might take some more time as it copies the data to HDFS filesystem (same as -put) and then deletes the file from the local filesystem.
Option 4 is a tricky one. It is designed for large inter/intra-cluster copying. But you can use the same command for local files also by providing local file URL with "file://" prefix. It is not the optimal solution w.r.t distcp as the tool is designed to work in parallel (using MapReduce) and as the file is on local, it cannot make use of its strength. (You can try by creating a mount on the cluster nodes which might increase the performance of distcp)

Combine Map output for directory to one file

I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us
Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.

HDFS Block Split

My Hadoop knowledge is 4 weeks old. I am using a sandbox with Hadoop.
According to the theory, when a file is copied into the HDFS file system, it will be split into 128 MB blocks. Each block will then be copied into different data nodes and then replicated to data nodes.
Question:
When I copy a data file (~500 MB) from local file system into HDFS (put command) entire file is still present in HDFS (-ls command). I was expecting to see 128 MB block. What am I doing wrong here ?
If suppose, I manage to split & distribute data file in HDFS, is there a way to combine and retrieve original file back to local file system ?
You won't see the individual blocks from the -ls command. These are the logical equivalent of blocks on a hard drive not showing up in Linux's ls or Windows Explorer. You can do this on the commandline like hdfs fsck /user/me/someFile.avro -files -blocks -locations, or you can use the NameNode UI to see which hosts have the blocks for a file, and on which hosts each block is replicated.
Sure. You'd just do something like hdfs dfs -get /user/me/someFile.avro or download the file using HUE or the NameNode UI. All these options will stream the appropriate blocks to you to assemble the logical file back together.

Resources