Difference between hadoop fs -put and hadoop distcp - hadoop

We are going to do the ingestion phase in our data lake project and I have mostly used hadoop fs -put throughout my Hadoop developer experience. So what's the difference with hadoop distcp and the difference in usage?

Distcp is a special tool used for copying the data from one cluster to another. Basically you usually copy from one hdfs to hdfs, but not for local file system. Another very important thing is that the process in done as a mapreduce job with 0 reduce task which makes it more fast due to the distribution of operations. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list
hdfs put - copies the data from local system to hdfs. Uses hdfs client for this behind the scene and does all the work sequentially through accessing NameNode and Datanodes. Does not create MapReduce jobs for processing the data.

hdfs or hadoop put is used for data ingestion from Local to HDFS file system
distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem
We extensively use distcp for (Archiving ) Back-up and Restore of the HDFS files something like this
hadoop distcp $CURRENT_HDFS_PATH $BACKUP_HDFS_PATH

"distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem"
-> it can, use "file" (eg. "file:///tmp/test.txt") as schema in URL (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html)
Hint: use "hadoop distcp -D dfs.replication=1" to decrease distcp process time during copy operation and later replicate the copied files.

Distcp is command is used for coying the data from one cluster's hdfs location to another cluster's hdfs location only. create MapReduce jobs with 0 reducer for processing the data.
hadoop -distcp webhdfs://source-ip/directory/filename webhdfs://target-ip/directory/
scp is the command used for copying the data from one cluster's local file system to another cluster's local file system.
scp //source-ip/directory/filename //target-ip/directory/
hdfs put command - copies the data from local file system to hdfs. Does not create MapReduce jobs for processing the data.
hadoop fs -put -f /path/file /hdfspath/file
hdfs get command -copies the data from hdfs to local file system
first, go to the directory where you want to copy the file then run below command
hadoop fs -get /hdfsloc/file

Related

How can I append multiple files in HDFS to a single file in HDFS without the help of local file system?

I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename

why mapreduce doesn't get launched when using hadoop fs -put command?

Please excuse me for this basic question.
But I wonder why mapreduce job don't get launched when we try to load some file having size more than the block size.
Somewhere I learnt that MapReduce will take care of loading the datasets from LFS to HDFS. Then why I am not able to see mapreduce logs on the console when I give hadoop fs -put command?
thanks in Advance.
You're thinking of hadoop distcp which will spawn a MapReduce job.
https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html
DistCp Version 2 (distributed copy) is a tool used for large inter/intra cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
hadoop fs -put or hdfs dfs -put are implemented entirely by HDFS and don't require MapReduce.

Where can i see my data in Hadoop HDFS

I set the dfs.name.dir and dfs.data.dir in master and slave nodes as /home/hduser/hadoop/hdfs/name
/home/hduser/hadoop/hdfs/data
I copy the file from local disk to HDFS.
Where can i see that file data in HDFS
These configuration parameters determine where in the local filesystem Hadoop stores its image and raw data. When you import file data into HDFS, it dosen't involve these values. In general, data is written into HDFS at the path you specify (when it is absolute), or a path qualified by your username (by default, I believe, this is /user/your_username) when you use a relative path.
So, if I have a file named example in my (local) home directory and say
local:~ matt> hadoop fs -put example relative/path
I should be able to find it in HDFS at /user/matt/relative/path/example. On the other hand, if I do this
local:~ matt> hadoop fs -put example /absolute/path/in/hdfs
it will be in HDFS at /absolute/path/in/hdfs/example.

Is it possible to run hadoop fs -getmerge in S3?

I have an Elastic Map Reduce job which is writing some files in S3 and I want to concatenate all the files to produce a unique text file.
Currently I'm manually copying the folder with all the files to our HDFS (hadoop fs copyFromLocal), then I'm running hadoop fs -getmerge and hadoop fs copyToLocal to obtain the file.
is there anyway to use hadoop fs directly on S3?
Actually, this response about getmerge is incorrect. getmerge expects a local destination and will not work with S3. It throws an IOException if you try and responds with -getmerge: Wrong FS:.
Usage:
hadoop fs [generic options] -getmerge [-nl] <src> <localdst>
An easy way (if you are generating a small file that fits on the master machine) is to do the following:
Merge the file parts into a single file onto the local machine (Documentation)
hadoop fs -getmerge hdfs://[FILE] [LOCAL FILE]
Copy the result file to S3, and then delete the local file (Documentation)
hadoop dfs -moveFromLocal [LOCAL FILE] s3n://bucket/key/of/file
I haven't personally tried the getmerge command myself but hadoop fs commands on EMR cluster nodes support S3 paths just like HDFS paths. For example, you can SSH into the master node of your cluster and run:
hadoop fs -ls s3://<my_bucket>/<my_dir>/
The above command will list of out all the S3 objects under the specified directory path.
I would expect hadoop fs -getmerge to work the same way. So, just use full S3 paths (starting with s3://) instead of HDFS paths.

Using multiple local folders as source in hadoop mapreduce job

I have data in multiple local folders i.e. /usr/bigboss/data1, /usr/bigboss/data2 and many more folders. I want to use all of these folders as input source for my MapReduce command and store the result at HDFS. I can not find a working command to use Hadoop Grep example to do it.
The data will need to reside in HDFS for you to process it with the grep example. You can upload the folders to HDFS using the -put FsShell command:
hadoop fs -mkdir bigboss
hadoop fs -put /usr/bigboss/data* bigboss
Which will create a folder in the current user HDFS directory, and upload each of the data directories to it
Now you should be able to run the grep example over the data

Resources