why mapreduce doesn't get launched when using hadoop fs -put command? - hadoop

Please excuse me for this basic question.
But I wonder why mapreduce job don't get launched when we try to load some file having size more than the block size.
Somewhere I learnt that MapReduce will take care of loading the datasets from LFS to HDFS. Then why I am not able to see mapreduce logs on the console when I give hadoop fs -put command?
thanks in Advance.

You're thinking of hadoop distcp which will spawn a MapReduce job.
https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html
DistCp Version 2 (distributed copy) is a tool used for large inter/intra cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
hadoop fs -put or hdfs dfs -put are implemented entirely by HDFS and don't require MapReduce.

Related

How can I append multiple files in HDFS to a single file in HDFS without the help of local file system?

I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename

How does cp command work in Hadoop?

I am reading "Hadoop: The Defnitive Guide" and to explain my question let me quote from the book
distcp is implemented as a MapReduce job where the work of copying is done by the
maps that run in parallel across the cluster. There are no reducers. Each file is copied
by a single map, and distcp tries to give each map approximately the same amount of
data by bucketing files into roughly equal allocations. By default, up to 20 maps are used, but this can be changed by specifying the -m argument to distcp.
and in a footnote
Even for a single file copy, the distcp variant is preferred for large files since hadoop fs -cp copies the file
via the client running the command.
I understand why distcp works better for collection of files as different mappers are performing parallelly each on a single file. But when only a single file is to be copied why distcp performs better when the file size is large (according to the footnote). I am only getting started so it would be helpful if how cp command in hadoop works is explained and what is meant by "hadoop fs -cp copies the file via the client running the command.". I understand the write process of Hadoop which is explained in the book where a pipeline of datanodes are formed and each datanode is responsible to write data to the following datanode in the pipeline.
When a file is copied "via the client", the byte content is streamed from HDFS, to the local node running the command, then uploaded back to the destination HDFS location. The file metadata is not simply copied over to a new spot between datanodes directly as you'd expect.
Compare that to distcp, which creates smaller, parallel cp commands spread out over multiple hosts

Difference between hadoop fs -put and hadoop distcp

We are going to do the ingestion phase in our data lake project and I have mostly used hadoop fs -put throughout my Hadoop developer experience. So what's the difference with hadoop distcp and the difference in usage?
Distcp is a special tool used for copying the data from one cluster to another. Basically you usually copy from one hdfs to hdfs, but not for local file system. Another very important thing is that the process in done as a mapreduce job with 0 reduce task which makes it more fast due to the distribution of operations. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list
hdfs put - copies the data from local system to hdfs. Uses hdfs client for this behind the scene and does all the work sequentially through accessing NameNode and Datanodes. Does not create MapReduce jobs for processing the data.
hdfs or hadoop put is used for data ingestion from Local to HDFS file system
distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem
We extensively use distcp for (Archiving ) Back-up and Restore of the HDFS files something like this
hadoop distcp $CURRENT_HDFS_PATH $BACKUP_HDFS_PATH
"distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem"
-> it can, use "file" (eg. "file:///tmp/test.txt") as schema in URL (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html)
Hint: use "hadoop distcp -D dfs.replication=1" to decrease distcp process time during copy operation and later replicate the copied files.
Distcp is command is used for coying the data from one cluster's hdfs location to another cluster's hdfs location only. create MapReduce jobs with 0 reducer for processing the data.
hadoop -distcp webhdfs://source-ip/directory/filename webhdfs://target-ip/directory/
scp is the command used for copying the data from one cluster's local file system to another cluster's local file system.
scp //source-ip/directory/filename //target-ip/directory/
hdfs put command - copies the data from local file system to hdfs. Does not create MapReduce jobs for processing the data.
hadoop fs -put -f /path/file /hdfspath/file
hdfs get command -copies the data from hdfs to local file system
first, go to the directory where you want to copy the file then run below command
hadoop fs -get /hdfsloc/file

Most efficient way to write data to hadoop

I am new to Hadoop HDFS. I am trying to learn how to write data read from local file to hadoop HDFS . I want to know how to write in an efficient way. Please help
You can try like this
hadoop fs -put localpath hdfspath
Example
hadoop fs -put /user/sample.txt /sample.txt
You can google it to find more hdfs commands. Refer here

How do you retrieve the replication factor info in Hdfs files?

I have set the replication factor for my file as follows:
hadoop fs -D dfs.replication=5 -copyFromLocal file.txt /user/xxxx
When a NameNode restarts, it makes sure under-replicated blocks are replicated.
Hence the replication info for the file is stored (possibly in nameNode). How can I get that information?
Try to use command hadoop fs -stat %r /path/to/file, it should print the replication factor.
You can run following command to get replication factor,
hadoop fs -ls /user/xxxx
The second column in the output signify replication factor for the file and for the folder it shows -, as shown in below pic.
Apart from Alexey Shestakov's answer, which works perfectly and does exactly what you ask, other ways, mostly found here, include:
hadoop dfs -ls /parent/path
which shows the replication factors of all the /parent/path contents on the second column.
Through Java, you can get this information by using:
FileStatus.getReplication()
You can also see the replication factors of files by using:
hadoop fsck /filename -files -blocks -racks
Finally, from the web UI of the namenode, I believe that this information is also available (didn't check that).
We can use following commands to check replication of the file.
hdfs dfs -ls /user/cloudera/input.txt
or
hdfs dfs -stat %r /user/cloudera/input.txt
In case if you need to check replication factor of a HDFS directory
hdfs fsck /tmp/data
shows the average replication factor of /tm/data/ HDFS folder

Resources