retrieve size of data copied with hadoop distcp - hadoop

I am running a hadoop distcp command as below:
hadoop distcp src-loc target-loc
I want to know the size of the data copied by running this command.
I am planning to run the command on Qubole.
Any help is appreciated

Run following command
hadoop dfs -dus -h target-loc
225.2 G target-loc
It will print the human readable summary for the target-loc.

Related

Hadoop Mapreduce get job history in psuedo-distributed mode

I am running Hadoop Mapreduce and Yarn in psuedo distributed mode and I want to get job history log. To get that, I tried solution 2 in this question and so, from the directory.
hadoop-3.0.0/bin
I executed
$ ./hdfs dfs -ls /tmp/hadoop-uname/mapred.
Following is what I get as response:
ls: `/tmp/hadoop-uname/mapred': No such file or directory
I get same response for:
$ ./hdfs dfs -ls /tmp/hadoop-uname/mapred/staging
also.
My questions are:
1) Are job history logs generated in psuedo history mode?
2) Is logging turned on by default? Or I need to do some other setting to turn it on?
3) Am I missing anything else?

Pig command to copy to HDFS from local FS of master node

I have this pig command executed through oozie:
fs -put -f /home/test/finalreports/accountReport.csv /user/hue/intermediateBingReports
/home/test/finalreports/accountReport.csv is created on local filesystem of only one of the hdfs nodes. I recently added a new HDFS node and this command fails on that hdfs node since /home/test/finalreports/accountReport.csv doesn't exist there.
What is the way to go for this?
I came across this but it doesn't seem to work for me:
Tried the following command:
hadoop fs -fs masternode:8020 -put /home/test/finalreports/accountReport.csv hadoopFolderName/
I get:
put: `/home/test/finalreports/accountReport.csv': No such file or directory

How to change replication factor while running copyFromLocal command?

I'm not asking how to set replication factor in hadoop for a folder/file. I know following command works flawlessly for existing files & folders.
hadoop fs -setrep -R -w 3 <folder-path>
I'm asking, how do I set the replication factor, other than default (which is 4 in my scenario), while copying data from local. I'm running following command,
hadoop fs -copyFromLocal <src> <dest>
When I run above commands, it copies the data from src to dest path with replication factor as 4. But I want to make replication factor as 1 while copying data but not after copying is complete. Bascially I want something like this,
hadoop fs -setrep -R 1 -copyFromLocal <src> <dest>
I tried it, but it didn't work. So, can it be done? or I've first copy data with replication factor 4 and then run setrep command?
According to this post and this post (both asking different questions), this command seems to work:
hadoop fs -D dfs.replication=1 -copyFromLocal <src> <dest>
The -D option means "Use value for given property."

How to make mahout interact with hadoop HDFS

I am using HDP mahout version 0.8. I have set MAHOUT_LOCAL="". When I run mahout, I see the message HADOOP LOCAL NOT SET RUNNING ON HADOOP but my program is not writing output to HDFS directory.
Can anyone tell me how to make my mahout program take input from HDFS and write output to HDFS?
Did you set the $MAHOUT_HOME/bin and $HADOOP_HOME/bin on the PATH ?
For example on Linux:
export PATH=$PATH:$MAHOUT_HOME/bin/:$HADOOP_HOME/bin/
export HADOOP_CONF_DIR=$HADOOP_HOME/conf/
Then, almost all the Mahout's commands use the options -i (input) and -o (output).
For example:
mahout seqdirectory -i <input_path> -o <output_path> -chunk 64
Assuming you have your mahout jar build which takes input and write to hdfs. Do the following:
From hadoop bin directory:
./hadoop jar /home/kuntal/Kuntal/BIG_DATA/mahout-recommender.jar mia.recommender.RecommenderIntro --tempDir /home/kuntal/Kuntal/BIG_DATA --recommenderClassName org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender
#Input Output Args specify if required
-Dmapred.input.dir=./ratingsLess.txt -Dmapred.output.dir=/input/output
Please check this:
http://chimpler.wordpress.com/2013/02/20/playing-with-the-mahout-recommendation-engine-on-a-hadoop-cluster/

Getting data in and out of Elastic MapReduce HDFS

I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it working on 10's of nodes within Elastic MapReduce.
What I've been doing is something like this:
./elastic-mapreduce --create --alive
JOBID="j-XXX" # output from creation
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp s3://bucket-id/XXX /XXX"
./elastic-mapreduce -j $JOBID --jar s3://bucket-id/jars/hdeploy.jar --main-class com.ranjan.HadoopMain --arg /XXX
This is asynchronous, but when the job's completed, I can do this
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp /XXX s3://bucket-id/XXX-output"
./elastic-mapreduce -j $JOBID --terminate
So while this sort-of works, but it's clunky and not what I'd like. Is there cleaner way to do this?
Thanks!
You can use distcp which will copy the files as a mapreduce job
# download from s3
$ hadoop distcp s3://bucket/path/on/s3/ /target/path/on/hdfs/
# upload to s3
$ hadoop distcp /source/path/on/hdfs/ s3://bucket/path/on/s3/
This makes use of your entire cluster to copy in parallel from s3.
(note: the trailing slashes on each path are important to copy from directory to directory)
#mat-kelcey, does the command distcp expect the files in S3 to have a minimum permission level? For some reason I have to set permission levels of the files to "Open/Download" and "View Permissions" for "Everyone", for the files to be able accessible from within the bootstrap or the step scripts.

Resources