I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it working on 10's of nodes within Elastic MapReduce.
What I've been doing is something like this:
./elastic-mapreduce --create --alive
JOBID="j-XXX" # output from creation
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp s3://bucket-id/XXX /XXX"
./elastic-mapreduce -j $JOBID --jar s3://bucket-id/jars/hdeploy.jar --main-class com.ranjan.HadoopMain --arg /XXX
This is asynchronous, but when the job's completed, I can do this
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp /XXX s3://bucket-id/XXX-output"
./elastic-mapreduce -j $JOBID --terminate
So while this sort-of works, but it's clunky and not what I'd like. Is there cleaner way to do this?
Thanks!
You can use distcp which will copy the files as a mapreduce job
# download from s3
$ hadoop distcp s3://bucket/path/on/s3/ /target/path/on/hdfs/
# upload to s3
$ hadoop distcp /source/path/on/hdfs/ s3://bucket/path/on/s3/
This makes use of your entire cluster to copy in parallel from s3.
(note: the trailing slashes on each path are important to copy from directory to directory)
#mat-kelcey, does the command distcp expect the files in S3 to have a minimum permission level? For some reason I have to set permission levels of the files to "Open/Download" and "View Permissions" for "Everyone", for the files to be able accessible from within the bootstrap or the step scripts.
Related
I am running a hadoop distcp command as below:
hadoop distcp src-loc target-loc
I want to know the size of the data copied by running this command.
I am planning to run the command on Qubole.
Any help is appreciated
Run following command
hadoop dfs -dus -h target-loc
225.2 G target-loc
It will print the human readable summary for the target-loc.
Im trying to program with crontab a simple task, copy some files from local to HDFS. My code is this:
#!/bing/ksh
ANIO=$(date +"%Y")
MES=$(date +"%m")
DIA=$(date +"%d")
HORA=$(date +"%H")
# LOCAL AND HDFS DIRECTORIES
DIRECTORIO_LOCAL="/home/cloudera/bicing/data/$ANIO/$MES/$DIA/stations"$ANIO$MES$DIA$HORA"*"
DIRECTORIO_HDFS="/bicing/data/$ANIO/$MES/$DIA/"
# Test if the destination directory exist and create it if it's necesary
echo "hdfs dfs -test -d $DIRECTORIO_HDFS">>/home/cloudera/bicing/data/logFile
hdfs dfs -test -d $DIRECTORIO_HDFS
if [ $? != 0 ]
then
echo "hdfs dfs -mkdir -p $DIRECTORIO_HDFS">>/home/cloudera/bicing/data/logFile
hdfs dfs -mkdir -p $DIRECTORIO_HDFS
fi
# Upload the files to HDFS
echo "hdfs dfs -put $DIRECTORIO_LOCAL $DIRECTORIO_HDFS">>/home/cloudera/bicing/data/logFile
hdfs dfs -put $DIRECTORIO_LOCAL $DIRECTORIO_HDFS
As you can see is quite simple, it only define the folders variables, create the directory in HDFS (if it doesn't exists) and copies the files from local to HDFS.
The script works if I launch it directly on the Terminal but when I schedule it with Crontab it doesn't "put" the files in HDFS.
Moreover, the script creates a "logFile" with the commands that should have been executed. When I copy them to the Terminal them work perfectly.
hdfs dfs -test -d /bicing/data/2015/12/10/
hdfs dfs -mkdir -p /bicing/data/2015/12/10/
hdfs dfs -put /home/cloudera/bicing/data/2015/12/10/stations2015121022* /bicing/data/2015/12/10/
I have checked the directories and files, but I cant find the key to solve it.
Thanks in advance!!!
When you execute these commands on the console, they run fine, because "HADOOP_HOME" is set. But, when the Cron job runs, most likely, "HADOOP_HOME" environment variable is not available.
You can resolve this problem in 2 ways:
In the script, add the following statements at the beginning. This will add the paths of all the Hadoop jars to your environment.
export HADOOP_HOME={Path to your HADOOP_HOME}
export PATH=$PATH:$HADOOP_HOME\etc\hadoop\;$HADOOP_HOME\share\hadoop\common\*;$HADOOP_HOME\share\hadoop\common\lib\*;$HADOOP_HOME\share\hadoop\hdfs\*;$HADOOP_HOME\share\hadoop\hdfs\lib\*;$HADOOP_HOME\share\hadoop\mapreduce\*;$HADOOP_HOME\share\hadoop\mapreduce\lib\*;$HADOOP_HOME\share\hadoop\tools\*;$HADOOP_HOME\share\hadoop\tools\lib\*;$HADOOP_HOME\share\hadoop\yarn\*;$HADOOP_HOME\share\hadoop\yarn\lib\*
You can also update your .profile (present in $HOME/.profile) or .kshrc (present in $HOME/.kshrc) to include the HADOOP paths.
That should solve your problem.
I'm not asking how to set replication factor in hadoop for a folder/file. I know following command works flawlessly for existing files & folders.
hadoop fs -setrep -R -w 3 <folder-path>
I'm asking, how do I set the replication factor, other than default (which is 4 in my scenario), while copying data from local. I'm running following command,
hadoop fs -copyFromLocal <src> <dest>
When I run above commands, it copies the data from src to dest path with replication factor as 4. But I want to make replication factor as 1 while copying data but not after copying is complete. Bascially I want something like this,
hadoop fs -setrep -R 1 -copyFromLocal <src> <dest>
I tried it, but it didn't work. So, can it be done? or I've first copy data with replication factor 4 and then run setrep command?
According to this post and this post (both asking different questions), this command seems to work:
hadoop fs -D dfs.replication=1 -copyFromLocal <src> <dest>
The -D option means "Use value for given property."
I am running the below step in my cluster in EMR:
./elastic-mapreduce -j CLUSTERID -jar s3n://mybucket/somejar
--main-class SomeClass
--arg -conf --arg 's3n://mybucket/configuration.xml'
The SomeClass is Hadoop job and implements Runnable interface. It reads configuration.xml for parameters, but in the above command the SomeClass can not access "s3n://mybucket/configuration.xml" (no error reported). I tried "s3://mybucket/configuration.xml" and it does not work either. I am sure the file existed, since I can see it with "hadoop fs -ls s3n://mybucket/configuration.xml". Any suggestion for the problem?
Thanks,
Here are the options to try
Use s3 instead of s3n.
Check the access permission for s3 bucket.
You can specify the log location and can check the log after job
fails.You can create job like below
elastic-mapreduce --create --name "j_flow_name" --log-uri "s3://your_s3_bucket"
It gives you the more debug information.
3.
./elastic-mapreduce -j JobFlowId -jar s3://your_bucket --arg "s3://your_conf_file_bucket_name" --arg "second parameter"
For more detailed information EMR CLI
I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?
Thanks!
In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.
hadoop jar \
$HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
-Dmapred.reduce.tasks=1 \
-Dmapred.job.queue.name=$QUEUE \
-input "$INPUT" \
-output "$OUTPUT" \
-mapper cat \
-reducer cat
If you want compression add
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
okay...I figured out a way using hadoop fs commands -
hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]
It worked when I tested it...any pitfalls one can think of?
Thanks!
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.
You can use the tool HDFSConcat, new in HDFS 0.21, to perform this operation without incurring the cost of a copy.
If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run 'hadoop-streaming-2.7.1.2.3.2.0-2950.jar' jar which runs single reducer and get the merged file into HDFS output location.
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat
You can download this jar from
Get hadoop streaming jar
If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD
sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)
This will merge all part files into one and save it again into hdfs location
Addressing this from Apache Pig perspective,
To merge two files with identical schema via Pig, UNION command can be used
A = load 'tmp/file1' Using PigStorage('\t') as ....(schema1)
B = load 'tmp/file2' Using PigStorage('\t') as ....(schema1)
C = UNION A,B
store C into 'tmp/fileoutput' Using PigStorage('\t')
All the solutions are equivalent to doing a
hadoop fs -cat [dir]/* > tmp_local_file
hadoop fs -copyFromLocal tmp_local_file
it only means that the local m/c I/O is on the critical path of data transfer.