Does hadoop auto-copy input files not on HDFS? - hadoop

Using hadoop streaming:
hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -file mapper.rb -mapper mapper.rb -file reducer.rb -reducer reducer.rb -input textfile.txt -output output
Assuming the directory I am in is "/home/user/sei/Documents" and the textfile.txt
1) is in the same folder as the directory I am currently in
2) I did not use -copyFromLocal to put textfile.txt into HDFS
Does hadoop automatically copy the input files (in this case textfile.txt) to some location on HDFS (i.e. "/user/sei/textfile.txt" automatically upon execution) to use for processing? Does this apply to all cases of hadoop commands (i.e. hadoop jar jarfile myfilename)

No it does not copy the records into HDFS, that you will have to do by yourself. If you are running a single node, or a pseudo distributed cluster on one machine you should be ok with a local file path. But if you are running a distributed cluster, the mappers and reducers will not be able to find that file.

Related

How to change replication factor while running copyFromLocal command?

I'm not asking how to set replication factor in hadoop for a folder/file. I know following command works flawlessly for existing files & folders.
hadoop fs -setrep -R -w 3 <folder-path>
I'm asking, how do I set the replication factor, other than default (which is 4 in my scenario), while copying data from local. I'm running following command,
hadoop fs -copyFromLocal <src> <dest>
When I run above commands, it copies the data from src to dest path with replication factor as 4. But I want to make replication factor as 1 while copying data but not after copying is complete. Bascially I want something like this,
hadoop fs -setrep -R 1 -copyFromLocal <src> <dest>
I tried it, but it didn't work. So, can it be done? or I've first copy data with replication factor 4 and then run setrep command?
According to this post and this post (both asking different questions), this command seems to work:
hadoop fs -D dfs.replication=1 -copyFromLocal <src> <dest>
The -D option means "Use value for given property."

hadoop cp vs streaming with /bin/cat as mapper and reducer

I am new to Hadoop, have very basic question on hadoop copy (cp) vs hadoop streaming if /bin/cat is used for mapper and reducer.
hadoop -input -output
-mapper /bin/cat -reducer /bin/cat
I believe above command would copy the files (how is it different from hadoop cp?) or correct me if my understanding is wrong.
They kind of do the same thing but in different fashion:
hadoop cp will just invoke the JAVA HDFS API and performs a copy to another specified location, which is way faster than streaming solution.
hadoop streaming on the other (see the example command below) will kick off a mapreduce job. Hence like any other mapreduce job it has to go through map -> sort & shuffle -> reduce phases which will take a long time to complete depending on your input dataset size. Because of the default sort & shuffle phase your input data also gets sorted in the output directory.
hadoop jar /path/to/hadoop-streaming.jar \
-input /input/path
-output /output/path
-mapper /bin/cat
-reducer /bin/cat

Getting data in and out of Elastic MapReduce HDFS

I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it working on 10's of nodes within Elastic MapReduce.
What I've been doing is something like this:
./elastic-mapreduce --create --alive
JOBID="j-XXX" # output from creation
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp s3://bucket-id/XXX /XXX"
./elastic-mapreduce -j $JOBID --jar s3://bucket-id/jars/hdeploy.jar --main-class com.ranjan.HadoopMain --arg /XXX
This is asynchronous, but when the job's completed, I can do this
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp /XXX s3://bucket-id/XXX-output"
./elastic-mapreduce -j $JOBID --terminate
So while this sort-of works, but it's clunky and not what I'd like. Is there cleaner way to do this?
Thanks!
You can use distcp which will copy the files as a mapreduce job
# download from s3
$ hadoop distcp s3://bucket/path/on/s3/ /target/path/on/hdfs/
# upload to s3
$ hadoop distcp /source/path/on/hdfs/ s3://bucket/path/on/s3/
This makes use of your entire cluster to copy in parallel from s3.
(note: the trailing slashes on each path are important to copy from directory to directory)
#mat-kelcey, does the command distcp expect the files in S3 to have a minimum permission level? For some reason I have to set permission levels of the files to "Open/Download" and "View Permissions" for "Everyone", for the files to be able accessible from within the bootstrap or the step scripts.

Starting jobs with direct calls to Hadoop from within SSH

I've been able to kick off job flows using the elastic-mapreduce ruby library just fine. Now I have an instance which is still 'alive' after it's jobs have finished. I've logged in to is using SSH and would like to start another job, but each of my various attempts have failed because hadoop can't find the input file. I've tried storing the input file locally and on S3.
How can I create new hadoop jobs directly from within my SSH session?
The errors from my attempts:
(first attempt using local file storage, which I'd created by uploading files using SFTP)
hadoop jar hadoop-0.20-streaming.jar \
-input /home/hadoop/mystic/search_sets/test_sample.txt \
-output /home/hadoop/mystic/search_sets/test_sample_output.txt \
-mapper /home/hadoop/mystic/ctmp1_mapper.py \
-reducer /home/hadoop/mystic/ctmp1_reducer.py \
-file /home/hadoop/mystic/ctmp1_mapper.py \
-file /home/hadoop/mystic/ctmp1_reducer.py
11/10/04 22:33:57 ERROR streaming.StreamJob: Error Launching job :Input path does not exist: hdfs://ip-xx-xxx-xxx-xxx.us-west-1.compute.internal:9000/home/hadoop/mystic/search_sets/test_sample.txt
(second attempt using s3):
hadoop jar hadoop-0.20-streaming.jar \
-input s3n://xxxbucket1/test_sample.txt \
-output /home/hadoop/mystic/search_sets/test_sample_output.txt \
-mapper /home/hadoop/mystic/ctmp1_mapper.py \
-reducer /home/hadoop/mystic/ctmp1_reducer.py \
-file /home/hadoop/mystic/ctmp1_mapper.py \
-file /home/hadoop/mystic/ctmp1_reducer.py
11/10/04 22:26:45 ERROR streaming.StreamJob: Error Launching job : Input path does not exist: s3n://xxxbucket1/test_sample.txt
The first will not work. Hadoop will look for that location in HDFS, not local storage. It might work if you use the file:// prefix, like this:
-input file:///home/hadoop/mystic/search_sets/test_sample.txt
I've never tried this with streaming input, though, and it probably isn't the best idea even if it does work.
The second (S3) should work. We do this all the time. Make sure the file actually exists with:
hadoop dfs -ls s3n://xxxbucket1/test_sample.txt
Alternately, you could put the file in HDFS and use it normally. For jobs in EMR, though, I usually find S3 to be the most convenient.

Merging multiple files into one within Hadoop

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?
Thanks!
In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.
hadoop jar \
$HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
-Dmapred.reduce.tasks=1 \
-Dmapred.job.queue.name=$QUEUE \
-input "$INPUT" \
-output "$OUTPUT" \
-mapper cat \
-reducer cat
If you want compression add
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
okay...I figured out a way using hadoop fs commands -
hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]
It worked when I tested it...any pitfalls one can think of?
Thanks!
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.
You can use the tool HDFSConcat, new in HDFS 0.21, to perform this operation without incurring the cost of a copy.
If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run 'hadoop-streaming-2.7.1.2.3.2.0-2950.jar' jar which runs single reducer and get the merged file into HDFS output location.
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat
You can download this jar from
Get hadoop streaming jar
If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD
sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)
This will merge all part files into one and save it again into hdfs location
Addressing this from Apache Pig perspective,
To merge two files with identical schema via Pig, UNION command can be used
A = load 'tmp/file1' Using PigStorage('\t') as ....(schema1)
B = load 'tmp/file2' Using PigStorage('\t') as ....(schema1)
C = UNION A,B
store C into 'tmp/fileoutput' Using PigStorage('\t')
All the solutions are equivalent to doing a
hadoop fs -cat [dir]/* > tmp_local_file
hadoop fs -copyFromLocal tmp_local_file
it only means that the local m/c I/O is on the critical path of data transfer.

Resources