How to change replication factor while running copyFromLocal command? - hadoop

I'm not asking how to set replication factor in hadoop for a folder/file. I know following command works flawlessly for existing files & folders.
hadoop fs -setrep -R -w 3 <folder-path>
I'm asking, how do I set the replication factor, other than default (which is 4 in my scenario), while copying data from local. I'm running following command,
hadoop fs -copyFromLocal <src> <dest>
When I run above commands, it copies the data from src to dest path with replication factor as 4. But I want to make replication factor as 1 while copying data but not after copying is complete. Bascially I want something like this,
hadoop fs -setrep -R 1 -copyFromLocal <src> <dest>
I tried it, but it didn't work. So, can it be done? or I've first copy data with replication factor 4 and then run setrep command?

According to this post and this post (both asking different questions), this command seems to work:
hadoop fs -D dfs.replication=1 -copyFromLocal <src> <dest>
The -D option means "Use value for given property."

Related

how to send files to hdfs while keeping their basename

Someone suggest to me, what's the best solution to shipp files from different sources and store them in hdfs based on their names. My situation is :
I have a server that has large number of files and I need to send them to HDFS.
Actually I used flume, in its config I tried spooldir and ftp as sources, but both of them has disadvantages.
So any idea, how to do that ?
Use the hadoop put command:
put
Usage: hadoop fs -put [-f] [-p] [-l] [-d] [ - | .. ].
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”
Copying fails if the file already exists, unless the -f flag is given.
Options:
-p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.
-d : Skip creation of temporary file with the suffix .COPYING.
Examples:
hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#put

Pig command to copy to HDFS from local FS of master node

I have this pig command executed through oozie:
fs -put -f /home/test/finalreports/accountReport.csv /user/hue/intermediateBingReports
/home/test/finalreports/accountReport.csv is created on local filesystem of only one of the hdfs nodes. I recently added a new HDFS node and this command fails on that hdfs node since /home/test/finalreports/accountReport.csv doesn't exist there.
What is the way to go for this?
I came across this but it doesn't seem to work for me:
Tried the following command:
hadoop fs -fs masternode:8020 -put /home/test/finalreports/accountReport.csv hadoopFolderName/
I get:
put: `/home/test/finalreports/accountReport.csv': No such file or directory

How to change HDFS replication factor for HIVE alone

Our current HDFS Cluster has replication factor 1.But to improve the performance and reliability(node failure) we want to increase Hive intermediate files (hive.exec.scratchdir) replication factor alone to 5. Is it possible to implement that ?
Regards,
Selva
See if -setrep helps you.
setrep
Usage:
hadoop fs -setrep [-R] [-w] <numReplicas> <path>
Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path.
Options:
The -w flag requests that the command wait for the replication to complete. This can potentially take a very long time.
The -R flag is accepted for backwards compatibility. It has no effect.
Example:
hadoop fs -setrep -w 3 /user/hadoop/dir1
hadoop fs -setrep -R -w 100 /path/to/hive/warehouse
Reference: -setrep

How to unzip file in hadoop?

I was trying to unzip a zip file, stored in Hadoop file system, & store it back in hadoop file system. I tried following commands, but none of them worked.
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop fs -put - /tmp/
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop fs -put - /tmp
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop put - /tmp/
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop put - /tmp
I get errors like gzip: stdin has more than one entry--rest ignored, cat: Unable to write to output stream., Error: Could not find or load main class put on terminal, when I run those commands. Any help?
Edit 1: I don't have access to UI. So, only command lines are allowed. Unzip/gzip utils are installed on my hadoop machine. I'm using Hadoop 2.4.0 version.
To unzip a gzipped (or bzipped) file, I use the following
hdfs dfs -cat /data/<data.gz> | gzip -d | hdfs dfs -put - /data/
If the file sits on your local drive, then
zcat <infile> | hdfs dfs -put - /data/
I use most of the times hdfs fuse mounts for this
So you could just do
$ cd /hdfs_mount/somewhere/
$ unzip file_in_hdfs.zip
http://www.cloudera.com/content/www/en-us/documentation/archive/cdh/4-x/4-7-1/CDH4-Installation-Guide/cdh4ig_topic_28.html
Edit 1/30/16: In case if you use hdfs ACLs: In some cases fuse mounts don't adhere to hdfs ACLs, so you'll be able to do file operations that are permitted by basic unix access privileges. See https://issues.apache.org/jira/browse/HDFS-6255, comments at the bottom that I recently asked to reopen.
To stream the data through a pipe to hadoop, you need to use the hdfs command.
cat mydatafile | hdfs dfs -put - /MY/HADOOP/FILE/PATH/FILENAME.EXTENSION
gzip use -c to read data from stdin
hadoop fs -put doesnt support read the data from stdin
I tried a lots of things and would help.I cant find the zip input support of hadoop.So it left me no choice but download the hadoop file to local fs ,unzip it and upload to hdfs again.

Merging multiple files into one within Hadoop

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?
Thanks!
In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.
hadoop jar \
$HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
-Dmapred.reduce.tasks=1 \
-Dmapred.job.queue.name=$QUEUE \
-input "$INPUT" \
-output "$OUTPUT" \
-mapper cat \
-reducer cat
If you want compression add
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
okay...I figured out a way using hadoop fs commands -
hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]
It worked when I tested it...any pitfalls one can think of?
Thanks!
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.
You can use the tool HDFSConcat, new in HDFS 0.21, to perform this operation without incurring the cost of a copy.
If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run 'hadoop-streaming-2.7.1.2.3.2.0-2950.jar' jar which runs single reducer and get the merged file into HDFS output location.
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat
You can download this jar from
Get hadoop streaming jar
If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD
sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)
This will merge all part files into one and save it again into hdfs location
Addressing this from Apache Pig perspective,
To merge two files with identical schema via Pig, UNION command can be used
A = load 'tmp/file1' Using PigStorage('\t') as ....(schema1)
B = load 'tmp/file2' Using PigStorage('\t') as ....(schema1)
C = UNION A,B
store C into 'tmp/fileoutput' Using PigStorage('\t')
All the solutions are equivalent to doing a
hadoop fs -cat [dir]/* > tmp_local_file
hadoop fs -copyFromLocal tmp_local_file
it only means that the local m/c I/O is on the critical path of data transfer.

Resources