How to change HDFS replication factor for HIVE alone - hadoop

Our current HDFS Cluster has replication factor 1.But to improve the performance and reliability(node failure) we want to increase Hive intermediate files (hive.exec.scratchdir) replication factor alone to 5. Is it possible to implement that ?
Regards,
Selva

See if -setrep helps you.
setrep
Usage:
hadoop fs -setrep [-R] [-w] <numReplicas> <path>
Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path.
Options:
The -w flag requests that the command wait for the replication to complete. This can potentially take a very long time.
The -R flag is accepted for backwards compatibility. It has no effect.
Example:
hadoop fs -setrep -w 3 /user/hadoop/dir1
hadoop fs -setrep -R -w 100 /path/to/hive/warehouse
Reference: -setrep

Related

Know the disk space of data nodes in hadoop?

Is there a way or any command using which I can come to know the disk space of each datanode or the total cluster disk space?
I tried the command
dfs -du -h /
but it seems that I do not have permission to execute it for many directories and hence cannot get the actual disk space.
From UI:
http://namenode:50070/dfshealth.html#tab-datanode
---> which will give you all the details about datanode.
From command line:
To get disk space of each datanode:
sudo -u hdfs hdfs dfsadmin -report
---> which will give you the details of entire HDFS and the individual datanodes OR
sudo -u hdfs hdfs dfs -du -h /
---> which will give you the total disk usage of each folder under root / directory
You view the information about all datanodes and their disk usage in the namenode UI's Datanodes tab.
Total cluster disk space can be seen in the summary part of the main page.
http://namenode-ip:50070
If you are using Hadoop cluster configured as simple security, you can execute the below command to get the usage of data nodes.
export HADOOP_USER_NAME=hdfs ;
* Above command can be used to get admin privilege in simple security, If you are using any other user for hdfs admin, replace hdfs with the respective hdfs admin user.
hadoop dfsadmin -report
Alternate option is to login to respective datanode and execute the below unix command to get disk utilization of that server.
df -h
Hadoop 3.2.0:
hduser#hadoop-node1:~$ hdfs dfs -df
Filesystem Size Used Available Use%
hdfs://hadoop-node1:54310 3000457228288 461352007680 821808787456 15%
hduser#hadoop-node1:~$
For human-readable numbers, use:
hduser#hadoop-node1:~$ hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://hadoop-node1:54310 2.7 T 429.7 G 765.4 G 15%
hduser#hadoop-node1:~$

how to send files to hdfs while keeping their basename

Someone suggest to me, what's the best solution to shipp files from different sources and store them in hdfs based on their names. My situation is :
I have a server that has large number of files and I need to send them to HDFS.
Actually I used flume, in its config I tried spooldir and ftp as sources, but both of them has disadvantages.
So any idea, how to do that ?
Use the hadoop put command:
put
Usage: hadoop fs -put [-f] [-p] [-l] [-d] [ - | .. ].
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”
Copying fails if the file already exists, unless the -f flag is given.
Options:
-p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.
-d : Skip creation of temporary file with the suffix .COPYING.
Examples:
hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#put

Remaining space on cloudera hadoop cluster in human readable format

I am looking for a command that shows the human readable form of the space left on hadoop cluster. I found a command on this forum and the output is in the image.
hdfs dfsadmin -report
[output of dfsadmin command][1]
I heard that there is another command in hortonworks that gives a more human readable output. And that command is hdfs dfsadmin -report
That command doesn't seem to work on cloudera.
Is there any equivalent command in cloudera?
Thanks much
It shouldn't matter whether you're using Cloudera or Hortonworks. If you're using an older version of hadoop the command might be hadoop dfsadmin -report.
Other options you have are:
hadoop fs -df -h
$ hadoop fs -df -h
Filesystem Size Used Available Use%
hdfs://<IP>:8020 21.8 T 244.2 G 21.6 T 1%
Shows the capacity, free and used space of the filesystem. If the filesystem has
multiple partitions, and no path to a particular partition is specified, then
the status of the root partitions will be shown.
hadoop fs -du -h /
$ hadoop fs -du -h /
772 /home
437.3 M /mnt
0 /tmp
229.2 G /user
9.3 G /var
Shows the amount of space, in bytes, used by the files that match the specified file pattern.

Find out actual disk usage in HDFS

Is there a way to find out how much space is consumed in HDFS?
I used
hdfs dfs -df
but it seems to be not relevant cause after deleting huge amount of data with
hdfs dfs -rm -r -skipTrash
the previous comand displays changes not at once but after several minutes (I need up-to-date disk usage info).
To see the space consumed by a particular folder try:
hadoop fs -du -s /folder/path
And if you want to see the usage, space consumed, space available, etc. of the whole HDFS:
hadoop dfsadmin -report
hadoop cli is deprecated. Use hdfs instead.
Folder wise :
sudo -u hdfs hdfs dfs -du -h /
Cluster wise :
sudo -u hdfs hdfs dfsadmin -report
hadoop fs -count -q /path/to/directory

How to change replication factor while running copyFromLocal command?

I'm not asking how to set replication factor in hadoop for a folder/file. I know following command works flawlessly for existing files & folders.
hadoop fs -setrep -R -w 3 <folder-path>
I'm asking, how do I set the replication factor, other than default (which is 4 in my scenario), while copying data from local. I'm running following command,
hadoop fs -copyFromLocal <src> <dest>
When I run above commands, it copies the data from src to dest path with replication factor as 4. But I want to make replication factor as 1 while copying data but not after copying is complete. Bascially I want something like this,
hadoop fs -setrep -R 1 -copyFromLocal <src> <dest>
I tried it, but it didn't work. So, can it be done? or I've first copy data with replication factor 4 and then run setrep command?
According to this post and this post (both asking different questions), this command seems to work:
hadoop fs -D dfs.replication=1 -copyFromLocal <src> <dest>
The -D option means "Use value for given property."

Resources