HDFS space consumed: "hdfs dfs -du /" vs "hdfs dfsadmin -report" - hadoop

Which tool is the right one to measure HDFS space consumed?
When I sum up the output of "hdfs dfs -du /" I always get less amount of space consumed compared to "hdfs dfsadmin -report" ("DFS Used" line). Is there data that du does not take into account?

Hadoop file systems provides a relabel storage, by putting a copy of data to several nodes. The number of copies is replication factor, usually it is greate then one.
Command hdfs dfs -du / shows space consume your data without replications.
Command hdfs dfsadmin -report (line DFS Used) shows actual disk usage, taking into account data replication. So it should be several times bigger when number getting from dfs -ud command.

How HDFS Storage works in brief:
Let say replication factor = 3 (default)
Data file size = 10GB (i.e xyz.log)
HDFS will take 10x3 = 30GB to store that file
Depending on the type of command you use, you will get different values for space occupied by HDFS (10GB vs 30GB)
If you are on latest version of Hadoop, try the following command. In my case this works very well on Hortonworks Data Platform (HDP) 2.3.* and above. This should also work on cloudera's latest platform.
hadoop fs -count -q -h -v /path/to/directory
(-q = quota, -h = human readable values, -v = verbose)
This command will show the following fields in the output.
QUOTA REMAINING_QUOTA SPACE_QUOTA REMAINING_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Where
CONTENT_SIZE = real file size without replication (10GB) and
SPACE_QUOTA = space occupied in HDFS to save the file (30GB)
Notes:
Control replication factor here: Modify "dfs.replication" property found in hdfs-site.xml file under conf/ dir of default hadoop installation directory). Changing this using Ambari/Cloudera Manager is recommended if you have multinode cluster.
There are other commands to check storage space. E.G hadoop fsck, hadoop dfs -dus,

Related

How to create new user in hadoop

I am new to hadoop. I have done apache hadoop multinode installation and the user name is hadoop.
I am using total 3 nodes: 1 namenode and 2 datanodes
I have to create new user for data isolation. I have found few links on google, but those are not working and I am unable to access the hdfs.
**[user1#datanode1~]# hdfs dfs -ls -R /
bash: hdfs: command not found...**
Can someone help me with the steps to create the new user which can access hdfs for data isolation. And on which node I should create the new user.
Thanks
Hadoop doesn't have users like Linux does. Users are generally managed by external LDAP/Kerberos systems. By default, there is not even security features, all user-names are based on the HADOOP_USER_NAME environment variable, and can be overriden by export command. Also, by default, the user used is the current username, for example, your command user1#datanode1 # hdfs dfs -ls would actually run hdfs dfs -ls /user/user1, and return an error if that folder doesn't first exist.
However, your actual error is saying that your OS PATH variable does not include $HADOOP_HOME/bin, for example. Edit your .bashrc to fix this.
You'd create an HDFS folder for "user" username with
hdfs dfs -mkdir /user/username
hdfs dfs -chown username /user/username
hdfs dfs -chmod -R 770 /user/username
And you also should run useradd command on the namenode machine to make sure it knows about a user named "username"

Hadoop configure cluter queried based on a flag/env parameter

Apologies beforehand if this turns out to be a silly question, I am new to hadoop environment.
I have two hadoop clusters my-prod-cluster and my-bcp-cluster.
Both are accessible over the same network.
Is there any way to configure my clusters in such a way that when I am in BCP mode, all my queries to my-prod-cluster gets routed to my-bcp-cluster (on the basis of some config parameter or environment variable)
So when flag=prod
hadoop fs -ls /my-prod-cluster/mydir translates to hadoop fs -ls /my-prod-cluster/mydir
and fetches the data in /my-prod-cluster/mydir
when the flag=bcp
hadoop fs -ls /my-prod-cluster/mydir translates to hadoop fs -ls /my-bcp-cluster/mydir
and fetches data from /my-bcp-cluster/mydir
I am using [mapr][1] flavour of haddop(provided by HP), version 6.1, in case that matters
You could easily make a shell wrapper script that prepends the NameNode address to each query
For example, a fully-qualified command would look like this
hdfs dfs -ls hdfs://my-prod-cluster.domain.com/path/to/mydir
So, refactoring that, you could have a script like
#!/bin/sh
if [ $1 -eq "prod" ]; then
NAMENODE=hdfs://my-prod-cluster.domain.com
fi
# TODO: error handling and more clusters
PATH=$2
hdfs dfs -ls "${NAMENODE}${PATH}"
Then execute something like my-hdfs-ls prod /mydir
If you need something more complex than that like Kerberos tickets, and such, then creating a separate HADOOP_CONF_DIR variable with unique core-site and hdfs-site XMLs for each cluster would be recommended.

HDFS copyFromLocal slow when using ssh

I am using ssh to issue a copyFromLocal command of HDFS like this (in a script):
ssh -t ubuntu#namenode_server "hdfs dfs -copyFromlocal data/file.csv /file.csv"
However, I am observing very peculiar behavior. This ssh command can take a variable time from 20 min to 25 min for a 9GB file. However, if I simply delete the file from HDFS and rerun the command, it always executes within 4 min.
The transfer of the file also takes around 3-4 min when moving it from one HDFS cluster to another as well (even when I change the block size between source and destination clusters).
I am using EC2 servers for the HDFS cluster. I am using Hadoop 2.7.6.
Not sure why it takes such a long time to copy the file from local file system to HDFS the first time.

How determine Hive database size?

How determine Hive's database size from Bash or from Hive CLI?
hdfs and hadoop commands are also avaliable in Bash.
A database in hive is a metadata storage - meaning it holds information about tables and has a default location. Tables in a database can also be stored anywhere in hdfs if location is specified when creating a table.
You can see all tables in a database using show tables command in Hive CLI.
Then, for each table, you can find its location in hdfs using describe formatted <table name> (again in Hive CLI).
Last, for each table you can find its size using hdfs dfs -du -s -h /table/location/
I don't think there's a single command to measure the sum of sizes of all tables of a database. However, it should be fairly easy to write a script that automates the above steps. Hive can also be invoked from bash CLI using: hive -e '<hive command>'
Show Hive databases on HDFS
sudo hadoop fs -ls /apps/hive/warehouse
Show Hive database size
sudo hadoop fs -du -s -h /apps/hive/warehouse/{db_name}
if you want the size of your complete database run this on your "warehouse"
hdfs dfs -du -h /apps/hive/warehouse
this gives you the size of each DB in your warehouse
if you want the size of tables in a specific DB run:
hdfs dfs -du -h /apps/hive/warehouse/<db_name>
run a "grep warehouse" on hive-site.xml to find your warehouse path

Failed to copy file from FTP to HDFS

I have FTP server (F [ftp]), linux box(S [standalone]) and hadoop cluster (C [cluster]). The current files flow is F->S->C. I am trying to improve performance by skipping S.
The current flow is:
wget ftp://user:password#ftpserver/absolute_path_to_file
hadoop fs -copyFromLocal path_to_file path_in_hdfs
I tried:
hadoop fs -cp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
and:
hadoop distcp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
Both hangs. The distcp one being a job is killed by timeout. The logs (hadoop job -logs) only said it was killed by timeout. I tried to wget from the ftp from some node of the C and it worked. What could be the reason and any hint how to figure it out?
Pipe it through stdin:
wget ftp://user:password#ftpserver/absolute_path_to_file | hadoop fs -put - path_in_hdfs
The single - tells HDFS put to read from stdin.
hadoop fs -cp ftp://user:password#ftpserver.com/absolute_path_to_file path_in_hdfs
This cannot be used as the source file is a file in the local file system. It does not take into account the scheme you are trying to pass. Refer to the javadoc: FileSystem
DISTCP is only for large intra or inter cluster (to be read as Hadoop clusters i.e. HDFS). Again it cannot get data from FTP. 2 step process is still your best bet. Or write a program to read from FTP and write to HDFS.

Resources