How determine Hive database size? - bash

How determine Hive's database size from Bash or from Hive CLI?
hdfs and hadoop commands are also avaliable in Bash.

A database in hive is a metadata storage - meaning it holds information about tables and has a default location. Tables in a database can also be stored anywhere in hdfs if location is specified when creating a table.
You can see all tables in a database using show tables command in Hive CLI.
Then, for each table, you can find its location in hdfs using describe formatted <table name> (again in Hive CLI).
Last, for each table you can find its size using hdfs dfs -du -s -h /table/location/
I don't think there's a single command to measure the sum of sizes of all tables of a database. However, it should be fairly easy to write a script that automates the above steps. Hive can also be invoked from bash CLI using: hive -e '<hive command>'

Show Hive databases on HDFS
sudo hadoop fs -ls /apps/hive/warehouse
Show Hive database size
sudo hadoop fs -du -s -h /apps/hive/warehouse/{db_name}

if you want the size of your complete database run this on your "warehouse"
hdfs dfs -du -h /apps/hive/warehouse
this gives you the size of each DB in your warehouse
if you want the size of tables in a specific DB run:
hdfs dfs -du -h /apps/hive/warehouse/<db_name>
run a "grep warehouse" on hive-site.xml to find your warehouse path

Related

How to create new user in hadoop

I am new to hadoop. I have done apache hadoop multinode installation and the user name is hadoop.
I am using total 3 nodes: 1 namenode and 2 datanodes
I have to create new user for data isolation. I have found few links on google, but those are not working and I am unable to access the hdfs.
**[user1#datanode1~]# hdfs dfs -ls -R /
bash: hdfs: command not found...**
Can someone help me with the steps to create the new user which can access hdfs for data isolation. And on which node I should create the new user.
Thanks
Hadoop doesn't have users like Linux does. Users are generally managed by external LDAP/Kerberos systems. By default, there is not even security features, all user-names are based on the HADOOP_USER_NAME environment variable, and can be overriden by export command. Also, by default, the user used is the current username, for example, your command user1#datanode1 # hdfs dfs -ls would actually run hdfs dfs -ls /user/user1, and return an error if that folder doesn't first exist.
However, your actual error is saying that your OS PATH variable does not include $HADOOP_HOME/bin, for example. Edit your .bashrc to fix this.
You'd create an HDFS folder for "user" username with
hdfs dfs -mkdir /user/username
hdfs dfs -chown username /user/username
hdfs dfs -chmod -R 770 /user/username
And you also should run useradd command on the namenode machine to make sure it knows about a user named "username"

Hadoop configure cluter queried based on a flag/env parameter

Apologies beforehand if this turns out to be a silly question, I am new to hadoop environment.
I have two hadoop clusters my-prod-cluster and my-bcp-cluster.
Both are accessible over the same network.
Is there any way to configure my clusters in such a way that when I am in BCP mode, all my queries to my-prod-cluster gets routed to my-bcp-cluster (on the basis of some config parameter or environment variable)
So when flag=prod
hadoop fs -ls /my-prod-cluster/mydir translates to hadoop fs -ls /my-prod-cluster/mydir
and fetches the data in /my-prod-cluster/mydir
when the flag=bcp
hadoop fs -ls /my-prod-cluster/mydir translates to hadoop fs -ls /my-bcp-cluster/mydir
and fetches data from /my-bcp-cluster/mydir
I am using [mapr][1] flavour of haddop(provided by HP), version 6.1, in case that matters
You could easily make a shell wrapper script that prepends the NameNode address to each query
For example, a fully-qualified command would look like this
hdfs dfs -ls hdfs://my-prod-cluster.domain.com/path/to/mydir
So, refactoring that, you could have a script like
#!/bin/sh
if [ $1 -eq "prod" ]; then
NAMENODE=hdfs://my-prod-cluster.domain.com
fi
# TODO: error handling and more clusters
PATH=$2
hdfs dfs -ls "${NAMENODE}${PATH}"
Then execute something like my-hdfs-ls prod /mydir
If you need something more complex than that like Kerberos tickets, and such, then creating a separate HADOOP_CONF_DIR variable with unique core-site and hdfs-site XMLs for each cluster would be recommended.

Netezza utility NZLOAD to point -df location to the hdfs location

Currently, we are copying the files from hdfs to local and we are using the NZLOAD utility to load the data into Netezza, but just wanted to know if it is possible to provide the hdfs location of the files as below
nzload -host ${NZ_HOST} -u ${NZ_USER} -pw ${NZ_PASS} -db ${NZ_DB} -t ${TAR_TABLE} -df "hdfs://${HDFS_Location}"
As HDFS is different file system, nzload will not recognise the file if you provide hdfs file path in -df option of Netezza nzload.
You can use hdfs dfs -cat along with nzload to load Netezza table from hdfs directory.
$ hdfs dfs -cat /data/stud_dtls/stud_detls.csv | nzload -host 192.168.1.100 -u admin -pw password -db training -t stud_dtls -delim ','
Load session of table 'STUD_DTLS' completed successfully
Load HDFS file into Netezza Table Using nzload and External Tables

HDFS space consumed: "hdfs dfs -du /" vs "hdfs dfsadmin -report"

Which tool is the right one to measure HDFS space consumed?
When I sum up the output of "hdfs dfs -du /" I always get less amount of space consumed compared to "hdfs dfsadmin -report" ("DFS Used" line). Is there data that du does not take into account?
Hadoop file systems provides a relabel storage, by putting a copy of data to several nodes. The number of copies is replication factor, usually it is greate then one.
Command hdfs dfs -du / shows space consume your data without replications.
Command hdfs dfsadmin -report (line DFS Used) shows actual disk usage, taking into account data replication. So it should be several times bigger when number getting from dfs -ud command.
How HDFS Storage works in brief:
Let say replication factor = 3 (default)
Data file size = 10GB (i.e xyz.log)
HDFS will take 10x3 = 30GB to store that file
Depending on the type of command you use, you will get different values for space occupied by HDFS (10GB vs 30GB)
If you are on latest version of Hadoop, try the following command. In my case this works very well on Hortonworks Data Platform (HDP) 2.3.* and above. This should also work on cloudera's latest platform.
hadoop fs -count -q -h -v /path/to/directory
(-q = quota, -h = human readable values, -v = verbose)
This command will show the following fields in the output.
QUOTA REMAINING_QUOTA SPACE_QUOTA REMAINING_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Where
CONTENT_SIZE = real file size without replication (10GB) and
SPACE_QUOTA = space occupied in HDFS to save the file (30GB)
Notes:
Control replication factor here: Modify "dfs.replication" property found in hdfs-site.xml file under conf/ dir of default hadoop installation directory). Changing this using Ambari/Cloudera Manager is recommended if you have multinode cluster.
There are other commands to check storage space. E.G hadoop fsck, hadoop dfs -dus,

Can't import/load data to hive, why?

I'm trying to import data (simple file with two columns, int and string), table looks:
hive> describe test;
id int
name string
and when I try to import:
hive> load data inpath '/user/test.txt' overwrite into table test;
Loading data to table default.test
rmr: org.apache.hadoop.security.AccessControlException: Permission denied: user=hadoop, access=ALL, inode="/user/hive/warehouse/test":hive:hadoop:drwxrwxr-x
Failed with exception org.apache.hadoop.security.AccessControlException: Permission denied: user=hadoop, access=WRITE, inode="/user/hive/warehouse/test":hive:hadoop:drwxrwxr-x
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
Looks like user hadoop has all permissions, but still can't load data, however I was able to create table.
What's wrong?
Hive uses Metastore for it's metadata. All table definitions are created in it, but actual data stored in hdfs. Currently hive permissions and hdfs permissions are completely different things. They are unrelated. You have several workarounds:
Disable permissions at all (for hdfs hdfs)
Use Storage Based https://cwiki.apache.org/confluence/display/Hive/HCatalog+Authorization (in this case you will not be able to create tables, if you don't own database directory on hdfs)
Submit all jobs under hive user ( sudo -u hive hive )
Create database:
create database hadoop;
and create needed directory in hdfs with correct permissions
hdfs dfs -mkdir /user/hive/warehouse/hadoop.db;
hdfs dfs -chown hadoop:hive /user/hive/warehouse/hadoop.db
hdfs dfs -chmod g+w /user/hive/warehouse/hadoop.db
Of course, you should enable hive.metastore.client.setugi=true and hive.metastore.server.setugi=true. These parameters instruct hive execute jobs under current shell user (looks like these parameters are already enabled, because hive can't create directory).
This issue is because of syntax.
The format given for generating a table should be similar to the input file format.
Yes this is a permission error on the destination directory in the HDFS. An approach that worked for me:
identify the destination dir in the HDFS, hive > describe extended [problem table name]; under location parameter, if you don't know where that is, then
change the permissions on that directory:
hadoop fs -chmod [-R] nnn /problem/table/directory
May have to run as superuser depending on your setup. Use the -R option to apply new permissions to everything within the directory. Choose nnn to whatever is appropriate for your system.

Resources