I have multiple Databases in Hive. A, B, C.
Database A has hive tables One, Two, Three. All three tables have different schemas.
How can I write a hive script to dump data of all three tables into a single csv file.
Try this.
#!/bin/bash
for db in A B C #List of databases
do
tbs=$(hive -S -e "use $db; show tables")
for tb in $tbs
do
hive -e "set hive.cli.print.header=true; use $db; SELECT * FROM $tb;" | sed 's/[\t]/,/g' >> sampleData.csv
done
done
Related
I am facing an issue where Lines of records increase by 22,63,728 records when data exported from hive into a csv on hadoop edge node.
Sample command :
beeline -u $HIVE_JDBC_URL --silent=true --outputformat=tsv2 --showHeader=false --silent=true -e "use db; select * from table" > sample.csv
Any clue why this would happen ?
Is there a way to catch all schema + table name info in a single command through Hive in a similar way to
SELECT * FROM information_schema.tables
from the PostgreSQL world?
show databases and show tables combined in a loop [here an example] is an answer, but I'm looking for a more compact way to have the same result in a single command.
It's been long I have worked on Hive Queries but as far as I remember you can probably use
hive> desc formatted tableName;
or
hive> describe formatted tableName;
It will give you all the relevant information related to the Table like the Schema, Partition info, Table Type like Managed Table, etc
I am not sure If you are particularly looking for this ??
There is another way to query Hive Tables, is writing Hive Scripts which can be called from Hadoop Terminal rather than from Hive Terminal itself.
std]$ cat sample.hql or vi sample.hql
use dbName;
select * from tableName;
desc formatted tableName;
# this hql script can be called from outside the hive terminal
std]$ hive -f sample.hql
or, without even have to write script file you can probably query hive as
std]$ hive -e "use dbName; select * from emp;" > text.txt or >> to append
On the Database level, you can probably query as :
hive> use dbName;
hive> set hive.cli.print.current.db=true;
hive(dbName)> describe database dbName;
it will bring metadata from MySQL(metastore) about the Database.
In Hive, how do we search a table by name in all databases?
I am a Teradata user. Is there any counterpart of systems tables (present in Teradata) like dbc.tables, dbc.columns which are present in HIVE?
You can use SQL like to search a table.
Example:
I want to search a table with the name starting from "Benchmark" I don't know the rest of it.
Input in HIVE CLI:
show tables like 'ben*'
Output:
+-----------------------+--+
| tab_name |
+-----------------------+--+
| benchmark_core_month |
| benchmark_core_qtr |
| benchmark_core_year |
+-----------------------+--+
3 rows selected (0.224 seconds)
Or you can try below command if you are using Beeline
!tables
Note: It will work with Beeline only (JDBC client based)
More about beeline: http://blog.cloudera.com/blog/2014/02/migrating-from-hive-cli-to-beeline-a-primer/
you can also use hdfs to find a table in all databases:
the path of hive databases is:
/apps/hive/warehouse/
so, by using hdfs :
hdfs dfs -find /apps/hive/warehouse/ -name t*
You should query the metastore.
You can find the connection properties within hive-site.xml
bash
<$HIVE_HOME/conf/hive-site.xml grep -A1 jdo
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
--
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
--
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
--
<name>javax.jdo.option.ConnectionPassword</name>
<value>cloudera</value>
Within the metastore you can use a query similar to the following
mysql
select *
from metastore.DBS as d
join metastore.TBLS as t
on t.DB_ID =
d.DB_ID
where t.TBL_NAME like '% ... put somthing here ... %'
order by d.NAME
,t.TBL_NAME
;
Searching for tables with name containing infob across all Hive databases
for i in `hive -e "show schemas"`; do echo "Hive DB: $i"; hive -e "use $i; show tables"|grep "infob"; done
Hive stores all its metadata information in Metastore. Metastore schema can be found at: link: https://issues.apache.org/jira/secure/attachment/12471108/HiveMetaStore.pdf
It has tables like DBS for database, TBLS for tables and Columns. You may use appropriate join, to find out table name or column names.
#hisi's answer is elegant. However it induce an error with lacking memory for GC on our cluster. So, there is another less elegant approach that works for me.
Let foo is the table name to search. So
hadoop fs -ls -R -C /apps/hive/warehouse/ 2>/dev/null | grep '/apps/hive/warehouse/[^/]\{1,\}/foo$'
If one does not remember exact name of table but only substring bar in table name, then command is
hadoop fs -ls -R -C /apps/hive/warehouse/ 2>/dev/null | grep '/apps/hive/warehouse/[^/]\{1,\}/[^/]\{1,\}$' | grep bar
That's an extention of Mantej Singh's answer: you can use pyspark to find tables across all Hive databases (not just one):
from functools import reduce
from pyspark import SparkContext, HiveContext
from pyspark.sql import DataFrame
sc = SparkContext()
sqlContext = HiveContext(sc)
dbnames = [row.databaseName for row in sqlContext.sql('SHOW DATABASES').collect()]
tnames = []
for dbname in dbnames:
tnames.append(sqlContext.sql('SHOW TABLES IN {} LIKE "%your_pattern%"'.format(dbname)))
tables = reduce(DataFrame.union, tnames)
tables.show()
The way to do this is to iterate through the databases searching for table with a specified name.
I have a Local directory it is used to store the hive table data.
I need to list all tables which are using Local directory .
These tables (managed tables) are stored in hive Default DB , this DB allows to store Data in other Local directories .
My Local directory : /abc/efg/data/
Table data is Stored in sub folders like 123 , 456,789 etc
For table xyz location is /abc/efg/data/123 , PQR location is /abc/efg/data/456 like that.
I am trying to use
hive -e " show tables " > All_tables list all tables and redirect to a file
For each line(each table) in All_tables
hive -e " desc formatted $line " | grep '/abc/efg/data/' >> Tables_My_local_dir
but it will result some performance issue as i have 6000 tables in DB .
please help me to list all tables which are using Local directory with a best performance.
I assume that you wanted to list table and its corresponding location information by extracting it from the desc formatted command for managed tables in default database.
If my understanding is correct, I suggest you to go with querying the Hive Meta-store, provided its an externally configured one and you have necessary permissions to fetch the same information
Query on meta-store:
SELECT T.TBL_NAME AS TABLE_NAME,S.LOCATION AS LOCATION FROM TBLS T LEFT JOIN SDS S ON T.SD_ID=S.SD_ID WHERE T.TBL_TYPE='MANAGED_TABLE' AND T.DB_ID=1 ;
note: in the query, DB_ID for default database is 1
Output:
------------+------------------------------------------------------------+
| TABLE_NAME | LOCATION |
+------------+------------------------------------------------------------+
| sample | hdfs://********:8020/user/hive/warehouse/sample |
...
.
Based on the rule
HADOOP TABLES ARE DIRECTORIES
I have created a shell script to do the below steps.
Step 1. Find all the directories which are not being modified since last 14 days .
Step 2 . Separate real tables and real folders 2.1execute "desc $dir_name "
2.2 based on return status($?) redirect $dir_name to two files(one for real tables and other for directories )
Now I have the required tables in a file.
select * from information_schema.columns;
In MySQL gives me the dbname, table name and column details of a MySQL db.
Can I get the same details in hive from any tables?
If you have configured your metastore in mysql .Then there are tables in the metastore database named DBS, COLUMNS_V2 which will have metadata of all hive DBs and tables .
Describe will meet your requirement.
hive -e "desc formatted tablename"
On above output you can use grep like below
hive -e "desc formatted tablename" |grep -i database
only column names can be get with below command.
hive -e "show columns from tablename"