Querying HIVE Metadata - hadoop

I need to query the following table and view information from my Apache HIVE cluster:
Each row needs to contain the following:
TABLE SCHEMA
TABLE NAME
TABLE DESCRIPTION
COLUMN NAME
COLUMN DATA TYPE
COLUMN LENGTH
COLUMN PRECISION
COLUMN SCALE
NULL OR NOT NULL
PRIMARY KEY INDICATOR
This can be easily queried from most RDBMS (metadata tables/views), but I am struggling to find much information about the equivalent metadata tables/views in HIVE.
Please help :)

This information is available from the Hive metastore. The below example query is for a MySQL-backed metastore (Hive version 1.2).
SELECT
DBS.NAME AS TABLE_SCHEMA,
TBLS.TBL_NAME AS TABLE_NAME,
TBL_COMMENTS.TBL_COMMENT AS TABLE_DESCRIPTION,
COLUMNS_V2.COLUMN_NAME AS COLUMN_NAME,
COLUMNS_V2.TYPE_NAME AS COLUMN_DATA_TYPE_DETAILS
FROM DBS
JOIN TBLS ON DBS.DB_ID = TBLS.DB_ID
JOIN SDS ON TBLS.SD_ID = SDS.SD_ID
JOIN COLUMNS_V2 ON COLUMNS_V2.CD_ID = SDS.CD_ID
JOIN
(
SELECT DISTINCT TBL_ID, TBL_COMMENT
FROM
(
SELECT TBLS.TBL_ID TBL_ID, TABLE_PARAMS.PARAM_KEY, TABLE_PARAMS.PARAM_VALUE, CASE WHEN TABLE_PARAMS.PARAM_KEY = 'comment' THEN TABLE_PARAMS.PARAM_VALUE ELSE '' END TBL_COMMENT
FROM TBLS JOIN TABLE_PARAMS
ON TBLS.TBL_ID = TABLE_PARAMS.TBL_ID
) TBL_COMMENTS_INTERNAL
) TBL_COMMENTS
ON TBLS.TBL_ID = TBL_COMMENTS.TBL_ID;
Sample output:
+--------------+----------------------+-----------------------+-------------------+------------------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_DESCRIPTION | COLUMN_NAME | COLUMN_DATA_TYPE_DETAILS |
+--------------+----------------------+-----------------------+-------------------+------------------------------+
| default | temp003 | This is temp003 table | col1 | string |
| default | temp003 | This is temp003 table | col2 | array<string> |
| default | temp003 | This is temp003 table | col3 | array<string> |
| default | temp003 | This is temp003 table | col4 | int |
| default | temp003 | This is temp003 table | col5 | decimal(10,2) |
| default | temp004 | | col11 | string |
| default | temp004 | | col21 | array<string> |
| default | temp004 | | col31 | array<string> |
| default | temp004 | | col41 | int |
| default | temp004 | | col51 | decimal(10,2) |
+--------------+----------------------+-----------------------+-------------------+------------------------------+
Metastore tables referred in query:
DBS: Details of databases/schemas.
TBLS: Details of tables.
COLUMNS_V2: Details about columns.
SDS: Details about storage.
TABLE_PARAMS: Details about table parameters (key-value pairs)

Related

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng.
Table Name : prod_db.products
desc:
+----------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+----------------------------+-----------------------+-----------------------+--+
| name | string | |
| cost | double | |
| load_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| load_date | string | |
+----------------------------+-----------------------+-----------------------+--+
## I am using the following code
SET hivevar:current_date=current_date();
ALTER TABLE prod_db.products DROP PARTITION(load_date='${current_date}');
Before and After picture of partitions:
+-----------------------+--+
| partition |
+-----------------------+--+
| load_date=2022-04-07 |
| load_date=2022-04-11 |
| load_date=2022-04-18 |
| load_date=2022-04-25 |
+-----------------------+--+
It runs without any error but doesn't work but won't drop the partition. Table is internal/managed.
I tried different ways mentioned on stack but it is just not working for me.
Help.
You dont need to set a variable. You can directly drop using direct sql.
Alter table prod_db.products
drop partition (load_date= current_date());

hive table shows 0 results while querying

My hive table is a managed table and i can see the files present in HDFS.
While querying through hive it does not display any result.
hive> describe formatted emp
Result -
| Table Type: | MANAGED_TABLE
| Table Parameters: | NULL
| 2 | bucketing_version
| 1376 | numFiles
| 43 | numPartitions
| 0 | numRows
| gzip | parquet.compression
| 0 | rawDataSize
| 4770821594 | totalSize
| true | transactional
| insert_only | transactional_properties
| 1612857428 | transient_lastDdlTime
While selecting data from table -
select * from emp;
it fetches no results.
Why there is difference in HDFS and select output.
Command worked for me -
ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS;

monetdb full outer join resulting in varchar type_digits=0

I am using MonetDB v11.29.7 "Mar2018-SP1" on a Windows10 x64 bit operating system. When I perform a full outer join with two tables on respective varchar columns with lengths > 0 (type_digits > 0), the resultant column in the target table yields a varchar column with type_digits=0, although the column data seems to display the proper, non-null varchar records.
I am not sure how to interpret column information of type=varchar and type_digits=0. This state is causing issues in the subsequent handling/extraction of data via Python interfaces (UDFs), as the expected Python dtype for the data of this column is ambiguous for Python numpy conversion.
I have provided a simple example whereby I created two small tables (dummy4 and dummy5) with two columns each and then create a third table (dummy6) using a full outer join command.
For table dummy6 and column "key", I would have expected the type_digits=32 (as per the "key" columns in the two source tables dummy4 & dummy5). Additionally, how should I interpret type=varchar and type_digits=0 state? What would be the proper handling/expectation when accessing/allocating a Python/numpy array for extracting the "key" column of table "dummy6" (via Python UDFs) in this case?
create table dummy4(key varchar(32), val int);
insert into dummy4 values('AAAAAAAA',1);
insert into dummy4 values('BBBBBBBBB',2);
select * from dummy4;
+-----------+------+
| key | val |
+===========+======+
| AAAAAAAA | 1 |
| BBBBBBBBB | 2 |
+-----------+------+
create table dummy5(key varchar(32), val int);
insert into dummy5 values('CCCCCCCC',3);
insert into dummy5 values('DDDDDDDD',4);
select * from dummy5;
+----------+------+
| key | val |
+==========+======+
| CCCCCCCC | 3 |
| DDDDDDDD | 4 |
+----------+------+
create table dummy6 as select key, dummy4.val as "val4", dummy5.val as "val5" from dummy4 full outer join dummy5 using (key);
select * from dummy6;
+-----------+------+------+
| key | val4 | val5 |
+===========+======+======+
| AAAAAAAA | 1 | null |
| BBBBBBBBB | 2 | null |
| CCCCCCCC | null | 3 |
| DDDDDDDD | null | 4 |
+-----------+------+------+
select t.name as "table_name", t.id as "table_id", c.id as "column_id", c.name as "column_name", c.type, c.type_digits from sys.tables t JOIN sys.columns c ON c.table_id = t.id where t.name = 'dummy4';
+------------+----------+-----------+-------------+---------+-------------+
| table_name | table_id | column_id | column_name | type | type_digits |
+============+==========+===========+=============+=========+=============+
| dummy4 | 78445 | 78443 | key | varchar | 32 |
| dummy4 | 78445 | 78444 | val | int | 32 |
+------------+----------+-----------+-------------+---------+-------------+
select t.name as "table_name", t.id as "table_id", c.id as "column_id", c.name as "column_name", c.type, c.type_digits from sys.tables t JOIN sys.columns c ON c.table_id = t.id where t.name = 'dummy5';
+------------+----------+-----------+-------------+---------+-------------+
| table_name | table_id | column_id | column_name | type | type_digits |
+============+==========+===========+=============+=========+=============+
| dummy5 | 78449 | 78447 | key | varchar | 32 |
| dummy5 | 78449 | 78448 | val | int | 32 |
+------------+----------+-----------+-------------+---------+-------------+
select t.name as "table_name", t.id as "table_id", c.id as "column_id", c.name as "column_name", c.type, c.type_digits from sys.tables t JOIN sys.columns c ON c.table_id = t.id where t.name = 'dummy6';
+------------+----------+-----------+-------------+---------+-------------+
| table_name | table_id | column_id | column_name | type | type_digits |
+============+==========+===========+=============+=========+=============+
| dummy6 | 78457 | 78454 | key | varchar | 0 |
| dummy6 | 78457 | 78455 | val4 | int | 32 |
| dummy6 | 78457 | 78456 | val5 | int | 32 |
+------------+----------+-----------+-------------+---------+-------------+
In fact this was a MonetDB's bug and was fixed today. Th fix will be featured on the upcoming Nov2019 release.

Automatically generating documentation about the structure of the database

There is a database that contains several views and tables.
I need create a report (documentation of database) with a list of all the fields in these tables indicating the type and, if possible, an indication of the minimum/maximum values and values from first row. For example:
.------------.--------.--------.--------------.--------------.--------------.
| Table name | Column | Type | MinValue | MaxValue | FirstRow |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | day | date | ‘2010-09-17’ | ‘2016-12-10’ | ‘2016-12-10’ |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | price | double | 1030.8 | 29485.7 | 6023.8 |
:------------+--------+--------+--------------+--------------+--------------:
| … | | | | | |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | day | date | ‘2014-06-20’ | ‘2016-11-28’ | ‘2016-11-16’ |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | owner | string | NULL | NULL | ‘Joe’ |
'------------'--------'--------'--------------'--------------'--------------'
I think the execution of many queries
SELECT MAX(column_name) as max_value, MIN(column_name) as min_value
FROM table_name
Will be ineffective on the huge tables that are stored in Hadoop.
After reading documentation found an article about "Statistics in Hive"
It seems I must use request like this:
ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS;
But this command ended with error:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
Do I understand correctly that this request add information to the description of the table and not display the result? Will this request work with view?
Please suggest how to effectively and automatically create documentation for the database in HIVE?

How to use ResultSet to fetch the ID of the record

I have got a table with name table_listnames whose structure is given below
mysql> desc table_listnames;
+-------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | | NULL | |
+-------+--------------+------+-----+---------+----------------+
2 rows in set (0.04 sec)
It has got sample data as shown
mysql> select * from table_listnames;
+----+------------+
| id | name |
+----+------------+
| 6 | WWW |
| 7 | WWWwww |
| 8 | WWWwwws |
| 9 | WWWwwwsSSS |
| 10 | asdsda |
+----+------------+
5 rows in set (0.00 sec)
I have a requirement where if name not found under the table , i need to insert or else do nothing
I am achieving it this way
String sql = "INSERT INTO table_listnames (name) SELECT name FROM (SELECT ?) AS tmp WHERE NOT EXISTS (SELECT name FROM table_listnames WHERE name = ?) LIMIT 1";
pst = dbConnection.prepareStatement(sql);
pst.setString(1, salesName);
pst.setString(2, salesName);
pst.executeUpdate();
Is it possible to know the id of the record of the given name in this case

Resources