Automatically generating documentation about the structure of the database - hadoop

There is a database that contains several views and tables.
I need create a report (documentation of database) with a list of all the fields in these tables indicating the type and, if possible, an indication of the minimum/maximum values and values from first row. For example:
.------------.--------.--------.--------------.--------------.--------------.
| Table name | Column | Type | MinValue | MaxValue | FirstRow |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | day | date | ‘2010-09-17’ | ‘2016-12-10’ | ‘2016-12-10’ |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | price | double | 1030.8 | 29485.7 | 6023.8 |
:------------+--------+--------+--------------+--------------+--------------:
| … | | | | | |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | day | date | ‘2014-06-20’ | ‘2016-11-28’ | ‘2016-11-16’ |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | owner | string | NULL | NULL | ‘Joe’ |
'------------'--------'--------'--------------'--------------'--------------'
I think the execution of many queries
SELECT MAX(column_name) as max_value, MIN(column_name) as min_value
FROM table_name
Will be ineffective on the huge tables that are stored in Hadoop.
After reading documentation found an article about "Statistics in Hive"
It seems I must use request like this:
ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS;
But this command ended with error:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
Do I understand correctly that this request add information to the description of the table and not display the result? Will this request work with view?
Please suggest how to effectively and automatically create documentation for the database in HIVE?

Related

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng.
Table Name : prod_db.products
desc:
+----------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+----------------------------+-----------------------+-----------------------+--+
| name | string | |
| cost | double | |
| load_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| load_date | string | |
+----------------------------+-----------------------+-----------------------+--+
## I am using the following code
SET hivevar:current_date=current_date();
ALTER TABLE prod_db.products DROP PARTITION(load_date='${current_date}');
Before and After picture of partitions:
+-----------------------+--+
| partition |
+-----------------------+--+
| load_date=2022-04-07 |
| load_date=2022-04-11 |
| load_date=2022-04-18 |
| load_date=2022-04-25 |
+-----------------------+--+
It runs without any error but doesn't work but won't drop the partition. Table is internal/managed.
I tried different ways mentioned on stack but it is just not working for me.
Help.
You dont need to set a variable. You can directly drop using direct sql.
Alter table prod_db.products
drop partition (load_date= current_date());

Unable to query/select data those inserted through Spark SQL

I am trying to insert data into a Hive Managed table that has a partition.
Show create table output for reference.
+--------------------------------------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------------------------------------+--+
| CREATE TABLE `part_test08`( |
| `id` string, |
| `name` string, |
| `baseamount` double, |
| `billtoaccid` string, |
| `extendedamount` double, |
| `netamount` decimal(19,5), |
| `netunitamount` decimal(19,5), |
| `pricingdate` timestamp, |
| `quantity` int, |
| `invoiceid` string, |
| `shiptoaccid` string, |
| `soldtoaccid` string, |
| `ingested_on` timestamp, |
| `external_id` string) |
| PARTITIONED BY ( |
| `productid` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'wasb://blobrootpath/hive/warehouse/db_103.db/part_test08' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'transactional'='true', |
| 'transactional_properties'='default', |
| 'transient_lastDdlTime'='1549962363') |
+--------------------------------------------------------------------------------------------------+--+
Trying to execute SQL statement to insert records into the part table like below
sparkSession.sql("INSERT INTO TABLE db_103.part_test08 PARTITION(ProductId) SELECT reflect('java.util.UUID', 'randomUUID'),stg_name,stg_baseamount,stg_billtoaccid,stg_extendedamount,stg_netamount,stg_netunitamount,stg_pricingdate,stg_quantity,stg_invoiceid,stg_shiptoaccid,stg_soldtoaccid,'2019-02-12 09:06:07.566',stg_id,stg_ProductId FROM tmp_table WHERE part_id IS NULL");
Without insert statement, if we run select query then getting below data.
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
|reflect(java.util.UUID, randomUUID)|stg_name|stg_baseamount| stg_billtoaccid|stg_extendedamount|stg_netamount|stg_netunitamount| stg_pricingdate|stg_quantity|stg_invoiceid| stg_shiptoaccid| stg_soldtoaccid|2019-02-12 09:06:07.566|stg_id|stg_ProductId|
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
| 4e0b4331-b551-42d...| OLI6| 16.0|2DD4E682-6B4F-E81...| 34.567| 1166.74380| 916.78000|2018-10-18 05:06:22| 13| I1|2DD4E682-6B4F-E81...|2DD4E682-6B4F-E81...| 2019-02-12 09:06:...| 6| P3|
| 8b327a8e-dd3c-445...| OLI7| 16.0|2DD4E682-6B4F-E81...| 34.567| 766.74380| 1016.78000|2018-10-18 05:06:22| 13| I6|2DD4E682-6B4F-E81...|2DD4E682-6B4F-E81...| 2019-02-12 09:06:...| 7| P4|
| c0e14b9a-8d1a-426...| OLI5| 14.6555| null| 34.56| 500.87000| 814.65000|2018-10-11 05:06:22| 45| I4|29B73C4E-846B-E71...|29B73C4E-846B-E71...| 2019-02-12 09:06:...| 5| P1|
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
Earlier I was getting error while inserting into Managed table. But after restarting Hive & Thrift services now there is no error in execution of job but not able to see those inserted data while doing select query through beeline/program. I can see partition with delta files also got inserted into hive/warehouse, see below screenshot.
Also, I can see some warnings as below not sure its related to the error or not.
Cannot get ACID state for db_103.part_test08 from null
One more note: If I use External Table then it is working fine can able to view data as well.
We are using Azure HDInsight Spark 2.3 (HDI 4.0 Preview) cluster with below Service Stacks.
HDFS: 3.1.1
Hive: 3.1.0
Spark2: 2.3.1
Have you added below set commands while trying to insert data.
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I have faced similar issue where i was not allowed to do any read / write operation, after adding the above properties I was able to query the table.
Since you have not faced any issue with external table, not sure if this is going to solve your problem.

Materialized View having UNKNOWN staleness - Oracle 11G

I am working on Oracle 11G.
One of my Materialized view has become UNKNOWN (MY_MAT_VW1). You can check the output of the ALL_MVIEWS below.
OWNER | MVIEW_NAME | CONTAINER_NAME | QUERY | QUERY_LEN | UPDATABLE | UPDATE_LOG | MASTER_ROLLBACK_SEG | MASTER_LINK | REWRITE_ENABLED | REWRITE_CAPABILITY | REFRESH_MODE | REFRESH_METHOD | BUILD_MODE | FAST_REFRESHABLE | LAST_REFRESH_TYPE | LAST_REFRESH_DATE | STALENESS | AFTER_FAST_REFRESH | UNKNOWN_PREBUILT | UNKNOWN_PLSQL_FUNC | UNKNOWN_EXTERNAL_TABLE | UNKNOWN_CONSIDER_FRESH | UNKNOWN_IMPORT | UNKNOWN_TRUSTED_FD | COMPILE_STATE | USE_NO_INDEX | STALE_SINCE | NUM_PCT_TABLES | NUM_FRESH_PCT_REGIONS | NUM_STALE_PCT_REGIONS
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
MY_DB | MY_MAT_VW1 | MY_MAT_VW1 | select.. | 6728 | N | | | | N | GENERAL | DEMAND | COMPLETE | IMMEDIATE | NO | COMPLETE | 14-Nov-16 | UNKNOWN | NA | N | Y | N | N | N | N | VALID | N | 0 | | |
MY_DB | MY_MAT_VW2 | MY_MAT_VW2 | select.. | 7074 | N | | | | N | TEXTMATCH | DEMAND | COMPLETE | IMMEDIATE | NO | COMPLETE | 13-Nov-16 | FRESH | NA | N | N | N | N | N | N | FRESH | N | 0 | 0 | |
The queries for the materialized view contain complex joins between multiple tables, inline views and unions.
As per my understanding (UNKNOWN_PLSQL_FUNC column) I guess there is a PLSQL Function which is causing the staleness to become UNKNOWN. However I am not sure which one.
I tried re-compiling and refreshing it but no luck.
Can anyone provide me some information on how to detect the root cause and make sure it does not become UNKNOWN again.
Also is there any implication of it on the data stored within it?
Below is just a sample I've created to replicate the scenario.
SELECT * FROM ENTITY_T;
ID | ENTITY_TYPE | FIRST_NAME | LAST_NAME | LEGAL_NAME
--------------------------------------------------
1 | INDIVIDUAL | JOHN | LESSEN |
2 | INDIVIDUAL | ROSAN | MEL |
3 | CORP | SIGMA | | SIGMA CORPORATION
--Function to get name base upon type
CREATE OR REPLACE FUNCTION GET_NAME (P_ID IN NUMBER)
RETURN VARCHAR2
DETERMINISTIC
AS
LV_NAME VARCHAR2(200);
BEGIN
SELECT CASE ENTITY_TYPE WHEN 'INDIVIDUAL' THEN FIRST_NAME ||' '|| LAST_NAME
WHEN 'CORP' THEN LEGAL_NAME
ELSE 'NONE'
END INTO LV_NAME
FROM ENTITY_T
WHERE ID=P_ID;
RETURN LV_NAME;
EXCEPTION
WHEN NO_DATA_FOUND THEN
RETURN 'NO ID FOUND';
WHEN OTHERS THEN
RETURN 'OTHER ERROR';
END;
--Materialized view creation
CREATE MATERIALIZED VIEW TEST_MV
AS
SELECT ID,ENTITY_TYPE,GET_NAME(ID) NAME
FROM ENTITY_T;
SELECT MVIEW_NAME,STALENESS,AFTER_FAST_REFRESH,UNKNOWN_PLSQL_FUNC,COMPILE_STATE,STALE_SINCE
FROM ALL_MVIEWS WHERE MVIEW_NAME='TEST_MV';
MVIEW_NAME | STALENESS | AFTER_FAST_REFRESH | UNKNOWN_PLSQL_FUNC | COMPILE_STATE | STALE_SINCE
----------------------------------------------------------------------------------------------
TEST_MV | UNKNOWN | NA | Y | VALID |
The Oracle Issue/Doc ID 757537.1 mentioned by JSapkota states clearly, that this is not a bug, but correct/expected behaviour:
STALENESS of the mview, refering to PL/SQL function is set to UNKOWN
as one cannot determine PL/SQL function changes. Current behaviour is
correct as per the design & code.
I guess using DETERMINISTIC functions instead of the default scope could prevent it.
As per the My Oracle Support this could be a bug(7582462).
As there is no solution to this bug, you have to deal with fact that staleness will show unknown, or not use functions on Materialized View definition.
Reference:DBA_MVIEWS Shows STALENESS Value of UNKNOWN After Refresh (Doc ID 757537.1)

Change table row grouping on interactive sort SSRS Local Report Webforms

Does Local mode processing SSRS in Webforms support this king of feature?
I want the columns of the table implement interactive sort, so the user will be able to sort the detail rows easily. However, I have different groupings initially in the table. I wanted it to be removed as soon as the user clicks on the interactive sort button.
For example:
Originally
| Column1 | Column2 | Column3
-Group1
| A-Data | Data | Data
| C-Data | Data | Data
-Group2
| B-Data | Data | Data
| Z-Data | Data | Data
-Group3
| D-Data | Data | Data
If I click the interactive sort of Column1, for example, the table will look like this:
| Column1 | Column2 | Column3
| A-Data | Data | Data
| B-Data | Data | Data
| C-Data | Data | Data
| D-Data | Data | Data
| Z-Data | Data | Data

How to repeat query in Oracle Forms upon dynamically changing ORDER_BY clause?

I have an Oracle Forms 6i form with a data block that consists of several columns.
------------------------------------------------------------------------------
| FIRST_NAME | LAST_NAME | DEPARTMENT | BIRTH_DATE | JOIN_DATE | RETIRE_DATE |
------------------------------------------------------------------------------
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
------------------------------------------------------------------------------
The user can press F7 (to Enter in Query Mode, for example, he/she types JOH% in the first_name and H% in the DEPARTMENT field) , then F8 to execute the query and see the results. In this example, a list of all employees with their last name starting with JOH and working in any department starting with H will be listed. Here is a sample output of that query
------------------------------------------------------------------------------
| FIRST_NAME | LAST_NAME | DEPARTMENT | BIRTH_DATE | JOIN_DATE | RETIRE_DATE |
------------------------------------------------------------------------------
| MIKE | JOHN | HUMAN RES. | 05-MAY-82 | 02-FEB-95 | |
| BEN | JOHNATHAN | HOUSING | 23-APR-76 | 16-AUG-98 | |
| SMITH | JOHN | HOUSING | 11-DEC-78 | 30-JUL-91 | |
| | | | | | |
------------------------------------------------------------------------------
I then added a small button on top of each column to allow the user to sort the data by the desired column, by executing WHEN-BUTTON-PRESSED trigger:
set_block_property('dept', order_by, 'first_name desc');
The good news is that the ORDER_BY does change. The bad news is that the user never notice the change because he/she will need to do another query and execute to see the output ordered by the column they selected. In other words, user will only notice the change in the next query he/she will execute.
I tried to automatically execute the query upon changing the ORDER_BY clause like this:
set_block_property('dept', order_by, 'first_name desc');
go_block('EMPLOYEE');
do_key('EXECUTE_QUERY');
/* EXECUTE_QUERY -- same thing */
but what happens is that all data from the table is selected, ignoring the criteria that the user has initially set during the query mode entry.
I also searched for a solution to this problem and most of them deal with SYSTEM.LAST_QUERY and default_where. The problem is, last_query can refer to a different block from a different form, that is not valid on the currently displayed data bloc.
How can do the following in just one button press:
1- Change the ORDER_BY clause of the currently active datablock
and: 2- Execute the last query that the user has executed, using the same criteria that was set?
Any help will be highly appreciated.
You can get the last query of the block with get_block_property built-in function:
GET_BLOCK_PROPERTY('EMPLOYEE', LAST_QUERY);
Another option is to provide separate search field(s) on the form, instead of using the QBE functionality.

Resources