Finding summary & basic statistics from data in Vertica

Finding summary & basic statistics from data in Vertica - vertica

Recently I am exploring HPE Vertica a bit. Is it possible to find summary statistics (mean,sd,quartiles,max,min,counts etc) from a data table loaded in vertica?
These two links;
https://my.vertica.com/docs/7.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_STATISTICS.htm
https://my.vertica.com/docs/7.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_HISTOGRAM.htm
say that we can find statistics & histogram from the data but the result is making no sense to me.
According to it, the ANALYZE_STATISTICS command will throw a 0 for successful execution. Like
NEWDB_aug17=> SELECT ANALYZE_STATISTICS ('MM_schema.capitalline');
ANALYZE_STATISTICS
--------------------
0
(1 row)
Here NEWDB_aug17 is the database, schema is MM_schema under which capitalline table was inserted. But where are the summary measures, i mean the numbers we are actually looking for? Only a 0 is not going to serve my purpose.
Can you please guide me in this context?

Vertica saves the statistics collected by ANALYZE_STATISTICS() in the catalog location.
These statistics are later used to calculate best query execution plan.
You can find the statistics details in the system table v_internal.dc_analyze_statistics
[dbadmin#vertica-1 ~]$ vsql
dbadmin=> \x
Expanded display is on.
dbadmin=> select * from v_internal.dc_analyze_statistics limit 1;
-[ RECORD 1 ]----+-----------------------------------
time | 2017-08-21 02:07:03.287895+00
node_name | v_test_node0001
session_id | v_test_node0001-502811:0x834a4
user_id | 45035996273704962
user_name | dbadmin
transaction_id | 45035996307673368
statement_id | 9
request_id | 1
table_name | test_table
proj_column_name | test_column
proj_name | test_table_sp_v11_b1
table_oid | 45036013037102108
proj_column_oid | 45036013037111264
proj_row_count | 119878353211
disk_percent | 10
disk_read_rows | 11987835321
sample_rows | 131072
sample_bytes | 7602176
start_time | 2017-08-21 02:07:03.657377+00
end_time | 2017-08-21 02:07:24.799398+00
Time: First fetch (1 row): 849.467 ms. All rows formatted: 849.594 ms
Or at this path:
{your_catalog_location}/{db_name}/{node_name}_catalog/DataCollector/AnalyzeStatistics_*.log

percentile_cont function of Vertica would be helpful in retrieving quartile.
create table test
(metric_value integer);
insert into test values(1);
insert into test values(2);
insert into test values(3);
insert into test values(4);
insert into test values(5);
insert into test values(6);
insert into test values(7);
insert into test values(8);
insert into test values(9);
insert into test values(10);
alter table anatest add column metric varchar(100) default 'abc';
select
metric_value,
percentile_cont(1) within group (order by metric_value) over (partition by metric) as max,
percentile_cont(.75) within group (order by metric_value ) over (partition by metric) as q3,
percentile_cont(.5) within group (order by metric_value ) over (partition by metric) as median,
percentile_cont(.25) within group (order by metric_value ) over (partition by metric) as q1,
percentile_cont(0) within group (order by metric_value ) over (partition by metric) as min
from test ;

Related

Vertica Table Analysis

I would like to analyze table usage on Verica to check the following
the tables that are hit most be queries
tables that are getting more write queries
tables that are getting more read queries.
So I am asking for help for SQL query or if anyone has any documents please point me in right direction. Thank you.

Here, I create a function QTYPE() that assigns a request of type 'QUERY' to either a SELECT, an INSERT, or a MODIFY (meaning DELETE,UPDATE,MERGE). The differentiation comes from the fact that, in Vertica, UPDATE/MERGE are actually DELETEs, then INSERTs.
I use two regular expressions of a certain complexity: first, finding [schema.]tablename after a JOIN or FROM keyword, then finding [schema.]tablename after either the UPDATE, the INSERT INTO, the MERGE INTO and the DELETE FROM keywords. Then, I join back to the tables system table to a) only select the tables really existing and b) add the schema name if it is missing.
The final report would be:
qtype | tbname | tx_count
--------+------------------------------------------------------+----------
INSERT | dbadmin.nrm_cpustats_rate | 74
INSERT | dbadmin.v_poll_item | 39
INSERT | dbadmin.child | 32
INSERT | dbadmin.tbid | 32
INSERT | dbadmin.etl_group_membership | 12
INSERT | dbadmin.sensor_oco | 11
INSERT | webanalytics.webtraffic_part | 10
INSERT | webanalytics.webtraffic_new_design_platform_datadate | 9
MODIFY | cp.foo | 2
MODIFY | public.foo | 2
MODIFY | taboola_tests.foo | 2
SELECT | dbadmin.flext | 112
SELECT | dbadmin.children | 112
SELECT | dbadmin.ffoo | 112
SELECT | dbadmin.demovals | 112
SELECT | dbadmin.allbut4 | 112
SELECT | dbadmin.allcols | 112
SELECT | dbadmin.allbut1 | 112
SELECT | dbadmin.flx | 112
Here's the function definition, and the CREATE TABLE statement to collect the statistics of what you're looking for, and finally the query getting the 'hit parade' of the most touched tables ...
Mind you, it might become a long runner with a lot of history in your query_requests table ...
CREATE OR REPLACE FUNCTION qtype(sql VARCHAR(64000))
RETURN VARCHAR(8) AS BEGIN
RETURN
CASE UPPER(REGEXP_SUBSTR(sql,'\w+')::VARCHAR(16))
WHEN 'SELECT' THEN 'SELECT'
WHEN 'WITH' THEN 'SELECT'
WHEN 'AT' THEN 'SELECT'
WHEN 'INSERT' THEN 'INSERT'
WHEN 'DELETE' THEN 'MODIFY'
WHEN 'UPDATE' THEN 'MODIFY'
WHEN 'MERGE' THEN 'MODIFY'
ELSE UPPER(REGEXP_SUBSTR(sql,'\w+')::VARCHAR(16))
END
;
END;
DROP TABLE IF EXISTS table_op_stats;
CREATE TABLE table_op_stats AS
WITH
-- need 1000 integers - up to ~400 source tables found in 1 select
i(i) AS (
SELECT MICROSECOND(tm)
FROM (
SELECT TIMESTAMPADD(MICROSECOND, 1,'2000-01-01'::TIMESTAMP)
UNION ALL SELECT TIMESTAMPADD(MICROSECOND,1000,'2000-01-01'::TIMESTAMP)
) l(ts)
TIMESERIES tm AS '1 MICROSECOND' OVER(ORDER BY ts)
)
,
tblist AS (
-- selects can affect several types, found by JOIN or FROM keyword before
-- hence look_behind regular expression
SELECT
QTYPE(request) AS qtype
, transaction_id
, statement_id
, i
, LTRIM(REGEXP_SUBSTR(request,'(?<=(from|join))\s+(\w+\.)?\w+\b',1,i,'i')) as tbname
FROM query_requests CROSS JOIN i
WHERE request_type='QUERY'
AND success
AND LTRIM(REGEXP_SUBSTR(request,'(?<=(from|join))\s+(\w+\.)?\w+\b',1,i,'i')) <> ''
UNION ALL
-- insert/delete/update/merge queries only affect one table each
SELECT
QTYPE(request) AS qtype
, transaction_id
, statement_id
, 1 AS i
, LTRIM(REGEXP_SUBSTR(request,'(insert\s+.*into\s+|update\s+.*|merge\s+.*into|delete\s+.*from)\s*((\w+\.)?\w+)\b',1,1,'i',2)) as tbname
FROM query_requests
WHERE request_type='QUERY'
AND success
AND QTYPE(request) <> 'SELECT'
)
,
-- join back to the "tables" system table - removes queries from correlation names, and adds schema name if needed
real_tables AS (
SELECT
qtype
, transaction_id
, statement_id
, i
, CASE WHEN SPLIT_PART(tbname,'.',2)=''
THEN table_schema||'.'||tbname
ELSE tbname
END AS tbname
FROM tblist
JOIN tables ON CASE WHEN SPLIT_PART(tbname,'.',2)=''
THEN tbname=table_name
ELSE SPLIT_PART(tbname,'.',1)=table_schema AND SPLIT_PART(tbname,'.',2)=table_name
END
)
SELECT
qtype
, transaction_id
, statement_id
, i
, tbname
FROM real_tables;
-- Time: First fetch (0 rows): 42483.769 ms. All rows formatted: 42484.324 ms
-- the query at the end:
WITH grp AS (
SELECT
qtype
, tbname
, COUNT(*) AS tx_count
FROM table_op_stats
GROUP BY 1,2
)
SELECT
*
FROM grp
LIMIT 8 OVER(
PARTITION BY qtype
ORDER BY tx_count DESC
);

Return first row in each group from Oracle SQL

Hi I need to come up with a query that efficiently returns just one record per group(I might be thinking about it wrong) and stop searching for more records in that group as soon as it has found one record.
This is my table:
|col1 | col2|
|-----|-----|
| A | 1 |
| A | 2 |
| B | 3 |
| B | 4 |
I want to return.
|col1 | col2|
|-----|-----|
| A | 1 |
| B | 3 |
Note that I don't actually care if in row one I have A,1 or A,2(same applies to second row).
What I want is to get one record that has A in first column could be any record that matches that criteria, and similarly I want one record that has B in col1.
The closes that I know to getting this are two queries
SELECT col1, MIN(col2)
FROM tablename
GROUP BY col1
and the other:
SELECT *
FROM tablename
WHERE col1 = 'A'
AND ROWNUM = 1
First query is not good enough because it will try to find all records that have A in col1(in the actual table I'm looking at this means searching though millions of rows, and my indicies won't be of much help here). Second query will return just one value of col1 at a time, so I'd have to run it thousands of times to get all the records I need.
NOTE:
I did see similar question in here but the answers were focused on just getting the right query results, I my case issue is how long do I need to wait for these results.

Sounds like this is the query you are looking for:
select col1
, min(col2) keep (dense_rank first order by rownum) col2
from tablename
group by col1;

I feel the below query is returning the result quick.
SELECT col1,col2 FROM (
SELECT col1,col2, ROW_NUMBER () over (partition by col1 order by col2 asc)
minseq FROM tablename
--where rownum < 1000000000
)
where minseq = 1;
Normal query
SELECT col1, MIN(col2)
FROM tablename
GROUP BY col1
took 1 min to fetch 500 records out of 100000000 records and this query took 0.03 seconds

Delete the same set of values from multiple tables

I would like to not have to repeat the same subquery over and over for all tables.
Example:
begin
-- where subquery is quite complex and returns thousands of records of a single ID column
delete from t1 where exists ( select 1 from subquery where t1.ID = subquery.ID );
delete from t2 where exists ( select 1 from subquery where t2.ID = subquery.ID );
delete from t3 where exists ( select 1 from subquery where t3.ID = subquery.ID );
end;
/
An alternative I've found is:
declare
type id_table_type is table of table.column%type index by PLS_INTEGER
ids id_table_type;
begin
select ID
bulk collect into ids
from subquery;
forall indx in 1 .. ids.COUNT
delete from t1 where ID = ids(indx);
forall indx in 1 .. ids.COUNT
delete from t2 where ID = ids(indx);
forall indx in 1 .. ids.COUNT
delete from t3 where ID = ids(indx);
end;
/
What are your thoughts about this alternative? is there a more efficient way of doing this?

Create a temporary table, once, to hold the results of the subquery.
For each run, insert the results of the subquery into the temporary table. The subquery only runs once and each delete is simple: delete from mytable t where t.id in (select id from tmptable);.
Truncate the table when finished.

If you could do it in pure SQL than do it in SQL, no need of PL/SQL. With every SQL call in PL/SQL(or vice-versa, but less in this case) there is an overhead associated with each context switch between the two engines.
Now, having said that, if you must do it in PL/SQL, then, it is possible to reduce context switches by bulk binding the whole collection to the DML statement in one operation.
Usually a cursor for loop does an implicit bulk collect limit 100 which is much better than an explicit cursor.
But, it's not just about bulk collecting, we are dealing with the operations we would subsequently do on the array that we have fetched incrementally. We could further improve the performance by using FORALL statement along with BULK COLLECT.
IMO, the best would be do it in pure SQL. If you really want to do it in PL/SQL, then do it as I mentioned above.
I would go with the SQL approach, and since you have the same subquery repeated, I would use QUERY RESULT CACHE. Oracle 11g introduced the QUERY RESULT CACHE.
In your subquery:
SELECT /*+ RESULT_CACHE */ <column_list> .. <your subquery>...
For example,
SQL> EXPLAIN PLAN FOR
2 SELECT /*+ RESULT_CACHE */
3 deptno,
4 AVG(sal)
5 FROM emp
6 GROUP BY deptno;
Explained.
Let's look at the plan table output:
SQL> SELECT * FROM TABLE(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------
Plan hash value: 4067220884
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 21 | 3 (0)| 00:00:01 |
| 1 | RESULT CACHE | b9aa181887ufz5341w1zqpf1d1 | | | | |
| 2 | HASH GROUP BY | | 3 | 21 | 3 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL| EMP | 14 | 98 | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------
Result Cache Information (identified by operation id):
------------------------------------------------------
1 - column-count=2; dependencies=(SCOTT.EMP); name="SELECT /*+ RESULT_CACHE */
deptno,
AVG(sal)
FROM emp
GROUP BY deptno"
15 rows selected.

An alternative would be:
begin
for s in (select distinct id from subquery) loop
delete from t1 where t1.id = s.id;
delete from t2 where t2.id = s.id;
delete from t3 where t3.id = s.id;
end loop;
end;
/
In general this is a questionable technique, as processing "row-by-row" is slower than doing multi-row operations. But if the point is that subquery is very slow, this may be an improvement.

There seems to be a dependency between tables t1, t2 and t3 on its id field.
Probably you can solve this by a delete trigger on t1 to remove entries in t2and t3 as well.
I don't know exactly the affected tables of your subquery, maybe you can set an update flag (e.g. a datefield) on a further table t0 and delete entries on t1, t2 and t3 by an update trigger on table t0 if datefield changes.
The advantage of this solution is database consistency and pure sql.

I think it is not possible to use delete on multiple tables at a time.
But,one way could be to use the sub query factoring (with clause) .
As of now,i don't have exact answer but i think with clause might help.I will get back with exact query once i am able to devote some time.
Also,in case you have a requirement of deleting the sub query id from sub query tables as well on daily basis ,then it will be easier in sense,you can create foreign key based on on delete cascade.
Hope it helps .

Creating fact table with year of 9999

I'm building a simple fact table in oracle based on a customer status where a customer has a status, 'Active' and 'Lost' and a date they started with that status and a date they ended.
A sample 3 rows would be;
CustID | status | date_start | date_end
---------------------------------------
1 | active | 1/1/13 | 1/12/14
1 | lost | 1/12/14 | 31/12/9999
2 |active | 1/12/14 | 31/12/9999
Here, cust 1 was active and then was lost. When a account status is current (as of today) the end date column is 31/12/9999. Cust 2 is active as of today
My question is, how can I bring this into a fact table?
CREATE TABLE temp AS
SELECT CS.contract_status_id , to_char(ASH.Contract_Status_Start, 'DD/MM/YYYY') AS cust_status_start_date, to_char(ASH.CONTRACT_STATUS_END, 'DD/MM/YYYY') As cust_status_end_date
FROM account_status_history ASH,
customer_status_dim CS
WHERE ASH.contract_status = CS.contract_status
Fact Table:
CREATE TABLE customer_status_fact AS
SELECT T.cust_status_start_date, T.cust_status_end_date, T.contract_status_id,
count(T.contract_status_id) AS TOTAL_ACCOUNTS
FROM temp T
GROUP BY T.cust_status_start_date, T.cust_status_end_date, T.contract_status_id
And testing it;
select sum(F.TOTAL_ACCOUNTS), CS.contract_status_txt
from customer_status_fact F, customer_status_dim CS
where F.contract_status_id = CS.contract_status_id
and F.cust_status_start_date <= sysdate
and F.cust_status_end_date = '31/12/9999'
group by CS.contract_status_txt
I can't seem to get oracle to recognise the year 9999 Any help is appreciated

and F.cust_status_end_date = '31/12/9999'
'31/12/9999' is NOT a DATE, it is a string enclosed within single-quotation marks. You must use TO_DATE to explicitly convert it into a DATE.
For example,
SQL> alter session set nls_date_format='DD/MM/YYYY HH24:MI:SS';
Session altered.
SQL> SELECT to_date('31/12/9999 23:59:59','DD/MM/YYYY HH24:MI:SS') FROM dual;
TO_DATE('31/12/9999
-------------------
31/12/9999 23:59:59
SQL>
OR,
SQL> SELECT to_date(5373484, 'J') + (1 - 1/24/60/60) FROM dual;
TO_DATE(5373484,'J'
-------------------
31/12/9999 23:59:59
SQL>
CREATE TABLE temp AS SELECT CS.contract_status_id ,
to_char(ASH.Contract_Status_Start, 'DD/MM/YYYY') AS
cust_status_start_date, to_char(ASH.CONTRACT_STATUS_END, 'DD/MM/YYYY')
As cust_status_end_date
Why would you create a table with DATEs converted to a STRING? You should let the dates as it is. You should use TO_CHAR only for display purpose. For any date calculations, let the date remain as a date.

Oracle Index with multiple Columns querying on single column

In a table in our Oracle installation we have a table with an index on two of the columns (X and Y). If I do a query on the table with a where clause only touching column X, will Oracle be able to use the index?
For example:
Table Y:
Col_A,
Col_B,
Col_C,
Index exists on (Col_A, Col_B)
SELECT * FROM Table_Y WHERE Col_A = 'STACKOVERFLOW';
Will the index be used, or will a table scan be done?

It depends.
You could check it by letting Oracle explain the execution plan:
EXPLAIN PLAN FOR
SELECT * FROM Table_Y WHERE Col_A = 'STACKOVERFLOW';
and then
select * from table(dbms_xplan.display);
So, for example with
create table table_y (
col_a varchar2(30),
col_b varchar2(30),
col_c varchar2(30)
);
create unique index table_y_ix on table_y (col_a, col_b);
and then a
explain plan for
select * from table_y
where col_a = 'STACKOVERFLOW';
select * from table(dbms_xplan.display);
The plan (on my installation) looks like:
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 51 | 1 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TABLE_Y | 1 | 51 | 1 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | TABLE_Y_IX | 1 | | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("COL_A"='STACKOVERFLOW')
ID 2 shows you, that the index TABLE_Y_IX is indeed used for an index range scan.
If on another installation Oracle chooses to use the index is dependend on many things. It's Oracle's query optimizer that makes this decision.
Update If you feel you're be better off (performance wise, that is) if Oracle used the index, you might want to try the + index_asc(...) (see index hint)
So in your case that would be something like
SELECT /*+ index_asc(TABLE_Y TABLE_Y_IX) */ *
FROM Table_Y
WHERE Col_A = 'STACKOVERFLOW';
Additionally, I would ensure that you have gathered statistics on the table and its columns. You can check the date of the last gathering of statistics with a
select last_analyzed from dba_tables where table_name = 'TABLE_Y';
and
select column_name, last_analyzed from dba_tab_columns where table_name = 'TABLE_Y';
If there are no statistics or if they're stale, make yourself familiar with the dbms_stats package to gather such statistics.
These statistics are the data that the query optimizer relies on heavily to make its decisions.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Finding summary & basic statistics from data in Vertica - vertica

Related

Vertica Table Analysis

Return first row in each group from Oracle SQL

Delete the same set of values from multiple tables

Creating fact table with year of 9999

Oracle Index with multiple Columns querying on single column

Categories

Resources