Vertica Table Analysis - vertica

I would like to analyze table usage on Verica to check the following
the tables that are hit most be queries
tables that are getting more write queries
tables that are getting more read queries.
So I am asking for help for SQL query or if anyone has any documents please point me in right direction. Thank you.

Here, I create a function QTYPE() that assigns a request of type 'QUERY' to either a SELECT, an INSERT, or a MODIFY (meaning DELETE,UPDATE,MERGE). The differentiation comes from the fact that, in Vertica, UPDATE/MERGE are actually DELETEs, then INSERTs.
I use two regular expressions of a certain complexity: first, finding [schema.]tablename after a JOIN or FROM keyword, then finding [schema.]tablename after either the UPDATE, the INSERT INTO, the MERGE INTO and the DELETE FROM keywords. Then, I join back to the tables system table to a) only select the tables really existing and b) add the schema name if it is missing.
The final report would be:
qtype | tbname | tx_count
--------+------------------------------------------------------+----------
INSERT | dbadmin.nrm_cpustats_rate | 74
INSERT | dbadmin.v_poll_item | 39
INSERT | dbadmin.child | 32
INSERT | dbadmin.tbid | 32
INSERT | dbadmin.etl_group_membership | 12
INSERT | dbadmin.sensor_oco | 11
INSERT | webanalytics.webtraffic_part | 10
INSERT | webanalytics.webtraffic_new_design_platform_datadate | 9
MODIFY | cp.foo | 2
MODIFY | public.foo | 2
MODIFY | taboola_tests.foo | 2
SELECT | dbadmin.flext | 112
SELECT | dbadmin.children | 112
SELECT | dbadmin.ffoo | 112
SELECT | dbadmin.demovals | 112
SELECT | dbadmin.allbut4 | 112
SELECT | dbadmin.allcols | 112
SELECT | dbadmin.allbut1 | 112
SELECT | dbadmin.flx | 112
Here's the function definition, and the CREATE TABLE statement to collect the statistics of what you're looking for, and finally the query getting the 'hit parade' of the most touched tables ...
Mind you, it might become a long runner with a lot of history in your query_requests table ...
CREATE OR REPLACE FUNCTION qtype(sql VARCHAR(64000))
RETURN VARCHAR(8) AS BEGIN
RETURN
CASE UPPER(REGEXP_SUBSTR(sql,'\w+')::VARCHAR(16))
WHEN 'SELECT' THEN 'SELECT'
WHEN 'WITH' THEN 'SELECT'
WHEN 'AT' THEN 'SELECT'
WHEN 'INSERT' THEN 'INSERT'
WHEN 'DELETE' THEN 'MODIFY'
WHEN 'UPDATE' THEN 'MODIFY'
WHEN 'MERGE' THEN 'MODIFY'
ELSE UPPER(REGEXP_SUBSTR(sql,'\w+')::VARCHAR(16))
END
;
END;
DROP TABLE IF EXISTS table_op_stats;
CREATE TABLE table_op_stats AS
WITH
-- need 1000 integers - up to ~400 source tables found in 1 select
i(i) AS (
SELECT MICROSECOND(tm)
FROM (
SELECT TIMESTAMPADD(MICROSECOND, 1,'2000-01-01'::TIMESTAMP)
UNION ALL SELECT TIMESTAMPADD(MICROSECOND,1000,'2000-01-01'::TIMESTAMP)
) l(ts)
TIMESERIES tm AS '1 MICROSECOND' OVER(ORDER BY ts)
)
,
tblist AS (
-- selects can affect several types, found by JOIN or FROM keyword before
-- hence look_behind regular expression
SELECT
QTYPE(request) AS qtype
, transaction_id
, statement_id
, i
, LTRIM(REGEXP_SUBSTR(request,'(?<=(from|join))\s+(\w+\.)?\w+\b',1,i,'i')) as tbname
FROM query_requests CROSS JOIN i
WHERE request_type='QUERY'
AND success
AND LTRIM(REGEXP_SUBSTR(request,'(?<=(from|join))\s+(\w+\.)?\w+\b',1,i,'i')) <> ''
UNION ALL
-- insert/delete/update/merge queries only affect one table each
SELECT
QTYPE(request) AS qtype
, transaction_id
, statement_id
, 1 AS i
, LTRIM(REGEXP_SUBSTR(request,'(insert\s+.*into\s+|update\s+.*|merge\s+.*into|delete\s+.*from)\s*((\w+\.)?\w+)\b',1,1,'i',2)) as tbname
FROM query_requests
WHERE request_type='QUERY'
AND success
AND QTYPE(request) <> 'SELECT'
)
,
-- join back to the "tables" system table - removes queries from correlation names, and adds schema name if needed
real_tables AS (
SELECT
qtype
, transaction_id
, statement_id
, i
, CASE WHEN SPLIT_PART(tbname,'.',2)=''
THEN table_schema||'.'||tbname
ELSE tbname
END AS tbname
FROM tblist
JOIN tables ON CASE WHEN SPLIT_PART(tbname,'.',2)=''
THEN tbname=table_name
ELSE SPLIT_PART(tbname,'.',1)=table_schema AND SPLIT_PART(tbname,'.',2)=table_name
END
)
SELECT
qtype
, transaction_id
, statement_id
, i
, tbname
FROM real_tables;
-- Time: First fetch (0 rows): 42483.769 ms. All rows formatted: 42484.324 ms
-- the query at the end:
WITH grp AS (
SELECT
qtype
, tbname
, COUNT(*) AS tx_count
FROM table_op_stats
GROUP BY 1,2
)
SELECT
*
FROM grp
LIMIT 8 OVER(
PARTITION BY qtype
ORDER BY tx_count DESC
);

Related

Generate duplicate rows with incremental values in Oracle

I have a requirement in Oracle SQL that specific values should be repeated 5 times. So, I have written the below query and getting the result as expected.
SELECT Item_Name || RANGES AS Item_Number
FROM
(
SELECT DISTINCT 'Price Line ' || column1 || '-' AS Item_Name, level+0 RANGES
FROM
(
SELECT DISTINCT column1 from tbl where column1 in
(
'ABC',
'BCD'
)
) connect by level <= 5
) order by Item_Number
OUTPUT:
Price Line ABC-1
Price Line ABC-2
Price Line ABC-3
Price Line ABC-4
Price Line ABC-5
Price Line BCD-1
Price Line BCD-2
Price Line BCD-3
Price Line BCD-4
Price Line BCD-5
But when I add more than 10 values like 'DEF', 'EFG',.....'XYZ', the query keeps executing for hours without any result.
Any help or suggestion on this would be appreciated.
Regards,
Make sure the CONNECT BY is not operating on your table rows. For example, use a common table expression (i.e., a WITH clause) to create a row source with 5 rows using CONNECT BY and then CROSS JOIN that row source to your table. Here is an example.
CREATE TABLE ITEMS ( item_number VARCHAR2(30) );
INSERT INTO ITEMS VALUES ('ABC');
INSERT INTO ITEMS VALUES ('BCD');
INSERT INTO ITEMS VALUES ('CDE');
INSERT INTO ITEMS VALUES ('DEF');
INSERT INTO ITEMS VALUES ('EFG');
INSERT INTO ITEMS VALUES ('FGH');
INSERT INTO ITEMS VALUES ('GHI');
INSERT INTO ITEMS VALUES ('HIJ');
INSERT INTO ITEMS VALUES ('IJK');
INSERT INTO ITEMS VALUES ('JKL');
COMMIT;
WITH rowgen AS (
SELECT rownum rn FROM dual CONNECT BY rownum <= 5 )
SELECT item_number || '-' || rn
FROM items
CROSS JOIN rowgen
ORDER BY item_number, rn;
+----------------------+
| ITEM_NUMBER||'-'||RN |
+----------------------+
| ABC-1 |
| ABC-2 |
| ABC-3 |
| ABC-4 |
| ABC-5 |
| BCD-1 |
| BCD-2 |
| BCD-3 |
| BCD-4 |
| BCD-5 |
| CDE-1 |
| CDE-2 |
| CDE-3 |
| CDE-4 |
| CDE-5 |
| ... |
+----------------------+

Finding summary & basic statistics from data in Vertica

Recently I am exploring HPE Vertica a bit. Is it possible to find summary statistics (mean,sd,quartiles,max,min,counts etc) from a data table loaded in vertica?
These two links;
https://my.vertica.com/docs/7.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_STATISTICS.htm
https://my.vertica.com/docs/7.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_HISTOGRAM.htm
say that we can find statistics & histogram from the data but the result is making no sense to me.
According to it, the ANALYZE_STATISTICS command will throw a 0 for successful execution. Like
NEWDB_aug17=> SELECT ANALYZE_STATISTICS ('MM_schema.capitalline');
ANALYZE_STATISTICS
--------------------
0
(1 row)
Here NEWDB_aug17 is the database, schema is MM_schema under which capitalline table was inserted. But where are the summary measures, i mean the numbers we are actually looking for? Only a 0 is not going to serve my purpose.
Can you please guide me in this context?
Vertica saves the statistics collected by ANALYZE_STATISTICS() in the catalog location.
These statistics are later used to calculate best query execution plan.
You can find the statistics details in the system table v_internal.dc_analyze_statistics
[dbadmin#vertica-1 ~]$ vsql
dbadmin=> \x
Expanded display is on.
dbadmin=> select * from v_internal.dc_analyze_statistics limit 1;
-[ RECORD 1 ]----+-----------------------------------
time | 2017-08-21 02:07:03.287895+00
node_name | v_test_node0001
session_id | v_test_node0001-502811:0x834a4
user_id | 45035996273704962
user_name | dbadmin
transaction_id | 45035996307673368
statement_id | 9
request_id | 1
table_name | test_table
proj_column_name | test_column
proj_name | test_table_sp_v11_b1
table_oid | 45036013037102108
proj_column_oid | 45036013037111264
proj_row_count | 119878353211
disk_percent | 10
disk_read_rows | 11987835321
sample_rows | 131072
sample_bytes | 7602176
start_time | 2017-08-21 02:07:03.657377+00
end_time | 2017-08-21 02:07:24.799398+00
Time: First fetch (1 row): 849.467 ms. All rows formatted: 849.594 ms
Or at this path:
{your_catalog_location}/{db_name}/{node_name}_catalog/DataCollector/AnalyzeStatistics_*.log
percentile_cont function of Vertica would be helpful in retrieving quartile.
create table test
(metric_value integer);
insert into test values(1);
insert into test values(2);
insert into test values(3);
insert into test values(4);
insert into test values(5);
insert into test values(6);
insert into test values(7);
insert into test values(8);
insert into test values(9);
insert into test values(10);
alter table anatest add column metric varchar(100) default 'abc';
select
metric_value,
percentile_cont(1) within group (order by metric_value) over (partition by metric) as max,
percentile_cont(.75) within group (order by metric_value ) over (partition by metric) as q3,
percentile_cont(.5) within group (order by metric_value ) over (partition by metric) as median,
percentile_cont(.25) within group (order by metric_value ) over (partition by metric) as q1,
percentile_cont(0) within group (order by metric_value ) over (partition by metric) as min
from test ;

PL/SQL Switching two columns from two tables

Suppose I have two tables (tblA and tblB) and want to switch the second column of each table (tblA.Grade and tblB.Grade) as shown:
+-------------------------------------+
| table a table b |
+-------------------------------------+
| name grade name grade |
| a 60 f 50 |
| b 45 g 70 |
| c 30 h 90 |
+-------------------------------------+
Now, I would like to switch the grade column from table a to table b and the the grade column from table b to table a. The result should look like this:
+-----------------------------------------+
| table a table b |
+-----------------------------------------+
| name grade name grade |
| a 50 f 60 |
| b 70 g 45 |
| c 90 h 30 |
+-----------------------------------------+
I have created the tables, loaded them into cursors using bulk collect and the following code to complete the transformation:
insert into tblA values('a',60);
insert into tblA values('b',45);
insert into tblA values('c',30);
insert into tblb values('f',70);
insert into tblb values('g',80);
insert into tblb values('h',90);
.
DECLARE
TYPE tbla_type IS TABLE OF tbla%ROWTYPE;
l_tbla tbla_type;
TYPE tblb_type IS TABLE OF tblb%ROWTYPE;
l_tblb tblb_type;
BEGIN
-- All rows at once...
SELECT *
BULK COLLECT INTO l_tbla
FROM tbla;
SELECT *
BULK COLLECT INTO l_tblb
FROM tblb;
DBMS_OUTPUT.put_line (l_tblb.COUNT);
FOR indx IN 1 .. l_tbla.COUNT
LOOP
DBMS_OUTPUT.put_line (l_tbla(indx).lname);
update tbla set grade = l_tblb(indx).grade
where l_tbla(indx).lname= tbla.lname;
update tblb set grade = l_tbla(indx).grade
where l_tblb(indx).lname= tblb.lname;
END LOOP;
END;
So, although I did the task, I am wondering if there is a more simple solution that I have not thought of?
Please let me know if anyone knows if there may be a more simple solution?
Note that there is nothing called first or second record in databases as there is no guarantee that the first record entered will be the first one returned. So there should always be an order by to decide first/second etc.
So assuming you want the records to be ordered by name and then swap grade of smallest name of first table with grade of smallest name of second table,
Now assuming you fix the order thingy in your existing code, and if it is working, I believe it would be faster than the way I would do it below. Something like
Create a temp table and put names and grade ordered by name.
Reason of using temp table is mostly because later if I want to correct or revert the data, I can use the same temp table to reverse the merge.
create table tmp1 as
with ta as
(select t.* ,
row_number() over (order by name) as rnk
from tblA t)
,tb as
(select t.* ,
row_number() over (order by name) as rnk
from tblb t)
select ta.name as ta_name,ta.grade as ta_grade,
tb.name as tb_name,tb.grade as tb_grade
from ta inner join tb
on ta.rnk=tb.rnk
Output of tmp1
+---------+----------+---------+----------+
| TA_NAME | TA_GRADE | TB_NAME | TB_GRADE |
+---------+----------+---------+----------+
| a | 60 | f | 70 |
| b | 45 | g | 80 |
| c | 30 | h | 90 |
+---------+----------+---------+----------+
Then use merge to swap value from tmp1.
merge into tbla t1
using tmp1 t
on (t1.name=t.ta_name)
when matched then update
set t1.grade=t.tb_grade;
merge into tblb t1
using tmp1 t
on (t1.name=t.tb_name)
when matched then update
set t1.grade=t.ta_grade;
If satisfied with result, drop the temp table later
drop table tmp1;

IN statement from CASE result inside Where clause Oracle

I have a sample of the table and problem I am trying to solve in Oracle.
CREATE TABLE mytable (
id_field number
,status_code number
,desc1 varchar2(15)
);
INSERT INTO mytable VALUES (1,240,'desc1');
INSERT INTO mytable VALUES (2,242,'desc1');
INSERT INTO mytable VALUES (3,241,'desc1');
INSERT INTO mytable VALUES (4,244,'desc1');
INSERT INTO mytable VALUES (5,240,'desc2');
INSERT INTO mytable VALUES (6,242,'desc2');
INSERT INTO mytable VALUES (7,245,'desc2');
INSERT INTO mytable VALUES (8,246,'desc2');
INSERT INTO mytable VALUES (9,246,'desc1');
INSERT INTO mytable VALUES (10,242,'desc1');
commit;
SELECT *
FROM mytable
WHERE status_code IN CASE WHEN desc1 = 'desc1' THEN (240,242)
WHEN desc1 = 'desc2' THEN (240,245)
END
Basically I need to select a subset of status codes for each condition.
I could solve this with separate statements but the actual table I am doing this on has multiple descriptions and would result in around 20 unioned queries.
Any way to do this in one statement like I have attempted?
I believe that a CASE statement can only return one value (corresponding to one column in the result set). However, you can achive this in your WHERE clause without using a CASE statement:
WHERE (desc1 = 'desc1' AND status_code IN (240,242)) OR
(desc1 = 'desc2' AND status_code IN (240,245))
I like Tim answer better, but at least in postgres you can do this. Couldnt try it on oracle
Sql Fiddle DEMO
SELECT *
FROM mytable
WHERE CASE WHEN desc1 = 'desc1' THEN status_code IN (240,242)
WHEN desc1 = 'desc2' THEN status_code IN (240,245)
END
ORDER BY desc1
OUTPUT
| id_field | status_code | desc1 |
|----------|-------------|-------|
| 1 | 240 | desc1 |
| 2 | 242 | desc1 |
| 10 | 242 | desc1 |
| 5 | 240 | desc2 |
| 7 | 245 | desc2 |

Performance tuning about the "ORDER BY" and "LIKE" clause

I have 2 tables which have many records (say both TableA and TableB has about 3,000,000 records).vr2_input is a varchar input parameters enter by the users and I want to get the most 200 largest "dateField" 's TableA records whose stringField like 'vr2_input' .The 2 tables are joined as the following:
select * from(
select * from
TableA join TableB on TableA.id = TableB.id
where TableA.stringField like 'vr2_input' || '%'
order by TableA.dateField desc
) where rownum < 201
The query is slow , I goggled that and found out that it is because "like" and "order by" involves the full table scan .However , I cannot found a solution to solve the problem . How can I tune this type of SQL? I have already create an index on TableA.stringField and TableA.dateField but how can I use the index feature in the select statement? The database is oracle 10g. Thanks so much!!
Update : I use iddqd 's suggestion and only select the fields that I want and run the explain plan . It cost about 4 mins to finish the query . IX_TableA_stringField is the index name of the TableA.srv_ref field .I run again the explain plan without the hint , the explain plan still get the same result.
EXPLAIN PLAN FOR
select * from(
select
/*+ INDEX(TableB IX_TableA_stringField)*/
TableA.id,
TableA.stringField,
TableA.dateField,
TableA.someField2,
TableA.someField3,
TableB.someField1,
TableB.someField2,
TableB.someField3,
from TableA
join TableB on TableA.id=TableB.id
WHERE TableA.stringField like '21'||'%'
order by TableA.dateField desc
) where rownum < 201
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 871807846
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 200 | 24000 | 3293 (1)| 00:00:18 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 1397 | 163K| 3293 (1)| 00:00:18 |
|* 3 | SORT ORDER BY STOPKEY | | 1397 | 90805 | 3293 (1)| 00:00:18 |
| 4 | NESTED LOOPS | | 1397 | 90805 | 3292 (1)| 00:00:18 |
| 5 | TABLE ACCESS BY INDEX ROWID| TableA | 1397 | 41910 | 492 (1)| 00:00:03 |
|* 6 | INDEX RANGE SCAN | IX_TableA_stringField | 1397 | | 6 (0)| 00:00:01 |
| 7 | TABLE ACCESS BY INDEX ROWID| TableB | 1 | 35 | 2 (0)| 00:00:01 |
|* 8 | INDEX UNIQUE SCAN | PK_TableB | 1 | | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<201)
3 - filter(ROWNUM<201)
6 - access("TableA"."stringField" LIKE '21%')
filter("TableA"."stringField" LIKE '21%')
8 - access(TableA"."id"="TableB"."id")
You say it's taking about 4 minutes to run the query. The EXPLAIN PLAN output shows an estimate of 18 seconds. So the optimizer is probably far off on some of its estimates in this case. (It could still be choosing the best possible plan, but maybe not.)
The first step in a case like this is to get the actual execution plan and statistics. Run your query with the hint /*+ gather_plan_statistics */, then immediately afterwards execute select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST')).
This will show the actual execution plan that was run, and for each step it will show the estimated rows, actual rows, and actual time taken. Post the output here and maybe we can say something more meaningful about your issue.
Without that information, my suggestion is to try out the following rewrite of the query. I believe it is equivalent since it appears that ID is the primary key of TableB.
select TableA.id,
TableA.stringField,
TableA.dateField,
TableA.someField2,
TableA.someField3,
TableB.someField1,
TableB.someField2,
TableB.someField3,
from (select * from(
select
TableA.id,
TableA.stringField,
TableA.dateField,
TableA.someField2,
TableA.someField3,
from TableA
WHERE TableA.stringField like '21'||'%'
order by TableA.dateField desc
)
where rownum < 201
) TableA
join TableB on TableA.id=TableB.id
Do you need to select all columns (*)? The optimizer will be more likely to full scan if you select all columns. If you need all columns in output you may be better to select the id in your inline view and then join back to select other columns, which could be done with an index lookup. Try running an explain plan for both cases to see what the optimizer is doing.
Create indexes on the stringField and dateField columns. The SQL engine uses them automatically.
select id from(
select /*+ INDEX(TableB stringField_indx)*/ TableB.id from
TableA join TableB on TableA.id = TableB.id
where TableA.stringField like 'vr2_input' || '%'
order by TableA.dateField desc
) where rownum < 201
next:
SELECT * FROM TableB WHERE id iN( id from first query)
Please send stats and DDL of this tables.
If you have enough memory you can hint the query to use hash join. Could you please attach the explain plan
How many records does Table A has if it's the smaller table could you do the select on that table and then loop though the results retrieving the Table B records, as both the select and the sort are on TableA.
A good experiment would be to remove the join and test the speed on that also if allowed can you put the rownum < 201 as an AND clause on the main query. It's probable at the moment that the query is returning all rows to the outer query and then it's getting trimmed?
To optimize the like predicate, you can create a contextual index and use contains clause.
Look: http://docs.oracle.com/cd/B28359_01/text.111/b28303/ind.htm
Thanks
You can create one function index on tableA. That will return 1 or 0 based on the condition TableA.stringField like 'vr2_input' || '%' is satisfied or not. That index will make query run faster. The logic of the function will be
if (substr(TableA.stringField, 1, 9) = 'vr2_input'
THEN
return 1;
else
return 0;
Using actual column names instead of "*" may help. At least common column names should be removed.

Resources