Q: How to configure ClickHouse to return NULL instead of 0? - clickhouse

Let's say I have a table created as such without any record:
create table metric (date Int32) Engine=MergeTree ORDER BY (date);
If I run this query
select max(date) from metric;
ClickHouse returns
+-----------+
| max(date) |
+-----------+
| 0 |
+-----------+
1 row in set (0.02 sec)
instead of
+-----------+
| max(date) |
+-----------+
| NULL |
+-----------+
1 row in set (0.02 sec)
Is possible to configure ClickHouse to return NULL without have to write query like this:
select max(toNullable(date)) from metric;

Use setting aggregate_functions_null_for_empty:
SELECT max(date)
FROM metric
SETTINGS aggregate_functions_null_for_empty = 1
/*
┌─maxOrNull(date)─┐
│ ᴺᵁᴸᴸ │
└─────────────────┘
*/
or consider using OrNull-combinator:
SELECT maxOrNull(date)
FROM metric
/*
┌─maxOrNull(date)─┐
│ ᴺᵁᴸᴸ │
└─────────────────┘
*/

Related

Exclude rows based on condition from two columns

My question is very similar to this one, except that I want to exclude all columns that have a unique value in a column.
If we assume that to be the input.
Name | Location
-------------------
Bob | Shasta
Bob | Leaves
Sean | Leaves
Sean | Leaves
Dylan | Shasta
Dylan | Redwood
Dylan | Leaves
I want the output to be
Name | Location
-------------------
Bob | Shasta
Bob | Leaves
Dylan | Shasta
Dylan | Redwood
Dylan | Leaves
In this case, Sean is being excluded because he always has the same location.
In SQL, there exists a subquery called whereexists. How do we do this in clickhouse?
Try this query:
SELECT Name, Location
FROM (
/* emulate the origin dataset */
SELECT test_data.1 AS Name, test_data.2 AS Location
FROM (
SELECT arrayJoin([
('Bob', 'Shasta'),
('Bob', 'Leaves'),
('Sean', 'Leaves'),
('Sean', 'Leaves'),
('Dylan', 'Shasta'),
('Dylan', 'Redwood'),
('Dylan', 'Leaves')]) AS test_data))
WHERE Name IN (
SELECT Name
FROM (
/* emulate the origin dataset */
SELECT test_data.1 AS Name, test_data.2 AS Location
FROM (
SELECT arrayJoin([
('Bob', 'Shasta'),
('Bob', 'Leaves'),
('Sean', 'Leaves'),
('Sean', 'Leaves'),
('Dylan', 'Shasta'),
('Dylan', 'Redwood'),
('Dylan', 'Leaves')]) AS test_data))
GROUP BY Name
HAVING uniq(Location) > 1)
/* result
┌─Name──┬─Location─┐
│ Bob │ Shasta │
│ Bob │ Leaves │
│ Dylan │ Shasta │
│ Dylan │ Redwood │
│ Dylan │ Leaves │
└───────┴──────────┘
*/

Table scan on internal tables

Assume I have 2 tables - TABLE-1 & TABLE-2 and each of the table has 1 million rows with 10 columns and index on col1..
Now I build a internal table on this 2 tables ( 1 + 1 = 2 million) rows,
select * from
(select col1, col2,....col10 from table-1
union all
select col1, col2,....col10 from table-2) x
Questions,
how will the internal table will be treated in Oracle since its a internal table..
1. Will the internal table will be treated as a table with index on col1?
2. Will this be captured in the Explain plan?
Yes and yes.
Oracle will effectively treat this inline view as a table. It can use predicate pushing to apply a filter on the inline view to the base tables, and potentially use an index. The explain plan will show this.
Tables, indexes, sample data, and statistics
create table table1(col1 number, col2 number, col3 number, col4 number);
create table table2(col1 number, col2 number, col3 number, col4 number);
create index table1_idx on table1(col1);
create index table2_idx on table2(col1);
insert into table1 select level, level, level, level
from dual connect by level <= 100000;
insert into table2 select level, level, level, level
from dual connect by level <= 100000;
commit;
begin
dbms_stats.gather_table_stats(user, 'TABLE1');
dbms_stats.gather_table_stats(user, 'TABLE2');
end;
/
Explain plan showing predicate pushing and index access
explain plan for
select * from
(
select col1, col2, col3, col4 from table1
union all
select col1, col2, col3, col4 from table2
)
where col1 = 1;
select * from table(dbms_xplan.display);
Plan hash value: 400235428
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 40 | 2 (0)| 00:00:01 |
| 1 | VIEW | | 2 | 40 | 2 (0)| 00:00:01 |
| 2 | UNION-ALL | | | | | |
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED| TABLE1 | 1 | 20 | 2 (0)| 00:00:01 |
|* 4 | INDEX RANGE SCAN | TABLE1_IDX | 1 | | 1 (0)| 00:00:01 |
| 5 | TABLE ACCESS BY INDEX ROWID BATCHED| TABLE2 | 1 | 20 | 2 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | TABLE2_IDX | 1 | | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("COL1"=1)
6 - access("COL1"=1)
Notice how the predicates happen before the VIEW, and both indexes are used. By default everything should work as well as can be expected.
Notes
This type of query structure is called an inline view. Although a physical table is not built, the phrase "internal tables" is a good way of thinking about how the query logically works. Ideally, an inline view would work exactly like a pre-built table with the same data. In reality there are some cases where things don't quit work that way. But in general you are definitely on the right path - build a large query by assembling small inline views, and assume that Oracle will optimize it correctly.
for your particular query no any index will be used, but I suppose you do some filtering, ie where x.col1 = ###, I'm not sure that oracle will be able to use table-1/table-2 indexes to filter, so I suggest you to put where statements inside "union query"

Oracle primary key vs. index NOT IN performance

I have the following use case:
A table stores the changed as well as the original data from a person. My query is designed to get only one row for each person: The changed data if there is some, else the original data.
I populated the table with 100k rows of data and 2k of changed data. When using a primary key on my table the query runs in less than a half second. If I put an index on the table instead of a primary key the query runs really slow. So I'll use the primary key, no doubt about that.
My question is: Why is the PK approach so much faster than the one with an index?
Code here:
drop table up_data cascade constraints purge;
/
create table up_data(
pk integer,
hp_nr integer,
up_nr integer,
ps_flag varchar2(1),
ps_name varchar2(100)
-- comment this out and uncomment the index below.
, constraint pk_up_data primary key (pk,up_nr)
);
/
-- insert some data
insert into up_data
select rownum, 1, 0, 'A', 'tester_' || to_char(rownum)
from dual
connect by rownum < 100000;
/
-- insert some changed data
-- change ps_flag = 'B' and mark it with a change number in up_nr
insert into up_data
select rownum, 1, 1, 'B', 'tester_' || to_char(rownum)
from dual
connect by rownum < 2000;
/
-- alternative(?) to the primary key
-- CREATE INDEX idx_up_data ON up_data(pk, up_nr);
/
The select statement looks like this:
select count(*)
from
(
select *
from up_data u1
where up_nr = 1
or (up_nr = 0
and pk not in (select pk from up_data where up_nr = 1)
)
) u
The statement might be target of optimization but for the moment it will stay like this.
When you create a primary key constraint, Oracle also creates an index to support this at the same time. A primary key index has a couple of important differences over a basic index, namely:
All the values in this are guaranteed to be unique
There's no nulls in the table rows (of the columns forming the PK)
These reasons are the key to the performance differences you see. Using your setup, I get the following query plans:
--fast version with PK
explain plan for
select count(*)
from
(
select *
from up_data u1
where up_nr = 1
or (up_nr = 0
and pk not in (select pk from up_data where up_nr = 1)
)
) u
/
select * from table(dbms_xplan.display(NULL, NULL,'BASIC +ROWS'));
-----------------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | FILTER | | |
| 3 | INDEX FAST FULL SCAN| PK_UP_DATA | 103K|
| 4 | INDEX UNIQUE SCAN | PK_UP_DATA | 1 |
-----------------------------------------------------
alter table up_data drop constraint pk_up_data;
CREATE INDEX idx_up_data ON up_data(pk, up_nr);
/
--slow version with normal index
explain plan for
select count(*)
from
(
select *
from up_data u1
where up_nr = 1
or (up_nr = 0
and pk not in (select pk from up_data where up_nr = 1)
)
) u
/
select * from table(dbms_xplan.display(NULL, NULL,'BASIC +ROWS'));
------------------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | FILTER | | |
| 3 | INDEX FAST FULL SCAN| IDX_UP_DATA | 103K|
| 4 | INDEX FAST FULL SCAN| IDX_UP_DATA | 1870 |
------------------------------------------------------
The big difference is that the fast version employs a INDEX UNIQUE SCAN, rather than a INDEX FAST FULL SCAN in the second access of the table data.
From the Oracle docs (emphasis mine):
In contrast to an index range scan, an index unique scan must have
either 0 or 1 rowid associated with an index key. The database
performs a unique scan when a predicate references all of the columns
in a UNIQUE index key using an equality operator. An index unique scan
stops processing as soon as it finds the first record because no
second record is possible.
This optimization to stop processing proves to be a significant factor in this example. The fast version of your query:
Full scans ~103,000 index entries
For each one of these finds one matching row in the PK index and stop processing the second index further
The slow version:
Full scans ~103,000 index entries
For each one of these performs another scan of the 103,000 rows to find if there's any matches.
So to compare the work done:
With the PK, we have one fast full scan, then 103,000 lookups of one index value
With normal index, we have one fast full scan then 103,000 scans of 103,000 index entries - several orders of magnitude more work!
In this example, both the uniqueness of the primary key and the not null-ness of the index values are necessary to get the performance benefit:
-- create index as unique - we still get two fast full scans
drop index index idx_up_data;
create unique index idx_up_data ON up_data(pk, up_nr);
explain plan for
select count(*)
from
(
select *
from up_data u1
where up_nr = 1
or (up_nr = 0
and pk not in (select pk from up_data where up_nr = 1)
)
) u
/
select * from table(dbms_xplan.display(NULL, NULL,'BASIC +ROWS'));
------------------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | FILTER | | |
| 3 | INDEX FAST FULL SCAN| IDX_UP_DATA | 103K|
| 4 | INDEX FAST FULL SCAN| IDX_UP_DATA | 1870 |
------------------------------------------------------
-- now the columns are not null, we see the index unique scan
alter table up_data modify (pk not null, up_nr not null);
explain plan for
select count(*)
from
(
select *
from up_data u1
where up_nr = 1
or (up_nr = 0
and pk not in (select pk from up_data where up_nr = 1)
)
) u
/
select * from table(dbms_xplan.display(NULL, NULL,'BASIC +ROWS'));
------------------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
| 2 | FILTER | | |
| 3 | INDEX FAST FULL SCAN| IDX_UP_DATA | 103K|
| 4 | INDEX UNIQUE SCAN | IDX_UP_DATA | 1 |
------------------------------------------------------

Oracle Index with multiple Columns querying on single column

In a table in our Oracle installation we have a table with an index on two of the columns (X and Y). If I do a query on the table with a where clause only touching column X, will Oracle be able to use the index?
For example:
Table Y:
Col_A,
Col_B,
Col_C,
Index exists on (Col_A, Col_B)
SELECT * FROM Table_Y WHERE Col_A = 'STACKOVERFLOW';
Will the index be used, or will a table scan be done?
It depends.
You could check it by letting Oracle explain the execution plan:
EXPLAIN PLAN FOR
SELECT * FROM Table_Y WHERE Col_A = 'STACKOVERFLOW';
and then
select * from table(dbms_xplan.display);
So, for example with
create table table_y (
col_a varchar2(30),
col_b varchar2(30),
col_c varchar2(30)
);
create unique index table_y_ix on table_y (col_a, col_b);
and then a
explain plan for
select * from table_y
where col_a = 'STACKOVERFLOW';
select * from table(dbms_xplan.display);
The plan (on my installation) looks like:
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 51 | 1 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TABLE_Y | 1 | 51 | 1 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | TABLE_Y_IX | 1 | | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("COL_A"='STACKOVERFLOW')
ID 2 shows you, that the index TABLE_Y_IX is indeed used for an index range scan.
If on another installation Oracle chooses to use the index is dependend on many things. It's Oracle's query optimizer that makes this decision.
Update If you feel you're be better off (performance wise, that is) if Oracle used the index, you might want to try the + index_asc(...) (see index hint)
So in your case that would be something like
SELECT /*+ index_asc(TABLE_Y TABLE_Y_IX) */ *
FROM Table_Y
WHERE Col_A = 'STACKOVERFLOW';
Additionally, I would ensure that you have gathered statistics on the table and its columns. You can check the date of the last gathering of statistics with a
select last_analyzed from dba_tables where table_name = 'TABLE_Y';
and
select column_name, last_analyzed from dba_tab_columns where table_name = 'TABLE_Y';
If there are no statistics or if they're stale, make yourself familiar with the dbms_stats package to gather such statistics.
These statistics are the data that the query optimizer relies on heavily to make its decisions.

Performance tuning about the "ORDER BY" and "LIKE" clause

I have 2 tables which have many records (say both TableA and TableB has about 3,000,000 records).vr2_input is a varchar input parameters enter by the users and I want to get the most 200 largest "dateField" 's TableA records whose stringField like 'vr2_input' .The 2 tables are joined as the following:
select * from(
select * from
TableA join TableB on TableA.id = TableB.id
where TableA.stringField like 'vr2_input' || '%'
order by TableA.dateField desc
) where rownum < 201
The query is slow , I goggled that and found out that it is because "like" and "order by" involves the full table scan .However , I cannot found a solution to solve the problem . How can I tune this type of SQL? I have already create an index on TableA.stringField and TableA.dateField but how can I use the index feature in the select statement? The database is oracle 10g. Thanks so much!!
Update : I use iddqd 's suggestion and only select the fields that I want and run the explain plan . It cost about 4 mins to finish the query . IX_TableA_stringField is the index name of the TableA.srv_ref field .I run again the explain plan without the hint , the explain plan still get the same result.
EXPLAIN PLAN FOR
select * from(
select
/*+ INDEX(TableB IX_TableA_stringField)*/
TableA.id,
TableA.stringField,
TableA.dateField,
TableA.someField2,
TableA.someField3,
TableB.someField1,
TableB.someField2,
TableB.someField3,
from TableA
join TableB on TableA.id=TableB.id
WHERE TableA.stringField like '21'||'%'
order by TableA.dateField desc
) where rownum < 201
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 871807846
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 200 | 24000 | 3293 (1)| 00:00:18 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 1397 | 163K| 3293 (1)| 00:00:18 |
|* 3 | SORT ORDER BY STOPKEY | | 1397 | 90805 | 3293 (1)| 00:00:18 |
| 4 | NESTED LOOPS | | 1397 | 90805 | 3292 (1)| 00:00:18 |
| 5 | TABLE ACCESS BY INDEX ROWID| TableA | 1397 | 41910 | 492 (1)| 00:00:03 |
|* 6 | INDEX RANGE SCAN | IX_TableA_stringField | 1397 | | 6 (0)| 00:00:01 |
| 7 | TABLE ACCESS BY INDEX ROWID| TableB | 1 | 35 | 2 (0)| 00:00:01 |
|* 8 | INDEX UNIQUE SCAN | PK_TableB | 1 | | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<201)
3 - filter(ROWNUM<201)
6 - access("TableA"."stringField" LIKE '21%')
filter("TableA"."stringField" LIKE '21%')
8 - access(TableA"."id"="TableB"."id")
You say it's taking about 4 minutes to run the query. The EXPLAIN PLAN output shows an estimate of 18 seconds. So the optimizer is probably far off on some of its estimates in this case. (It could still be choosing the best possible plan, but maybe not.)
The first step in a case like this is to get the actual execution plan and statistics. Run your query with the hint /*+ gather_plan_statistics */, then immediately afterwards execute select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST')).
This will show the actual execution plan that was run, and for each step it will show the estimated rows, actual rows, and actual time taken. Post the output here and maybe we can say something more meaningful about your issue.
Without that information, my suggestion is to try out the following rewrite of the query. I believe it is equivalent since it appears that ID is the primary key of TableB.
select TableA.id,
TableA.stringField,
TableA.dateField,
TableA.someField2,
TableA.someField3,
TableB.someField1,
TableB.someField2,
TableB.someField3,
from (select * from(
select
TableA.id,
TableA.stringField,
TableA.dateField,
TableA.someField2,
TableA.someField3,
from TableA
WHERE TableA.stringField like '21'||'%'
order by TableA.dateField desc
)
where rownum < 201
) TableA
join TableB on TableA.id=TableB.id
Do you need to select all columns (*)? The optimizer will be more likely to full scan if you select all columns. If you need all columns in output you may be better to select the id in your inline view and then join back to select other columns, which could be done with an index lookup. Try running an explain plan for both cases to see what the optimizer is doing.
Create indexes on the stringField and dateField columns. The SQL engine uses them automatically.
select id from(
select /*+ INDEX(TableB stringField_indx)*/ TableB.id from
TableA join TableB on TableA.id = TableB.id
where TableA.stringField like 'vr2_input' || '%'
order by TableA.dateField desc
) where rownum < 201
next:
SELECT * FROM TableB WHERE id iN( id from first query)
Please send stats and DDL of this tables.
If you have enough memory you can hint the query to use hash join. Could you please attach the explain plan
How many records does Table A has if it's the smaller table could you do the select on that table and then loop though the results retrieving the Table B records, as both the select and the sort are on TableA.
A good experiment would be to remove the join and test the speed on that also if allowed can you put the rownum < 201 as an AND clause on the main query. It's probable at the moment that the query is returning all rows to the outer query and then it's getting trimmed?
To optimize the like predicate, you can create a contextual index and use contains clause.
Look: http://docs.oracle.com/cd/B28359_01/text.111/b28303/ind.htm
Thanks
You can create one function index on tableA. That will return 1 or 0 based on the condition TableA.stringField like 'vr2_input' || '%' is satisfied or not. That index will make query run faster. The logic of the function will be
if (substr(TableA.stringField, 1, 9) = 'vr2_input'
THEN
return 1;
else
return 0;
Using actual column names instead of "*" may help. At least common column names should be removed.

Resources