How can i use groupBitmapAnd with AggregatingMergeTree engine in clickhouse? - clickhouse

I'd like to maintain a table in clickhouse which use bitmap "AND" logic to merge bitmaps accross the rows with the same tag_id.
Since bitmap is defined as AggregateFunction(groupBitmap, UInt*) in clickhouse and groupBitmapAnd takes bitmap as its argument, i created a table like:
CREATE TABLE test.bitmap_column_expr_test2
(
`tag_id` String,
`z` AggregateFunction(groupBitmapAnd, AggregateFunction(groupBitmap, UInt32))
)
ENGINE = AggregatingMergeTree()
ORDER BY tag_id
SETTINGS index_granularity = 8192
In my case, i want to insert data like:
INSERT INTO test.bitmap_column_expr_test2 VALUES ('tag1', groupBitmapAndState(bitmapBuild(cast([1,2,3,4] as Array(UInt32)))));
INSERT INTO test.bitmap_column_expr_test2 VALUES ('tag1', groupBitmapAndState(bitmapBuild(cast([1] as Array(UInt32)))));
INSERT INTO test.bitmap_column_expr_test2 VALUES ('tag1', groupBitmapAndState(bitmapBuild(cast([1,3,4] as Array(UInt32)))));
And i expect to get the bitmap AND result by:
SELECT bitmapToArray(groupBitmapAndMergeState(z)) FROM test.bitmap_column_expr_test2;
However, my ddl is rewrited by clickhouse as:
CREATE TABLE test.bitmap_column_expr_test2
(
`tag_id` String,
`z` AggregateFunction(groupBitmap, AggregateFunction(groupBitmap, UInt32))
)
ENGINE = AggregatingMergeTree()
ORDER BY tag_id
which losts its origin definition of column z
Besides, the insertion will end up with exception:
DB::Exception: Aggregate function groupBitmapAndState(bitmapBuild(CAST([1, 2, 3, 4], 'Array(UInt32)'))) is found in wrong place in query: While processing groupBitmapAndState(bitmapBuild(CAST([1, 2, 3, 4], 'Array(UInt32)'))) (version 20.11.4.13 (official build))
I am not sure if i am doing the right thing to get bitmap accross the rows merged by "AND" logic in AggregatingMergeTree.

groupBitmapAnd accepts as a parameter the bitmap represented as AggregateFunction(groupBitmap, UInt_) type.
ClickHouse doesn't support nested aggregation types (in this case it be AggregateFunction(groupBitmapAnd, AggregateFunction(groupBitmap, UInt_))).
So it needs to follow the pattern described in the official docs - https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/groupbitmapand/#groupbitmapand:
CREATE TABLE bitmap_column_expr_test2
(
tag_id String,
z AggregateFunction(groupBitmap, UInt32)
)
ENGINE = MergeTree
ORDER BY tag_id;
INSERT INTO bitmap_column_expr_test2 VALUES ('tag1', bitmapBuild(cast([1,2,3,4,5,6,7,8,9,10] as Array(UInt32))));
INSERT INTO bitmap_column_expr_test2 VALUES ('tag2', bitmapBuild(cast([6,7,8,9,10,11,12,13,14,15] as Array(UInt32))));
INSERT INTO bitmap_column_expr_test2 VALUES ('tag3', bitmapBuild(cast([2,4,6,8,10,12] as Array(UInt32))));
SELECT groupBitmapAnd(z) FROM bitmap_column_expr_test2 WHERE like(tag_id, 'tag%');
┌─groupBitmapAnd(z)─┐
│ 3 │
└───────────────────┘

Related

Create a generic DB table

I am having multiple products and each of them are having there own Product table and Value table. Now I have to create a generic screen to validate those product and I don't want to create validated table for each Product. I want to create a generic table which will have all the Products details and one extra column called ProductIdentifier. but the problem is that here in this generic table I may end up putting millions of records and while fetching the data it will take time.
Is there any other better solution???
"Millions of records" sounds like a VLDB problem. I'd put the data into a partitioned table:
CREATE TABLE myproducts (
productIdentifier NUMBER,
value1 VARCHAR2(30),
value2 DATE
) PARTITION BY LIST (productIdentifier)
( PARTITION p1 VALUES (1),
PARTITION p2 VALUES (2),
PARTITION p5to9 VALUES (5,6,7,8,9)
);
For queries that are dealing with only one product, specify the partition:
SELECT * FROM myproducts PARTITION FOR (9);
For your general report, just omit the partition and you get all numbers:
SELECT * FROM myproducts;
Documentation is here:
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/vldbg/toc.htm

Define index for sparse column

I have a table with a columns 'A' and 'B'.
'A' is a column with 90% 'null' and 10% different values , and most of the time I query to have record with one or two of these different values.
and 'B' is a column with 90% value='1' and 10% different values and most of the time I query to have record with one or two of these different values.
In this table we have DML transaction most of the time.
now , I don't know define index on these columns is good? if yes which type of index?
In principle Bitmap Index would be the best in such situation. However, due to mulit-user environment they are not suitable - you would slow down your application significantly by table locks and perhaps get even dead-locks.
Maybe you can optimize your application by smart partitioning and usage of Partial Indexes (new feature in Oracle 12c)
CREATE TABLE statements below should be equivalent.
CREATE TABLE YOUR_TABLE (a INTEGER, b INTEGER, ... more COLUMNS)
PARTITION BY LIST (a) SUBPARTITION BY LIST (b) (
PARTITION part_a_NULL VALUES (NULL) (
SUBPARTITION part_a_NULL_b_1 VALUES (1) INDEXING OFF,
SUBPARTITION part_a_NULL_b_other VALUES (DEFAULT) INDEXING ON
),
PARTITION part_a_others VALUES (DEFAULT) (
SUBPARTITION part_a_others_b_1 VALUES (1) INDEXING OFF,
SUBPARTITION part_a_others_b_other VALUES (DEFAULT) INDEXING ON
)
);
CREATE TABLE YOUR_TABLE (a INTEGER, b INTEGER, ... more COLUMNS)
PARTITION BY LIST (a) SUBPARTITION BY LIST (b)
SUBPARTITION TEMPLATE (
SUBPARTITION b_1 VALUES (1) INDEXING OFF,
SUBPARTITION b_other VALUES (DEFAULT) INDEXING ON
)
(
PARTITION part_a_NULL VALUES (NULL),
PARTITION part_a_others VALUES (DEFAULT)
);
CREATE INDEX IND_A ON YOUR_TABLE (A) LOCAL INDEXING PARTIAL;
CREATE INDEX IND_B ON YOUR_TABLE (B) LOCAL INDEXING PARTIAL;
By this your index will consume only 10% of entire tablespace. If your WHERE condition is WHERE A IS NULL or WHERE B = 1 then Oracle optimizer would skip such indexes anyway.
Verify with this query
SELECT table_name, partition_name, subpartition_name, indexing
FROM USER_TAB_SUBPARTITIONS
WHERE table_name = 'YOUR_TABLE';
if INDEXING is used on desired subpartitions.
Update
I just see actually this is an overkill because NULL values on column A do not create any index entry anyway. So, it can be simplified to
CREATE TABLE YOUR_TABLE (a INTEGER, b INTEGER, ... more COLUMNS)
PARTITION BY LIST (b) (
PARTITION part_b_1 VALUES (1) INDEXING OFF,
PARTITION part_b_other VALUES (DEFAULT) INDEXING ON
);
For example, if you have index a_b_idx on A, B (in that order):
a) select ... from ... where A = ... will use index
b) select ... from ... where B = ... will not use index
On the other side, if you have index b_a_idx on B, A:
a) select ... from ... where A = ... will not use index
b) select ... from ... where B = ... will use index
Oracle can't use second column in index if it doesn't filter on first column, since in regular cases index is tree-like structure: column1->column2->column3->etc.
You need index on column A only or on columns A, B if you do queries like a).
You need index on column B only or on columns B, A if you do queries like b).
Oracle doesn't store all-null values in index, but it can store null value for A if B contains non-null value.
Sometimes it's more fruitful to read whole table into memory and ignore index. Optimizer can do it if possible result set is big and it goes for all records, since index-to-record transition costs more than simple records read.
Also sometimes it happens erroneously for tables without statistics, so you either need jobs with alter table ... compute statistics or oracle 11+ that can compute statistics like this without jobs.
Most of the times, another index is good thing for queries, but bad thing for updates/disk. Each index takes disk space and each update of record(s) makes updates to every index. So for heavily updated tables it's not good to have many indexes, but for frequently queried tables it's better to have indexes covering all common cases.
For most flat queries (without joins/subqueries/hierarchy) only 1 index is used, so having indexes for each column is generally just a waste of disk space. You need multicolumn index to optimize where A=... and B=...
As for index type, you probably need simple non-unique indexes.
Column A
Let assume that you create an index named _columnA_index_. In general, indexes in RDBMS would not include NULL values, which means there is no index entries in _columnA_index_ pointing to records having NULL values. Thus, the following query
Q1: select * from MyTable where A is null;
will result in a table scan instead ( or DBMS opts to use another index on another column if any)
However, since there is 10% of records having 'different values', the _columnA_index_ will of course help for queries, for example.
Q2: select * from MyTable where A = '123';
In the above example, if the query returns < 1% of the records, the _columnA_index_ is helpful. Depending on how selective the query is, the index greatly improves the performance. You can create an index that is suitable for datatype of column A.
Column B
Similarly, an index on B will not help
Q3: select * from MyTable where B = 1;
but it will help with different values
Q4: select * from MyTable where B = '456';
NULL values
So far, I answered that any index does not help with NULL values. Therefore, if you need to query Q1 most of the time, I suggest the following ideas
Make sure that your version of DBMS does support NULL values be included in indexes. For example Oracle 11g does but not versions before that.
Plan to create function-based index here, again with Oracle. But you can take the idea at least.
Redesign the logic of your application / your need to do querying on Null values. I prefer this approach.

Accelerate SQLite Query

I'm currently learning SQLite (called by Python).
According to my previous question (Reorganising Data in SQLLIte), I want to store multiple time series (Training data) in my database.
I have defined the following fields:
CREATE TABLE VARLIST
(
VarID INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL
)
CREATE TABLE DATAPOINTS
(
DataID INTEGER PRIMARY KEY,
timeID INTEGER,
VarID INTEGER,
value REAL
)
CREATE TABLE TIMESTAMPS
(
timeID INTEGER PRIMARY KEY AUTOINCREMENT,
TRAININGS_ID INT,
TRAINING_TIME_SECONDS FLOAT
)
VARLIST has 8 entries, TIMESTAMPS 1e5 entries and DATAPOINTS around 5e6.
When I now want to extract data for a given TrainingsID and VarID, I try it like:
SELECT
(SELECT TIMESTAMPS.TRAINING_TIME_SECONDS
FROM TIMESTAMPS
WHERE t.timeID = timeID) AS TRAINING_TIME_SECONDS,
(SELECT value
FROM DATAPOINTS
WHERE DATAPOINTS.timeID = t.timeID and DATAPOINTS.VarID = 2) as value
FROM
(SELECT timeID
FROM TIMESTAMPS
WHERE TRAININGS_ID = 96) as t;
The command EXPLAIN QUERY PLAN delivers:
0|0|0|SCAN TABLE TIMESTAMPS
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SCAN TABLE DATAPOINTS
This basically works.
But there are two problems:
Minor problem: If there is a timeID where no data for the requested VarID is availabe, I get an line with the valueNone`.
I would prefer this line to be skipped.
Big problem: the search is incredibly slow (approx 5 minutes using http://sqlitebrowser.org/).
How do I best improve the performance?
Are there better ways to formulate the SELECT command, or should I modify the database structure itself?
Ok, based on the hints I have got I could extremly accelerate the search by applieng INDEXES as:
CREATE INDEX IF NOT EXISTS DP_Index on DATAPOINTS (VarID,timeID,DataID);
CREATE INDEX IF NOT EXISTS TS_Index on TIMESTAMPS(TRAININGS_ID,timeID);
The EXPLAIN QUERY PLAN output now reads as:
0|0|0|SEARCH TABLE TIMESTAMPS USING COVERING INDEX TS_Index (TRAININGS_ID=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE DATAPOINTS USING INDEX DP_Index (VarID=? AND timeID=?)
Thanks for your comments.

Hive - clustered by and sorted by returns unsorted result

I created a table with buckets, clustered by, and sorted by like this:
set hive.enforce.bucketing = true;
set mapred.reduce.tasks = 32;
create table my_table
(
a string
, b string
, c int
)
clustered by (a, b) sorted by (a, b) into 32 buckets
row FORMAT delimited fields TERMINATED BY ','
stored AS parquet
tblproperties ('parquet.compress'='SNAPPY');
The data was inserted like this:
insert into table my_table select * from old_table;
When I query:
select * from my_table limit 100;
I get unsorted results. Does it mean that the table is not sorted?
Will sorted merge join on this table work if I'll join this table with other table on a and b?
BDW:
When I query the old_table (for which the insert into included distribute by and sort by) like this: select * from old_table limit 100; I get sorted results
While you have set Hive to enforce bucketing, you need to also set Hive to enforce sorting before inserting the data into your new table, "my_table". Try this:
hive.enforce.sorting=true;
Then you should get sorted results in the new table.

T-SQL - wrong query execution plan behaviour

One of our queries degraded after generating load on the DB.
Our query is a join between 3 tables:
Base table which contain 10 M rows.
EventPerson table which contain 5000 rows.
EventPerson788 which is empty.
It seems that the optimizer scans the index on the EventPerson instead of seek, this the script for replicating the issue:
--Create Tables
CREATE TABLE [dbo].[BASE](
[ID] [bigint] NOT NULL,
[IsActive] BIT
PRIMARY KEY CLUSTERED ([ID] ASC)
)ON [PRIMARY]
GO
CREATE TABLE [dbo].[EventPerson](
[DUID] [bigint] NOT NULL,
[PersonInvolvedID] [bigint] NULL,
PRIMARY KEY CLUSTERED ([DUID] ASC)
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [EventPerson_IDX] ON [dbo].[EventPerson]
(
[PersonInvolvedID] ASC
)
CREATE TABLE [dbo].[EventPerson788](
[EntryID] [bigint] NOT NULL,
[LinkedSuspectID] [bigint] NULL,
[sourceid] [bigint] NULL,
PRIMARY KEY CLUSTERED ([EntryID] ASC)
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[EventPerson788] WITH CHECK
ADD CONSTRAINT [FK7A34153D3720F84A]
FOREIGN KEY([sourceid]) REFERENCES [dbo].[EventPerson] ([DUID])
GO
ALTER TABLE [dbo].[EventPerson788] CHECK CONSTRAINT [FK7A34153D3720F84A]
GO
CREATE NONCLUSTERED INDEX [EventPerson788_IDX]
ON [dbo].[EventPerson788] ([LinkedSuspectID] ASC)
GO
--POPOLATE BASE TABLE
DECLARE #I BIGINT=1
WHILE (#I<10000000)
BEGIN
begin transaction
INSERT INTO BASE(ID) VALUES(#I)
SET #I+=1
if (#I%10000=0 )
begin
commit;
end;
END
go
--POPOLATE EventPerson TABLE
DECLARE #I BIGINT=1
WHILE (#I<5000)
BEGIN
BEGIN TRANSACTION
INSERT INTO EventPerson(DUID,PersonInvolvedID) VALUES(#I,(SELECT TOP 1 ID FROM BASE ORDER BY NEWID()))
SET #I+=1
IF(#I%10000=0 )
COMMIT TRANSACTION ;
END
GO
This the query :
select
count(EventPerson.DUID)
from
EventPerson
inner loop join
Base on EventPerson.DUID = base.ID
left outer join
EventPerson788 on EventPerson.DUID = EventPerson788.sourceid
where
(EventPerson.PersonInvolvedID = 37909 or
EventPerson788.LinkedSuspectID = 37909)
AND BASE.IsActive = 1
Do you have any idea why the optimizer decides to use index scan instead of index seek?
Workaround that already done :
Analyze tables and build statistics.
Rebuild Indices.
Try the FORCESEEK hint
None of the above persuaded the optimizer to run an index seek on EventPerson and seek on the base tables.
Thanks for your help .
The scan is there because of the or condition and the outer join against EventPerson788.
Either it will return rows from EventPerson when EventPerson.PersonInvolvedID = 37909 or when the there exists rows in EventPerson788 where EventPerson788.LinkedSuspectID = 37909. The last part means that every row in EventPerson has to be checked against the join.
The fact that EventPerson788 is empty can not be used by the query optimizer since the query plan is saved to be reused later when there might be matching rows in EventPerson788.
Update:
You can rewrite your query using a union all instead of or to get a seek in EventPerson.
select count(EventPerson.DUID)
from
(
select EventPerson.DUID
from EventPerson
where EventPerson.PersonInvolvedID = 1556 and
not exists (select *
from EventPerson788
where EventPerson788.LinkedSuspectID = 1556)
union all
select EventPerson788.sourceid
from EventPerson788
where EventPerson788.LinkedSuspectID = 1556
) as EventPerson
inner join BASE
on EventPerson.DUID=base.ID
where
BASE.IsActive=1
Well, you're asking SQL Server to count the rows of the EventPerson table - so why do you expect a seek to be better than a scan here?
For a COUNT, the SQL Server optimizer will almost always use a scan - it needs to count the rows, after all - all of them... it will do a clustered index scan, if no other non-nullable columns are indexed.
If you have an index on a small, non-nullable column (e.g. on a ID INT or something like that), it would probably do a scan on that index instead (less data to read to count all rows).
But in general: seek is great for selecting one or a few rows - but it sucks if you're dealing with all rows (like for a count)
You can easily observe this behavior if you're using the AdventureWorks sample database.
When doing a COUNT(*) on the Sales.SalesOrderDetail table which has over 120000 rows like this:
SELECT COUNT(*) FROM Sales.SalesOrderDetail
then you'll get an index scan on IX_SalesOrderDetail_ProductID - it just doesn't pay off to do seeks on over 120000 entries!
However, if you do the same operation on a smaller set of data, like this:
SELECT COUNT(*) FROM Sales.SalesOrderDetail
WHERE ProductID = 897
then you get back 2 rows out of all of them - and SQL Server will now use an index seek on that same index.

Resources