Clickhouse query takes long time to execute with array joins and group by - performance

I have a table student which has over 90 million records. The create table query is as follows:
CREATE TABLE student(
id integer,
student_id FixedString(15) NOT NULL,
teacher_array Nested(
teacher_id String,
teacher_name String,
teacher_role_id smallint
),
subject_array Nested(
subject_id String,
subject_name String,
subject_category_id smallint
),
year integer NOT NULL
)
ENGINE=MergeTree()
PRIMARY KEY id
PARTITION BY year
ORDER BY id
SETTINGS index_granularity = 8192
The following query takes 5 seconds to execute:
SELECT count(distinct id) as student_count,
(
SELECT count(distinct id)
FROM student
ARRAY JOIN teacher_array
WHERE hasAny(subject_array.subject_category_id, [1, 2]) AND (teacher_array.teacher_role_id NOT IN (1))
) AS total_student_count,
count(*) OVER () AS total_result_count,
teacher_array.teacher_role_id AS teacher_id
FROM
(
SELECT *
FROM student
ARRAY JOIN subject_array
)
ARRAY JOIN teacher_array
WHERE (subject_array.subject_category_id IN (1, 2)) AND (teacher_array.teacher_role_id NOT IN (1))
GROUP BY teacher_array.teacher_role_id
ORDER BY student_count DESC
LIMIT 0, 10
Expecting the query to run within 500 milliseconds is there any workaround for this? Tried using uniq and groupBitmap still the execution time comes around 2 seconds.

Related

ORDER BY BASED ON COLUMN

I have two tables,PRODUCTS AND LOOKUP TABLES.Now i want to order the KEY Column in products table based on KEY column value in LOOKUP TABLE.
CREATE TABLE PRODUCTS
(
ID INT,
KEY VARCHAR(50)
)
INSERT INTO PRODUCTS
VALUES (1, 'EGHS'), (2, 'PFE'), (3, 'EGHS'),
(4, 'PFE'), (5, 'ABC')
CREATE TABLE LOOKUP (F_KEY VARCHAR(50))
INSERT INTO LOOKUP VALUES('PFE,EGHS,ABC')
Now I want to order the records in PRODUCTS table based on KEY (PFE,EGHS,ABC) values in LOOKUP table.
Example output:
PRODUCTS
ID F_KEY
-----------
2 PFE
4 PFE
1 EGHS
3 EGHS
5 ABC
I use this query, but it is not working
SELECT *
FROM PRODUCTS
ORDER BY (SELECT F_KEY FROM LOOKUP)
You can split the string using XML. You first need to convert the string to XML and replace the comma with start and end XML tags.
Once done, you can assign an incrementing number using ROW_NUMBER() like following.
;WITH cte
AS (SELECT dt,
Row_number()
OVER(
ORDER BY (SELECT 1)) RN
FROM (SELECT Cast('<X>' + Replace(F.f_key, ',', '</X><X>')
+ '</X>' AS XML) AS xmlfilter
FROM [lookup] F)F1
CROSS apply (SELECT fdata.d.value('.', 'varchar(500)') AS DT
FROM f1.xmlfilter.nodes('X') AS fdata(d)) O)
SELECT P.*
FROM products P
LEFT JOIN cte C
ON C.dt = P.[key]
ORDER BY C.rn
Online Demo
Output:
ID F_KEY
-----------
2 PFE
4 PFE
1 EGHS
3 EGHS
5 ABC
You may do it like this:
SELECT ID, [KEY] FROM PRODUCTS
ORDER BY
CASE [KEY]
WHEN 'PFE' THEN 1
WHEN 'EGHS' THEN 2
WHEN 'ABC' THEN 3
END

SQL delete rows not in another table

I'm looking for a good SQL approach (Oracle database) to fulfill the next requirements:
Delete rows from Table A that are not present in Table B.
Both tables have identical structure
Some fields are nullable
Amount of columns and rows is huge (more 100k rows and 20-30 columns to compare)
Every single field of every single row needs to be compared from Table A against table B.
Such requirement is owing to a process that must run every day as changes will come from Table B.
In other words: Table A Minus Table B => Delete the records from the Table A
delete from Table A
where (field1, field2, field3) in
(select field1, field2, field3
from Table A
minus
select field1, field2, field3
from Table B);
It's very important to mention that a normal MINUS within DELETE clause fails as does not take the nulls on nullable fields into consideration (unknown result for oracle, then no match).
I also tried EXISTS with success, but I have to use NVL function to replace the nulls with dummy values, which I don't want it as I cannot guarantee that the value replaced in NVL will not come as a valid value in the field.
Does anybody know a way to accomplish such thing? Please remember performance and nullable fields as "a must".
Thanks ever
decode finds sameness (even if both values are null):
decode( field1, field2, 1, 0 ) = 1
To delete rows in table1 not found in table2:
delete table1 t
where t.rowid in (select t1.rowid
from table1 t1
left outer join table2 t2
on decode(t1.field1, t2.field1, 1, 0) = 1
and decode(t1.field2, t2.field2, 1, 0) = 1
and decode(t1.field3, t2.field3, 1, 0) = 1
/* ... */
where t2.rowid is null /* no matching row found */
)
to use existing indexes
...
left outer join table2 t2
on (t1.index_field1=t2.index_field1 or
t1.index_field1 is null and t2.index_field1 is null)
and ...
Use a left outer join and test for null in your where clause
delete a
from a
left outer join b on a.x = b.x
where b.x is null
Have you considered ORALCE SQL MERGE statement?
Use Bulk operation for huge number of records. Performance wise it will be faster.
And use join between two table to get rows to be delete. Nullable columns can be compared with some default value.
Also, if you want Table A to be similar as Table B, why don't you truncate table A and then insert data from table b
Assuming you the same PK field available on each table...(Having a PK or some other unique key is critical for this.)
create table table_a (id number, name varchar2(25), dob date);
insert into table_a values (1, 'bob', to_date('01-01-1978','MM-DD-YYYY'));
insert into table_a values (2, 'steve', null);
insert into table_a values (3, 'joe', to_date('05-22-1989','MM-DD-YYYY'));
insert into table_a values (4, null, null);
insert into table_a values (5, 'susan', to_date('08-08-2005','MM-DD-YYYY'));
insert into table_a values (6, 'juan', to_date('11-17-2001', 'MM-DD-YYYY'));
create table table_b (id number, name varchar2(25), dob date);
insert into table_b values (1, 'bob', to_date('01-01-1978','MM-DD-YYYY'));
insert into table_b values (2, 'steve',to_date('10-14-1992','MM-DD-YYYY'));
insert into table_b values (3, null, to_date('05-22-1989','MM-DD-YYYY'));
insert into table_b values (4, 'mary', to_date('12-08-2012','MM-DD-YYYY'));
insert into table_b values (5, null, null);
commit;
-- confirm minus is working
select id, name, dob
from table_a
minus
select id, name, dob
from table_b;
-- from the minus, re-query to just get the key, then delete by key
delete table_a where id in (
select id from (
select id, name, dob
from table_a
minus
select id, name, dob
from table_b)
);
commit;
select * from table_a;
But, if at some point in time, tableA is to be reset to the same as tableB, why not, as another answer suggested, truncate tableA and select all from tableB.
100K is not huge. I can do ~100K truncate and insert on my laptop instance in less than 1 second.
> DELETE FROM purchase WHERE clientcode NOT IN (
> SELECT clientcode FROM client );
This deletes the rows from the purchase table whose clientcode are not in the client table. The clientcode of purchase table references the clientcode of client table.
DELETE FROM TABLE1 WHERE FIELD1 NOT IN (SELECT CLIENT1 FROM TABLE2);

How to select distinct keywords from database without specifying any keywords

I am saving in a database column more keywords separated by comma
ex: aaaa, bbbb, cccc ....
There are many rows:
ex:
row 1 = aaaa, bbbb
row 2 = aaaa,cccc,ddddd..
etc.
I would like to obtain an array with all different keywords (no duplicates).
Thank you in advance!
Create table :
CREATE TABLE tablename (
id INT,
name VARCHAR(20));
INSERT INTO tablename VALUES
(1, 'aaaa,bbbb'),
(2, 'aaaa,cccc,dddd');
CREATE TABLE numbers (
n INT PRIMARY KEY);
INSERT INTO numbers VALUES (1),(2),(3),(4),(5),(6);
Fire query :
SELECT group_concat(name) as result
FROM
(
SELECT distinct name
FROM
(
SELECT
tablename.id,
SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.name, ',', numbers.n), ',', -1) name
FROM
numbers INNER JOIN tablename
ON CHAR_LENGTH(tablename.name)
-CHAR_LENGTH(REPLACE(tablename.name, ',', ''))>=numbers.n-1
ORDER BY
id, n
) tab1
) tab2;

Fetch single record from duplicate rows from oracle table

I have a table user_audit_records_tbl which has multiple rows for a single user ,Every time user logs in one entry is made into this table so i want a select query which will fetch a latest single record for each user, I have a query which uses IN clause.
Table Name : user_audit_records_tbl
Record_id Number Primary Key,
user_id varchar Primary Key ,
user_ip varchar,
.
.
etc
Current query i am using is
select * from user_audit_records_tbl where record_id in (select
max(record_id) from user_audit_records_tbl
group by user_id);
but was just wondering if anybody has better solution for this since this table has huge volumns.
You can use the first/last function
select max(Record_id) as Record_id,
user_id,
max(user_ip) keep (dense_rank last order by record_id) as user_ip,
...
from user_audit_records_tbl
group by user_id
No sure if it will be more efficient.
EDIT : As above query is less efficient, may be you could try an exist clause
select *
from user_audit_records_tbl A
where exists ( select 1
from user_audit_records_tbl B
where A.user_id = B.user_id
group by B.user_id
having max(B.record_id) = A.record_id
)
But maybe, you should look on the index side instead of the query side.
select *
from ( select row_number() over ( partition by user_id order by record_id desc) row_nr,
a.*
from user_audit_records_tbl a
)
where row_nr = 1
;

Compare two tables in Hive without apply JOINS

I have 2 tables, TableA and TableB. Both having same set of columns C1, C2. Now need to compare both the table are having same DATA or NOT. How do you do without use JOIN. I tried MINUS operator ie.,
SELECT * FROM TableA
MINUS
SELECT * FROM TableB
But this is not supported in HIVE. May be impala has this SET operator?
Please suggest how to do without JOINS. Thanks.
You can try with
SELECT *
FROM T1
WHERE NOT EXISTS (SELECT * FROM T2 WHERE T1.X = T2.Y)
WHERE T1.X = T2.Y are the "key"
create table student
(
id integer,
subject string,
total_score integer
);
insert into student
(id, subject, total_score)
values
(1, 'math', 90);
insert into student
(id, subject, total_score)
values
(1, 'science', 100);
insert into student
(id, subject, total_score)
values
(2, 'math', 90);
insert into student
(id, subject, total_score)
values
(2, 'science', 80);
---------- MINUS ---------
select id,subject,
total_score
from ( select max (id) id,
subject,
total_score,
count (*)
from (
select *
from student
where id = 1
union all
select *
from student
where id = 2
) merged_data
group by subject, total_score
having count (*) = 1
) minus_data
where id is not null;
id subject total_score
2 science 80
1 science 100

Resources