Clickhouse: DB::Exception: Memory limit (for query) exceeded - clickhouse

What should I do when I run out of memory for Clickhouse queries? You can't just crank up the memory, right? There is also a limit to memory, how to configure the hard disk?
SELECT
UserID,
Title
FROM
(
SELECT
L.UserID,
L.Title
FROM tutorial.hits_v1 AS L
INNER JOIN tutorial.hits_v2 AS R ON L.UserID = R.UserID
) AS T
ORDER BY UserID ASC
LIMIT 10
#user.d/abc.xml
<?xml version="1.0"?>
<yandex>
<!-- Profiles of settings. -->
<profiles>
<!-- Default settings. -->
<default>
<!-- Maximum memory usage for processing single query, in bytes. -->
<max_memory_usage>350000000</max_memory_usage>
<max_memory_usage_for_user>350000000</max_memory_usage_for_user>
<max_bytes_before_external_group_by>100000000</max_bytes_before_external_group_by>
<max_bytes_before_external_sort>100000000</max_bytes_before_external_sort>
</default>
</profiles>
</yandex>

avoid using huge tables as a right table of JOIN: "ClickHouse takes the <right_table> and creates a hash table for it in RAM"
apply query restrictions to sub-queries, not an outer one
Try this one:
SELECT L.UserID, L.Title
FROM tutorial.hits_v1 AS L
INNER JOIN (
SELECT UserID
FROM tutorial.hits_v2
/* WHERE .. */
LIMIT 10) AS R ON L.UserID = R.UserID
ORDER BY UserID
or
SELECT UserID, Title
FROM tutorial.hits_v1
WHERE UserID IN (SELECT UserID FROM tutorial.hits_v2 /* WHERE .. */ LIMIT 10)
ORDER BY UserID

If you have margin, increase max memory before executing the query:
SET max_memory_usage = 8000000000;
In my case setting it to 8 GB solved the issue.

Related

Optimizing Left Join With Group By and Order By (MariaDb)

I am attempting to optimize a query in MariaDb that is really bogged down by its ORDER BY clause. I can run it in under a tenth of a second without the ORDER BY clause, but it takes over 25 seconds with it. Here is the gist of the query:
SELECT u.id, u.display_name, u.cell_phone, u.email,
uv.year, uv.make, uv.model, uv.id AS user_vehicle_id
FROM users u
LEFT JOIN user_vehicles uv ON uv.user_id = u.id AND uv.current_owner=1
WHERE u.is_deleted = 0
GROUP BY u.id
ORDER BY u.display_name
LIMIT 0, 10;
I need it to be a left join because I want to include users that aren't linked to a vehicle.
I need the group by because I want only 1 result per user (and display_name is not guaranteed to be unique).
users table has about 130K rows, while user_vehicles has about 230K rows.
Here is the EXPLAIN of the query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE u index dms_cust_idx PRIMARY 4 null 124825 Using where; Using temporary; Using filesort
1 SIMPLE uv ref user_idx user_idx 4 awscheduler.u.id 1 Using where
I have tried these two indices to speed things up, but they don't seem to do much.
CREATE INDEX idx_display_speedy ON users(display_name);
CREATE INDEX idx_display_speedy2 ON users(id, display_name, is_deleted, dms_cust_id);
I am looking for ideas on how to speed this up. I attempted using nested queries, but since the order by is the bottleneck & order within the nested query is ignored, I believe that attempt was in vain.
how about:
WITH a AS (
SELECT u.id, u.display_name, u.cell_phone, u.email
FROM users u
WHERE u.is_deleted = 0
GROUP BY u.id
LIMIT 0, 10
)
SELECT a.id, a.display_name, a.cell_phone, a.email,
uv.year, uv.make, uv.model, uv.id AS user_vehicle_id
FROM a LEFT JOIN user_vehicles uv ON uv.user_id = a.id AND uv.current_owner=1
ORDER BY a.display_name;
The intention is we take a subset of users before joining it with user_vehicles.
Disclaimer: I haven't verified if its faster or not, but have similar experience in the past where this helps.
with a as (
SELECT u.id, u.display_name, u.cell_phone, u.email,
uv.year, uv.make, uv.model, uv.id AS user_vehicle_id
FROM users u
LEFT JOIN user_vehicles uv ON uv.user_id = u.id AND uv.current_owner=1
WHERE u.is_deleted = 0
GROUP BY u.id
)
select * from a
ORDER BY u.display_name;
)
I suspect it's not actually the ordering that is causing the problem... If you remove the limit, I bet the ordered and un-ordered versions will end up performing pretty close to the same.
Depending on if your actual query is as simple as the one you posted, you may be able to get good performance in a single query by using RowNum() as described here:
SELECT u.id, u.display_name, u.cell_phone, u.email,
uv.year, uv.make, uv.model, uv.id AS user_vehicle_id
FROM (
SELECT iu.id, iu.display_name, iu.cell_phone, iu.email
FROM users iu
WHERE iu.is_deleted = 0
ORDER BY iu.display_name) as u
LEFT JOIN user_vehicles uv ON uv.user_id = u.id AND uv.current_owner=1
WHERE ROWNUM() < 10
GROUP BY u.id
ORDER BY u.display_name
If that doesn't work, you probably need to select the users in one select and then select their vehicles in a second Select

Oracle Performance issues on using subquery in an "In" orperator

I have two query that looks close to the same but Oracle have very different performance.
Query A
Create Table T1 as Select * from FinalView1 where CustomerID in ('A0000001','A000002')
Query B
Create Table T1 as Select * from FinalView1 where CustomerID in (select distinct CustomerID from CriteriaTable)
The CriteriaTable have 800 rows but all belongs to Customer ID 'A0000001' and 'A000002'.
This means the subquery: "select distinct CustomerID from CriteriaTable" also only returns the same two elements('A0000001','A000002') as manually entered in query A
Following is the query under the FinalView1
create or replace view FinalView1_20200716 as
select
Customer_ID,
<Some columns>
from
Table1_20200716 T1
INNER join Table2_20200716 T2 on
T1.Invoice_number = T2.Invoice_number
and
T1.line_id = T2.line_id
left join Table3_20200716 T3 on
T3.id = T1.Customer_ID
left join Table4_20200716 T4 on
T4.Shipping_ID = T1.Shipping_ID
left join Table5_20200716 Table5 on
Table5.Invoice_ID = T1.Invoice_ID
left join Table6_20200716 T6 on
T6.Shipping_ID = T4.Shipping_ID
left join First_Order first on
first.Shipping_ID = T1.Shipping_ID
;
Table1_20200716,Table2_20200716,Table3_20200716,Table4_20200716,Table5_20200716,Table6_20200716 are views to the corresponding table with temporal validity feature. For example
The query under Table1_20200716
Create or replace view Table1_20200716 as
select
*
from Table1 as for period of to_date('20200716,'yyyymmdd')
However table "First_Order" is just a normal table as
Following is the performance for both queries (According to explain plan):
Query A:
Cardinality: 102
Cost : 204
Total Runtime: 5 secs max
Query B:
Cardinality:27921981
Cost: 14846
Total Runtime:20 mins until user cancelled
All tables are indexed using those columns that used to join against other tables in the FinalView1. According to the explain plan, they have all been used except for the FirstOrder table.
Query A used uniquue index on the FirstOrder Table while Query B performed a full scan.
For query B, I was expecting the Oracle will firstly query the sub-query get the result into the in operator, before executing the main query and therefore should only have minor impact to the performance.
Thanks in advance!
As mentioned from my comment 2 days ago. Someone have actually posted the solution and then have it removed while the answer actually work. After waiting for 2 days the So I designed to post that solution.
That solution suggested that the performance was slow down by the "in" operator. and suggested me to replace it with an inner join
Create Table T1 as
Select
FV.*
from
FinalView1 FV
inner join (
select distinct
CustomerID
from
CriteriaTable
) CT on CT.customerid = FV.customerID;
Result from explain plan was worse then before:
Cardinality:28364465 (from 27921981)
Cost: 15060 (from 14846)
However, it only takes 17 secs. Which is very good!

Commit to reduce temp tablespace usage

I have a large table insert as part of a job for reporting. For ease of development, I did a single insert with a single select, rather than splitting this up into multiple commits.
insert /*+ parallel(AUTO) */ into sc_session_activity_stage1(fiscal_year
,fiscal_quarter_id
,date_stamp
,time_stamp
,session_key
,activity_call_type_key
,user_key
,device_key
,url_key
,ref_url_key
,event_key
,page_type_key
,link_url_key
,component_key
,content_trace_key
,key) (
select /*+ parallel(AUTO) */
schfql.fiscal_year fiscal_year
,schfql.fiscal_quarter_id fiscal_quarter_id
,pkg_sc_portfolio_load.sc_datestamp_from_epoch(swa.time_stamp)
,swa.time_stamp time_stamp
,schuse.session_key session_key
,schact.activity_call_type_key activity_call_type_key
,schu.user_key user_key
,schde.device_key device_key
,schurl_url.url_key url_key
,schurl_ref.url_key ref_url_key
,schev.event_key event_key
,schapt.page_type_key page_type_key
,schurl_link_url.url_key link_url_key
,schwac.component_id component_id
,schti_content_unique_id.trace_id_key content_unique_id_trace_id_key
,schti_unique_id.trace_id_key unique_id_trace_id_key
from web_activity swa
inner join sc_fiscal_quarter_list schfql
on pkg_sc_portfolio_load.sc_datestamp_from_epoch(swa.time_stamp) between schfql.start_date and schfql.end_date
inner join sc_user_sessions schuse
on schuse.session_id = swa.session_id
inner join sc_activity_call_types schact
on schact.activity_call_type_name = swa.calltype
inner join sc_users schu
on schu.user_email = sc_normalize_email(swa.userid)
inner join sc_devices schde
on swa.device=schde.device and
swa.ip=schde.source_ip and
swa.operation_system = schde.operating_system and
swa.browser = schde.browser
left join sc_urls schurl_url
on schurl_url.full_url = trim(swa.url)
inner join sc_events schev
on schev.event=trim(swa.event)
inner join sc_activity_page_types schapt
on schapt.page_type_name=swa.pagetype
left join sc_urls schurl_link_url
on schurl_link_url.full_url = trim(swa.linkurl)
left join sc_urls schurl_ref
on schurl_ref.full_url = trim(swa.ref)
inner join sc_web_activity_components schwac
on schwac.component_name=trim(swa.component)
left join sc_trace_ids schti_content_unique_id
on schti_content_unique_id.alfresco_trace_id = swa.CONTENT_UNIQUE_ID
left join sc_trace_ids schti_unique_id
on schti_unique_id.alfresco_trace_id=swa.UNIQUE_ID
);
commit;
On production, this triggers alarms for TEMP tablespace. If I were to split the above into multiple commits, would that reduce the TEMP usage at any one point in time? This may be obvious to some, but I'm not sure how Oracle works. I'm not seeing any ORA type errors, rather some threshold is being triggered and someone from the DBA team sends an email.
Thank you from the Woodsman.
TEMP tablespace blowouts are common and can be addressed by increasing the TEMP tablesapce AND/OR tuning the SQL to use less TEMP. For tuning: I usually start with the SQL Tuning Advisor recommendations (requires diagnostics and tuning pack). BTW: TEMP usage goes up with parallel queries and also is mostly specific to the SELECT part. You can also reduce the TEMP tablespace usage by doing more in memory (i.e. increasing the PGA).

Same Query, Different Peformance

There is a query that takes more than two minutes. that query's result are 960000 row.
so i use hint. then it takes just 30 seconds. so i applied query in mybatis and run application. but that query takes more over two minute. so i copied query in log and paste local developer. and excute query. but it takes just 30 seconds.
i don't know........
environment : spring, mybatis, oracle
SELECT *
FROM (
SELECT /*+ USE_HASH(a b c) */ ROW_NUMBER() OVER(ORDER BY A.TRANS_DT DESC, A.TRANS_TM DESC) AS RNUM
,A.PAY_METHOD
,B.BRAND_NM
,A.I_MID
,C.TRANS_DT
,C.TRANS_TM
,C.TRANS_DT || C.TRANS_TM AS TRANS_DTM
,C.VACCT_VALID_DT
,C.VACCT_VALID_TM
,C.VACCT_VALID_DT||C.VACCT_VALID_TM AS VACCT_VALID_DTM
,NVL(C.DEPOSIT_DT, ' ') DEPOSIT_DT
,NVL(C.DEPOSIT_TM, ' ') DEPOSIT_TM
,NVL(C.DEPOSIT_DT, ' ')||NVL(C.DEPOSIT_TM, ' ') AS DEPOSIT_DTM
,NVL(C.DEPOSIT_AMT, 0) AS AMT
,NVL(A.AMT, 0) AS INPUT_AMT
,A.BANK_CD
,IONPAY.UF_GET_BANK_NAME(A.BANK_CD) AS BANK_CD_NM
,C.VACCT_NO
,A.BILLING_NM
,A.REFERENCE_NO
,A.TXID
,A.STATUS
,NVL(A.STATUS, ' ') AS INPUT_STATUS
,IONPAY.UF_GET_TRANS_VACCT_STATUS_NAME(C.STATUS) AS INPUT_STATUS_NAME
,C.STATUS AS TRANS_STATUS
,IONPAY.UF_GET_TRANS_VACCT_STA_NAME_2(C.STATUS, C.MATCH_CL, C.VACCT_VALID_DT, C.VACCT_VALID_TM) AS STATUS_NAME
,A.ACQU_STATUS
,A.CANCEL_DT||A.CANCEL_TM AS REVERSAL_DATE
,NVL((SELECT DESC2
FROM TB_CODE
WHERE CODE_CL = 'CHNL'
AND CODE1 = C.BANK_CD
AND CODE2 = C.CHANNEL_TYPE), ' ') AS channel
FROM TB_TRANS_HISTORY A, TB_BO_MER_MGMT B, TB_VACCT_TRANS C
WHERE A.I_MID = B.I_MID
AND A.TXID = C.TXID
AND A.I_MID = C.I_MID
AND B.I_MID = C.I_MID
AND A.PAY_METHOD IN ('02')
AND A.TRANS_DT BETWEEN '20161016' AND '20161116'
AND A.TRANS_DT||A.TRANS_TM BETWEEN '2016101600%3A0000' AND '2016111624%3A0000'
AND B.TAX_NO != 'NICEPAY'
AND C.SIMULATION_FLG = '0'
AND C.STATUS IN ('0', '1', '2', '3', '4')
) TBL
WHERE RNUM BETWEEN 210000 AND 220000
I have already met such situation: the time difference is not in SQL query execution, but on fetching results.
Oracle JDBC driver default fetch size is 10, that means the ResultSet is fed 10 rows by 10 rows, which make a lot of round-trips to DB when there are 1 million records. The Fetch size must be increased.
With PostgreSql, the default fetch size is unlimited: the ResultSet is fed with the whole result (or until OutOfMemory). The Fetch size could need to be decreased.
To specify the fetch size value in Mybatis:
Use fetchSize attribute in XML select statement:
<select id="listItems" fetchSize="300">SELECT ...</select>
Or use fetchSize option in annotations:
#Select("SELECT ...")
#Options(fetchSize=300)
A value in range 200-500 is often a good compromise.

In Elasticsearch, how can I establish join query with conditions and later perform percentile and count functions?

I have set of tables in my data base like table A which has set of set of categories , table B set of repositeries. A and B are related by categoryid. And then table C which has set of properties for a repoId. Table C and A are associated with repoId.
Table C can have multiple values for a repoId.
The data in C table is like a property say a number string like 12345XXXX (max data of 10 characters) and I have to find the top 6 matching characters of a particular value in table C and the count of repoIds associated with those top 6 value for a particular data in table A (categoryid).
Table A(set of categories ) ---------> Table B (set of repositories, associated with A with categoryid)---------> Table V (set of FMProperties against a repoId)
Now currently, this has been achieved by using joins and substring queries on these tables and it is very slow.
I have to achieve this functionality using Elastic search. I dont have clear view how to start?
Do I create separate documents / indexes for table A , B and C or fetch the info using sql query and create a single document.
And how we can apply this analytics part explained above.
I am very new and amateur in this technology but I am following the tutorials provided at elasticsearch site.
PFB the query in mysql for this logic:-
select 'fms' as fmstype, C.fmscode as fmsCode,
count(C.repoId) as countOffms from tableC C, tableB B
where B.repoId = C.repoId and B.categoryid = 175
group by C.fmscode
order by countOffms desc
limit 1)
UNION ALL
(select 'fms6' as fmstype, t1.fmscode, t2.countOffms from tableC t1
inner join
(
select substring(C.fmscode,1,6) as first6,
count(C.repoId) as countOffms from tableC C, tableB B
where B.repoId = C.repoId and B.categoryid = 175 and length(C.fmscode) = 6
group by substring(C.fmscode,1,6) order by countOffms desc
limit 1 ) t2
ON
substring(t1.fmscode,1,6) = t2.first6 and length(t1.fmscode) = 6
group by t1.fmscode
order by count(t1.fmscode) desc
limit 1)

Resources