Populating a Materialized View in ClickHouse exceeds Memory limit - clickhouse

I'm trying to create a materialized view using the ReplicatedAggregatingMergeTree engine on a table that uses a ReplicatedMergeTree engine.
After a few million rows I get DB::Exception: Memory limit (for query) exceeded. Is there way to work around this?
CREATE MATERIALIZED VIEW IF NOT EXISTS shared.aggregated_calls_1h
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/shared/aggregated_calls_1h', '{replica}')
PARTITION BY toRelativeDayNum(retained_until_date)
ORDER BY (
client_id,
t,
is_synthetic,
source_application_ids,
source_service_id,
source_endpoint_id,
destination_application_ids,
destination_service_id,
destination_endpoint_id,
boundary_application_ids,
process_snapshot_id,
docker_snapshot_id,
host_snapshot_id,
cluster_snapshot_id,
http_status
)
SETTINGS index_granularity = 8192
POPULATE
AS
SELECT
client_id,
toUInt64(floor(t / (60000 * 60)) * (60000 *60)) AS t,
date,
toDate(retained_until_timestamp / 1000) retained_until_date,
is_synthetic,
source_application_ids,
source_service_id,
source_endpoint_id,
destination_application_ids,
destination_service_id,
destination_endpoint_id,
boundary_application_ids,
http_status,
process_snapshot_id,
docker_snapshot_id,
host_snapshot_id,
cluster_snapshot_id,
any(destination_endpoint) AS destination_endpoint,
any(destination_endpoint_type) AS destination_endpoint_type,
groupUniqArrayArrayState(destination_technologies) AS destination_technologies_state,
minState(ingestion_time) AS min_ingestion_time_state,
sumState(batchCount) AS sum_call_count_state,
sumState(errorCount) AS sum_error_count_state,
sumState(duration) AS sum_duration_state,
minState(toUInt64(ceil(duration/batchCount))) AS min_duration_state,
maxState(toUInt64(ceil(duration/batchCount))) AS max_duration_state,
quantileTimingWeightedState(0.25)(toUInt64(ceil(duration/batchCount)), batchCount) AS latency_p25_state,
quantileTimingWeightedState(0.50)(toUInt64(ceil(duration/batchCount)), batchCount) AS latency_p50_state,
quantileTimingWeightedState(0.75)(toUInt64(ceil(duration/batchCount)), batchCount) AS latency_p75_state,
quantileTimingWeightedState(0.90)(toUInt64(ceil(duration/batchCount)), batchCount) AS latency_p90_state,
quantileTimingWeightedState(0.95)(toUInt64(ceil(duration/batchCount)), batchCount) AS latency_p95_state,
quantileTimingWeightedState(0.98)(toUInt64(ceil(duration/batchCount)), batchCount) AS latency_p98_state,
quantileTimingWeightedState(0.99)(toUInt64(ceil(duration/batchCount)), batchCount) AS latency_p99_state,
quantileTimingWeightedState(0.25)(toUInt64(ceil(duration/batchCount)/100), batchCount) AS latency_p25_large_state,
quantileTimingWeightedState(0.50)(toUInt64(ceil(duration/batchCount)/100), batchCount) AS latency_p50_large_state,
quantileTimingWeightedState(0.75)(toUInt64(ceil(duration/batchCount)/100), batchCount) AS latency_p75_large_state,
quantileTimingWeightedState(0.90)(toUInt64(ceil(duration/batchCount)/100), batchCount) AS latency_p90_large_state,
quantileTimingWeightedState(0.95)(toUInt64(ceil(duration/batchCount)/100), batchCount) AS latency_p95_large_state,
quantileTimingWeightedState(0.98)(toUInt64(ceil(duration/batchCount)/100), batchCount) AS latency_p98_large_state,
quantileTimingWeightedState(0.99)(toUInt64(ceil(duration/batchCount)/100), batchCount) AS latency_p99_large_state,
sumState(minSelfTime) AS sum_min_self_time_state
FROM shared.calls_v2
WHERE sample_type != 'user_selected'
GROUP BY
client_id,
t,
date,
retained_until_date,
is_synthetic,
source_application_ids,
source_service_id,
source_endpoint_id,
destination_application_ids,
destination_service_id,
destination_endpoint_id,
boundary_application_ids,
process_snapshot_id,
docker_snapshot_id,
host_snapshot_id,
cluster_snapshot_id,
http_status
HAVING destination_endpoint_type != 'INTERNAL'

You can try using --max_memory_usage option of clickhouse-client to increase the limit.
--max_memory_usage arg "Maximum memory usage for processing of single query. Zero means unlimited."
https://clickhouse.yandex/docs/en/operations/settings/query_complexity/#settings_max_memory_usage
Or instead of populating just copy the data manually into the table as
INSERT INTO .inner.shared.aggregated_calls_1h
SELECT
client_id,
toUInt64(floor(t / (60000 * 60)) * (60000 *60)) AS t,
...

Related

GORM hangs with parameter

The below queries hangs exactly after 5 calls every time,
tx := db.Raw("select count(*) as hash from transaction_logs left join blocks on transaction_logs.block_number = blocks.number"+
" where (transaction_logs.address = ? and transaction_logs.topic0 = ?) and blocks.total_confirmations >= 7 "+
"group by transaction_hash", strings.ToLower("0xa11265a58d9f5a2991fe8476a9afea07031ac5bf"),
"0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef").Scan(&totalIds)
If we replace it without the arguments it works
db.Raw("select count(*) as hash from transaction_logs left join blocks on transaction_logs.block_number = blocks.number"+
" where (transaction_logs.address = #tokenAddress and transaction_logs.topic0 = '0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef') and blocks.total_confirmations >= 7 "+
"group by transaction_hash", sql.Named("tokenAddress", strings.ToLower("0xa11265a58d9f5a2991fe8476a9afea07031ac5bf"))
Tried even with named parameter, same result
Can anyone help here

Using IF statement in Oracle SQL

I'm trying to use an IF ELSE statement in a sql query because I want to sum the units per district based on their package size per each PLN #. Can I do this? Here is my attempt, which does not work:
SELECT district_nbr,PLN_NBR, SUM(A.LO_IOH_UNITS) AS TOTAL_UNITS, SUM (A.LO_IOH_EXT_COST_DLRS) AS TOTAL_DOLLARS
FROM FCT_LOIOH_DAY_STR_PLN A, DIM_PROD_PLN B, DIM_LOCATION C
WHERE B.PLN_NBR IN(40000683181, 40000418723, 40000335776)
AND A.PROD_ID = B.PROD_ID
AND A.LOC_ID = C.LOC_ID
if PLN_NBR = '40000683181' then SUM(A.LO_IOH_UNITS)/25
else SUM(A.LO_IOH_UNITS)/60
GROUP BY district_nbr,PLN_NBR
SELECT district_nbr, PLN_NBR,
SUM(A.LO_IOH_UNITS) / (case when PLN_NBR = 40000683181 then 25 else 60 end) AS TOTAL_UNITS,
SUM (A.LO_IOH_EXT_COST_DLRS) AS TOTAL_DOLLARS
FROM FCT_LOIOH_DAY_STR_PLN A,
DIM_PROD_PLN B,
DIM_LOCATION C
WHERE B.PLN_NBR IN (40000683181, 40000418723, 40000335776)
AND A.PROD_ID = B.PROD_ID
AND A.LOC_ID = C.LOC_ID
GROUP BY district_nbr, PLN_NBR;
In Oracle SQL, you can use CASE WHEN ... ELSE for conditionals:
CASE WHEN PLN_NBR = '40000683181' THEN SUM(A.LO_IOH_UNITS) / 25
ELSE SUM)A.LO_IOH_UNITS) / 60
END
Such CASE WHEN ... statements go to the part where the other selected columns go:
SELECT
foo,
bar,
CASE ....
FROM
...

Performance: how to insert CLOB fast using cx_Oracle and executemany()?

cx_Oracle API was very fast for me until I tried to work with CLOB values.
I do it as follows:
import time
import cx_Oracle
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.prepare("insert into table_clob (msg_id, message) values (:msg_id, :msg)")
cur.bindarraysize = num_records
msg_arr = cur.var(cx_Oracle.CLOB, arraysize=num_records)
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
for id in range(num_records):
msg_arr.setvalue(id, text)
rows.append( (id, msg_arr) ) # ???
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
cur.executemany(None, rows)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
The main problem worrying me is performance:
100 records prepared, 25.090 s - Very much for copying 100MB in memory!
100 records inserted, 23.503 s - Seems to be too much for 100MB over network.
The problematic step is msg_arr.setvalue(id, text). If I comment it, script takes just milliseconds to complete (inserting null into CLOB column of course).
Secondly, it seems to be weird to add the same reference to CLOB variable in rows array. I found this example in internet, and it works correctly but do I do it right?
Are there ways to improve performance in my case?
UPDATE: Tested network throughput: a 107 MB file copies in 11 s via SMB to the same host. But again, network transfer is not the main problem. Data preparation takes abnormally much time.
Weird workaround (thanks to Avinash Nandakumar from cx_Oracle mailing list) but it's a real way to greatly improve performance when inserting CLOBs:
import time
import cx_Oracle
import sys
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.bindarraysize = num_records
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
cur.executemany(
"insert into table_clob (msg_id, message) values (:msg_id, empty_clob())",
[(i,) for i in range(1, 101)])
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
selstmt = "select message from table_clob " +
"where msg_id between 1 and :1 for update"
cur.execute(selstmt, [num_records])
for id in range(num_records):
results = cur.fetchone()
results[0].write(text)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
Semantically this is not exactly the same as in my original post, I wanted to keep example as easy as possible to show the principle. The point is that you should insert emptyclob(), then select it and write its contents.

Hive returns a non specific error : FAILED: SemanticException java.lang.reflect.UndeclaredThrowableException

I have the following query in HIVE, but it doesn't work
SELECT
newcust.dt , aspen.Probe , newcust.direction , aspen.VLan , sum(newcust.npacket), sum(newcust.nbyte), sum(newcust.nbytetcp), sum(newcust.nbyteudp), sum(newcust.byte_unknown), sum(newcust.pack_unknown), sum(newcust.byte_web), sum(newcust.pack_web), sum(newcust.byte_webapp), sum(newcust.pack_webapp), sum(newcust.byte_residential), sum(newcust.pack_residential), sum(newcust.byte_download), sum(newcust.pack_download), sum(newcust.byte_news), sum(newcust.pack_news), sum(newcust.byte_mail), sum(newcust.pack_mail), sum(newcust.byte_db), sum(newcust.pack_db), sum(newcust.byte_routing), sum(newcust.pack_routing), sum(newcust.byte_encrypted), sum(newcust.pack_encrypted), sum(newcust.byte_office), sum(newcust.pack_office), sum(newcust.byte_vpn), sum(newcust.pack_vpn), sum(newcust.byte_tunneling), sum(newcust.pack_tunneling), sum(newcust.byte_others), sum(newcust.pack_others), sum(newcust.byte_remoteaccess), sum(newcust.pack_remoteaccess), sum(newcust.byte_streaming), sum(newcust.pack_streaming) , sum(newcust.byte_chat), sum(newcust.pack_chat), sum(newcust.byte_voip), sum(newcust.pack_voip), aspen.CustomerName, aspen.General_NetworkPriority, aspen.SLA_CIR, aspen.SLA_EIR
FROM
newcust INNER JOIN aspen ON( aspen.Probe = newcust.numsonde AND aspen.VLan = substring(newcust.name1,1,instr(newcust.name1, '_')-1) )
WHERE
newcust.numsonde = '1'
AND newcust.direction = '0'
AND newcust.dt LIKE '2012-01-20-%%%%'
AND COALESCE(UNIX_TIMESTAMP(aspen.scd_end,'dd-MM-yyyy'),CAST(9999999999 AS BIGINT)) >= UNIX_TIMESTAMP(newcust.dt,'yyyy-MM-dd-HHmm')+cast((newcust.period * 360) as BIGINT)
AND UNIX_TIMESTAMP(aspen.scd_start,'dd-MM-yyyy') < UNIX_TIMESTAMP(newcust.dt,'yyyy-MM-dd-HHmm')+cast((newcust.period * 360) as BIGINT)
GROUP BY newcust.dt, aspen.Probe, newcust.direction, aspen.VLan, aspen.CustomerName, aspen.General_NetworkPriority, aspen.SLA_CIR, aspen.SLA_EIR, from_unixtime(UNIX_TIMESTAMP(newcust.dt,'yyyy-MM-dd-HHmm')+cast((newcust.period * 360) as BIGINT))
Hive returns the following error :
FAILED: SemanticException java.lang.reflect.UndeclaredThrowableException. But there is no other explaination about the root of the problem.
Do you think the query is invalid or is it another "deeper" issue?
Make sure you include all the fields which are there in the select clause are included in group by clause.

prepared statement in multithreading

I have used MERGE command in my prepared statement,and when i was executed it in a single threaded env,its working fine,But in multi threaded environment,it causes some problem.That is data is duplicated,that is if i have 5 threads,each record will duplicate 5 times.I think there is no lock in db to help the thread.
My code:
//db:oracle
sb.append("MERGE INTO EMP_BONUS EB USING (SELECT 1 FROM DUAL) on (EB.EMP_id = ?) WHEN MATCHED THEN UPDATE SET TA =?,DA=?,TOTAL=?,MOTH=? WHEN NOT MATCHED THEN "+ "INSERT (EMP_ID, TA, DA, TOTAL, MOTH, NAME)VALUES(?,?,?,?,?,?) ");
//sql operation,calling from run() method
public void executeMerge(String threadName) throws Exception {
ConnectionPro cPro = new ConnectionPro();
Connection connE = cPro.getConection();
connE.setAutoCommit(false);
System.out.println(sb.toString());
System.out.println("Threadname="+threadName);
PreparedStatement pStmt= connE.prepareStatement(sb.toString());
try {
count = count + 1;
for (Employee employeeObj : employee) {//datalist of employee
pStmt.setInt(1, employeeObj.getEmp_id());
pStmt.setDouble(2, employeeObj.getSalary() * .10);
pStmt.setDouble(3, employeeObj.getSalary() * .05);
pStmt.setDouble(4, employeeObj.getSalary()
+ (employeeObj.getSalary() * .05)
+ (employeeObj.getSalary() * .10));
pStmt.setInt(5, count);
pStmt.setDouble(6, employeeObj.getEmp_id());
pStmt.setDouble(7, employeeObj.getSalary() * .10);
pStmt.setDouble(8, employeeObj.getSalary() * .05);
pStmt.setDouble(9, employeeObj.getSalary()
+ (employeeObj.getSalary() * .05)
+ (employeeObj.getSalary() * .10));
pStmt.setInt(10, count);
pStmt.setString(11, threadName);
// pStmt.executeUpdate();
pStmt.addBatch();
}
pStmt.executeBatch();
connE.commit();
} catch (Exception e) {
connE.rollback();
throw e;
} finally {
pStmt.close();
connE.close();
}
}
if employee.size=5, thread count =5,after execution i would get 25 records instead of 5
If there is no constraint (i.e. a primary key or a unique key constraint on the emp_id column in emp_bonus), there would be nothing to prevent the database from allowing each thread to insert 5 rows. Since each database session cannot see uncommitted changes made by other sessions, each thread would see that there was no row in emp_bonus with the emp_id the thread is looking for (I'm assuming that employeeObj.getEmp_id() returns the same 5 emp_id values in each thread) so each thread would insert all 5 rows leaving you with a total of 25 rows if there are 5 threads. If you have a unique constraint that prevents the duplicate rows from being inserted, Oracle will allow the other 4 threads to block until the first thread commits allowing the subsequent threads to do updates rather than inserts. Of course, this will cause the threads to be serialized defeating any performance gains you would get from running multiple threads.

Resources