Create materialized view in Clickhouse times out - clickhouse

I'm trying to create a large, populated materialized view in Clickhouse db, but it times out while creating. The error is:
Code: 159, e.displayText() = DB::Exception: Timeout exceeded: elapsed 122.162157893 seconds, maximum: 120 (version 20.4.3.16 (official build)) (version 20.4.3.16 (official build))
To fix this I'd like to increase this timeout but the problem is, I don't know which one it is. There is no timeout setting in my driver properties that has this value of 120s.
I have set the socket_timeout already to 500s.
How do I increase the timeout that triggers the above error after 120s?

It looks like triggered lock_acquire_timeout max_execution_time.
Try to either:
don't define POPULATE-clause and populate materialized view manually by chunks
# define MV without POPULATE
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_name ..
ENGINE = engine
AS SELECT ..
FROM ..;
# manually populate it by call of several INSERTs
INSERT INTO mv_name
SELECT ..
FROM ..
WHERE dt_column >= '..' AND dt_column < ''; // <-- restrict the chunk size
or increase the lock_acquire_timeout max_execution_time-value
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_name ..
ENGINE = engine
AS SELECT ..
FROM ..
SETTINGS max_execution_time = 600;
To define the settings that set to 120 use query:
SELECT *
FROM system.settings
WHERE value = '120'

Related

Optimize table does not run right after table mutation

I'm updating a table with table mutations like this:
ALTER TABLE T1
UPDATE column1 = replaceAll('X', 'Y')
After that, I'm sending optimize-final command with clickhouse-client like this:
OPTIMIZE TABLE T1 FINAL
Ok.
0 rows in set. Elapsed: 0.002 sec.
But it returns instantly(0.002 sec.) and I can see the rows are not updated yet.
After a couple of seconds(10-50) I run the optimize-final command again but this time it hangs until the table is optimized.
Is this the expected behavior of optimize-final?
I can see the rows are not updated yet.
ALTER TABLE T1 UPDATE -- asynchronous
You should check select count() from system.mutations where not is_done; that your mutation is done.
In next versions you can run mutations synchronously
ALTER TABLE T1 UPDATE column1 = replaceAll('X', 'Y') SETTINGS
mutations_sync = 2
mutations_sync, 0, "Wait for synchronous execution of ALTER TABLE UPDATE/DELETE queries (mutations). 0 - execute asynchronously. 1 - wait current server. 2 - wait all replicas if they exist.
OPTIMIZE TABLE T1 FINAL
OPTIMIZE -- merge has no relation to mutations.
0 rows in set. Elapsed: 0.002 sec.
In some cases OPTIMIZE could not start and returns immediately
Use optimize_throw_if_noop to find out a reason
set optimize_throw_if_noop = 1;
OPTIMIZE TABLE T1 FINAL;

create a trigger to update values

I have these two tables
Project (projID,TotArticles)
Article (prodID,ArticleID)
How can I create a trigger to update by 1 the total amount of article every time someone published an article on it?
CREATE TRIGGER Art_Up
AFTER INSERT ON Article
FOR EACH ROW
UPDATE Project
SET TotArticle = TotArticle + 1
WHERE paperID = NEW.PaperID;
However, it gives me this error PLS-00103: Encountered the symbol ";"
You messed some names, once you write projID, once prodID, and in your trigger it is paperID. Also trigger has no begin ... end;. And you did not handle adding articles where projID does not exist in table project. You could check it at first or use rowcount after update and if it is 0 then use insert. More simple is to use merge.
create or replace trigger art_up after insert on article for each row
begin
merge into project
using (select :new.projid projid from dual) src
on (project.projid = src.projid)
when matched then update set totarticles = totarticles + 1
when not matched then insert (projID, TotArticles) values (:new.projid, 1);
end;
It works, I tested some basic inserts, but it is not recommended at all, because:
it's a bad idea to keep logic in triggers,
triggers can be droppped, disabled and then this information may be misleading, false,
we are slowing insert operations,
this trigger does not handle delete where you should decrement total number of articles.
Instead of trigger use simple view:
create or replace view vw_project as select projID, count(1) total from article group by projid;

Passing the table header in Hive transform

I am creating a query in Hive to execute a R script. I am using transform function to pass the table. However when I receive the table in R it comes without the header. I know that I could create a variable and ask the user to insert the header manually but I do not wanna do it.
I wanna do something automatically, I am considering two options:
1) Figure out a way to pass the table with the header included when using transform function
2) Save the header in a variable and pass it in transform (I have already tried it in different ways but instead of passing the result of the query it is passing the query string - as seen below)
Here is what I have:
--Name of the origin table
set source_table = categ_table_small;
--Number of clusters
set k = "5";
--Distance to be used in the model
set distance = "euclidean";
--Folder where the results of the model will be saved
set dir_tar = "/output_r";
--Name of the model used in the naming of the files
set model_name ="testeclara_small";
--Samples: integer, number of samples to be drawn from the dataset.
set n_samples = "10";
--sampsize: integer, number of observations in each sample. This formula is suggested by the package. sampsize<-min(nrow(x), 40 + 2 * k)
set sampsize = "50";
--Creating a matrix which will store the sample number and the group of each sample according to the algorithm
CREATE TABLE IF NOT EXISTS medoids_result AS SELECT * FROM categ_table_small;
--In the normal situation you don't have the output label, it means you just have 'x' and do not have 'y', so you need to add one extra column to receive
--the group of each observation
--ALTER TABLE medoids_result ADD COLUMNS (medoid INT);
set result_matrix = medoids_result;
set headerMatrix = show columns in categ_table_small;
--Trainning query
SET mapreduce.job.name = K medoids Clara- ${hiveconf:source_table};
SET mapreduce.job.reduces=1;
INSERT OVERWRITE TABLE ${hiveconf:result_matrix}
SELECT TRANSFORM ($begin(cols="${hiveconf:source_table}" delimiter= "," excludes="y")$column$end)
USING '/usr/bin/Rscript_10gb /programs_r/du8_dev_1.R ${hiveconf:k}${hiveconf:distance}${hiveconf:dir_tar}${hiveconf:model_name}${hiveconf:n_samples}${hiveconf:sampsize}${hiveconf:headerMatrix}'
AS
(
$begin(table='${hiveconf:result_matrix}') $column$end
)
FROM
(SELECT *
FROM ${hiveconf:source_table}
DISTRIBUTE BY '1'
)t1;
You can add this line
hive -e 'set hive.cli.print.header=true;select * from tablename;'
Where tablename refers to your table name
If you want defaultly work for every table then you need to update the $HOME/.hiverc file with
hive> set hive.cli.print.header=true;
in the first line.

h2 index corruption? embedded database loaded with runscript has "invisible" rows

Using h2 in embedded mode, I am restoring an in memory database from a script backup that was previously generated by h2 using the SCRIPT command.
I use this URL:
jdbc:h2:mem:main
I am doing it like this:
FileReader script = new FileReader("db.sql");
RunScript.execute(conn,script);
which, according to the doc, should be similar to this SQL:
RUNSCRIPT FROM 'db.sql'
And, inside my app they do perform the same. But if I run the load using the web console using h2.bat, I get a different result.
Following the load of this data in my app, there are rows that I know are loaded but are not accessible to me via a query. And these queries demonstrate it:
select count(*) from MY_TABLE yields 96576
select count(*) from MY_TABLE where ID <> 3238396 yields 96575
select count(*) from MY_TABLE where ID = 3238396 yields 0
Loading the web console and using the same RUNSCRIPT command and file to load yields a database where I can find the row with that ID.
My first inclination was that I was dealing with some sort of locking issue. I have tried the following (with no change in results):
manually issuing a conn.commit() after the RunScript.execute()
appending ;LOCK_MODE=3 and the ;LOCK_MODE=0 to my URL
Any pointers in the right direction on how I can identify what is going on? I ended up inserting :
Server.createWebServer("-trace","-webPort","9083").start()
So that I could run these queries through the web console to sanity check what was coming back through JDBC. The problem happens consistently in my app and consistently doesn't happen via the web console. So there must be something at work.
The table schema is not exotic. This is the schema column from
select * from INFORMATION_SCHEMA.TABLES where TABLE_NAME='MY_TABLE'
CREATE MEMORY TABLE PUBLIC.MY_TABLE(
ID INTEGER SELECTIVITY 100,
P_ID INTEGER SELECTIVITY 4,
TYPE VARCHAR(10) SELECTIVITY 1,
P_ORDER DECIMAL(8, 0) SELECTIVITY 11,
E_GROUP INTEGER SELECTIVITY 1,
P_USAGE VARCHAR(16) SELECTIVITY 1
)
Any push in the right direction would be really appreciated.
EDIT
So it seems that the database is corrupted in some way just after running the RunScript command to load it. As I was trying to debug to find out what is going on, I tried executing the following:
delete from MY_TABLE where ID <> 3238396
And I ended up with:
Row not found when trying to delete from index "PUBLIC.MY_TABLE_IX1: 95326", SQL statement:
delete from MY_TABLE where ID <> 3238396 [90112-178] 90112/90112 (Help)
I then tried dropping and recreating all my indexes from within the context, but it had no effect on the overall problem.
Help!
EDIT 2
More information: The problem occurs due to the creation of an index. (I believe I have found a bug in h2 and I have working on creating a minimal case that reproduces it). The simple code below will reproduce the problem, if you have the right set of data.
public static void main(String[] args)
{
try
{
final String DB_H2URL = "jdbc:h2:mem:main;LOCK_MODE=3";
Class.forName("org.h2.Driver");
Connection c = DriverManager.getConnection(DB_H2URL, "sa", "");
FileReader script = new FileReader("db.sql");
RunScript.execute(c,script);
script.close();
Statement st = c.createStatement();
ResultSet rs = st.executeQuery("select count(*) from MY_TABLE where P_ID = 3238396");
rs.next();
if(rs.getLong(1) == 0)
System.err.println("It happened");
else
System.err.println("It didn't happen");
} catch (Throwable e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
I have reduced the db.sql script to about 5000 rows and it still happens. When I attempted to go to 2500 rows, it stopped happening. If I remove the last line of the db.sql (which is the index creation), the problem will also stop happening. The last line is this:
CREATE INDEX PUBLIC.MY_TABLE_IX1 ON PUBLIC.MY_TABLE(P_ID);
But the data is an important player in this. It still appears to only ever be the one row and the index somehow makes it inaccessible.
EDIT 3
I have identified the minimal data example to reproduce. I stripped the table schema down to a single column, and I found that the values in that column don't seem to matter -- just the number of rows. Here is the contents of (snipped with obvious stuff) of my db.sql generated via the SCRIPT command:
;
CREATE USER IF NOT EXISTS SA SALT '8eed806dbbd1ea59' HASH '6d55cf715c56f4ca392aca7389da216a97ae8c9785de5d071b49de5436b0c003' ADMIN;
CREATE MEMORY TABLE PUBLIC.MY_TABLE(
P_ID INTEGER SELECTIVITY 100
);
-- 5132 +/- SELECT COUNT(*) FROM PUBLIC.MY_TABLE;
INSERT INTO PUBLIC.MY_TABLE(P_ID) VALUES
(1),
(2),
(3),
... snipped you obviously have breaks in the bulk insert here ...
(5143),
(3238396);
CREATE INDEX PUBLIC.MY_TABLE_IX1 ON PUBLIC.MY_TABLE(P_ID);
But that will recreate the problem. [Note that my numbering skips a number every time there was a bulk insert line. So there really is 5132 rows, though you see 5143 select count(*) from MY_TABLE yields 5132]. Also, I seem to be able to recreate the problem in the WebConsole directly now by doing:
drop table MY_TABLE
runscript from 'db.sql'
select count(*) from MY_TABLE where P_ID = 3238396
You have recreated the problem if you get 0 back from the select when you know you have a row in there.
Oddly enough, I seem to be able to do
select * from MY_TABLE order by P_ID desc
and I can see the row at this point. But going directly for the row:
select * from MY_TABLE where P_ID = 3238396
Yields nothing.
I just realized that I should note that I am using h2-1.4.178.jar
The h2 folks have already apparently resolved this.
https://code.google.com/p/h2database/issues/detail?id=566
Just either need to get the code from version control or wait for the next release build. Thanks Thomas.

SQL insert slow on 1 million rows

WITH TOP 100000 (100k) this query is finished in about 3 seconds
WITH TOP 1000000 (1mil) this query is finished in about 2 minutes
SELECT TOP 1000000
db_id = IDENTITY(int, 1, 1), *
INTO dbo.tablename
FROM dbname.dbo.tablename
Actual execution plan is always:
clustered index scan 4% cost
top
top
compute scalar
insert (96% cost)
select into
The table has 1.3 mil rows and has an int primary key on first column
Can I speed it up somehow? I'm using SQL Server 2008 R2.
The results showed that 100,000 records takes 159 ms, and 1,000,000 records takes 1,435 ms. On a Raid 1 OS, Raid 1 Data, Raid 1 Log, Raid 1 TempDb all separate drives. Our Dev enviroment.
The results showed that 100,000 records takes 113 ms, and 1,000,000 records takes 996 ms. On my laptop with a single SSD (Samsung 840 250GB). SSD's rock!!!
The results showed that 100,000 records takes 188 ms, and 1,000,000 records takes 1,880 ms. On a Raid 1 OS, Raid 10 Data, Raid 10 Log, Raid 1 TempDb all separate drives under a production load.
Here is a complete script that shows that the 1 million takes less than ten times as long as 100,000. Your situation is likely slightly different, but this shows that the fundamentals are not the issue.
The results show that 100,000 records takes 146 ms, and 1,000,000 records takes 1,315 ms.
These results are from my desktop. If someone else could run the script and post their results, that would be very useful.
Rob
USE master;
GO
-- Drop database SourceDB
IF EXISTS (SELECT * FROM sys.databases WHERE name = 'SourceDB') ALTER DATABASE SourceDB SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
IF EXISTS (SELECT * FROM sys.databases WHERE name = 'SourceDB') DROP DATABASE SourceDB;
GO
-- Create database SourceDB
CREATE DATABASE SourceDB;
ALTER DATABASE SourceDB SET RECOVERY SIMPLE;
GO
USE SourceDB;
GO
-- Create table SourceDB.dbo.SourceTable
CREATE TABLE dbo.SourceTable (
ColID int PRIMARY KEY
);
GO
-- Populate table SourceDB.dbo.SourceTable
DECLARE #i int = 0;
WHILE #i < 1300000
BEGIN
SET #i += 1;
INSERT INTO dbo.SourceTable (ColID) VALUES (#i);
END;
GO
-- Drop database Test1
IF EXISTS (SELECT * FROM sys.databases WHERE name = 'Test1') ALTER DATABASE Test1 SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
IF EXISTS (SELECT * FROM sys.databases WHERE name = 'Test1') DROP DATABASE Test1;
GO
-- Create database Test1
CREATE DATABASE Test1;
ALTER DATABASE Test1 SET RECOVERY SIMPLE;
ALTER DATABASE Test1 MODIFY FILE (NAME = Test1, SIZE = 3000MB, MAXSIZE = 8TB);
ALTER DATABASE Test1 MODIFY FILE (NAME = Test1_log, SIZE = 3000MB, MAXSIZE = 2TB);
GO
USE Test1;
GO
IF EXISTS (SELECT * FROM sys.tables WHERE [OBJECT_ID] = OBJECT_ID('dbo.DestinationTable1')) DROP TABLE dbo.DestinationTable1;
IF EXISTS (SELECT * FROM sys.tables WHERE [OBJECT_ID] = OBJECT_ID('dbo.DestinationTable2')) DROP TABLE dbo.DestinationTable2;
GO
DECLARE #n int = 100000;
DECLARE #t1 datetime2 = SYSDATETIME();
SELECT TOP (#n) db_id = IDENTITY(int, 1, 1), *
INTO dbo.DestinationTable1
FROM SourceDB.dbo.SourceTable;
SELECT DATEDIFF(ms, #t1, SYSDATETIME()) AS ElapsedMs;
GO
DECLARE #n int = 1000000;
DECLARE #t1 datetime2 = SYSDATETIME();
SELECT TOP (#n) db_id = IDENTITY(int, 1, 1), *
INTO dbo.DestinationTable2
FROM SourceDB.dbo.SourceTable;
SELECT DATEDIFF(ms, #t1, SYSDATETIME()) AS ElapsedMs;
GO

Resources