Is there a way to fetch the table or column size for tables created with a Log Engine in Clickhouse? I know that MergeTree Engine columns' size info can be queried with the system.columns table. But for Log engine, it returns 0 for data_compressed_bytes and data_uncompressed_bytes:
┌─database─┬─table──────────────────────────────────┬─name───────────┬─type─────────────┬─default_kind─┬─default_expression─┬─data_compressed_bytes─┬─data_uncompressed_bytes─┬─marks_bytes─┬─comment─┬─is_in_partition_key─┬─is_in_sorting_key─┬─is_in_primary_key─┬─is_in_sampling_key─┬─compression_codec─┐
│ default │ table_321895094fb2431cad3cc27ca070ec86 │ Related Change │ Nullable(String) │ │ │ 0 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ 0 │ │
│ default │ table_4a02605b096f47a381288279891a6aa9 │ Related Change │ Nullable(String) │ │ │ 0 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ 0 │ │
│ default │ table_ef618f6114f646759225ffa6cd4d330b │ Related Change │ Nullable(String) │ │ │ 0 │ 0 │ 0 │ │ 0 │ 0 │ 0 │ 0 │ │
└──────────┴────────────────────────────────────────┴────────────────┴──────────────────┴──────────────┴────────────────────┴───────────────────────┴─────────────────────────┴─────────────┴─────────┴─────────────────────┴───────────────────┴───────────────────┴────────────────────┴───────────────────┘
I searched a lot but couldn't find a configuration for enabling this for the Log engine. And if it's not possible, what would be a rough estimation given the count and column types? Or would it be too off to attempt that since Clickhouse compresses the data?
It needs to look at the table data in the file system:
find out the location of table files
SELECT *
FROM system.tables
WHERE name = '{table name}'
FORMAT Vertical
/* Result:
Row 1:
──────
..
data_paths: ['/var/lib/clickhouse/data/test/log_engine_001/']
..
*/
list all files in this directory
sudo ls -lsh /var/lib/clickhouse/data/test/log_engine_001
# Result:
# total 88K
# 4.0K -rw-r----- 1 clickhouse clickhouse 64 Feb 3 20:28 __marks.mrk
# 40K -rw-r----- 1 clickhouse clickhouse 40K Feb 3 20:28 id.bin
# 40K -rw-r----- 1 clickhouse clickhouse 40K Feb 3 20:28 name.bin
# 4.0K -rw-r----- 1 clickhouse clickhouse 100 Feb 3 20:28 sizes.json
Related
I had executed the following query but it has processed ~1B rows and took total time of 75 seconds for a simple count.
SELECT count(*)
FROM events_distributed
WHERE (orgId = '174a4727-1116-4c5c-8234-ab76f2406c4a') AND (timestamp >= '2022-12-05 00:00:00.000000000')
Query id: e4312ff5-6add-4757-8deb-d68e0f3e29d9
┌──count()─┐
│ 13071204 │
└──────────┘
1 row in set. Elapsed: 74.951 sec. Processed 979.00 million rows, 8.26 GB (13.06 million rows/s., 110.16 MB/s.)
I am wondering how I can speed this up? My events table has the following partition by and order by columns and a bloom filter index on orgid
PARTITION BY toDate(timestamp)
ORDER BY (timestamp);
INDEX idx_orgid orgid TYPE bloom_filter(0.01) GRANULARITY 1,
Below is the execution plan
EXPLAIN indexes = 1
SELECT count(*)
FROM events_distributed
WHERE (orgid = '174a4727-1116-4c5c-8234-ab76f240fc4a') AND (timestamp >= '2022-12-05 00:00:00.000000000') AND (timestamp <= '2022-12-06 00:00:00.000000000')
Query id: 879c2ce5-c4c7-4efc-b0e2-25613848afad
┌─explain────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ MergingAggregated │
│ Union │
│ Aggregating │
│ Expression (Before GROUP BY) │
│ Filter (WHERE) │
│ ReadFromMergeTree (users.events) │
│ Indexes: │
│ MinMax │
│ Keys: │
│ timestamp │
│ Condition: and((timestamp in (-Inf, '1670284800']), (timestamp in ['1670198400', +Inf))) │
│ Parts: 12/342 │
│ Granules: 42122/407615 │
│ Partition │
│ Keys: │
│ toDate(timestamp) │
│ Condition: and((toDate(timestamp) in (-Inf, 19332]), (toDate(timestamp) in [19331, +Inf))) │
│ Parts: 12/12 │
│ Granules: 42122/42122 │
│ PrimaryKey │
│ Keys: │
│ timestamp │
│ Condition: and((timestamp in (-Inf, '1670284800']), (timestamp in ['1670198400', +Inf))) │
│ Parts: 12/12 │
│ Granules: 30696/42122 │
│ Skip │
│ Name: idx_orgid │
│ Description: bloom_filter GRANULARITY 1 │
│ Parts: 8/12 │
│ Granules: 20556/30696 │
│ ReadFromRemote (Read from remote replica) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
32 rows in set. Elapsed: 0.129 sec.
How can I speed up this query? because processing 1B rows to give a count of 13M sounds like something is total off. Does creating a SET index on orgid any better? because I will have a max of 10K orgs
The queries I typicall run are
SELECT org_level, min(timestamp) as minTimeStamp,max(timestamp) as maxTimeStamp, toStartOfInterval(toDateTime(timestamp), INTERVAL <step> second) as roundedDownTs, count(*) as cnt, orgid
FROM events_distributed
WHERE orgid = 'foo' and timestamp BETWEEN <one week>
GROUP BY roundedDownTs, orgid, org_level
ORDER BY roundedDownTs DESC;
please note <step> here would be any of the following values 0, 60, 240, 1440, 10080
and another query for a one week time slice but it can be any time slice and always want the results in descending order because of timeseries
SELECT org_text
FROM events_distributed
WHERE (orgid = '174a4727-1116-4c5c-8234-ab76f2406c4a') AND (timestamp >= '2022-12-01 00:00:00.000000000' and timestamp <= '2022-12-07 00:00:00.000000000') order by timestamp DESC LIMIT 51;
You don't use primary index
I suggest is to use
PARTITION BY toDate(timestamp)
ORDER BY (orgId, timestamp)
https://kb.altinity.com/engines/mergetree-table-engine-family/pick-keys/
And remove bloom_filter index.
I want to embed a sub module directory in parent module, but golang remind me that pattern tpl/api_new/*: cannot embed file tpl/api_new/README.md: in different module
I know that I can delete go.mod & go.sum and then run "go mod init && go get -u" when the new project is generated.
The bottom is the file tree and the embed variable, is there anything others I can do to embed go.mod & go.sum?
Thanks~
//go:embed tpl/api_new/*
var apiNew embed.FS
├─api_new
│ │ .editorconfig
│ │ .gitignore
│ │ generate.go
│ │ go.mod
│ │ go.sum
│ │ makefile
│ │ README.md
│ │
│ ├─cmd
│ │ └─app
│ │ main.go
│ │
│ ├─config
│ │ config-dev.toml
│ │ config-live.toml
│ │ config-local.toml
│ │ config-prod.toml
│ │ config-stress.toml
│ │ config-trunk.toml
│ │
│ └─internal
│ └─app
│ ├─http
│ │ │ server.go
│ │ │
│ │ └─example
│ │ hello.go
│ │
│ ├─lib
│ │ ├─err
│ │ │ codecommon.go
│ │ │ err.go
│ │ │
│ │ ├─pms
│ │ │ init.go
│ │ │
│ │ └─util
│ │ md5.go
│ │ url.go
│ │
│ ├─model
│ │ │ init.go
│ │ │
│ │ ├─grpc
│ │ │ ├─roomaggregation
│ │ │ │ aggregation.proto
│ │ │ │ base.go
│ │ │ │
│ │ │ ├─roombase
│ │ │ │ base.proto
│ │ │ │ roombase.go
│ │ │ │
│ │ │ └─roomlist
│ │ │ base.proto
│ │ │ icon.go
│ │ │
│ │ ├─hrpc
│ │ │ │ init.go
│ │ │ │
│ │ │ └─efs
│ │ │ efs.go
│ │ │ init.go
│ │ │ option.go
│ │ │
│ │ └─redis
│ │ ├─attachInfo
│ │ │ index.go
│ │ │
│ │ ├─outing
│ │ │ index.go
│ │ │
│ │ ├─roomcity
│ │ │ roomcity.go
│ │ │
│ │ └─roomjump
│ │ index.go
│ │
│ └─service
│ │ init.go
│ │
│ └─example
│ hello.go
Each module within a repository is stored separately within the module cache. By design, the presence of a go.mod file in a subdirectory causes that entire subtree to be completely pruned out of the outer module.
If you really need the individual files in tpl/api_new to be accessible to the module in the parent directory, then you can either:
remove the inner go.mod and go.sum files to put the source files all in the same module, or
export the embed.FS data from some package within the …/tpl/api_new module (perhaps an internal package), and import that package in the parent-directory module in order to access that data programmatically.
is there anything others I can do to embed go.mod & go.sum?
No.
I have a problem when I want to use data from a csv-file in a table I have created. The database I created is called "test" and the table is created as following:
CREATE TABLE testing
(
`year` Int16,
`amount` Int16,
`rate` Float32,
`number` Int16
)
ENGINE = Log
Ok.
0 rows in set. Elapsed: 0.033 sec.
I created all these columns to be able to cover all the data in the csv-file. I've read through the clickhouse documentation but just can't figure out how to get the data into my database.
I tested to do this:
$ cat test.csv | clickhouse-client \ >-- database =test\ >--query='INSERT test FORMAT CSV'
Code: 62. DB::Exception: Syntax error: failed at position 1 (line 1, col 1): 2010,646,1.00,13
2010,2486,1.00,19
2010,8178,1.00,10
2010,15707,1.00,4
2010,15708,1.00,10
2010,15718,1.00,4
2010,16951,1.00,8
2010,17615,1.00,13
2010. Unrecognized token
Link: https://yadi.sk/d/ijJlmnBjsjBVc
cat test.csv |clickhouse-client -d test -q 'INSERT into testing FORMAT CSV'
SELECT *
FROM test.testing
┌─year─┬─amount─┬─rate─┬─number─┐
│ 2010 │ 646 │ 1 │ 13 │
│ 2010 │ 2486 │ 1 │ 19 │
│ 2010 │ 8178 │ 1 │ 10 │
│ 2010 │ 15707 │ 1 │ 4 │
│ 2010 │ 15708 │ 1 │ 10 │
│ 2010 │ 15718 │ 1 │ 4 │
│ 2010 │ 16951 │ 1 │ 8 │
│ 2010 │ 17615 │ 1 │ 13 │
│ 2010 │ 17616 │ 1 │ 4 │
│ 2010 │ 17617 │ 1 │ 8 │
│ 2010 │ 17618 │ 1 │ 9 │
I'm playing with data in csv format from https://dev.maxmind.com/geoip/geoip2/geolite2/.
Generally, it's data that map from ip block to asn and country.
I have 2 table both are Memory engine, first has 299727 records, second has 406685.
SELECT *
FROM __ip_block_to_country
LIMIT 5
┌─network────┬───────id─┬───min_ip─┬───max_ip─┬─geoname_id─┬─country_iso_code─┬─country_name─┐
│ 1.0.0.0/24 │ 16777216 │ 16777217 │ 16777472 │ 2077456 │ AU │ Australia │
│ 1.0.1.0/24 │ 16777472 │ 16777473 │ 16777728 │ 1814991 │ CN │ China │
│ 1.0.2.0/23 │ 16777728 │ 16777729 │ 16778240 │ 1814991 │ CN │ China │
│ 1.0.4.0/22 │ 16778240 │ 16778241 │ 16779264 │ 2077456 │ AU │ Australia │
│ 1.0.8.0/21 │ 16779264 │ 16779265 │ 16781312 │ 1814991 │ CN │ China │
└────────────┴──────────┴──────────┴──────────┴────────────┴──────────────────┴──────────────┘
SELECT *
FROM __ip_block_to_asn
LIMIT 5
┌─network──────┬─autonomous_system_number─┬─autonomous_system_organization─┬───────id─┬─subnet_count─┬───min_ip─┬───max_ip─┐
│ 1.0.0.0/24 │ 13335 │ Cloudflare Inc │ 16777216 │ 255 │ 16777217 │ 16777472 │
│ 1.0.4.0/22 │ 56203 │ Gtelecom-AUSTRALIA │ 16778240 │ 1023 │ 16778241 │ 16779264 │
│ 1.0.16.0/24 │ 2519 │ ARTERIA Networks Corporation │ 16781312 │ 255 │ 16781313 │ 16781568 │
│ 1.0.64.0/18 │ 18144 │ Energia Communications,Inc. │ 16793600 │ 16383 │ 16793601 │ 16809984 │
│ 1.0.128.0/17 │ 23969 │ TOT Public Company Limited │ 16809984 │ 32767 │ 16809985 │ 16842752 │
└──────────────┴──────────────────────────┴────────────────────────────────┴──────────┴──────────────┴──────────┴──────────┘
Now, i want to exam which country that covers entire ip pool of one asn. The below query is just to obtain index of statisfied country.
SELECT idx from(
SELECT
(
SELECT groupArray(min_ip),groupArray(max_ip),groupArray(country_iso_code),groupArray(country_name)
FROM __ip_block_to_country
) t,
arrayFilter((i,mii, mai) -> min_ip >= mii and max_ip <= mai, arrayEnumerate(t.1), t.1, t.2) as idx
FROM __ip_block_to_asn
);
I got following exception:
Received exception from server (version 1.1.54394):
Code: 241. DB::Exception: Received from localhost:9000, ::1. DB::Exception: Memory limit (for query) exceeded: would use 512.02 GiB (attempt to allocate chunk of 549755813888 bytes), maximum: 37.25 GiB.
My question is:
It seems like the statement SELECT groupArray(min_ip),groupArray(max_ip),groupArray(country_iso_code),groupArray(country_name) is executed along with every record of __ip_block_to_asn, then query needs so much memory. Is that true to my query ?
Scalar subquery is executed only once.
But to execute arrayFilter, arrays are multiplied by number of rows of processed blocks from __ip_block_to_asn table. It is something like cross join of two tables.
To overcome this, you can use smaller block size for SELECT from __ip_block_to_asn.
It is controlled by max_block_size setting. But for Memory tables, blocks always have the same size as when they was inserted into a table, regardless to max_block_size setting during SELECT. To allow flexible block size, you can reload this table to TinyLog engine.
CREATE TABLE __ip_block_to_asn2 ENGINE = TinyLog AS SELECT * FROM __ip_block_to_asn
Then execute:
SET max_block_size = 10;
SELECT idx from(
SELECT
(
SELECT groupArray(min_ip),groupArray(max_ip),groupArray(country_iso_code),groupArray(country_name)
FROM __ip_block_to_country
) t,
arrayFilter((i,mii, mai) -> min_ip >= mii and max_ip <= mai, arrayEnumerate(t.1), t.1, t.2) as idx
FROM __ip_block_to_asn2
);
I have data like this in a text file:
CLASS col2 col3 ...
1 ... ... ...
1 ... ... ...
2 ... ... ...
2 ... ... ...
2 ... ... ...
I load them using the following code:
data = readdlm("file.txt")[2:end, :] # without header line
And now I would like to get array with rows only from class 1.
(Data could be loaded using some other function if it would help.)
Logical indexing is the straight-forward way to do filtering on an array:
data[data[:,1] .== 1, :]
If, though, you read your file in as a data frame, you'll have more options available to you, and it'll keep track of your headers:
julia> using DataFrames
julia> df = readtable("file.txt", separator=' ')
5×4 DataFrames.DataFrame
│ Row │ CLASS │ col2 │ col3 │ _ │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ "..." │ "..." │ "..." │
│ 2 │ 1 │ "..." │ "..." │ "..." │
│ 3 │ 2 │ "..." │ "..." │ "..." │
│ 4 │ 2 │ "..." │ "..." │ "..." │
│ 5 │ 2 │ "..." │ "..." │ "..." │
julia> df[df[:CLASS] .== 1, :] # Refer to the column by its header name
2×4 DataFrames.DataFrame
│ Row │ CLASS │ col2 │ col3 │ _ │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ "..." │ "..." │ "..." │
│ 2 │ 1 │ "..." │ "..." │ "..." │
There are even more tools available with the DataFramesMeta package that aim to make this simpler (and other packages actively under development). You can use its #where macro to do SQL-style filtering:
julia> using DataFramesMeta
julia> #where(df, :CLASS .== 1)
2×4 DataFrames.DataFrame
│ Row │ CLASS │ col2 │ col3 │ _ │
├─────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ "..." │ "..." │ "..." │
│ 2 │ 1 │ "..." │ "..." │ "..." │
data[find(x -> a[x,1] == 1, 1:size(data)[1]),:]