Understanding clickhouse partitions - clickhouse

I see that clickhouse created multiple directories for each partition key(in each node).
Documentation says the directory name format is : partition ID_minimum block number_maximum block number_level.
Any idea what is level here?
347 distinct partition keys on one node(for one table) created 1358 directories. (custom partitioning)
The documentation recommends not to have more than 1000 partitions. Should we just keep in mind the number of partitions keys or the number of directories also?
Also, Is there a configuration on how to control this number of directories?

Any idea what is level here?
Level is a concept of LSM-tree. MergeTree tables have mechanisms to merge data parts into bigger and deeper (w.r.t level) ones.
Should we just keep in mind the number of partitions keys or the number of directories also?
Well I don't think that's a good idea as this method doesn't scale well. You'd better choose a low-cardinality column or expression as the partition key.
Also, Is there a configuration on how to control this number of directories?
No explicit settings for that. But you can easily use modular expression to limit the total number of partitions.

Adding to this discussion, you can check parts and partition in the following ways :
For active partition :
select count(distinct partition) from system.parts where the table in ('table_name') and active
For Active parts :
select count() from system.parts where table in ('table_name') and active
Inactive parts will be removed soon in less than 10 minutes.
Furthermore, you can also read more here about parts, partition and how merging happens.
To view table parts and partition together :
SELECT
partition,
name,
active
FROM system.parts
WHERE table = 'table_name'

Related

Move Range Interval partition data from one table to history table in other database

We have a primary table that is Range partitioned by date with a 1-month interval. It's also a list sub-partitioned with 4 distinct values. So essentially it is one month partition having 4 sub-partitions.
Database: Oracle 19c
I need advice on how to effectively move the partition/sub-partition data from active schema to historical schema in another database.
Also, there are about 30 tables that are referenced partitioned on the primary table for which the data needs to be moved as well. Overall I'm looking to move about 2500 subpartitions
I'm not sure if an exchange partition would be the right approach in this scenario?
TIA
You could use exchange to get the data rapidly out of your active table, but you would still then to send that table over the wire to the remote history database to load it in.
In which case, using "exchange" probably is just adding more steps to the process for little gain. (There are still potential uses here depending on how you want to handle indexing etc).
But simplest is perhaps just transferring the data over, assuming a common structure between the two tables, ie
insert /*+ APPEND */ into history_table#remote_db
select * from active_table partition ( myparname )
I can't remember if partition naming syntax is supported over a db link, but if not, then the appropriate date predicates will do the same trick, and then just follow up with:
alter table active_table truncate partition myparname;

Oracle {LIST} Partition by Program and {RANGE} subpartition by DATE with interval

I'm trying to figure the best possible way to determine the partition strategy in Oracle 12c (12.2.0.1.0)
This post
is almost identical to my requirements. However, I want to know the best possible way to implement in Oracle 12c (12.2.0.1.0) version.
Here is my question:
We have four (4) distinct programs for which the bills are submitted in our system.
The approx volume of bills submitted per year is as follows:
Program_1 ~ 3M per year
Program_2 ~ 1M per year
Program_3 ~ 500K per year
Program_4 ~ 100K per year
My initial thought process is to create PARTITION BY LIST (PROGRAM) AND SUBPARTITION BY RANGE (BILL_SUBMISSION_DATE).
I would like to use oracle interval feature for SUBPARTITION, would like to know if there are any limitations with this approach.
Your approach of partitioning by PROGRAM and sub-partitioning by BILL_SUBMISSTION_DATE sounds good.
I have not tested performance differences (I imagine they would be negligible), but for coding the INTERVAL option makes querying and maintenance easier in my opinion.
For the following example the table partition clause I used was:
partition by range (INVOICE_MONTH) interval (numtoyminterval(1, 'MONTH'))
Example query, using old style partition names, query a partition for an invoice for April 2012, assuming I created a partition named INV201204 for those invoices for that month:
select * from MARK_INV_HDR
partition ('INV201204');
And the same query, using INTERVAL automatically generated partitions:
select * from MARK_INV_HDR
where invoice_month = to_date('2012-04', 'yyyy-mm');
The advantage of the later query is I don't have to know the naming convention for the partitions.
To drop the oldest partition, one query and one DDL:
select to_char(min(invoice_month), 'dd-Mon-yyyy') as min_inv_dt from MARK_INV_HDR;
MIN_INV_DT
-----------
01-Apr-2012
alter table mark_inv_hdr
drop partition for (TO_DATE('01-Apr-2012', 'dd-Mon-yyyy'))
update global indexes;
EDIT:
Update: I forgot that you cannot use the INTERVAL clause on a sub-partition; thanks to user12283435 for the reminder. In looking more closely at the question, it appears that there is probably no need to partition on PROGRAM, so just a single partition by range on BILL_SUBMISSION_DATE with the INTERVAL clause should work fine.
When you have a small set of values like you do for PROGRAM, no obvious reason to partition on it. The typical example of partitioning by list given in Oracle documentation is list of regions, for a global call center, so that you can do batch reports and maintenance on certain regions after business hours, etc. You can have a global bit-mapped index on PROGRAM, if you don't do many updates, if you query criteria frequently includes just one PROGRAM. (Updating a column with a bit-mapped index will briefly lock the table.)

How Hive Partition works

I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.
If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.

Hive query optimisation

Have to perform incremental load into an internal table from an external table in hive when the source data file is appended with new records, on a daily basis. The new records can be filtered out based on the timestamp(column load_ts in the table) at which they were loaded. Trying to achieve this by selecting the records from source table whose load_ts is greater than the current max(load_ts) in the target table as given below:
INSERT INTO TABLE target_temp PARTITION (DATA_DT)
SELECT ms.* FROM temp_db.source_temp ms
JOIN (select max(load_ts) max_load_ts from target_temp) mt
ON 1=1
WHERE
ms.load_ts > mt.max_load_ts;
But the above query does not give the desired output. Takes very long time for execution (should not be the case with Map-Reduce paradigm).
Tried other scenarios also like passing the max(load_ts) as a variable, instead of joining. Still no improvement in the performance. Would be very helpful if anyone can give their insights as to what is possibly incorrect in this approach, with any alternate solutions.
First of all, the map/reduce model does not guarantee that your queries will take less. The main idea is that its performance will scale linearly with the number of nodes, but you have to still think about how you're doing things, more so than in normal SQL.
First thing to check is if the source table is partitioned by time. If not, it should as you'd be reading the whole table every single time.
Second, you're calculating the max as well every time, also, on the whole destination table. You could make it a lot faster if you just calculate the max on the last partition, so change this
JOIN (select max(load_ts) max_load_ts from target_temp) mt
to this (you didn't write the partition column so I am going to assume it's called 'dt'
JOIN (select max(load_ts) max_load_ts from target_temp WHERE dt=PREVIOUS_DATA_DT) mt
since we know the max load_ts is going to be in the last partition.
Otherwise, it's hard to help without knowing the structure of the source table, and, like somebody else commented, the sizes of the two tables.
JOIN is slower than variable in the WHERE clause. But the main problem with performance here is that your query performs full scan of target table and source table. I would recommend:
Query only the latest partition for max(load_ts).
Enable statistics gathering and usage
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.autogather=true;
Compute statistics on both tables for columns.
Statistics will make queries like selecting MAX(partition) or max(ts) executing faster
Try to put source partition files into target partition folder instead of INSERT if applicable (target and source tables partitioning and storage format should enable this). It works fine for example for textfile storage format and if source table partition contain only rows>max(target_partition). You can combine both copy files method(for those source partitions that exactly contain rows to be inserting without filtering) and INSERT(for partitions containing mixed data that need to be filtering).
Hive may be merging your files during INSERT. This merge phase takes additional time and adds additional stage job. Check hive.merge.mapredfiles option and try to switch it off.
And of course use pre-calculated variable instead of join.
Use Cost-Based Optimisation Technique by enabling below properties
set hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.stats.fetch.column.stats=true;
set hive.compute.query.using.stats=true;
set hive.vectorized.execution.enabled=true;
set hive.exec.parallel=true;
Also analyze the table
ANALYZE TABLE temp_db.source_temp COMPUTE STATISTICS [comma_separated_column_list];
ANALYZE TABLE target_temp PARTITION(DATA_DT) COMPUTE STATISTICS;

comparing data in two tables taking time

I need to query table1 find all orders and created date ( key is order number an date)).
In table 2 ( key is order number an date) Check if the order exists for a a date.
For this i am scanning table 1 and for each record checking if it exists in table 2. Any better way to do this
In this situation in which your key is identical for both tables, it makes sense to have a single table in which you store both data for Table 1 and Table 2. In that way you can do a single scan on your data and know straight away if the data exists for both criteria.
Even more so, if you want to use this data in MapReduce, you would simply scan that single table. If you only want to get the relevant rows, you could define a filter on the Scan. For example, in the case where you will not be populating rows at all in Table 2, you would simply use a ColumnPrefixFilter
If, however, you do need to keep this data separately in 2 tables, you could pre-split the tables with the same region boundaries for both tables - this will be helpful when you do the query that you are aiming for - load all rows in Table 1 when row exists in Table 2. Essentially this would be a map-side join. You could define multiple inputs in your MapReduce job, and since the region borders are the same, the splits will be such that each mapper will have corresponding rows from both tables. You would probably need to implement your own MultipleInput format for that (the MultiTableInputFormat class recently introduced in 0.96 does not seem to do that map side join)

Resources