Clustering order does not work with compound partition key - sorting

With the following table definition:
CREATE TABLE device_by_create_date (
year int,
comm_nr text,
created_at timestamp,
PRIMARY KEY ((year, comm_nr), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
Comm_nr is a unique identifier.
I would expect to see data ordered by created_at column, which is not the case when I add data.
Example entries:
Table CQL:
How can I issue select * from table; queries, which return data ordered by the created_at row?

TLDR: You need to create a new table.
Your partition key is (year, comm_nr). You're created_at key is ordered but it is ordered WITHIN that partition key. A query where SELECT * FROM table WHERE year=x AND comm_nr=y; will be ordered by created_at.
Additionally if instead of (year, comm_nr), created_at your key was instead year, comm_r, created_at even if your create table syntax only specifiied created_at as the having a clustering order, it would be created as WITH CLUSTERING ORDER BY (comm_nr DESC, created_at DESC). Data is sorted within SSTables by key from left to right.
The way to do this in true nosql fashion is to create a separate table where your key is instead year, created_at, comm_nr. You would write to both on user creation, but if you needed the answer for who created their account first you would instead query the new table.

Related

How to decide the partition key for clickhouse

I want to know what's the best practice for the partition key.
In my project, we have a table with event_date, app_id and other columns. The app_id will be growing and could be thousands.
The select query is based on event_date and app_id.
The simple data schema is as below:
CREATE TABLE test.test_custom_partition (
company_id UInt64,
app_id String,
event_date DateTime,
event_name String ) ENGINE MergeTree()
PARTITION BY (toYYYYMMDD(event_date), app_id)
ORDER BY (app_id, company_id, event_date)
SETTINGS index_granularity = 8192;
the select query is like below:
select event_name from test_custom_partition
where event_date >= '2020-07-01 00:00:00' AND event_date <= '2020-07-15 00:00:00'
AND app_id = 'test';
I want to use (toYYYYMMDD(event_date), app_id) as the partition key, as the query could read the minimal data parts. But it could cause the partitions more than 1000, from the document I see
A merge only works for data parts that have the same value for the
partitioning expression. This means you shouldn't make overly granular
partitions (more than about a thousand partitions). Otherwise, the
SELECT query performs poorly because of an unreasonably large number
of files in the file system and open file descriptors.
Or should I use the partition key only toYYYYMMDD(event_date)?
also, could anyone explain why the partition shouldn't more than 1000 partitions? even if the query only use a small set of the data part, it still could cause performance issue?
Thanks

fast comparison a list with itself

I have a list giant list (100k entries) in my database. Each entry contains a id, text and a date.
I created a function to compare two text as possible. How it looks like is not necessary right now.
Is there a "good" way to remove "duplicates" (as possible) from the list by text?
Currently I'm looping through the list twice and compare each entry with each entry, except itself by id.
If your question is when you insert a row in the table... you can include the unique constraint.
Postgresql
CREATE TABLE table1 (
id serial PRIMARY KEY,
txt VARCHAR (50),
dt timestamp,
UNIQUE(txt)
);
Oracle
CREATE TABLE table1
( id numeric(10) NOT NULL,
txt varchar2(50) NOT NULL,
date timestamp,
CONSTRAINT txt_unique UNIQUE (txt)
);

Create a generic DB table

I am having multiple products and each of them are having there own Product table and Value table. Now I have to create a generic screen to validate those product and I don't want to create validated table for each Product. I want to create a generic table which will have all the Products details and one extra column called ProductIdentifier. but the problem is that here in this generic table I may end up putting millions of records and while fetching the data it will take time.
Is there any other better solution???
"Millions of records" sounds like a VLDB problem. I'd put the data into a partitioned table:
CREATE TABLE myproducts (
productIdentifier NUMBER,
value1 VARCHAR2(30),
value2 DATE
) PARTITION BY LIST (productIdentifier)
( PARTITION p1 VALUES (1),
PARTITION p2 VALUES (2),
PARTITION p5to9 VALUES (5,6,7,8,9)
);
For queries that are dealing with only one product, specify the partition:
SELECT * FROM myproducts PARTITION FOR (9);
For your general report, just omit the partition and you get all numbers:
SELECT * FROM myproducts;
Documentation is here:
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/vldbg/toc.htm

Oracle Composite Range-Hash Partitioning

I am trying to partition my table as below
CREATE TABLE ABC(
id VARCHAR2(100) primary key,
datecreated DATE)
PARTITION BY RANGE (datecreated) INTERVAL (NUMTODSINTERVAL(1,'DAY'))
SUBPARTITION BY HASH (ID) SUBPARTITIONS 4
(PARTITION lessthan2018 VALUES LESS THAN (TIMESTAMP' 2018-01-01 00:00:00') );
If I use WHERE clause with column "ID" will the performance improve? Will the performance be the same if I just partition by date. Will it perform the same since ID is already a primary key?
If id is the primary key then you will have a unique global index on that column, and partitioning will not make any difference because the index already takes you to the physical address of the specified row.
Also, dropping, truncating or exchanging a partition will invalidate the index unless you specify the update global indexes clause.
If id was not the PK but something non-unique like a product type, or area code, then a query for just that column without any date would need to check every partition.

Secondary indexes on composite keys in cassandra

I have this table in cassandra
CREATE TABLE global_product_highlights (
deal_id text,
product_id text,
highlight_strength double,
category_id text,
creation_date timestamp,
rank int,
PRIMARY KEY (deal_id, product_id, highlight_strength)
)
When i fire below query in Golang
err = session.Query("select product_id from global_product_highlights where category_id=? order by highlight_strength DESC",default_category).Scan(&prodId_array)
I get ERROR : ORDER BY with 2ndary indexes is not supported.
I have an index on category_id.
I don't completely understand how is secondary index applied on composite keys in cassandra.
Appreciate if anyone would explain and rectify this one.
The ORDER BY clause in Cassandra only works on your first clustering column (2nd column in the primary key), which in this case is your product_id. This DataStax doc states that:
Querying compound primary keys and sorting results ORDER BY clauses
can select a single column only. That column has to be the second
column in a compound PRIMARY KEY.
So, if you want to have your table sorted by highlight_strength, then you'll need to make that field the first clustering column.

Resources