Secondary indexes on composite keys in cassandra - go

I have this table in cassandra
CREATE TABLE global_product_highlights (
deal_id text,
product_id text,
highlight_strength double,
category_id text,
creation_date timestamp,
rank int,
PRIMARY KEY (deal_id, product_id, highlight_strength)
)
When i fire below query in Golang
err = session.Query("select product_id from global_product_highlights where category_id=? order by highlight_strength DESC",default_category).Scan(&prodId_array)
I get ERROR : ORDER BY with 2ndary indexes is not supported.
I have an index on category_id.
I don't completely understand how is secondary index applied on composite keys in cassandra.
Appreciate if anyone would explain and rectify this one.

The ORDER BY clause in Cassandra only works on your first clustering column (2nd column in the primary key), which in this case is your product_id. This DataStax doc states that:
Querying compound primary keys and sorting results ORDER BY clauses
can select a single column only. That column has to be the second
column in a compound PRIMARY KEY.
So, if you want to have your table sorted by highlight_strength, then you'll need to make that field the first clustering column.

Related

Oracle Composite Range-Hash Partitioning

I am trying to partition my table as below
CREATE TABLE ABC(
id VARCHAR2(100) primary key,
datecreated DATE)
PARTITION BY RANGE (datecreated) INTERVAL (NUMTODSINTERVAL(1,'DAY'))
SUBPARTITION BY HASH (ID) SUBPARTITIONS 4
(PARTITION lessthan2018 VALUES LESS THAN (TIMESTAMP' 2018-01-01 00:00:00') );
If I use WHERE clause with column "ID" will the performance improve? Will the performance be the same if I just partition by date. Will it perform the same since ID is already a primary key?
If id is the primary key then you will have a unique global index on that column, and partitioning will not make any difference because the index already takes you to the physical address of the specified row.
Also, dropping, truncating or exchanging a partition will invalidate the index unless you specify the update global indexes clause.
If id was not the PK but something non-unique like a product type, or area code, then a query for just that column without any date would need to check every partition.

Clustering order does not work with compound partition key

With the following table definition:
CREATE TABLE device_by_create_date (
year int,
comm_nr text,
created_at timestamp,
PRIMARY KEY ((year, comm_nr), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
Comm_nr is a unique identifier.
I would expect to see data ordered by created_at column, which is not the case when I add data.
Example entries:
Table CQL:
How can I issue select * from table; queries, which return data ordered by the created_at row?
TLDR: You need to create a new table.
Your partition key is (year, comm_nr). You're created_at key is ordered but it is ordered WITHIN that partition key. A query where SELECT * FROM table WHERE year=x AND comm_nr=y; will be ordered by created_at.
Additionally if instead of (year, comm_nr), created_at your key was instead year, comm_r, created_at even if your create table syntax only specifiied created_at as the having a clustering order, it would be created as WITH CLUSTERING ORDER BY (comm_nr DESC, created_at DESC). Data is sorted within SSTables by key from left to right.
The way to do this in true nosql fashion is to create a separate table where your key is instead year, created_at, comm_nr. You would write to both on user creation, but if you needed the answer for who created their account first you would instead query the new table.

Cassandra - filter rows based on a range

Using cassandra and spark and datastax's spark-cassandra-connector.
In the spark-cassandra-connector it support gives such filter example:
sc.cassandraTable("test", "cars").select("id", "model").where("color = ?", "black").toArray.foreach(println)
Basically it filters the color column with black. However, can I filter the row based on a range? Like I want to filter the range column which is a long type and the range falls between 100000 and 200000 ? Does cql support such a range filter?
CQL supports range queries only on clustering columns. Range queries can be expressed as in SQL by using two bounding conditions on the same field, for instance, in spark-cassandra-connector you will write:
.where("my_long >= ? and my_long < ?", 1L, 100L)
This will work as long as the "my_long" column is the first clustering column. Clustering columns are the columns that follows the declaration of the partition columns in the primary key.
For instance, you can run range queries on my_long column if the primary key is declared as follows:
PRIMARY KEY (pk1, my_long)
PRIMARY KEY (pk1, my_long, pk3)
PRIMARY KEY ((pk1, pk2), my_long)
PRIMARY KEY ((pk1, pk2), my_long, pk4)
...
As you see, in all the preceding cases, my_long follows the declaration of partition key in the primary key.
If the column belongs to the clustering columns but it's not the first one, you have to provide an equality condition for all preceding columns.
For example:
PRIMARY KEY (pk1, pk2, my_long) --> .where("pk2=? and my_long>? and my_long
PRIMARY KEY (pk1, pk2, pk3, my_long) --> .where("pk2=? and pk3=? and my_long>? and my_long
PRIMARY KEY ((pk1, pk2), pk3, my_long) --> .where("pk3=? and my_long>? and my_long
PRIMARY KEY ((pk1, pk2), pk3, my_long, pk5) --> .where("pk3=? and my_long>? and my_long
Note: spark-cassandra-connector adds by default the clause "ALLOW FILTERING" in all the queries. If you try to run the examples above in cqlsh, you have to specify that clause manually.

How to add composite primary keys?

I have a table with three columns, [Id,QTY,Date]. out of these three, two columns [id and date], should be set as primary keys, because I need to fetch the record one by one, from this table, into a reference.
the data to be inserted into this table is
101,10,NULL
101,20,201220
101,7,201440
102,5,null
102,8,201352
date is in yyyyww format
How do I define two columns as composite primary keys when they have null values, duplicates?
alter table abc add constraint pk primary key (ID, DATE);
if I try to alter the table the error appears
Error report:
SQL Error: ORA-01449: column contains NULL values; cannot alter to NOT NULL
01449. 00000 - "column contains NULL values; cannot alter to NOT NULL"
*Cause:
*Action:
Using table level constraint, you can use this query
alter table your_table add constraint pkc_Name primary key (column1, column2)
but first you need to declare the columns NOT NULL. All parts of a primary key need to be NOT NULL.
The column name of your table is ID and it is still null and non-unique, how is it possible. If it is primary key of other table try adding a surrogate key column for this table and make it primary key.
In case of composite primary key, it should have atleast one not null value(For each row) in the combination of columns. And the combination of column must be unique at all case.
For further details check, http://docs.oracle.com/cd/B10500_01/server.920/a96524/c22integ.htm
Correction - If composite primary key is made up of 3 columns, then no column (among 3) can hold NULL value. And the combination of those 3 columns must be unique.
E.g. (1,2,2)
(1,2,1)
(2,2,1)
(1,2,2) - not valid

HIVE order by messes up data

In Hive 0.8 with Hadoop 1.03 consider this table:
CREATE TABLE table (
key int,
date timestamp,
name string,
surname string,
height int,
weight int,
age int)
CLUSTERED BY(key) INTO 128 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Then I tried:
select *
from table
where key=xxx
order by date;
The result is sorted but everything after the column name is wrong. In fact, all the rows have the exact same values in the respective fields and the surname column is missing. I also have a bitmap index on name and surname and an index on key.
Is there something wrong with my query or should I be looking into bugs about order by (I cant find anything specific).
Seems like there has been an error in loading data into hive. Make sure you don't have any special characters in your CSV File that might interfere with the insertion.
And you have clustered by the key property. Where does this key come from the CSV? or some other source? Are you sure that this is unique?

Resources