I want to move away from mysql storing visitors in a social media application, and thought cassandra would be fine for that.
Suppose the followign table:
CREATE TABLE visitors (
visiteduserid bigint,
visitinguserid bigint,
visitdate timestamp,
PRIMARY KEY (visiteduserid,visitinguserid)
);
I want to get the latest 20 unique visitinguserid's for the current visiteduserid, but
SELECT visitinguserid FROM visitors WHERE visiteduserid=1 ORDER BY visitdate DESC LIMIT 20
Bad Request: Order by is currently only supported on the clustered columns of the PRIMARY KEY, got visitdate
Am I right in assuming I can't just add visitdate to the primary key, as I only want to have the latest visitdate for a single user/user combo. Best would be to have them sorted by visitdate descending, as mentioned in the "Twitter Clone" presentation...
Any help out there?
You were close. Try this and see how it works for you
CREATE TABLE visitors (
visiteduserid bigint,
visitdate timestamp,
visitinguserid bigint,
PRIMARY KEY (visiteduserid,visitdate)
) WITH CLUSTERING ORDER BY (visitdate DESC);
Then do this:
SELECT visitinguserid FROM visitors WHERE visiteduserid=1 LIMIT 20
That should do what you are looking for. Relying on the use of a comparator to create the order you want on insert is much more efficient.
Related
I have a data modeling question for cases where data needs to be sorted by keys which can be modified.
So , say we have a user table
{
dept_id text,
user_id text,
user_name text,
mod_date timestamp
PRIMARY KEY (dept_id,user_id)
}
Now I can query cassandra to get all users by a dept_id.
What if I wanted to query to get all users in a dept, sorted by mod_date.
So, one way would be to
{
dept_id text,
mod_date timestamp,
user_id text,
user_name text,
PRIMARY KEY (dept_id, mod_date,user_id)
}
But, mod_date changes every time user name is updated. So it can't be part of clustering key.
Attempt 1:
Don't update the row but instead create new record for every update.
So, say the record for user foo is like below
{'dept_id1',TimeStamp1','user_id1','foo'}
and then the name was changed to 'bar' and then to 'baz' .
In that case we add another row to table, so it would look like
{'dept_id1',TimeStamp3','user_id1','baz'}
{'dept_id1',TimeStamp2','user_id1','bar'}
{'dept_id1',TimeStamp1','user_id1','foo'}
Now we can get all users in a dept, sorted by mod_date but it presents a different problem.
The data returned is duplicated
.
Attempt 2 :
Add another column to identify the head record much like a linked list
{
dept_id text,
mod_date timestamp,
user_id text,
user_name text,
next_record text
PRIMARY KEY (dept_id,mod_date,user_id)
}
Every time an update happens it adds a row and also adds the PK of new record.
{'dept_id1',TimeStamp3','user_id1','baz','HEAD'}
{'dept_id1',TimeStamp2','user_id1','bar','dept_id1#TimeStamp3'}
{'dept_id1',TimeStamp1','user_id1','foo','dept_id1#TimeStamp2'}
and also add a secondary index to 'next_record' column.
Now I can support get all users in a dept, sorted by mod_date by
select * from USERS where dept_id=':dept' AND next_record='HEAD' order
by mod_date.
But it looks fairly involved solution and perhaps I am missing something , a simpler solution ..
The other option is delete and insert but for high frequency changes I think Cassandra has issues with tombstones.
Suggestions/Feedback are welcome.
Thanks !
As I see, the simplest way is sorting users on application (client code) side. You use dept as a partition key, this means that all users in one dept can be handled one cassandra node, so there is no many users in one dept and this users can be sorted on application side fast enough.
Consider the following table
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR(50),
SALARY REAL
);
If we have 100 million random data in this table.
Select age from company where id=2855265
Executed in less than a millisecond
Select age from company where id<353
Return less than 50 rows and Executed in less than a millisecond
Both query uses index
But the following query use full table scan and executed in 3 seconds
Select age from company where id<2855265
Return less than 500 rows
How can I speed up the query that select primary key less than variable?
Performance
The predicate id < 2855265 potentially returns a large percentage of rows in the table. Unless Postgres has information in table statistics to expect only around 500 rows, it might switch from an index scan to a bitmap index scan or even a sequential scan. Explanation:
Postgres not using index when index scan is much better option
We would need to see the output from EXPLAIN (ANALYZE, BUFFERS) for your queries.
When you repeat the query, do you get the same performance? There may be caching effects.
Either way, 3 seconds is way to slow for 500 rows, Postgres might be working with outdated or inexact table statistics. Or there may be issues with your server configuration (not enough resources). Or there can be several other not so common reasons, including hardware issues ...
If VACUUM ANALYZE did not help, VACUUM FULL ANALYZE might. It effectively rewrites the whole table and all indexes in pristine condition. Takes an exclusive lock on the table and might conflict with concurrent access!
I would also consider increasing the statistics target for the id column. Instructions:
Keep PostgreSQL from sometimes choosing a bad query plan
Table definition?
Whatever else you do, there seem to be various problems with your table definition:
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL, -- int is probably enough. "id" is a terrible column name
NAME TEXT NOT NULL, -- "name" is a teriible column name
AGE INT NOT NULL, -- typically bad idea to store age, store birthday instead
ADDRESS CHAR(50), -- never use char(n)!
SALARY REAL -- why would a company have a salary? never store money as real
);
You probably want something like this instead:
CREATE TABLE emmployee(
emploee_id serial PRIMARY KEY
company_id int NOT NULL -- REFERENCES company(company_id)?
, birthday date NOT NULL
, employee_name text NOT NULL
, address varchar(50) -- or just text
, salary int -- store amount as *Cents*
);
Related:
How to implement a many-to-many relationship in PostgreSQL?
Any downsides of using data type "text" for storing strings?
You will need to perform a VACUUM ANALYZE company; to update the planning.
Im trying to find on google my situation but no one talk about this situation.
i have a table thats is gonna be partitionized with 2 columns.
for 2 columns partitions can anyone show an example for the interval?
In this case i have only one.
For this example how do i use an interval with 2 columns
INTERVAL( NUMTODSINTERVAL(1,'DAY'))
My table:
create table TABLE_TEST
(
PROCESS_DATE DATE GENERATED ALWAYS AS (TO_DATE(SUBSTR("CHARGE_DATE_TIME",1,10),'yyyymmdd')),
PROCESS_HOUR VARCHAR(10) GENERATED ALWAYS AS (SUBSTR("CHARGE_DATE_TIME",12,2)),
ANUM varchar(100),
SWTICH_DATE_TIME varchar(100),
CHARGE_DATE_TIME varchar(100),
CHARGE varchar(100),
)
TABLESPACE TB_LARGE_TAB
PARTITION BY RANGE (PROCESS_DATE, PROCESS_HOUR)
INTERVAL( NUMTODSINTERVAL(1,'DAY'))
Many Thanks,
Macieira
You can't use an interval if your range has more than one column; you'd get: ORA-14750: Range partitioned table with INTERVAL clause has more than one column. From the documentaion:
You can specify only one partitioning key column, and it must be of NUMBER, DATE, FLOAT, or TIMESTAMP data type.
I'm not sure why you're splitting the date and hour out into separate columns (since a date has a time component anyway), or why you're storing the 'real' date and number values as strings; it would be much simpler to just have columns with the correct data types in the first place. But assuming you are set on storing the data that way and need the separate process_date and process_hour columns as you have them, you can add a third virtual column that combines them:
create table TABLE_TEST
(
PROCESS_DATE DATE GENERATED ALWAYS AS (TO_DATE(SUBSTR(CHARGE_DATE_TIME,1,10),'YYYYMMDD')),
PROCESS_HOUR VARCHAR2(8) GENERATED ALWAYS AS (SUBSTR(CHARGE_DATE_TIME,12,2)),
PROCESS_DATE_HOUR DATE GENERATED ALWAYS AS (TO_DATE(CHARGE_DATE_TIME, 'YYYYMMDDHH24')),
ANUM VARCHAR2(100),
SWTICH_DATE_TIME VARCHAR2(100),
CHARGE_DATE_TIME VARCHAR2(100),
CHARGE VARCHAR2(100)
)
PARTITION BY RANGE (PROCESS_DATE_HOUR)
INTERVAL (NUMTODSINTERVAL(1,'DAY'))
(
PARTITION TEST_PART_0 VALUES LESS THAN (DATE '1970-01-01')
);
Table table_test created.
I've also changed your string data types to varchar2 and added a made-up initial partition. process_hour probably wants to be a number type, depending on how you'll use it. As I don't know why you're choosing your current data types it's hard to tell what would really be more appropriate.
I don't really understand why you'd want the partition range to be hourly and the interval to be one day though, unless you want the partitions to be from, say, midday to midday; in which case the initial partition (test_part_0) would have to specify that time, and your range specification is still wrong for that.
Interval partitioning could be built only on one column.
In your case you have proper partition key column - CHARGE_DATE_TIME. Why do you create virtual columns as VARCHAR2? And why do you need to create partition key on them? Interval partitioning could be built only on NUMBER or DATE columns.
I am a cassandra newbie trying to see how I can model our current sql data in cassandra. The database stores document metadata that includes document_id, last_modified_time, size_in_bytes among a host of other data, and the number of documents can be arbitrarily large and hence we are looking for a scalable solution for storage and query.
There is a requirement of 2 range queries
select all docs where last_modified_time >=x and last_modified_time
select all docs where size >= x and size <= y
And also a set of queries where docs needs to be grouped by specific metadata e.g.
select all docs where user in (x,y,z)
What is the best practice of designing the data model based on these queries?
My initial thought is to have a table (in Cassandra 2.0, CQL 3.0) with the last_mod_time as the secondary index as follows
create table t_document (
document_id bigint,
last_mod_time bigint ,
size bigint,
user text,
....
primary key (document_id, last_mod_time)
}
This should take care of query 1.
Do I need to create another table with the primary key as (document_id, size) for the query 2? Or can I just add the size as the third item in the primary key of the same table e.g. (document_id, last_mod_time, size). But in this case will the second query work without using the last_mod_time in the where clause?
For the query 3, which is all docs for one or more users, is it the best practice to create a t_user_doc table where the primary key is (user, doc_id)? Or a better approach is to create a secondary index on the user on the same t_document table?
Thanks for any help.
When it comes to inequalities, you don't have many choices in Cassandra. They must be leading clustering columns (or secondary indexes). So a data model might look like this:
CREATE TABLE docs_by_time (
dummy int,
last_modified_time timestamp,
document_id bigint,
size_in_bytes bigint,
PRIMARY KEY ((dummy),last_modified_time,document_id));
The "dummy" column is always set to the same value, and is sued as a placeholder partition key, with all data stored in a single partition.
The drawback to such a data model is that, indeed, all data is stored in a single partition. There is the maximum of 2 billion cells per partition, but more importantly, a single partition never spans nodes. So this approach doesn't scale.
You could create secondary indexes on a table:
CREATE TABLE docs (
document_id bigint,
last_modified_time timestamp,
size_in_bytes bigint,
PRIMARY KEY ((dummy),last_modified_time,document_id));
CREATE INDEX docs_last_modified on docs(last_modified);
However secondary indexes have important drawbacks (http://www.slideshare.net/edanuff/indexing-in-cassandra), and aren't recommended for data with high cardinality. You could mitigate the cardinality issue somewhat by reducing precision on last_modified_time by, say, only storing the day component.
I am in the middle of designing a table which include two columns valid_from and valid_to to track historical changes. For example, my table structure is like below:
create table currency_data
(
currency_code varchar(16) not null,
currency_desc varchar(16) not null,
valid_from date not null,
valid_to date,
d_insert_date date,
d_last_update date,
constraint pk_currency_data primary key (currency_code, valid_from)
)
The idea is to leave the valid_to as blank to start with, and if the currency_desc changes in the future, I will need to set a valid_to to the date that the old description is not valid any more, and create a new rows with a new valid_from. But how can I ensure that there will be never a overlap between these 2 rows. For example the query below should only yield one row.
select currency_desc
from currency_data
where currency_code = 'USD'
and trunc(sysdate) between valid_from and nvl(valid_to, sysdate)
Is there a better way to achieve this please other than make sure all developers/end users aware of this rule. Many thanks.
There is a set of implementation approaches known as slowly changing dimensions (SCD) for handling this kind of storage.
What you are currently implementing is SCD II, however, there are more.
Regarding your possible interval overlap issue - there is no simple way to enforce table-level (instead of row-level) consistency with standard constraints, so I guess a robust approach would be to restrict direct DML to this table and wrap it into some standartized pl/sql API which will enforce your riles prior to insert/update and which every developer will use.