How can I use the time series data model in Cassandra? - time

I'm new to Cassandra but I've seen Thrift examples earlier where I can model the columns as:
id | start_time | end_time | total_value | value + [timeStamp1]
| value + [timeStamp2]...
Is it possible to do this with a single column family with CQL? I can see that I can make a composite key of (id, timestamp) and store the values against the timestamp, and repeat the event level metadata for each row as part of denormalization, but would that still be storing it in one big row?

Yes you can do it in Cassandra with only one table. The idea is that you have a partition key (id) and a clustering key (timestamp). For the same partition key all data are written into one big row ...
CREATE TABLE timeseries (id uuid, ts timestamp, info text, otherinfo text, PRIMARY KEY (id, ts));
In this example you can query all timestamps event of a specific id by time.
SELECT * FROM timeseries where id=someid and ts > 0 and ts < 100;
for each id you will have a wide row containing the events. As far as "repeating the event metadata as denormalization", if for the same id all other informations does not change then you should declare these as static so, doesn't matter how many events you have within a ROW these columns will be present only once (it's a smart denormalization).
HTH,
Carlo

Related

Create a generic DB table

I am having multiple products and each of them are having there own Product table and Value table. Now I have to create a generic screen to validate those product and I don't want to create validated table for each Product. I want to create a generic table which will have all the Products details and one extra column called ProductIdentifier. but the problem is that here in this generic table I may end up putting millions of records and while fetching the data it will take time.
Is there any other better solution???
"Millions of records" sounds like a VLDB problem. I'd put the data into a partitioned table:
CREATE TABLE myproducts (
productIdentifier NUMBER,
value1 VARCHAR2(30),
value2 DATE
) PARTITION BY LIST (productIdentifier)
( PARTITION p1 VALUES (1),
PARTITION p2 VALUES (2),
PARTITION p5to9 VALUES (5,6,7,8,9)
);
For queries that are dealing with only one product, specify the partition:
SELECT * FROM myproducts PARTITION FOR (9);
For your general report, just omit the partition and you get all numbers:
SELECT * FROM myproducts;
Documentation is here:
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/vldbg/toc.htm

Clustering order does not work with compound partition key

With the following table definition:
CREATE TABLE device_by_create_date (
year int,
comm_nr text,
created_at timestamp,
PRIMARY KEY ((year, comm_nr), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
Comm_nr is a unique identifier.
I would expect to see data ordered by created_at column, which is not the case when I add data.
Example entries:
Table CQL:
How can I issue select * from table; queries, which return data ordered by the created_at row?
TLDR: You need to create a new table.
Your partition key is (year, comm_nr). You're created_at key is ordered but it is ordered WITHIN that partition key. A query where SELECT * FROM table WHERE year=x AND comm_nr=y; will be ordered by created_at.
Additionally if instead of (year, comm_nr), created_at your key was instead year, comm_r, created_at even if your create table syntax only specifiied created_at as the having a clustering order, it would be created as WITH CLUSTERING ORDER BY (comm_nr DESC, created_at DESC). Data is sorted within SSTables by key from left to right.
The way to do this in true nosql fashion is to create a separate table where your key is instead year, created_at, comm_nr. You would write to both on user creation, but if you needed the answer for who created their account first you would instead query the new table.

cassandra order by non-primary key workaround

Cassandra database
It seems that there is no way to order on anything but primary keys.
I have two columns: ID and timestamp.
I only want ONE(!) row per ID, but I want to filter (basically same as sort?) my results based on timestamp.
I want to run this command:
SELECT id, timestamp FROM " + TableName + " WHERE timestamp < ? ALLOW FILTERING;
How can I do this while making sure I have only a single row per id (which is not possible if my primary key consists of both (ID and timestamp)
I have yet to thoroughly test this method. But I think that I've found a workaround.
I will allow timestamp to be part of the primary key
CREATE TABLE IF NOT EXISTS " + TableName + " ( key blob, timestamp bigint, PRIMARY KEY( key, timestamp )
But before calling UPDATE/PUT I will delete all rows
WHERE id = ?
Such that it will replace the current timestamp for that ID. In that way I should still be able to sort on timestamp, while preserving one row per ID.

Accelerate SQLite Query

I'm currently learning SQLite (called by Python).
According to my previous question (Reorganising Data in SQLLIte), I want to store multiple time series (Training data) in my database.
I have defined the following fields:
CREATE TABLE VARLIST
(
VarID INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL
)
CREATE TABLE DATAPOINTS
(
DataID INTEGER PRIMARY KEY,
timeID INTEGER,
VarID INTEGER,
value REAL
)
CREATE TABLE TIMESTAMPS
(
timeID INTEGER PRIMARY KEY AUTOINCREMENT,
TRAININGS_ID INT,
TRAINING_TIME_SECONDS FLOAT
)
VARLIST has 8 entries, TIMESTAMPS 1e5 entries and DATAPOINTS around 5e6.
When I now want to extract data for a given TrainingsID and VarID, I try it like:
SELECT
(SELECT TIMESTAMPS.TRAINING_TIME_SECONDS
FROM TIMESTAMPS
WHERE t.timeID = timeID) AS TRAINING_TIME_SECONDS,
(SELECT value
FROM DATAPOINTS
WHERE DATAPOINTS.timeID = t.timeID and DATAPOINTS.VarID = 2) as value
FROM
(SELECT timeID
FROM TIMESTAMPS
WHERE TRAININGS_ID = 96) as t;
The command EXPLAIN QUERY PLAN delivers:
0|0|0|SCAN TABLE TIMESTAMPS
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SCAN TABLE DATAPOINTS
This basically works.
But there are two problems:
Minor problem: If there is a timeID where no data for the requested VarID is availabe, I get an line with the valueNone`.
I would prefer this line to be skipped.
Big problem: the search is incredibly slow (approx 5 minutes using http://sqlitebrowser.org/).
How do I best improve the performance?
Are there better ways to formulate the SELECT command, or should I modify the database structure itself?
Ok, based on the hints I have got I could extremly accelerate the search by applieng INDEXES as:
CREATE INDEX IF NOT EXISTS DP_Index on DATAPOINTS (VarID,timeID,DataID);
CREATE INDEX IF NOT EXISTS TS_Index on TIMESTAMPS(TRAININGS_ID,timeID);
The EXPLAIN QUERY PLAN output now reads as:
0|0|0|SEARCH TABLE TIMESTAMPS USING COVERING INDEX TS_Index (TRAININGS_ID=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE DATAPOINTS USING INDEX DP_Index (VarID=? AND timeID=?)
Thanks for your comments.

How to store weather station Details along with monitoring data efficiently?

I was following the TimeSeries data modelling in PlanetCassandra by Patrick McFadin. Regarding that, I had one query:
If I need to store the weather station name also, should it be in the same table, say:
create table test (wea_id int, wea_name text, wea_add text, eventday timeuuid, eventtime timeuuid, temp int, PRIMARY KEY ((wea_id, eventday), eventtime) );
This forces me to enter the wea_name and wea_add for each new row, so how to identify a new row has been created? Or is there any better mechanism for modeling the above data?
Regards,
Seenu.
I'm assuming you're referring to the article on getting started with time series data modeling at http://planetcassandra.org/getting-started-with-time-series-data-modeling/
The original CQL listed is:
CREATE TABLE temperature_by_day (
weatherstation_id text,
date text,
event_time timestamp,
temperature text,
PRIMARY KEY ((weatherstation_id,date),event_time)
);
If you need to add an attribute that's associated with the partition key, in this case (weatherstation_id,date), Cassandra has a feature that does just that: static columns
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
So you could write the statment as this:
CREATE TABLE temperature_by_day (
weatherstation_id text,
weatherstation_name text STATIC,
date text,
event_time timestamp,
temperature text,
PRIMARY KEY ((weatherstation_id,date),event_time)
);
You would be storing the name once per (weatherstation_id,date) combination rather than for every observation.
Ideally you'd like to store a name once per weather station. Using this choice of partition key you can't do this; you could model with one device per partition as per Patrick's first example if you like:
CREATE TABLE temperature (
weatherstation_id text,
weatherstation_name text STATIC,
event_time timestamp,
temperature text,
PRIMARY KEY ((weatherstation_id),event_time)
);

Resources