cassandra order by non-primary key workaround - sorting

Cassandra database
It seems that there is no way to order on anything but primary keys.
I have two columns: ID and timestamp.
I only want ONE(!) row per ID, but I want to filter (basically same as sort?) my results based on timestamp.
I want to run this command:
SELECT id, timestamp FROM " + TableName + " WHERE timestamp < ? ALLOW FILTERING;
How can I do this while making sure I have only a single row per id (which is not possible if my primary key consists of both (ID and timestamp)

I have yet to thoroughly test this method. But I think that I've found a workaround.
I will allow timestamp to be part of the primary key
CREATE TABLE IF NOT EXISTS " + TableName + " ( key blob, timestamp bigint, PRIMARY KEY( key, timestamp )
But before calling UPDATE/PUT I will delete all rows
WHERE id = ?
Such that it will replace the current timestamp for that ID. In that way I should still be able to sort on timestamp, while preserving one row per ID.

Related

fast comparison a list with itself

I have a list giant list (100k entries) in my database. Each entry contains a id, text and a date.
I created a function to compare two text as possible. How it looks like is not necessary right now.
Is there a "good" way to remove "duplicates" (as possible) from the list by text?
Currently I'm looping through the list twice and compare each entry with each entry, except itself by id.
If your question is when you insert a row in the table... you can include the unique constraint.
Postgresql
CREATE TABLE table1 (
id serial PRIMARY KEY,
txt VARCHAR (50),
dt timestamp,
UNIQUE(txt)
);
Oracle
CREATE TABLE table1
( id numeric(10) NOT NULL,
txt varchar2(50) NOT NULL,
date timestamp,
CONSTRAINT txt_unique UNIQUE (txt)
);

Clustering order does not work with compound partition key

With the following table definition:
CREATE TABLE device_by_create_date (
year int,
comm_nr text,
created_at timestamp,
PRIMARY KEY ((year, comm_nr), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
Comm_nr is a unique identifier.
I would expect to see data ordered by created_at column, which is not the case when I add data.
Example entries:
Table CQL:
How can I issue select * from table; queries, which return data ordered by the created_at row?
TLDR: You need to create a new table.
Your partition key is (year, comm_nr). You're created_at key is ordered but it is ordered WITHIN that partition key. A query where SELECT * FROM table WHERE year=x AND comm_nr=y; will be ordered by created_at.
Additionally if instead of (year, comm_nr), created_at your key was instead year, comm_r, created_at even if your create table syntax only specifiied created_at as the having a clustering order, it would be created as WITH CLUSTERING ORDER BY (comm_nr DESC, created_at DESC). Data is sorted within SSTables by key from left to right.
The way to do this in true nosql fashion is to create a separate table where your key is instead year, created_at, comm_nr. You would write to both on user creation, but if you needed the answer for who created their account first you would instead query the new table.

Expensive subquery tuning with SQLite

I'm working on a small media/file management utility using sqlite for it's persistent storage needs. I have a table of files:
CREATE TABLE file
( file_id INTEGER PRIMARY KEY AUTOINCREMENT
, file_sha1 BINARY(20)
, file_name TEXT NOT NULL UNIQUE
, file_size INTEGER NOT NULL
, file_mime TEXT NOT NULL
, file_add_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
);
And also a table of albums
CREATE TABLE album
( album_id INTEGER PRIMARY KEY AUTOINCREMENT
, album_name TEXT
, album_poster INTEGER
, album_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
, FOREIGN KEY (album_poster) REFERENCES file(file_id)
);
to which files can be assigned
CREATE TABLE album_file
( album_id INTEGER NOT NULL
, file_id INTEGER NOT NULL
, PRIMARY KEY (album_id, file_id)
, FOREIGN KEY (album_id) REFERENCES album(album_id)
, FOREIGN KEY (file_id) REFERENCES file(file_id)
);
CREATE INDEX file_to_album ON album_file(file_id, album_id);
Part of the functionality is to list albums, exposing
the album id,
the album's name,
an poster image for that album and
the number of files in the album
which currently uses this query:
SELECT a.album_id, a.album_name,
COALESCE(
a.album_poster,
(SELECT file_id FROM file
NATURAL JOIN album_file af
WHERE af.album_id = a.album_id
ORDER BY file.file_name LIMIT 1)),
(SELECT COUNT(file_id) AS file_count
FROM album_file WHERE album_id = a.album_id)
FROM album a
ORDER BY album_name ASC
The only "tricky" part of that query is that the album_poster column may be null, in which case COALESCE statement is used to just return the first file in the album as the "default poster".
With currently ~260000 files, ~2600 albums and ~250000 entries in the album_file table, this query takes over 10 seconds which makes for a not-so-great user experience. Here's the query plan:
0|0|0|SCAN TABLE album AS a
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|1|SEARCH TABLE album_file AS af USING COVERING INDEX album_to_file (album_id=?)
1|1|0|SEARCH TABLE file USING INTEGER PRIMARY KEY (rowid=?)
1|0|0|USE TEMP B-TREE FOR ORDER BY
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE album_file USING COVERING INDEX album_to_file (album_id=?)
Replacing the COALESCE statement with just a.album_poster, sacrificing the auto-poster functionality, brings the query time down to a few milliseconds:
0|0|0|SCAN TABLE album AS a
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE album_file USING COVERING INDEX album_to_file (album_id=?)
0|0|0|USE TEMP B-TREE FOR ORDER BY
What I don't understand is that limiting the album listing to 1 or 1000 rows makes no difference. It seems SQLite is doing the expensive sub-query for the "default" poster on all albums, only to throw away most of the results when finally cutting down the result set to the LIMITs specified with the query.
Is there something I can do to make the original query substantially faster, especially given that I'm usually only querying a small subset (using LIMIT) of all rows for display?

How to automatically get the current date and time in a column using HIVE

Hey I have two columns in my HIVE table :
For example :-
c1 : name
c2 : age
Now while creating a table I want to declare two more columns which automatically give me the current date and time when the row is loaded.
eg: John 24 26/08/2015 11:15
How can this be done?
Hive currently does not support the feature to add a default value to any column definition while creating a table. Please refer to the link for complete hive create table syntax:
Hive Create Table specification
Alternative work around for this issue would be to temporarily load data into temporary table and use the insert overwrite table statement to add the current date and time into the main table.
Below example may help:
1. Create a temporary table
create table EmpInfoTmp(name string, age int);
2. Insert data using a file or existing table into the EmpInfoTmp table:
name|age Alan|28 Sue|32 Martha|26
3. Create a table which will contain your final data:
create table EmpInfo(name string, age tinyint, createDate string, createTime string);
4. Insert data from the temporary table and with that also add the columns with default value as current date and time:
insert overwrite table empinfo select name, age, FROM_UNIXTIME( UNIX_TIMESTAMP(), 'dd/MM/YYYY' ), FROM_UNIXTIME( UNIX_TIMESTAMP(), 'HH:mm' ) from empinfofromfile;
5. End result would be like this:
name|age|createdate|createtime Alan|28|26/08/2015|03:56 Martha|26|26/08/2015|03:56 Sue|32|26/08/2015|03:56
Please note that the creation date and time values will be entered accurately by adding the data to your final table as and when it comes into the temp table.
Note: You can't set more then 1 column as CURRENT_TIMESTAMP.
Here this way, You cant set CURRENT_TIMESTAMP in one column
SQL:
CREATE TABLE IF NOT EXISTS `hive` (
`id` int(11) NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`age` int(11) DEFAULT '0',
`datecreated` timestamp NULL DEFAULT CURRENT_TIMESTAMP
);
Hey i found a way to do it using shell script.
Heres how :
echo "$(date +"%Y-%m-%d-%T") $(wc -l /home/hive/landing/$line ) $dir " >> /home/hive/recon/fileinfo.txt
Here i get the date without spaces. In the end I upload the textfile to my hive table.

How can I use the time series data model in Cassandra?

I'm new to Cassandra but I've seen Thrift examples earlier where I can model the columns as:
id | start_time | end_time | total_value | value + [timeStamp1]
| value + [timeStamp2]...
Is it possible to do this with a single column family with CQL? I can see that I can make a composite key of (id, timestamp) and store the values against the timestamp, and repeat the event level metadata for each row as part of denormalization, but would that still be storing it in one big row?
Yes you can do it in Cassandra with only one table. The idea is that you have a partition key (id) and a clustering key (timestamp). For the same partition key all data are written into one big row ...
CREATE TABLE timeseries (id uuid, ts timestamp, info text, otherinfo text, PRIMARY KEY (id, ts));
In this example you can query all timestamps event of a specific id by time.
SELECT * FROM timeseries where id=someid and ts > 0 and ts < 100;
for each id you will have a wide row containing the events. As far as "repeating the event metadata as denormalization", if for the same id all other informations does not change then you should declare these as static so, doesn't matter how many events you have within a ROW these columns will be present only once (it's a smart denormalization).
HTH,
Carlo

Resources