Sorting order in cassandra result - sorting

I created table
CREATE TABLE testtab (
testtabmainid bigint,
testtabid timeuuid,
posteddate timestamp,
description text,
year bigint,
month bigint,
day bigint,
PRIMARY KEY ((year,month,day),posteddate,testtabmainid)
) WITH CLUSTERING ORDER BY (posteddate DESC,testtabmainid desc);
then on
SELECT testtabmainid,posteddate,year,month, day FROM testtab;
i got result like this
testtabmainid / posteddate / year / month / day
90 / 2016-12-01 11:19:11+0530 / 2016 / 11 / 30
89 / 2016-11-30 16:21:58+0530 / 2016 / 11 / 30
88 / 2016-11-30 16:13:33+0530 / 2016 / 11 / 30
91 / 2016-12-01 11:20:42+0530 / 2016 / 12 / 1
the last row is not in sorted order. I need the last row (testtabmainid =91 ) on the top
i need to sort table in testtabmainid in desc manner

You queried without specifying any WHERE clause. This produces results ordered by the TOKEN function applied to your partition key data.
In order to satisfy your query you need to change first of all the table definition to:
CREATE TABLE testtab (
testtabmainid bigint,
testtabid timeuuid,
posteddate timestamp,
description text,
year bigint,
month bigint,
day bigint,
PRIMARY KEY ((year,month,day),testtabmainid,posteddate)
) WITH CLUSTERING ORDER BY (testtabmainid desc,posteddate DESC);
and then change your query to:
SELECT testtabmainid,posteddate,year,month, day
FROM testtab
WHERE year=2016 AND
month=12 AND
day=1;
The key point is that data is ordered by your CLUSTERING KEY only inside a partition, and that's why you need to filter your queries with a WHERE clause to obtain your order.
If you want to keep the posteddata DESC order, you'll need to create another table (the one you already have is fine) and insert/update both tables.

Related

The ambiguity w.r.t date field in Dim_time

I have come across a fact table fact_trips - composed of columns like
driver_id,
vehicle_id,
date ( in the int form - 'YYYYMMDD')
timestamp (in milliseconds - bigint )
miles,
time_of_trip
I have another dim_time - composed of columns like
date ( in the int form - 'YYYYMMDD'),
timestamp (in milliseconds - bigint ),
month,
year,
day_of_week
day
Now when I want to see the trips grouped based on year, I have to join the two tables based on timestamp (in bigint) and then group by year from dim_time.
Why the hell do we keep date in int form then? Because ultimately, I have to join on timestamp. What needs to be changed?
Also, the dim_time does not have a primary key, hence there are multiple entries for the same date. So, when I join the tables, I get more rows in return than expected.
You should have 2 Dim tables:
DIM_DATE: PK = YYYYMMDD
DIM_TIME: PK = number. Will hold the same number of records as however many milliseconds there are in a day (assuming you are holding time at the millisecond grain rather than second, minute, etc)

Oracle -- Datatype of column which can store value "13:45"

We need to store a value "13:45" in the column "Start_Time" of an Oracle table.
Value can be read as 45 minutes past 13:00 hours
Which datatype to be used while creating the table? Also, once queried, we would like to see only the value "13:45".
I would make it easier:
create table t_time_only (
time_col varchar2(5),
time_as_interval INTERVAL DAY TO SECOND invisible
generated always as (to_dsinterval('0 '||time_col||':0')),
constraint check_time
check ( VALIDATE_CONVERSION(time_col as date,'hh24:mi')=1 )
);
Check constraint allows you to validate input strings:
SQL> insert into t_time_only values('25:00');
insert into t_time_only values('25:00')
*
ERROR at line 1:
ORA-02290: check constraint (CHECK_TIME) violated
And invisible virtual generated column allows you to make simple arithmetic operations:
SQL> insert into t_time_only values('15:30');
1 row created.
SQL> select trunc(sysdate) + time_as_interval as res from t_time_only;
RES
-------------------
2020-09-21 15:30:00
Your best option is to store the data in a DATE type column. If you are going to be any comparisons against the times (querying, sorting, etc.), you will want to make sure that all of the times are using the same day. It doesn't matter which day as long as they are all the same.
CREATE TABLE test_time
(
time_col DATE
);
INSERT INTO test_time
VALUES (TO_DATE ('13:45', 'HH24:MI'));
INSERT INTO test_time
VALUES (TO_DATE ('8:45', 'HH24:MI'));
Test Query
SELECT time_col,
TO_CHAR (time_col, 'HH24:MI') AS just_time,
24 * (time_col - LAG (time_col) OVER (ORDER BY time_col)) AS difference_in_hours
FROM test_time
ORDER BY time_col;
Test Results
TIME_COL JUST_TIME DIFFERENCE_IN_HOURS
____________ ____________ ______________________
01-SEP-20 08:45
01-SEP-20 13:45 5
Table Definition using INTERVAL
create table tab
(tm INTERVAL DAY(1) to SECOND(0));
Input value as literal
insert into tab (tm) values (INTERVAL '13:25' HOUR TO MINUTE );
Input value dynamically
insert into tab (tm) values ( (NUMTODSINTERVAL(13, 'hour') + NUMTODSINTERVAL(26, 'minute')) );
Output
you may either EXTRACT the hour and minute
EXTRACT(HOUR FROM tm) int_hour,
EXTRACT(MINUTE FROM tm) int_minute
or use formatted output with a trick by adding some fixed DATE
to_char(DATE'2000-01-01'+tm,'hh24:mi') int_format
which gives
13:25
13:26
Please see this answer for other formating options HH24:MI
The used INTERVAL definition may store seconds as well - if this is not acceptable, add CHECK CONSTRAINT e.g. as follows (adjust as requiered)
tm INTERVAL DAY(1) to SECOND(0)
constraint "wrong interval" check (tm <= INTERVAL '23:59' HOUR TO MINUTE and EXTRACT(SECOND FROM tm) = 0 )
This rejects the following as invalid input
insert into tab (tm) values (INTERVAL '13:25:30' HOUR TO SECOND );
-- ORA-02290: check constraint (X.wrong interval) violated

Sysdate+days as default value in table column - Oracle

I'm working on my table which is supposed to store data about rented cars.
And there are 3 important columns:
RENT_DATE DATE DEFAULT TO_DATE (SYSDATE, 'DD-MM-YYYY'),
DAYS NUMBER NOT NULL,
RETURN_DATE DATE DEFAULT TO_DATE(SYSDATE+DAYS, 'DD-MM-YYYY')
My problem is that RETURN_DATE column is giving me error:
00984. 00000 - "column not allowed here"
What i want is that RENT_DATE set automatically date when record is added.
DAYS column is to store for how much days someone is renting car.
And the last column should store date of when car should be returned.
Thank you for any type of help.
This doesn't make sense:
DEFAULT TO_DATE (SYSDATE, 'DD-MM-YYYY')
SYSDATE is already a date. TO_DATE requires a char, so this takes a date, Oracle implicitly turns the date into a char, and then TO_DATE converts it back to a date. This is risky/unreliable because it uses a hardcoded date format to operate on a date that has been implicitly turned to a string using the system default format, which might one day not be DD-MM-YYYY (you're building a latent bug into your software)
If you want a date without a time on it use TRUNC(SYSDATE)
The other problem doesn't make sense either: you're storing a number of days rented for and also the return date, when one is a function of the other. Storing redundant data becomes a headache because you have to keep them in sync. My person class stores my birthdate, and I calculate how old I am. I don't store my age too and then update my table every day/year etc
Work out which will be more beneficial to you to store, and store it, then calculate the other whenever you want it. Personally I would store the return date as it's absolute, rather than open to interpretation of "is that working days, calendar days? what about public holidays? if the start date is jan 1 and the rental is for 10 days, is the car brought back on the 10th or the 11th?"
If you're desperate to have both columns in your DB consider using a view to calculate it or a function based column (again, to calculate one from the other) so they stay in sync
All in, you could look at this:
create table X(
RENT_DATE DATE DEFAULT TRUNC(SYSDATE) NOT NULL,
RETURN_DATE DATE NOT NULL,
DAYS AS (TRUNC(RETURN_DATE - RENT_DATE) + 1)
)
I put the days as +1 because to me, a car taken on the 1st and returned on the second is 2 days, but you might want to get more accurate - if it's taken on the first and returned before 10am on the second then it's one day otherwise it's 2 etc...
Use a virtual column:
CREATE TABLE table_name (
RENT_DATE DATE
DEFAULT TRUNC( SYSDATE )
CONSTRAINT table_name__rent_date__nn NOT NULL
CONSTRAINT table_name__rent_date_chk CHECK ( rent_date = TRUNC( rent_date ) ),
DAYS NUMBER
DEFAULT 7
CONSTRAINT table_name__days__nn NOT NULL,
RETURN_DATE DATE
GENERATED ALWAYS AS ( RENT_DATE + DAYS ) VIRTUAL
);
Then you can insert values:
INSERT INTO table_name ( rent_date, days ) VALUES ( DEFAULT, DEFAULT );
INSERT INTO table_name ( rent_date, days ) VALUES ( DATE '2020-01-01', 1 );
And:
SELECT * FROM table_name;
Outputs:
RENT_DATE | DAYS | RETURN_DATE
:------------------ | ---: | :------------------
2020-09-12T00:00:00 | 7 | 2020-09-19T00:00:00
2020-01-01T00:00:00 | 1 | 2020-01-02T00:00:00
db<>fiddle here

Unix time in PARTITION BY for Vertica

I have Big table in vertica which has time_stamp (int) as unix timestamp. I want to partition this table on week basis (week start day Monday).
Is there a better way to do this in one step rather than converting time_stamp from unix to TIMESTAMP (Vertica) then doing partitions ?
Optimally, you should be using the date/time type. You won't be able to use non-deterministic functions such as TO_TIMESTAMP in the PARTITION BY expression. The alternative is to use math to logically create the partitions:
Using a Unix timestamp to partition by:
Divide By
Minutes 60
Hours 60 * 60 (3600)
Days 60 * 60 * 24 (86400)
Weeks 60 * 60 * 24 * 7 (604800)
If we use 604800, this will give you the week number from January 1, 1970 00:00:00 UTC.
Let's set up a test table:
CREATE TABLE public.test (
time_stamp int NOT NULL
);
INSERT INTO public.test (time_stamp) VALUES (1404305559);
INSERT INTO public.test (time_stamp) VALUES (1404305633);
INSERT INTO public.test (time_stamp) VALUES (1404305705);
INSERT INTO public.test (time_stamp) VALUES (1404305740);
INSERT INTO public.test (time_stamp) VALUES (1404305778);
COMMIT;
Let's create the partition:
ALTER TABLE public.test PARTITION BY FLOOR(time_stamp/604800) REORGANIZE;
We then get:
NOTICE 4954: The new partitioning scheme will produce 1 partitions
WARNING 6100: Using PARTITION expression that returns a Numeric value
HINT: This PARTITION expression may cause too many data partitions. Use of an expression that returns a more accurate value, such as a regular VARCHAR or INT, is encouraged
NOTICE 4785: Started background repartition table task
ALTER TABLE
You'll also want to be mindful of how many partitions this creates. Vertica recommends keeping the number of partitions between 10-20.

Why is my date dimension table useless? (Confusion over PostgreSQL storage...)

I have looked over this about 4 times and am still perplexed with these results.
Take a look at the following (which I originally posted here)
Date dimension table --
-- Some output omitted
DROP TABLE IF EXISTS dim_calendar CASCADE;
CREATE TABLE dim_calendar (
id SMALLSERIAL PRIMARY KEY,
day_id DATE NOT NULL,
year SMALLINT NOT NULL, -- 2000 to 2024
month SMALLINT NOT NULL, -- 1 to 12
day SMALLINT NOT NULL, -- 1 to 31
quarter SMALLINT NOT NULL, -- 1 to 4
day_of_week SMALLINT NOT NULL, -- 0 () to 6 ()
day_of_year SMALLINT NOT NULL, -- 1 to 366
week_of_year SMALLINT NOT NULL, -- 1 to 53
CONSTRAINT con_month CHECK (month >= 1 AND month <= 31),
CONSTRAINT con_day_of_year CHECK (day_of_year >= 1 AND day_of_year <= 366), -- 366 allows for leap years
CONSTRAINT con_week_of_year CHECK (week_of_year >= 1 AND week_of_year <= 53),
UNIQUE(day_id)
);
INSERT INTO dim_calendar (day_id, year, month, day, quarter, day_of_week, day_of_year, week_of_year) (
SELECT ts,
EXTRACT(YEAR FROM ts),
EXTRACT(MONTH FROM ts),
EXTRACT(DAY FROM ts),
EXTRACT(QUARTER FROM ts),
EXTRACT(DOW FROM ts),
EXTRACT(DOY FROM ts),
EXTRACT(WEEK FROM ts)
FROM generate_series('2000-01-01'::timestamp, '2024-01-01', '1day'::interval) AS t(ts)
);
/* ==> [ INSERT 0 8767 ] */
Tables for testing --
DROP TABLE IF EXISTS just_dates CASCADE;
DROP TABLE IF EXISTS just_date_ids CASCADE;
CREATE TABLE just_dates AS
SELECT a_date AS some_date
FROM some_table;
/* ==> [ SELECT 769411 ] */
CREATE TABLE just_date_ids AS
SELECT d.id
FROM just_dates jd
INNER JOIN dim_calendar d
ON d.day_id = jd.some_date;
/* ==> [ SELECT 769411 ] */
ALTER TABLE just_date_ids ADD CONSTRAINT jdfk FOREIGN KEY (id) REFERENCES dim_calendar (id);
Confusion --
pocket=# SELECT pg_size_pretty(pg_relation_size('dim_calendar'));
pg_size_pretty
----------------
448 kB
(1 row)
pocket=# SELECT pg_size_pretty(pg_relation_size('just_dates'));
pg_size_pretty
----------------
27 MB
(1 row)
pocket=# SELECT pg_size_pretty(pg_relation_size('just_date_ids'));
pg_size_pretty
----------------
27 MB
(1 row)
Why is a table consisting of a bunch of smallints the same size as a table consisting of a bunch of dates? And I should mention that before, when dim_calendar.id was a normal SERIAL, it gave the same 27MB result.
Also, and more importantly -- WHY does a table with 769411 records with a single smallint field have a size of 27MB, which is > 32bytes/record???
P.S. Yes, I will have billions (or at a minimum hundreds of millions) of records, and am trying to add performance and space optimizations wherever possible.
EDIT
This might have something to do with it, so throwing it out there --
pocket=# select count(id) from just_date_ids group by id;
count
--------
409752
359659
(2 rows)
In tables with one or two columns, the biggest part of the size is always the Tuple Header.
Have a look here http://www.postgresql.org/docs/current/interactive/storage-page-layout.html, it explains how the data is stored. I'm quoting the part of the above page that is most relevant with your question
All table rows are structured in the same way. There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap, an optional object ID field, and the user data.
This mostly explains the question
WHY does a table with 769411 records with a single smallint field have a size of 27MB, which is > 32bytes/record???
The other part of your question has to do with the byte alignment of postgres data. Smallints are aligned in 2-byte offsets, but ints (and dates of course... date is an int4 after all) are aligned in 4 bytes offsets. So the order in which the table columns are devlared plays a significant role.
Having a table with smallint, date, smallint needs 12 bytes for user data (not counting the overhead), while declaring smallint, smallint, date only will need 8 bytes. See a great (and surprisingly not accepted) answer here Calculating and saving space in PostgreSQL

Resources