I have come across a fact table fact_trips - composed of columns like
driver_id,
vehicle_id,
date ( in the int form - 'YYYYMMDD')
timestamp (in milliseconds - bigint )
miles,
time_of_trip
I have another dim_time - composed of columns like
date ( in the int form - 'YYYYMMDD'),
timestamp (in milliseconds - bigint ),
month,
year,
day_of_week
day
Now when I want to see the trips grouped based on year, I have to join the two tables based on timestamp (in bigint) and then group by year from dim_time.
Why the hell do we keep date in int form then? Because ultimately, I have to join on timestamp. What needs to be changed?
Also, the dim_time does not have a primary key, hence there are multiple entries for the same date. So, when I join the tables, I get more rows in return than expected.
You should have 2 Dim tables:
DIM_DATE: PK = YYYYMMDD
DIM_TIME: PK = number. Will hold the same number of records as however many milliseconds there are in a day (assuming you are holding time at the millisecond grain rather than second, minute, etc)
Related
I created a table and two materialized views recursively.
Table:
CREATE TABLE `log_details` (
date String,
event_time DateTime,
username String,
city String)
ENGINE = MergeTree()
ORDER BY (date, event_time)
PARTITION BY date TTL event_time + INTERVAL 1 MONTH
Materialized views:
CREATE MATERIALIZED VIEW `log_u_c_day_mv`
ENGINE = SummingMergeTree()
PARTITION BY date
ORDER BY (date, username, city)
AS
SELECT date, username, city, count() as times
FROM `log_details`
GROUP BY date, username, city
CREATE MATERIALIZED VIEW `log_u_day_mv`
ENGINE = SummingMergeTree()
PARTITION BY date
ORDER BY (date, username)
AS
SELECT date, username, SUM(times) as total_times
FROM `.inner.log_u_c_day_mv`
GROUP BY date, username
Insert into log_details → Insert into log_u_c_day_mv → Insert into log_u_day_mv.
log_u_day_mv is not be optimized after 15 minutes inserting log_u_c_day_mv even over one day.
I tried to optimize log_u_day_mv manually and it works.
OPTIMIZE TABLE `.inner.log_u_day_mv` PARTITION 20210110
But ClickHouse does not timely optimize it.
How to solve it?
Data always is not fully aggregated/collapsed in MT.
If you do optimize final the next insert into creates a new part.
CH does not merge parts by time. Merge scheduler selects parts by own algorithm based on the current node workload / number of parts / size of parts.
SummingMT MUST BE QUERIED with sum / groupby ALWAYS.
select sum(times), username
from log_u_day_mv
group by username
DO NOT USE from log_u_day_mv FINAL it reads excessive columns!!!!!!!!!!!!!!
I'm working on my table which is supposed to store data about rented cars.
And there are 3 important columns:
RENT_DATE DATE DEFAULT TO_DATE (SYSDATE, 'DD-MM-YYYY'),
DAYS NUMBER NOT NULL,
RETURN_DATE DATE DEFAULT TO_DATE(SYSDATE+DAYS, 'DD-MM-YYYY')
My problem is that RETURN_DATE column is giving me error:
00984. 00000 - "column not allowed here"
What i want is that RENT_DATE set automatically date when record is added.
DAYS column is to store for how much days someone is renting car.
And the last column should store date of when car should be returned.
Thank you for any type of help.
This doesn't make sense:
DEFAULT TO_DATE (SYSDATE, 'DD-MM-YYYY')
SYSDATE is already a date. TO_DATE requires a char, so this takes a date, Oracle implicitly turns the date into a char, and then TO_DATE converts it back to a date. This is risky/unreliable because it uses a hardcoded date format to operate on a date that has been implicitly turned to a string using the system default format, which might one day not be DD-MM-YYYY (you're building a latent bug into your software)
If you want a date without a time on it use TRUNC(SYSDATE)
The other problem doesn't make sense either: you're storing a number of days rented for and also the return date, when one is a function of the other. Storing redundant data becomes a headache because you have to keep them in sync. My person class stores my birthdate, and I calculate how old I am. I don't store my age too and then update my table every day/year etc
Work out which will be more beneficial to you to store, and store it, then calculate the other whenever you want it. Personally I would store the return date as it's absolute, rather than open to interpretation of "is that working days, calendar days? what about public holidays? if the start date is jan 1 and the rental is for 10 days, is the car brought back on the 10th or the 11th?"
If you're desperate to have both columns in your DB consider using a view to calculate it or a function based column (again, to calculate one from the other) so they stay in sync
All in, you could look at this:
create table X(
RENT_DATE DATE DEFAULT TRUNC(SYSDATE) NOT NULL,
RETURN_DATE DATE NOT NULL,
DAYS AS (TRUNC(RETURN_DATE - RENT_DATE) + 1)
)
I put the days as +1 because to me, a car taken on the 1st and returned on the second is 2 days, but you might want to get more accurate - if it's taken on the first and returned before 10am on the second then it's one day otherwise it's 2 etc...
Use a virtual column:
CREATE TABLE table_name (
RENT_DATE DATE
DEFAULT TRUNC( SYSDATE )
CONSTRAINT table_name__rent_date__nn NOT NULL
CONSTRAINT table_name__rent_date_chk CHECK ( rent_date = TRUNC( rent_date ) ),
DAYS NUMBER
DEFAULT 7
CONSTRAINT table_name__days__nn NOT NULL,
RETURN_DATE DATE
GENERATED ALWAYS AS ( RENT_DATE + DAYS ) VIRTUAL
);
Then you can insert values:
INSERT INTO table_name ( rent_date, days ) VALUES ( DEFAULT, DEFAULT );
INSERT INTO table_name ( rent_date, days ) VALUES ( DATE '2020-01-01', 1 );
And:
SELECT * FROM table_name;
Outputs:
RENT_DATE | DAYS | RETURN_DATE
:------------------ | ---: | :------------------
2020-09-12T00:00:00 | 7 | 2020-09-19T00:00:00
2020-01-01T00:00:00 | 1 | 2020-01-02T00:00:00
db<>fiddle here
I have a hive table with below structure
ID string,
Value string,
year int,
month int,
day int,
hour int,
minute int
This table is refreshed every 15 mins and it is partitioned with year/month/day/hour/minute columns. Please find below samples on partitions.
year=2019/month=12/day=29/hour=19/minute=15
year=2019/month=12/day=30/hour=00/minute=45
year=2019/month=12/day=30/hour=08/minute=45
year=2019/month=12/day=30/hour=09/minute=30
year=2019/month=12/day=30/hour=09/minute=45
I want to select only latest partition data from the table. I tried to use max() statements with those partition columns, but its not very efficient as data size is huge.
Please let me know, how can i get the data in a convenient way using hive sql.
If the latest partition is always in current date, then you can filter current date partition and use rank() to find records with latest hour, minute:
select * --list columns here
from
(
select s.*, rank() over(order by hour desc, minute desc) rnk
from your_table s
where s.year=year(current_date) --filter current day (better pass variables calculated if possible)
and s.month=lpad(month(current_date),2,0)
and s.day=lpad(day(current_date),2,0)
-- and s.hour=lpad(hour(current_timestamp),2,0) --consider also adding this
) s
where rnk=1 --latest hour, minute
And if the latest partition is not necessarily equals current_date then you can use rank() over (order by s.year desc, s.month desc, s.day desc, hour desc, minute desc), without filter on date this will scan all the table and is not efficient.
It will perform the best if you can calculate partition filters in the shell and pass as parameters. See comments in the code.
For eg I have a student table with a DOJ(date of joining) column with its type set as DATE now in that I have stored records in dd-mon-yy format.
I have an IN param at runtime with date passed as string and its in dd/mm/yyyy format. How do I compare and fetch results on date?
I want to fetch count of records of students who have DOJ of 25-AUG-92 per my database table student, but I am getting date as varchar in dd/mm/yyyy format in an IN param, kindly please guide.
I have tried multiple options such as trunc, to_date, to_char but, unfortunately nothing seems to work.
I have a student table with a DOJ(date of joining) column with its type set as DATE now in that I have stored records in dd-mon-yy format.
Not quite, the DATE data-type does not have a format; it is stored internally in tables as 7-bytes (year is 2 bytes and month, day, hour, minute and second are 1-byte each). The user interface you are using (i.e. SQL/PLUS, SQL Developer, Toad, etc.) will handle the formatting of a DATE from its binary format to a human readable format. In SQL/Plus (or SQL Developer) this format is based on the NLS_DATE_FORMAT session parameter.
If the DATE is input using only the day, month and year then the time component is (probably) going to be set to 00:00:00 (midnight).
I have an IN param at runtime with date passed as string or say varchar and its in dd/mm/yyyy format. How do I compare and fetch results on date.?
Assuming the time component for you DOJ column is always midnight then:
SELECT COUNT(*)
FROM students
WHERE doj = TO_DATE( your_param, 'dd/mm/yyyy' )
If it isn't always midnight then:
SELECT COUNT(*)
FROM students
WHERE TRUNC( doj ) = TO_DATE( your_param, 'dd/mm/yyyy' )
or:
SELECT COUNT(*)
FROM students
WHERE doj >= TO_DATE( your_param, 'dd/mm/yyyy' )
AND doj < TO_DATE( your_param, 'dd/mm/yyyy' ) + INTERVAL '1' DAY
The below should do what you've described. If not, provide more information on how "nothing seems to work".
-- Get the count of students with DOJ = 25-AUG-1992
SELECT COUNT(1)
FROM STUDENT
WHERE TRUNC(DOJ) = TO_DATE('25/AUG/1992','dd/mon/yyyy');
The above was pulled from this answer. You may want to look at the answer, because if performance is critical to you, there is a different way to write this query which doesn't use trunc, which will allow Oracle to use index on DOJ, if one is present.
Though I am bit late in posting this but I have been able to resolve this.
What I did was I converted both the dates to_char in similar formats and it worked here is my query condition that worked..
TO_CHAR(TO_DATE(C.DOB, 'DD-MON-YY'),'DD-MON-YY')=TO_CHAR(TO_DATE(P_Dob,'DD/MM/YYYY'),'DD-MON-YY'))
Thanks for the support all. :)
I am learning oracle 11g. I need to create columns to store Year and Month in the following sample format:
Year: 2015
Month: 6
I saw Date Time data type which takes whole date only .Also Number type may allow invalid year and month. But I want them in the given form while avoiding invalid month and year. Please tell me how to fix it.thanks
Updates: is this okay for such inputs?
CREATE TABLE FOOBAR (YYYY DATE, MM DATE);
The best solution is to store dates in DATE columns. Oracle has some pretty neat date functions, and you'll find it easy to work with storing the first of the month in a single DATE column. Otherwise you'll find yourself constantly extracting elements from other dates or cluttering your code with TO_CHAR() and TO_DATE() calls. Find out more.
However, if you have a rigid requirement, you can use strong typing and check constraints to avoid invalid months:
CREATE TABLE FOOBAR (
YYYY number(4,0) not null
, MM number(2,0) not null
, constraint foobar_yyyy_ck check (yyyy != 0)
, constraint foobar_mm_ck check (mm between 1 and 12)
);
This won't do what you want because it will default the missing elements:
CREATE TABLE FOOBAR (YYYY DATE, MM DATE);
We can't store just a year or just a month in DATE columns.
Use the DATE data type..
and when perform insert operation onto your db.. use
TO_DATE ('November 13, 1992', 'MONTH DD, YYYY')
For input and output of dates, the standard Oracle date format is DD-MON-YY, as follows:
'13-NOV-92'
perform insert operation/query like this:
INSERT INTO table_name (name, created_at) VALUES
('ANDY', TO_DATE ('November 13, 1992', 'MONTH DD, YYYY'));
Here is link to the guide as well:
https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i1847
If you want to store month and year separately in the db you may use NUMBER & NUMBER(n)
https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i22289
Hope this helps..