MonetDB: Group by different parts of a timestamp - monetdb

I have a timestamp column in a monetdb table which I want to occasionally group by hour and occasionally group by day or month. What is the most optimal way of doing this in MonetDB?
In say postgres you could do something like:
select date_trunc('day', order_time), count(*)
from orders
group by date_trunc('day', order_time);
Which I appreciate would not use an index, but is there any way of doing this in MonetDB without creating additional date columns holding day, month and year truncated values?
Thanks.

You could use the EXTRACT(DAY FROM order_time) possibly as part of a subquery before grouping.

It might be a little late for answer, but the following should work for truncating to day precision:
SELECT CAST(order_time AS DATE) AS order_date, count(*)
FROM orders
GROUP BY order_date;
It works by casting the timestamp value to type DATE which is a MonetDB built-in type and the cast is pretty fast.
It does not have the flexibility of date_trunc in Postgres, but if you need to go to monthly of yearly precision, you could use the somewhat slower but usable EXTRACT to get the relevant parts of the timestamp and group by them. For monthly grouping, you could do:
SELECT EXTRACT(YEAR FROM order_time) AS y,
EXTRACT(MONTH FROM order_time) AS m,
count(*)
FROM orders GROUP BY y, m;
The only disadvantage is that you will have the date split to two columns.

Related

Query not filtering with date in Oracle

There are records in table for particular date. But when I query with that value, I am unable to filter the records.
select * from TBL_IPCOLO_BILLING_MST
where LAST_UPDATED_DATE = '03-09-21';
The dates are in dd-mm-yy format.
To the answer by Valeriia Sharak, I would just add a few things since your question is tagged Oracle. I was going to add this as a comment to her answer, but it's too long.
First, it is bad practice to compare dates to strings. Your query, for example, would not even execute for me -- it would end with ORA-01843: not a valid month. That is because Oracle must do an implicit type conversion to convert your string "03-09-21" to a date and it uses the current NLS_DATE_FORMAT setting to do that (which in my system happens to be DD-MON-YYYY).
Second, as was pointed out, your comparison is probably not matching rows due LAST_UPDATED_DATE having hours, minutes, and seconds. But a more performant solution for that might be:
...
WHERE last_update_date >= TO_DATE('03-09-21','DD-MM-YY')
AND last_update_date < TO_DATE('04-09-21','DD-MM-YY')
This makes the comparison without wrapping last_update_date in a TRUNC() function. This could perform better in either of the following circumstances:
If there is an index on last_update_date that would be useful in your query
If the table with last_update_date is large and is being joined to other tables (because it makes it easier for Oracle to estimate the number of rows from your table that are inputs to the join).
Your column might contain hours and seconds, but they can be hidden.
So when you filter on the date, oracle implicitly adds time to the date. So basically you are filtering on '03-09-21 00:00:00'
Try to trunc your column:
select * from TBL_IPCOLO_BILLING_MST
where trunc(LAST_UPDATED_DATE) = '03-09-21';
Hope, I understood your question correctly.
Oracle docs

Hive SELECT DISTINCT and GROUP BY in a subquery

I am running a query but I'm a little stuck on the concept of subqueries in HiveQL. I am new to Hive and I've done a lot of reading but I still can't get it to work.
So I have a big table with the fields I'm interested in being created_date and size. So I basically wan to run an aggregation of the sum of sizes of files created in a particular year and group by distinct year.
My current query:
SELECT year(created_date), SUM(size) FROM <tablename> GROUP BY created_date
2001 2654567
2001 231818
2001 1978222
2002 7625332
2002 6272829
2003 2733792
This gives me a list of all the years in the table and the sums of each year as above but I have duplicates of the year and this is where I need to do a subquery to SELECT DISTINCT year and the sum the total size too.
Any help will be superb please.
You might want to try GROUPING BY the year, (since that is what you are selecting).
SELECT year(created_date), SUM(size) FROM <tablename> GROUP BY year(created_date)

Joining two tables in hive

I have table where I have partitioned date by year and month and date
'ABC' Partition by
(year='2011', month='08', day='01')
I want to run a query something like
select * from ABC where dt>='2011-03-01' and dt<='2012-02-01';
How can I run this query with above partitioning scheme in terms of year, month and day?
You might consider creating an external table that is partitioned by 'yyyy-mm-dd', and uses the same locations as your existing table. You won't have to copy any data, and you'll have the flexibility of both partitioning formats.
select * from ABC where year='2011' and month >= '03'
UNION
select * from ABC where year='2012' and month = '01'
UNION
select * from ABC where year='2012' and month='02' and day='01';
The above query should solve the purpose but it's really neither flexible nor well-readable. Like Matt suggested, a better partitioning format would be of a single string variable in yyyy-MM-dd format as the partitioning column. However, you might have to make a copy of the data if you change the partitioning scheme for year, month, day to dt. In my opinion though, it's totally worth it.

Storing weekly and monthly aggregates in Oracle

I need to dynamically update weekly and monthly sales data per product and customer. These need to be updated and checked during the sale of a product, and for various reasons I'm not able to use stored procedures or materialized views for this (I'll read everything into the application, modify everything in memory and then update and commit the results).
What is the best table structure for holding the sales during a period?
Store the period type (M, W) with start and end dates, or just the type and start date?
Use date fields and a char, or code it into a string ('M201201' / 'W201248')
Normalize sales and periods into two tables, or keep both sales and the period in a single table?
I will be doing two kinds of queries - select the sales of the current weekly (xor monthly) period/customer/article but not update them, and select for update weekly and monthly periods for a customer/article.
If you store both the start and end dates of the applicable period in the row, then your retrieval queries will be much easier, at least the ones that are based on a single date (like today). This is a very typical mode of access since you are probably going to be looking at things from the perspective of a business transaction (like a specific sale) which happens on a given date.
It is very direct and simple to say where #date_of_interest >= start_date and #date_of_interest <= end_date. Any other combination requires you to do date arithmetic either in code before you go to your query or within your query itself.
Keeping a type code (M, W) as well as both start and end dates entails introducing some redundancy. However, you might choose to introduce this redundancy for the sake of easing data retrieval. This: where #date_of_interest >= start_date and #date_of_interest <= end_date and range_type='M' is also very direct and simple.
As with all denormalization, you need to ensure that you have controls that will manage this redundancy.
I would recommend you to use a normalized schema for that purpose where you store weekly and monthly aggregation in two different tables with the same structure. I don't know the kind of queries you're going to do, but I suppose that this would make it easier to do any sort of search (when it's done in the right way!!!).
Probably something like this sample
product_prices (
prod_code,
price,
date_price_begin
);
sales (
prod_code,
customer_code,
sale_date
);
<aggregate-week>
select trunc(sale_date,'w') as week,
prod_code,
customer_code,
sum(price) keep (dense_rank first order by date_price_start) as price
from sales
natural join product_prices
where sale_date > date_from
group by trunc(sale_date,'iw'),
prod_code,
customer_code
/
<aggregate-month>
select trunc(sale_date,'m') as month,
prod_code,
customer_code,
sum(price) keep (dense_rank first order by date_price_start) as price
from sales
natural join product_prices
where sale_date > date_from
group by trunc(sale_date,'m'),
prod_code,
customer_code
/

Querying a data warehouse data involving time dimension

I have two tables for time dimension
date (unique row for each day)
time of the day (unique row for each minute in a day)
Given this schema what would a query look like if one wants to retrieve facts for last X hours where X can be any number greater than 0.
Things start to be become tricky when the start time and end time happen to be in two different days of the year.
EDIT: My Fact table does not have a time stamp column
Fact tables do have (and should have) original timestamp in order to avoid weird by-time queries which happen over the boundary of a day. Weird means having some type of complicated date-time function in the WHERE clause.
In most DWs these type of queries are very rare, but you seem to be streaming data into your DW and using it for reporting at the same time.
So I would suggest:
Introduce the full timestamp in the fact table.
For the old records, re-create the timestamp from the Date and Time keys.
DW queries are all about not having any functions in the WHERE clause, or if a function has to be used, make sure it is SARGABLE.
You would probably be better served by converting the Start Date and End Date columns to TIMESTAMP and populating them.
Slicing the table would require taking the appropriate interval BETWEEN Start Date AND End Date. In Oracle the interval would be something along the lines of SYSDATE - (4/24) or SYSDATE - NUMTODSINTERVAL(4, 'HOUR')
This could also be rewritten as:
Start Date <= (SYSDATE - (4/24)) AND End Date >= (SYSDATE - (4/24))
It seems to me that given the current schema you have, that you will need to retrieve the appropriate time IDs from the time dimension table which meet your search criteria, and then search for matching rows in the fact table. Depending on the granularity of your time dimension, you might want to check the performance of doing either (SQL Server examples):
A subselect:
SELECT X FROM FOO WHERE TIMEID IN (SELECT ID FROM DIMTIME WHERE HOUR >= DATEPART(HOUR, CURRENT_TIMESTAMP()) AND DATEID IN (SELECT ID FROM DIMDATE WHERE DATE = GETDATE())
An inner join:
SELECT X FROM FOO INNER JOIN DIMTIME ON TIMEID = DIMTIME.ID WHERE HOUR >= DATEPART(HOUR, CURRENT_TIMESTAMP()) INNER JOIN DIMDATE ON DATEID = DIMDATE.ID WHERE DATE = GETDATE()
Neither of these are truly attractive options.
Have you considered that you may be querying against a cube that is intended for roll-up analysis and not necessarily for "last X" analysis?
If this is not a "roll-up" cube, I would agree with the other posters in that you should re-stamp your fact tables with better keys, and if you do in fact intend to search off of hour frequently, you should probably include that in the fact table as well, as any other attempt will probably make the query non-sargable (see What makes a SQL statement sargable?).
Microsoft recommends at http://msdn.microsoft.com/en-us/library/aa902672%28v=sql.80%29.aspx that:
In contrast to surrogate keys used in other dimension tables, date and time dimension keys should be "smart." A suggested key for a date dimension is of the form "yyyymmdd". This format is easy for users to remember and incorporate into queries. It is also a recommended surrogate key format for fact tables that are partitioned into multiple tables by date.
Best luck!

Resources