Joining two tables in hive - hadoop

I have table where I have partitioned date by year and month and date
'ABC' Partition by
(year='2011', month='08', day='01')
I want to run a query something like
select * from ABC where dt>='2011-03-01' and dt<='2012-02-01';
How can I run this query with above partitioning scheme in terms of year, month and day?

You might consider creating an external table that is partitioned by 'yyyy-mm-dd', and uses the same locations as your existing table. You won't have to copy any data, and you'll have the flexibility of both partitioning formats.

select * from ABC where year='2011' and month >= '03'
UNION
select * from ABC where year='2012' and month = '01'
UNION
select * from ABC where year='2012' and month='02' and day='01';
The above query should solve the purpose but it's really neither flexible nor well-readable. Like Matt suggested, a better partitioning format would be of a single string variable in yyyy-MM-dd format as the partitioning column. However, you might have to make a copy of the data if you change the partitioning scheme for year, month, day to dt. In my opinion though, it's totally worth it.

Related

Query not filtering with date in Oracle

There are records in table for particular date. But when I query with that value, I am unable to filter the records.
select * from TBL_IPCOLO_BILLING_MST
where LAST_UPDATED_DATE = '03-09-21';
The dates are in dd-mm-yy format.
To the answer by Valeriia Sharak, I would just add a few things since your question is tagged Oracle. I was going to add this as a comment to her answer, but it's too long.
First, it is bad practice to compare dates to strings. Your query, for example, would not even execute for me -- it would end with ORA-01843: not a valid month. That is because Oracle must do an implicit type conversion to convert your string "03-09-21" to a date and it uses the current NLS_DATE_FORMAT setting to do that (which in my system happens to be DD-MON-YYYY).
Second, as was pointed out, your comparison is probably not matching rows due LAST_UPDATED_DATE having hours, minutes, and seconds. But a more performant solution for that might be:
...
WHERE last_update_date >= TO_DATE('03-09-21','DD-MM-YY')
AND last_update_date < TO_DATE('04-09-21','DD-MM-YY')
This makes the comparison without wrapping last_update_date in a TRUNC() function. This could perform better in either of the following circumstances:
If there is an index on last_update_date that would be useful in your query
If the table with last_update_date is large and is being joined to other tables (because it makes it easier for Oracle to estimate the number of rows from your table that are inputs to the join).
Your column might contain hours and seconds, but they can be hidden.
So when you filter on the date, oracle implicitly adds time to the date. So basically you are filtering on '03-09-21 00:00:00'
Try to trunc your column:
select * from TBL_IPCOLO_BILLING_MST
where trunc(LAST_UPDATED_DATE) = '03-09-21';
Hope, I understood your question correctly.
Oracle docs

Varchar to Timestamp but varchar data is yyyy-mm-dd-hh:mi:ss:ff format

My source is from Oracle and the col1 is varchar2(26) but the value looks like YYYY-MM-DD-hh:mi:ss:ff (Sample rec: 2014-08-15-02.03.34.979946).
I have to extract only 6 months records based on COL1. Since there is a hypen between date part and time part - i could not consider as timestamp. Is there any idea how to have this as timestamp to lookup only 6 months data.
If it is possible at all, fix the data first. Storing timestamps in string data type is terrible. How do you know you don't have a time like 25:30:00 in the strings? Or a date like February 30? Besides, you can't really use an index on that column (so queries will be very slow), you will have to write a lot of code whenever referencing that column, etc.
Anyway - to deal with the immediate problem, use TO_TIMESTAMP(), exactly with the format model you show in your post - including the dash between the date part and the time part. Something like this:
select case when to_timestamp('2014-08-15-02.03.34.979946', 'YYYY-MM-DD-HH24:MI:SS.FF')
>= systimestamp - interval '6' month
then 'TRUE' else 'FALSE' end
as result
from dual;
RESULT
------
FALSE
EDIT: As Alex Poole points out (correctly as always) in a Comment below this Answer, interval arithmetic won't work correctly in all cases. It is better, than, to use something like
cast ( timestamp (...., format-model) as date ) <= add_months (sysdate, -6).
Maybe something like this will do:
select *
from your_table
where to_date(substr(col1,1,19),'yyyy-mm-dd-HH24.MI.SS') between add_months(sysdate,-6) and sysdate;
Assuming all the data format in col1 is always the same.
Also note that I used HH24 for hour segment, however could be not your case.
You can include the dash in your format model, as #mathguy showed, to convert your string to a timestamp:
select to_timestamp('2014-08-15-02.03.34.979946', 'YYYY-MM-DD-HH24:MI:SS.FF') from dual;
TO_TIMESTAMP('2014-08-15-02.
----------------------------
15-AUG-14 02.03.34.979946000
although unless you explicitly tell it not to be via the FX modifier, Oracle is flexible enough to allow a dash even if the model has a space (see the text below this table in the documentation:
select to_timestamp('2014-08-15-02.03.34.979946', 'YYYY-MM-DD HH24:MI:SS.FF') from dual;
TO_TIMESTAMP('2014-08-15-02.
----------------------------
15-AUG-14 02.03.34.979946000
However, converting all of the values in your col1 column and then comparing them may be a lot of work, and will prevent any index on that string column being used. Given the format, you can convert your date range to string instead, and use string comparison; e.g. to find everything in the six months up to midnight this morning:
select col1 -- or whichever columns you need
from your_table
where col1 >= to_char(cast(add_months(trunc(sysdate), -6) as timestamp), 'YYYY-MM-DD-HH24:MI:SS.FF6')
and col1 < to_char(cast(trunc(sysdate) as timestamp), 'YYYY-MM-DD-HH24:MI:SS.FF6');
or since the time part can be fixed for that example, you can use character literals instead of casting:
select col1 -- or whichever columns you need
from your_table
where col1 >= to_char(add_months(sysdate, -6), 'YYYY-MM-DD"-00:00:00.000000"')
and col1 < to_char(sysdate, 'YYYY-MM-DD"-00:00:00.000000"');
Of course, storing data in the correct native data type would be a much better solution. Any other solution only works at all if your string data actually contains what you think, and the data is all sane (or as sane as it can be in the wrong data type).

MonetDB: Group by different parts of a timestamp

I have a timestamp column in a monetdb table which I want to occasionally group by hour and occasionally group by day or month. What is the most optimal way of doing this in MonetDB?
In say postgres you could do something like:
select date_trunc('day', order_time), count(*)
from orders
group by date_trunc('day', order_time);
Which I appreciate would not use an index, but is there any way of doing this in MonetDB without creating additional date columns holding day, month and year truncated values?
Thanks.
You could use the EXTRACT(DAY FROM order_time) possibly as part of a subquery before grouping.
It might be a little late for answer, but the following should work for truncating to day precision:
SELECT CAST(order_time AS DATE) AS order_date, count(*)
FROM orders
GROUP BY order_date;
It works by casting the timestamp value to type DATE which is a MonetDB built-in type and the cast is pretty fast.
It does not have the flexibility of date_trunc in Postgres, but if you need to go to monthly of yearly precision, you could use the somewhat slower but usable EXTRACT to get the relevant parts of the timestamp and group by them. For monthly grouping, you could do:
SELECT EXTRACT(YEAR FROM order_time) AS y,
EXTRACT(MONTH FROM order_time) AS m,
count(*)
FROM orders GROUP BY y, m;
The only disadvantage is that you will have the date split to two columns.

Hive SELECT DISTINCT and GROUP BY in a subquery

I am running a query but I'm a little stuck on the concept of subqueries in HiveQL. I am new to Hive and I've done a lot of reading but I still can't get it to work.
So I have a big table with the fields I'm interested in being created_date and size. So I basically wan to run an aggregation of the sum of sizes of files created in a particular year and group by distinct year.
My current query:
SELECT year(created_date), SUM(size) FROM <tablename> GROUP BY created_date
2001 2654567
2001 231818
2001 1978222
2002 7625332
2002 6272829
2003 2733792
This gives me a list of all the years in the table and the sums of each year as above but I have duplicates of the year and this is where I need to do a subquery to SELECT DISTINCT year and the sum the total size too.
Any help will be superb please.
You might want to try GROUPING BY the year, (since that is what you are selecting).
SELECT year(created_date), SUM(size) FROM <tablename> GROUP BY year(created_date)

Querying a data warehouse data involving time dimension

I have two tables for time dimension
date (unique row for each day)
time of the day (unique row for each minute in a day)
Given this schema what would a query look like if one wants to retrieve facts for last X hours where X can be any number greater than 0.
Things start to be become tricky when the start time and end time happen to be in two different days of the year.
EDIT: My Fact table does not have a time stamp column
Fact tables do have (and should have) original timestamp in order to avoid weird by-time queries which happen over the boundary of a day. Weird means having some type of complicated date-time function in the WHERE clause.
In most DWs these type of queries are very rare, but you seem to be streaming data into your DW and using it for reporting at the same time.
So I would suggest:
Introduce the full timestamp in the fact table.
For the old records, re-create the timestamp from the Date and Time keys.
DW queries are all about not having any functions in the WHERE clause, or if a function has to be used, make sure it is SARGABLE.
You would probably be better served by converting the Start Date and End Date columns to TIMESTAMP and populating them.
Slicing the table would require taking the appropriate interval BETWEEN Start Date AND End Date. In Oracle the interval would be something along the lines of SYSDATE - (4/24) or SYSDATE - NUMTODSINTERVAL(4, 'HOUR')
This could also be rewritten as:
Start Date <= (SYSDATE - (4/24)) AND End Date >= (SYSDATE - (4/24))
It seems to me that given the current schema you have, that you will need to retrieve the appropriate time IDs from the time dimension table which meet your search criteria, and then search for matching rows in the fact table. Depending on the granularity of your time dimension, you might want to check the performance of doing either (SQL Server examples):
A subselect:
SELECT X FROM FOO WHERE TIMEID IN (SELECT ID FROM DIMTIME WHERE HOUR >= DATEPART(HOUR, CURRENT_TIMESTAMP()) AND DATEID IN (SELECT ID FROM DIMDATE WHERE DATE = GETDATE())
An inner join:
SELECT X FROM FOO INNER JOIN DIMTIME ON TIMEID = DIMTIME.ID WHERE HOUR >= DATEPART(HOUR, CURRENT_TIMESTAMP()) INNER JOIN DIMDATE ON DATEID = DIMDATE.ID WHERE DATE = GETDATE()
Neither of these are truly attractive options.
Have you considered that you may be querying against a cube that is intended for roll-up analysis and not necessarily for "last X" analysis?
If this is not a "roll-up" cube, I would agree with the other posters in that you should re-stamp your fact tables with better keys, and if you do in fact intend to search off of hour frequently, you should probably include that in the fact table as well, as any other attempt will probably make the query non-sargable (see What makes a SQL statement sargable?).
Microsoft recommends at http://msdn.microsoft.com/en-us/library/aa902672%28v=sql.80%29.aspx that:
In contrast to surrogate keys used in other dimension tables, date and time dimension keys should be "smart." A suggested key for a date dimension is of the form "yyyymmdd". This format is easy for users to remember and incorporate into queries. It is also a recommended surrogate key format for fact tables that are partitioned into multiple tables by date.
Best luck!

Resources