Query partition with calculation and avoid full table scan - hadoop

I am an analyst trying to build a query to pull data of last 7 days from a table in Hadoop. The table itself is partitioned by date.
When I test my query with hard-coded dates, everything works as expected. However, when I write it to calculate based on today's timestamp, it's doing full table scan and I had to kill the job.
Sample query:
SELECT * FROM target_table
WHERE date >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),7);
I'd appreciate some advice how I can revise my query while avoiding full table scan.
Thank you!

I'm not sure that I have an elegant solution, but since I use Oozie for workflow coordination, I pass in the start_date and end_date from Oozie. In the absence of Oozie I might use bash to calculate the appropriate dates and pass them in as parameters.
Partition filters have always had this problem, so I just found myself a workaround.

I had some workaround, and it works for me if the no of Date is more than 30/60/90/120.
query like
(((unix_timestamp(date,'yyyy-MM-dd')) >= (unix_timestamp(date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd') ,${sub_days}),'yyyy-MM-dd'))) and((unix_timestamp(date,'yyyy-MM-dd')) <= (unix_timestamp(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),'yyyy-MM-dd'))))
sub_days = passing paramenter , here it could be 7

Related

Oracle SQL Recursive Query

Have a query regarding this - have a table update where I have to backfill for over a years worth of data, and due to the code I have to update by day (which takes 4-5 mins per day), does anyone know how I can do this more effectively by setting a list of dates so I can do this in the background.
So for example if I set a variable called :reqdate which is the date field and I have a list of dates from a query (e.g. 01/01/20, 02/01/20... 04/04/20) is there something I can do to get sql to run this repeatedly eg :regdate=01/01/20, then when thats done it automatically does 02/01/20 and so on
Thanks
If I understood you correctly, the easiest way is to use merge clause like:
merge into dest_table t
using (
select date'2020-01-01'+N as dt
from xmltable('0 to 10' columns N int path '.')
) dates
on (t.date_col = dates.dt)
whem matched then update
set ...
Though I think you need to redesign your update to simple update like
update (select ... from) t
set ...
where t.dt between date'2020-01-01' and date'2020-01-20'

Hive - Using date_add in a where clause

I have a table that is partitioned by date. To query the last 10 days of data I usually write something that looks like:
SELECT * FROM table WHERE date = date_add(current_date, -10);
A coworker said this makes the query less efficient that using a simple date string. Is this the case? Can some explain this to me? Is there a way to write a dynamic date into the where clause that is efficient?
The only problem here can be with partition pruning. Partition pruning may not work with function in some Hive versions. You can easily check it yourself by executing EXPLAIN EXTENDED <your select query> command. It will print all partition paths to be queried.
In this case, use pre-calculated in a shell value and pass it as a parameter:
date_var=$(date +'%Y_%m_%d' --date "-10 day")
#call your script
hive -hivevar date_var="$date_var" -f your_script.hql
And use variable in the script:
SELECT * FROM table WHERE date = '${hivevar:date_var}';
And if partition pruning works good, you do not need to bother at all.

What's the best practice to filter out specific year in query in Netezza?

I am a SQL Server guy and just started working on Netezza, one thing pops up to me is a daily query to find out the size of a table filtered out by year: 2016,2015, 2014, ...
What I am using now is something like below and it works for me, but I wonder if there is a better way to do it:
select count(1)
from table
where extract(year from datacolumn) = 2016
extract is a built-in function, applying a function on a table with size like 10 billion+ is not imaginable in SQL Server to my knowledge.
Thank you for your advice.
The only problem i see with the query is the where clause which executes a function on the 'variable' side. That effectively disables zonemaps and thus forces netezza to scan all data pages, not only those with data from that year.
Instead write something like:
select count(1)
from table
where datecolumn between '2016-01-01' and '2016-12-31'
A more generic alternative is to create a 'date dimension table' with one row per day in your tables (and a couple of years into the future)
This is an example for Postgres: https://medium.com/#duffn/creating-a-date-dimension-table-in-postgresql-af3f8e2941ac
This enables you to write code like this:
Select count(1)
From table t join d_date d on t.datecolumn=d.date_actual
Where year_actual=2016
You may not have the generate_series() function on your system, but a 'select row_number()...' can do the same trick. A download is available here: https://www.ibm.com/developerworks/community/wikis/basic/anonymous/api/wiki/76c5f285-8577-4848-b1f3-167b8225e847/page/44d502dd-5a70-4db8-b8ee-6bbffcb32f00/attachment/6cb02340-a342-42e6-8953-aa01cbb10275/media/generate_series.tgz
A couple of further notices in 'date interval' where clauses:
Those columns are the most likely candidate for a zonemaps optimization. Add a 'organize on (datecolumn)' at the bottom of your table DDL and organize your table. That will cause netezza to move around records to pages with similar dates, and the query times will be better.
Furthermore you should ensure that the 'distribute on' clause for the table results in an even distribution across data slices of the table is big. The execution of the query will never be faster than the slowest dataslice.
I hope this helps

"group by day" in Oracle doesn't appear to group by date

My query looks something like this:
select datesent, count(*) the_count from receivedmessaged where status=5000
and datesent>(to_date('20130101', 'YYYYMMDD')) group by datesent
What I'm looking for is a table that has the count of messages with a status of 5000 per day, newer than a certain date. What I'm getting is a table with the same dates over and over. What I think is happening is that there is a hidden time part in that datesent field, and its grouping the entries by the exact time they were sent, rather than just looking at the date. Can anyone confirm this and tell me how I can fix it? Thanks!
What I think is happening is that there is a hidden time part in that datesent field, and its grouping the entries by the exact time they were sent, rather than just looking at the date.
That's very probably what's happening. So try that:
select TRUNC(datesent), count(*) the_count from receivedmessaged where status=5000
and datesent>(to_date('20130101', 'YYYYMMDD')) group by TRUNC(datesent)
TRUNC will remove the "time part" and allow you to group by day.
Please note that the use of TRUNC wil invalidate your index. Take a look at your execution plan. And if needed, you should add a function-based index on TRUNC(datesend).
Of course, using TRUNC would solve your issue, and using a function-based index would make it efficient.
However, from 11g onwards, you could also use VIRTUAL colums. In your case, you can add a virtual column as new_date ALWAYS GENERATED AS (TRUNC(date_column)). You just need to use this virtual column in your query. For performance improvement, if required, you could create an index.
NOTE : Indexes defined against virtual columns are equivalent to function-based indexes.

How to improve the performance of my query? / My query is running slow.

workitem_routing_stats table is having around 1000000 records .all records are acceesed thats why we are using full scan hint. it takes around 25 seconds to execute is there is any way to tune this query.
SELECT /*+ full(wrs) */
wrs.NODE_ID,
wrs.bb_id--,
SUM(CASE WHEN WRS.START_TS >= (SYSTIMESTAMP-NUMTODSINTERVAL(7,'day'))
AND wrs.END_TS <= SYSTIMESTAMP THEN (wrs.WORKITEM_COUNT) END) outliers_last_sevend,
SUM(CASE WHEN WRS.START_TS >= (SYSTIMESTAMP-NUMTODSINTERVAL(30,'day'))
AND wrs.END_TS <= SYSTIMESTAMP THEN (wrs.WORKITEM_COUNT) END)
outliers_last_thirtyd ,
SUM(CASE WHEN WRS.START_TS >= (SYSTIMESTAMP-NUMTODSINTERVAL(90,'day'))
AND wrs.END_TS <= SYSTIMESTAMP THEN (wrs.WORKITEM_COUNT) END)
outliers_last_ninetyd ,
SUM(wrs.WORKITEM_COUNT)outliers_year
FROM workitem_routing_stats wrs
WHERE wrs.START_TS BETWEEN (SYSTIMESTAMP-numtodsinterval(365,'day')) AND SYSTIMESTAMP
AND wrs.END_TS BETWEEN (SYSTIMESTAMP-numtodsinterval(365,'day')) AND SYSTIMESTAMP
GROUP BY wrs.NODE_ID,wrs.bb_id ;
You may range partition the table in a monthly manner on START_TS column. (will scan only the year you are interested in)
Secondly(not a very intelligent solution) you may add a parallel(wrs 4) hint if your storage is powerfull.
You can combine these two things.
a full scan is going to be painful in any case...
however - you may avoid some computation if you simply put in the proper numbers instead of calling the conversion functions:
(SYSTIMESTAMP-numtodsinterval(365,'day'))
should just be the same as
(SYSTIMESTAMP-365)
this should remove overhead of calling the function, and parsing the parameter string ('day')
one other possibility - it seems that maybe this data will be adding new timestamps as of today, but the rest is just history...
if this is the case, then you could add a summary table to hold the summarized historic information and only query this current table for the recent stuff, and UNION to the summary table for the older stuff.
you will then need to think through the JOB or other scheduled process to get the summaries populated, but it would save you a ton in this query time.

Resources