Hive - Using date_add in a where clause

Hive - Using date_add in a where clause - performance

I have a table that is partitioned by date. To query the last 10 days of data I usually write something that looks like:
SELECT * FROM table WHERE date = date_add(current_date, -10);
A coworker said this makes the query less efficient that using a simple date string. Is this the case? Can some explain this to me? Is there a way to write a dynamic date into the where clause that is efficient?

The only problem here can be with partition pruning. Partition pruning may not work with function in some Hive versions. You can easily check it yourself by executing EXPLAIN EXTENDED <your select query> command. It will print all partition paths to be queried.
In this case, use pre-calculated in a shell value and pass it as a parameter:
date_var=$(date +'%Y_%m_%d' --date "-10 day")
#call your script
hive -hivevar date_var="$date_var" -f your_script.hql
And use variable in the script:
SELECT * FROM table WHERE date = '${hivevar:date_var}';
And if partition pruning works good, you do not need to bother at all.

Related

Oracle SQL Recursive Query

Have a query regarding this - have a table update where I have to backfill for over a years worth of data, and due to the code I have to update by day (which takes 4-5 mins per day), does anyone know how I can do this more effectively by setting a list of dates so I can do this in the background.
So for example if I set a variable called :reqdate which is the date field and I have a list of dates from a query (e.g. 01/01/20, 02/01/20... 04/04/20) is there something I can do to get sql to run this repeatedly eg :regdate=01/01/20, then when thats done it automatically does 02/01/20 and so on
Thanks

If I understood you correctly, the easiest way is to use merge clause like:
merge into dest_table t
using (
select date'2020-01-01'+N as dt
from xmltable('0 to 10' columns N int path '.')
) dates
on (t.date_col = dates.dt)
whem matched then update
set ...
Though I think you need to redesign your update to simple update like
update (select ... from) t
set ...
where t.dt between date'2020-01-01' and date'2020-01-20'

Is there a difference between using between vs '> & <' when querying a hive table partitioned on date string?

I have a use of selecting data from a large hive table partitioned on date (format : yyyyMMdd), the hive query is required to fetch few fields from 6 months of data (total 180 date partitions. Currently the query looks like :
SELECT field_1, field_2
FROM table
WHERE `date` BETWEEN '20181125' and '20190525'
Wanted to know if changing the query to use >= & <= makes any difference in terms of performance.
SELECT field_1, field_2
FROM table
WHERE `date`>='20181125' AND `date`<='20190525'

I cannot think of any significant changes in the performance while using < > instead of Between keyword .
How ever using IN keyword and listing down all the dates between the range will have a slight advantage over the other two scenarios .
SELECT field_1, field_2 FROM table WHERE dates in ('20181125','20181126',...,'20190524','20190525');

>=, <= and BETWEEN should generate the same execution plans, though it may be different in your Hive version.
Use EXPLAIN, it shows the query execution plan. Only plan can help to answer this question for sure. Check EXPLAIN DEPENDENCY, it prints input_partitions to be scanned and you will see if partition pruning works or not in each case.
If plans are the same for >=, <=, BETWEEN and IN then it works the same and the performance should be the same.

What's the best practice to filter out specific year in query in Netezza?

I am a SQL Server guy and just started working on Netezza, one thing pops up to me is a daily query to find out the size of a table filtered out by year: 2016,2015, 2014, ...
What I am using now is something like below and it works for me, but I wonder if there is a better way to do it:
select count(1)
from table
where extract(year from datacolumn) = 2016
extract is a built-in function, applying a function on a table with size like 10 billion+ is not imaginable in SQL Server to my knowledge.
Thank you for your advice.

The only problem i see with the query is the where clause which executes a function on the 'variable' side. That effectively disables zonemaps and thus forces netezza to scan all data pages, not only those with data from that year.
Instead write something like:
select count(1)
from table
where datecolumn between '2016-01-01' and '2016-12-31'
A more generic alternative is to create a 'date dimension table' with one row per day in your tables (and a couple of years into the future)
This is an example for Postgres: https://medium.com/#duffn/creating-a-date-dimension-table-in-postgresql-af3f8e2941ac
This enables you to write code like this:
Select count(1)
From table t join d_date d on t.datecolumn=d.date_actual
Where year_actual=2016
You may not have the generate_series() function on your system, but a 'select row_number()...' can do the same trick. A download is available here: https://www.ibm.com/developerworks/community/wikis/basic/anonymous/api/wiki/76c5f285-8577-4848-b1f3-167b8225e847/page/44d502dd-5a70-4db8-b8ee-6bbffcb32f00/attachment/6cb02340-a342-42e6-8953-aa01cbb10275/media/generate_series.tgz
A couple of further notices in 'date interval' where clauses:
Those columns are the most likely candidate for a zonemaps optimization. Add a 'organize on (datecolumn)' at the bottom of your table DDL and organize your table. That will cause netezza to move around records to pages with similar dates, and the query times will be better.
Furthermore you should ensure that the 'distribute on' clause for the table results in an even distribution across data slices of the table is big. The execution of the query will never be faster than the slowest dataslice.
I hope this helps

Query partition with calculation and avoid full table scan

I am an analyst trying to build a query to pull data of last 7 days from a table in Hadoop. The table itself is partitioned by date.
When I test my query with hard-coded dates, everything works as expected. However, when I write it to calculate based on today's timestamp, it's doing full table scan and I had to kill the job.
Sample query:
SELECT * FROM target_table
WHERE date >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),7);
I'd appreciate some advice how I can revise my query while avoiding full table scan.
Thank you!

I'm not sure that I have an elegant solution, but since I use Oozie for workflow coordination, I pass in the start_date and end_date from Oozie. In the absence of Oozie I might use bash to calculate the appropriate dates and pass them in as parameters.
Partition filters have always had this problem, so I just found myself a workaround.

I had some workaround, and it works for me if the no of Date is more than 30/60/90/120.
query like
(((unix_timestamp(date,'yyyy-MM-dd')) >= (unix_timestamp(date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd') ,${sub_days}),'yyyy-MM-dd'))) and((unix_timestamp(date,'yyyy-MM-dd')) <= (unix_timestamp(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),'yyyy-MM-dd'))))
sub_days = passing paramenter , here it could be 7

Updating partitioned table oracle

Hi I have a partitioned table and when I am trying to update taking few selected partition in a loop with passing partition name dynamically, it s not working.
for i in 1..partition_tbl.count Loop
UPDATE cdr_data PARTITION(partition_tbl(i)) cdt
SET A='B'
WHERE
cdt.ab='c'
End Loop;
The partition_tbl object has all the partition in which I want to perform this update.
Please suggest me how to proceed here.
Thanks in advance

What is the problem that you are trying to solve? It doesn't make sense to run separate UPDATE statements against each partition in a loop. If you really want to update every row in the table where ab = 'c', just issue a single UPDATE statement
UPDATE cdr_data cdt
SET a = 'B'
WHERE ab = 'c'
potentially with a PARALLEL hint that would allow Oracle to update multiple partitions in parallel.
If you really, really want to update each partition independently, it would make much more sense to do so based on the partition keys. For example, if your table has daily partitions based on a date
FOR i IN 1 .. <<number of daily partitions>>
LOOP
UPDATE cdr_data cdt
SET a = 'B'
WHERE ab = 'c'
AND partition_key = <<minimum date>> + i;
END LOOP;
Using the partition( <<partition name>> ) syntax is an absolute last resort. If you're really determined to go down that path, you'd need to use dynamic SQL, constructing the SQL statement in the loop and using EXECUTE IMMEDIATE or dbms_sql to execute it.

Preferably let Oracle take care about partitions - pretend in your statement they do not exist
UPDATE cdr_data cdt SET A='B' WHERE cdt.ab='c'
it will from the where conditions and your partitions definitions choose the right partition(s) to apply the command on.
There may be rare event when you need a partition bounded DML, but certainly it is not the example shown. In such situation you can't provide partition name dynamically, like you can't normally provide table name dynamically e.g. you can't
select * from _variable_containing_table_name
If you really insist on partition bounded command then it would be
select * from table_name partition (partition_Name)
e.g.
select * from bills partition (p201403)
To use dynamic partition name the whole statement would have to be dynamically executed via execute immediate or dbms_sql.
But once again, do not choose partition, Oracle will.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive - Using date_add in a where clause - performance

Related

Oracle SQL Recursive Query

Is there a difference between using between vs '> & <' when querying a hive table partitioned on date string?

What's the best practice to filter out specific year in query in Netezza?

Query partition with calculation and avoid full table scan

Updating partitioned table oracle

Categories

Resources