specifying a string to date datatype using pyspark - filter

I want to filter a parquet partitioned by date.
When I apply the filter
.filter(col('DATE')>= '2020-08-01')
It casts the value 2020-08-01 as a string when doing the filtering as shown in the physical plan. I read that this is not efficient and results in a whole filescan.
PartitionFilters: [isnotnull(DATE#5535), (cast(DATE#5535 as string) >= 2020-08-01)]
How do I cast the string as date in the filter clause? All the examples on the internet mention to use to_date but that works only on columns.
Is this possible, or even worth it?
Please advise.
Thank You

Try this -
import pyspark.sql.functions as F
.filter(F.expr(" `Date` >= to_date('2020-08-01','yyyy-MM-dd' )"))

Related

Hive: filtering data between specified dates in string format. Which one is optimised and why?

Which one should be preferred and why? My_date_column is of type string and of format YYYY-MM-DD
SELECT *
FROM my_table
WHERE my_date_column >= '2019-08-31' AND your_date_column <= '2019-09-02';
OR
SELECT *
FROM my_table
WHERE my_date_column in ('2019-08-31','2019-09-01','2019-09-02') ;
Lastly, In general, should I be storing my date as Date data type or simply as a string? I chose string type simply to handle any corrupt/badly formatted data.
Always store dates as date type. If you store them as strings it adds overhead to do any kind of arithmetic with the values (including those inequalities).
If you decided to ignore this advice and store dates as string anyway, then I believe the second form (the IN list) will be faster as it is comparing text string against text string. The other option, using inequalities, may not give the results you expect unless you explicitly CAST(my_date_column as date) and CAST(your_date_column as date). Those cast will be applied to every record, which adds to the total cost of the query, and would be avoidable if those columns were stored as date instead.
A third option that you didn't mention is to use the BETWEEN operator (ie. WHERE my_date BETWEEN start_date AND end_date). This is the same as using the two inequalities, but probably a bit cleaner and more idiomatic.
When in doubt, take a look at the EXPLAIN plans to understand how the query will be executed.

Oracle : Want to convert Substring to a useable, sortable date

1st Post go easy on me.
I'm using this Substring to pull part of a Field, this date I assume is probably non Standard (ddmmmyy) - how can I enhance this command so that I can use this a sortable Date Field, I'm guessing Cast but have no idea of Syntax etc ??
SELECT SUBSTR(Host_Name,-9) as Decom_Date
Output
DECOM_DATE
31Oct2018
31May2018
31May2018
31Mar2017
31Jul2018
TIA
This is exactly what the TO_DATE function is designed for:
SELECT TO_DATE(SUBSTR(Host_Name,-9), 'DDMonYYYY') as Decom_Date
It doesn't affect you here but bear in mind that oracle dates can only store down to a second precision. Also, if you have any rogue data in the table that can't be cant be parsed as a date you'll get "not a valid..." or "a nonnumeric was found where a numeric was expected".
Be mindful that your strings here are in English but parsing MON (3 letter month name) can be regionally contextual so this code might not work on a server with a different NLS; for example consider passing 'NLS_DATE_LANGUAGE = American' as the third argument to TO_DATE if you know your strings will always be English month names

Trying to return the maximum value of a filtered date column in Power BI

I have a table within Power BI that has a date field, and a value field. I am filtering on this date field, using a slicer, to sum all of the value data before the specified date. I would like to get this date value to use in a LOOKUPVALUE() elsewhere (to get a conversion rate).
Is there a way to accomplish this?
I have tried the DAX functions that return the values of a particular table/column with filters preserved but this never seems to work, and just returns the entire dataset, e.g. VALUES(), FILTERS(), ALLEXCEPT().
Any help would be greatly appreciated!
I found a solution using measures.
The DAX for future reference:
Filter Date = CALCULATE(MAX('Table'[Date]),ALLSELECTED('Table'))

Need help in Hive on date functions

I am doing functionality testing(need to write hive code while referring Scala code) in my project. I am having an issue with my date functions in my code. In Scala we have casted our date data type into string as changed its structure into ‘YYYYMM’, MY value inside my date column is like 201706(YYYYMM), which is not accepted in Hive (read that it accepts only YYYY-MM-DD).
My question is
1) How to change the YYYYMM to YYYY-MM-DD? I have tried casting to date and also UNIX_TIMESTAMP neither of them are working query is getting failed at the end.
2) We are also using filter.to_date (colm1,”YYYYMM”).between(add_months(to_date((colm2),”YYYYMM”),-27), add_months(to_date((colm2),”YYYYMM”),-2))) in our Scala code , How can I change that to HIVE? Unable to get any ideas
Thanks In advance…..
Regards,
M Sontosh Aditya
use
unix_timestamp(DATE_COLUMN, string pattern)
Further understanding please refer DateFuncitos

How to get the week day name from a date in Apache pig?

Given "03/09/1982" how can we say it is which week day. In this case it will be "Tue".
Is it possible to get in a single query?
Thanks
You can convert this string into date object using ToDate(), then again into string with desired format using ToString(), and dont forget that Pig uses Java SimpleDateFormat class to deal with dates.
ToString( ToDate('03/09/1982','dd/MM/yyyy'), 'EEE' )

Resources