hive pyspark date comparison - hadoop

I am trying to translate a hiveQL query into pyspark. I am filtering on dates and getting different results and I would like to know how to get the behaviour in pySpark to match that of Hive. The hive query is:
SELECT COUNT(zip_cd) FROM table WHERE dt >= '2012-01-01';
In pySpark I am entering into the interpreter:
import pyspark.sql.functions as psf
import datetime as dt
hc = HiveContext(sc)
table_df = hc.table('table')
DateFrom = dt.datetime.strptime('2012-01-01', '%Y-%m-%d')
table_df.filter(psf.trim(table.dt) >= DateFrom).count()
I am getting similar, but not the same, results in the two counts. Does anyone know what is going on here?

Your code first creates datetime object from date 2012-01-01. Then during filtering the object is replaced with it's string representation (2012-01-01 00:00:00) and dates are compared using lexicographically which filters out 2012-01-01:
>>> '2012-01-01' >= '2012-01-01 00:00:00'
False
>>> '2012-01-02' >= '2012-01-01 00:00:00'
True
To achieve the same result as SQL just remove code with strptime and compare dates using strings.

Related

Oracle: filter query on datetime

I need to restrict a query with a
SELECT ... FROM ...
WHERE my_date=(RESULT FROM A SELECT)
... ;
in order to achieve that I am using as result of the select a timestamp (if I instead use a datetime I get nothing from my select probably because the format I am using trims the datetime at the second).
Sadly this is not working because these kindo of queries:
select DISTINCT TO_DATE(TO_TIMESTAMP(TO_DATE('25-10-2017 00:00', 'dd-MM-yyyy HH24:MI'))) from DUAL;
return an
ORA-01830: date format picture ends before converting entire input string
how to deal with timestamp to date conversion?
If you want to just compare and check only he dates use trunc on both LHS and RHS.
SELECT ... FROM ...
WHERE trunc(my_date)=(select trunc(RESULT) FROM A)
... ;
This will just compare the dates by truncating the timestamp values
You can use the combination of "TRUNC" and "IN" keywords in your query to achieve what you are expecting. Please check the below query sample as a reference.
SELECT * FROM customer WHERE TRUNC(last_update_dt) IN (select DISTINCT (TRUNC(last_update_dt)) from ... )
Cheers !!

Date format in Oracle- fetching Date of certain range

I have a date table in my db in Oracle. When I run a query I get the date format as '01-05-2015' but when I run a similar query in BIRT, I get the date format as '01-MAY-2015 12:00 AM'. How can I get the date format in dd/mm/yyy by keeping the data type of date field as date.
here is sample of my database.
EQ_DT
05-07-2015
06-06-2015
15-02-2015
19-09-2015
28-12-2015
also my query is :
select to_date(to_char(to_date(enquiry_dt,'DD/MM/YYYY'),'DD/MM/YY'),'DD/MM/YY') as q from xxcus.XXACL_SALES_ENQ_DATAMART where to_date(to_char(to_date(enquiry_dt,'DD/MM/YY'),'DD/MM/YY'),'DD/MM/YY')>'21-06-2012' order by q
I am getting error of NOT A VALID Month also
If enquiry_dt is already a date column, why are you trying to convert it to date (and then to char and to date again)?
SELECT to_char(enquiry_dt, 'DD/MM/YYYY') AS q
FROM xxcus.xxacl_sales_enq_datamart
WHERE enquiry_dt > to_date('21-06-2012', 'dd-mm-yyyy')
ORDER BY enquiry_dt
In birt, where you place the field on the report, set the field type to date. Then in properties for that field , go to format date time, and finally specify the date formatting you want for that field .
I prefer to always use pass date parameters as strings to BIRT, using a known date format. This is for report parameters as well as for DataSet parameters.
Then, inside the query, I convert to date like this:
with params as
( select to_date(pi_start_date_str, 'DD.MM.YYYY') as start_date_incl,
to_date(pi_end_date_str, 'DD.MM.YYYY') + 1 as end_date_excl
from dual
)
select whatever
from my_table, params
where ( my_table.event_date >= params.start_date_incl
and
my_table.end_date < params.start_date_excl
)
This works independent of the time of day.
This way, e.g. to select all events for january 2016, I could pass the query parameters '01.01.2016' and '31.01.2016' (I'm using german date format here).

Parsing date format to join in hive

I have a date field which is of type String and in the format:
03/11/2001
And I want to join it with another column, which is in a different String format:
1855-05-25 12:00:00.0
How can I join both columns efficiently in hive, ignoring the time part of the second column?
My query looks like below:
LEFT JOIN tabel1 t1 ON table2.Date=t1.Date
Since you have both the date values in different formats you need to use the date functions for both and convert it to a similar format of date type in your join query. It would be something like this :
LEFT JOIN tabel1 t1 ON unix_timestamp(table2.Date, 'yyyy-MM-dd HH:mm:ss.S')table2.Date=unix_timestamp(t1.Date,'MM/dd/yyyy')
You could refer this and this for the hive in built date functions.
convert the dates into same format
to_date(table2.date) = to_date(t1.date)

Get the sysdate -1 in Hive

Is there any way to get the current date -1 in Hive means yesterdays date always?
And in this format- 20120805?
I can run my query like this to get the data for yesterday's date as today is Aug 6th-
select * from table1 where dt = '20120805';
But when I tried doing this way with date_sub function to get the yesterday's date as the below table is partitioned on date(dt) column.
select * from table1 where dt = date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(),
'yyyyMMdd')) , 1) limit 10;
It is looking for the data in all the partitions? Why? Something wrong I am doing in my query?
How I can make the evaluation happen in a subquery to avoid the whole table scanned?
Try something like:
select * from table1
where dt >= from_unixtime(unix_timestamp()-1*60*60*24, 'yyyyMMdd');
This works if you don't mind that hive scans the entire table. from_unixtime is not deterministic, so the query planner in Hive won't optimize for you. For many cases (for example log files), not specifying a deterministic partition key can cause a very large hadoop job to start since it will scan the whole table, not just the rows with the given partition key.
If this matters to you, you can launch hive with an additional option
$ hive -hiveconf date_yesterday=20150331
And in the script or hive terminal use
select * from table1
where dt >= ${hiveconf:date_yesterday};
The name of the variable doesn't matter, nor does the value, you can set them in this case to get the prior date using unix commands. In the specific case of the OP
$ hive -hiveconf date_yesterday=$(date --date yesterday "+%Y%m%d")
In mysql:
select DATE_FORMAT(curdate()-1,'%Y%m%d');
In sqlserver :
SELECT convert(varchar,getDate()-1,112)
Use this query:
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP()-1*24*60*60,'%Y%m%d');
It looks like DATE_SUB assumes date in format yyyy-MM-dd. So you might have to do some more format manipulation to get to your format. Try this:
select * from table1
where dt = FROM_UNIXTIME(
UNIX_TIMESTAMP(
DATE_SUB(
FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd')
, 1)
)
, 'yyyyMMdd') limit 10;
Use this:
select * from table1 where dt = date_format(concat(year(date_sub(current_timestamp,1)),'-', month(date_sub(current_timestamp,1)), '-', day(date_sub(current_timestamp,1))), 'yyyyMMdd') limit 10;
This will give a deterministic result (a string) of your partition.
I know it's super verbose.

DATEDIFF (in months) in linq

select col1,col2,col3 from table1
where(DATEDIFF(mm, tblAccount.[State Change Date], GETDATE()) <= 4
I want to convert this sql query to LINQ. but I dont know any DateDiff alternative in LINQ. can you please suggest me?
You're looking for SqlMethods.DateDiffMonth.
EDIT: In EF4, use SqlFunctions.DateDiff.
Putting aside your original question for a moment, in your query you use:
where(DATEDIFF(mm, tblAccount.[State Change Date], GETDATE()) <= 4
This query would always cause a full table scan, since you're comparing the result of a function call against a constant. It would be much better if you calculate your date first, then compare your column value against the calculated value, which would allow SQL to use an index to find the results instead of having to evaluate every record in your table.
It looks like you're trying to retrieve anything within the past 4 months, so in your application code, try calculating the date that you can compare against first, and pass that value into your Linq2Entities expression:
DateTime earliestDate = new DateTime(DateTime.Now.Year, DateTime.Now.Month, 1).AddMonths(-4);
var results = from t in context.table1
where t.col3 >= earliestDate
select t;
In EF6, the class to use is DbFunctions. See, for example, DbFunctions.DiffMonths.
I solved this problem in another manner: I calculated the date from code:
theDate = Date.Now.AddMonths(-4)
And in EF change the condition:
tblAccount.[State Change Date] < theDate

Resources