Hive: filtering data between specified dates in string format. Which one is optimised and why? - hadoop

Which one should be preferred and why? My_date_column is of type string and of format YYYY-MM-DD
SELECT *
FROM my_table
WHERE my_date_column >= '2019-08-31' AND your_date_column <= '2019-09-02';
OR
SELECT *
FROM my_table
WHERE my_date_column in ('2019-08-31','2019-09-01','2019-09-02') ;
Lastly, In general, should I be storing my date as Date data type or simply as a string? I chose string type simply to handle any corrupt/badly formatted data.

Always store dates as date type. If you store them as strings it adds overhead to do any kind of arithmetic with the values (including those inequalities).
If you decided to ignore this advice and store dates as string anyway, then I believe the second form (the IN list) will be faster as it is comparing text string against text string. The other option, using inequalities, may not give the results you expect unless you explicitly CAST(my_date_column as date) and CAST(your_date_column as date). Those cast will be applied to every record, which adds to the total cost of the query, and would be avoidable if those columns were stored as date instead.
A third option that you didn't mention is to use the BETWEEN operator (ie. WHERE my_date BETWEEN start_date AND end_date). This is the same as using the two inequalities, but probably a bit cleaner and more idiomatic.
When in doubt, take a look at the EXPLAIN plans to understand how the query will be executed.

Related

Oracle : Want to convert Substring to a useable, sortable date

1st Post go easy on me.
I'm using this Substring to pull part of a Field, this date I assume is probably non Standard (ddmmmyy) - how can I enhance this command so that I can use this a sortable Date Field, I'm guessing Cast but have no idea of Syntax etc ??
SELECT SUBSTR(Host_Name,-9) as Decom_Date
Output
DECOM_DATE
31Oct2018
31May2018
31May2018
31Mar2017
31Jul2018
TIA
This is exactly what the TO_DATE function is designed for:
SELECT TO_DATE(SUBSTR(Host_Name,-9), 'DDMonYYYY') as Decom_Date
It doesn't affect you here but bear in mind that oracle dates can only store down to a second precision. Also, if you have any rogue data in the table that can't be cant be parsed as a date you'll get "not a valid..." or "a nonnumeric was found where a numeric was expected".
Be mindful that your strings here are in English but parsing MON (3 letter month name) can be regionally contextual so this code might not work on a server with a different NLS; for example consider passing 'NLS_DATE_LANGUAGE = American' as the third argument to TO_DATE if you know your strings will always be English month names

How can you add FLOAT measures in Tableau formatted as a time stamp (hh:mm:ss)?

The fields look as described above. They are time fields from SQL imported as a varchar. I had to format as date in tableau. There can be NULL values, so I am having a tough time getting over that. Tableau statement I have is only ([time spent])+([time waited])+([time solved)].
Thank you!
If you only want to use the result for a graphical visualization of what took the longest, you can split and add all the values into seconds and using it into your view. E.g.
In this case the HH:MM:SS fields are Strings for Tableau.
The formula used to sum the three fields is:
//transforms everything into seconds for each variable
zn((INT(SPLIT([Time Spent],':',1))*3600))
+
zn((INT(SPLIT([Time Spent],":",2))*60))
+
zn((INT(SPLIT([Time Spent],":",3))))
+
zn((INT(SPLIT([Time Waited],':',1))*3600))
+
zn((INT(SPLIT([Time Waited],":",2))*60))
+
zn((INT(SPLIT([Time Waited],":",3))))
+
zn((INT(SPLIT([Time Solved],':',1))*3600))
+
zn((INT(SPLIT([Time Solved],":",2))*60))
+
zn((INT(SPLIT([Time Solved],":",3))))
Quick explanation of the formula:
I SPLIT every field three times, one for the hours, minutes and seconds, adding all the values.
There is an INT formula that will convert the strings into integers.
There is also a ZN for every field - this will make Null fields become Zeros.
You can also use the value as integer if you want, e.g. the Case A has a Total Time of 5310 seconds.
The best approach is usually to store dates in the database in a date field instead of in a string. That might mean a data prep/cleanup step before you get to Tableau, but it will help with efficiency, simplicity and robustness ever after.
You can present dates in many formats, including hh:mm, when the underlying representation is a date datatype. See the custom date options on the format pane in Tableau for example. But storing dates as formatted strings and converting them to something else for calculations is really doing things the hard way.
If you have no choice but to read in strings and convert them to dates, then you should look at the DateParse function.
Either way, decide what a null date means and make sure your calculations behave well in that case -- unless you can enforce that the date field not contain nulls in the database.
One example would be a field called Completed_Date in a table of Work_Orders. You could determine that a null Completed_Date meant the work order had not been fulfilled yet, and thus allow nulls for that field. But you could also have the database enforce that another field, say Submitted_Date, could never be null.

proper way to compare two timestamp fields in Oracle

Say I have two timestamp type columns timestamp_column1 and timestamp_column2
I want to compare if timestamp_column1 is greater than timestamp_column2.
How do I compare these two timestamps ? Does comparison operators work properly with timestamp on Oracle?
timestamp_column1 > timestamp_column2
Is this correct??
Or do I have to wrap them in some function to compare them with each other
like to_timestamp(timestamp_column1) > to_timestamp(timestamp_column2)
?
As long as the "timestamp" columns are truly using one of the date or timestamp data types, then yes the usual relational operators will work.
The only time you need to wrap a timestamp in a function is if it's erroneously stored as a string, or if you want to manipulate it in some way such as truncating it to the hour, day, week, month, year or other less discreet unit of time.

For Hive partition based on date, why use string type? why not int?

If I'm defining a table in Hive, and will be partitioning based on date, and my dates are in the format YYYYMMDD, which should I choose for the type, int or string?
If it was just a field, and therefore in the files I'm supplying for the table, I could see using a string, even if only so that I can search for and identify malformed entries that might work their way into my data. But since I will be specifying the partition as part of the load process, I know I'll always have correctly formed values.
When used in a Where clause, the partition field will normally be equality or less-than/greater-than logic.
Dates are typically treated as strings in Hive. If you look at all the date manipulation UDFs available, they use string types, so if you were using integers you would have to cast them every time.
Conceptually also I think it makes more sense to use strings, your YYYYMMDD is just a literal representation of a date object, but it is implicitly equivalent to something like YYYY-MM-DD or DDMMYYYY. So if you were using an integer here, it becomes painful to do such comparisons.
Note that you can also compare strings in Hive with equality/greater/lower-than operators, if you want to select a range of partitions you can easily do that with these operators.
The only case I would see using a "date" as an integer is using a timestamps (Unix-style) because it is a continuous value and represents a real measurable quantity.
Because YYYY-MM-DD is the standard for date representation and is the output of hive's to_date() UDF
it also allows you to do lazy things like select * from foo where day>'2013'
http://xkcd.com/1179/

Oracle - Fetch date/time in milliseconds from DATE datatype field

I have last_update_date column defined as DATE field
I want to get time in milliseconds.
Currently I have:
TO_CHAR(last_update_date,'YYYY-DD-MM hh:mi:ss am')
But I want to get milliseconds as well.
I googled a bit and think DATE fields will not have milliseconds. only TIMESTAMP fields will.
Is there any way to get milliseconds? I do not have option to change data type for the field.
DATE fields on Oracle only store the data down to a second so there is no way to provide anything more precise than that. If you want more precision, you must use another type such as TIMESTAMP.
Here is a link to another SO question regarding Oracle date and time precision.
As RC says, the DATE type only supports a granularity down to the second.
If converting to TIMESTAMP is truly not an option then how about the addition of another numerical column that just holds the milliseconds?
This option would be more cumbersome to deal with than a TIMESTAMP column but it could be workable if converting the type is not possible.
In a similar situation where I couldn't change the fields in a table, (Couldn't afford to 'break' third party software,) but needed sub-second precision, I added a 1:1 supplemental table, and an after insert trigger on the original table to post the timestamp into the supplemental table.
If you only need to know the ORDER of records being added within the same second, you could do the same thing, only using a sequence as a data source for the supplemental field.

Resources