For Hive partition based on date, why use string type? why not int? - hadoop

If I'm defining a table in Hive, and will be partitioning based on date, and my dates are in the format YYYYMMDD, which should I choose for the type, int or string?
If it was just a field, and therefore in the files I'm supplying for the table, I could see using a string, even if only so that I can search for and identify malformed entries that might work their way into my data. But since I will be specifying the partition as part of the load process, I know I'll always have correctly formed values.
When used in a Where clause, the partition field will normally be equality or less-than/greater-than logic.

Dates are typically treated as strings in Hive. If you look at all the date manipulation UDFs available, they use string types, so if you were using integers you would have to cast them every time.
Conceptually also I think it makes more sense to use strings, your YYYYMMDD is just a literal representation of a date object, but it is implicitly equivalent to something like YYYY-MM-DD or DDMMYYYY. So if you were using an integer here, it becomes painful to do such comparisons.
Note that you can also compare strings in Hive with equality/greater/lower-than operators, if you want to select a range of partitions you can easily do that with these operators.
The only case I would see using a "date" as an integer is using a timestamps (Unix-style) because it is a continuous value and represents a real measurable quantity.

Because YYYY-MM-DD is the standard for date representation and is the output of hive's to_date() UDF
it also allows you to do lazy things like select * from foo where day>'2013'
http://xkcd.com/1179/

Related

Hive: filtering data between specified dates in string format. Which one is optimised and why?

Which one should be preferred and why? My_date_column is of type string and of format YYYY-MM-DD
SELECT *
FROM my_table
WHERE my_date_column >= '2019-08-31' AND your_date_column <= '2019-09-02';
OR
SELECT *
FROM my_table
WHERE my_date_column in ('2019-08-31','2019-09-01','2019-09-02') ;
Lastly, In general, should I be storing my date as Date data type or simply as a string? I chose string type simply to handle any corrupt/badly formatted data.
Always store dates as date type. If you store them as strings it adds overhead to do any kind of arithmetic with the values (including those inequalities).
If you decided to ignore this advice and store dates as string anyway, then I believe the second form (the IN list) will be faster as it is comparing text string against text string. The other option, using inequalities, may not give the results you expect unless you explicitly CAST(my_date_column as date) and CAST(your_date_column as date). Those cast will be applied to every record, which adds to the total cost of the query, and would be avoidable if those columns were stored as date instead.
A third option that you didn't mention is to use the BETWEEN operator (ie. WHERE my_date BETWEEN start_date AND end_date). This is the same as using the two inequalities, but probably a bit cleaner and more idiomatic.
When in doubt, take a look at the EXPLAIN plans to understand how the query will be executed.

Convert azure.timestamp to NiFi date data type in NiFi expression language

I am using the NiFi ListAzureBlobStorage to get the available blob objects. The processor creates a flowfile for each object with the attributes containing the object metadata. I want to filter on the azure.timestamp attribute, but I do not know what the numeric value represents and how it relates to the NiFi's expression language date data type. I want to compare it with a known date so I need to convert it to a NiFi data-time variable first. How do I do this?
Thanks
According to the code it is already in "NiFi format" which means a Unix timestamp.
Since it represents the number of milliseconds passed since 1/1/1970, you can compare this and the other timestamp using regular number comparison operators.
example: ${azure.timestamp:ge(${now()})} - this will return true if the azure.timestamp is later(or equal) than the current timestamp(now).
If you'd like to compare it to another attribute you can do this:
${azure.timestamp:ge(${attribute.name})}.
If you'd like to convert a different date into a unix timestamp, you can use toDate and then toNumber, or to do the other way around, just use format.

How can you add FLOAT measures in Tableau formatted as a time stamp (hh:mm:ss)?

The fields look as described above. They are time fields from SQL imported as a varchar. I had to format as date in tableau. There can be NULL values, so I am having a tough time getting over that. Tableau statement I have is only ([time spent])+([time waited])+([time solved)].
Thank you!
If you only want to use the result for a graphical visualization of what took the longest, you can split and add all the values into seconds and using it into your view. E.g.
In this case the HH:MM:SS fields are Strings for Tableau.
The formula used to sum the three fields is:
//transforms everything into seconds for each variable
zn((INT(SPLIT([Time Spent],':',1))*3600))
+
zn((INT(SPLIT([Time Spent],":",2))*60))
+
zn((INT(SPLIT([Time Spent],":",3))))
+
zn((INT(SPLIT([Time Waited],':',1))*3600))
+
zn((INT(SPLIT([Time Waited],":",2))*60))
+
zn((INT(SPLIT([Time Waited],":",3))))
+
zn((INT(SPLIT([Time Solved],':',1))*3600))
+
zn((INT(SPLIT([Time Solved],":",2))*60))
+
zn((INT(SPLIT([Time Solved],":",3))))
Quick explanation of the formula:
I SPLIT every field three times, one for the hours, minutes and seconds, adding all the values.
There is an INT formula that will convert the strings into integers.
There is also a ZN for every field - this will make Null fields become Zeros.
You can also use the value as integer if you want, e.g. the Case A has a Total Time of 5310 seconds.
The best approach is usually to store dates in the database in a date field instead of in a string. That might mean a data prep/cleanup step before you get to Tableau, but it will help with efficiency, simplicity and robustness ever after.
You can present dates in many formats, including hh:mm, when the underlying representation is a date datatype. See the custom date options on the format pane in Tableau for example. But storing dates as formatted strings and converting them to something else for calculations is really doing things the hard way.
If you have no choice but to read in strings and convert them to dates, then you should look at the DateParse function.
Either way, decide what a null date means and make sure your calculations behave well in that case -- unless you can enforce that the date field not contain nulls in the database.
One example would be a field called Completed_Date in a table of Work_Orders. You could determine that a null Completed_Date meant the work order had not been fulfilled yet, and thus allow nulls for that field. But you could also have the database enforce that another field, say Submitted_Date, could never be null.

proper way to compare two timestamp fields in Oracle

Say I have two timestamp type columns timestamp_column1 and timestamp_column2
I want to compare if timestamp_column1 is greater than timestamp_column2.
How do I compare these two timestamps ? Does comparison operators work properly with timestamp on Oracle?
timestamp_column1 > timestamp_column2
Is this correct??
Or do I have to wrap them in some function to compare them with each other
like to_timestamp(timestamp_column1) > to_timestamp(timestamp_column2)
?
As long as the "timestamp" columns are truly using one of the date or timestamp data types, then yes the usual relational operators will work.
The only time you need to wrap a timestamp in a function is if it's erroneously stored as a string, or if you want to manipulate it in some way such as truncating it to the hour, day, week, month, year or other less discreet unit of time.

Hibernate mapping of two properties to one column

I have an object that is generated from XSDs, so I can't change it. In it I have a String DATE and a String TIME (representing the time of day without the date).
DATE = yyyy-mm-dd
TIME = hh:MM:ss:mmmm
In the OracleDB, I don't want to represent these as VARCHAR. I'd like to use DATE or DATETIME. Therefore, I'd need to map both DATE + TIME to one single column, DATETIME.
This is not possible. You can map two columns to a single property (using composites or user types) but not the other way around.
Using the same column name twice in the mapping file usually results in strange exceptions (index out of bounds).
I would use two columns in the database. Convert them to DATE-kind data types using a user type.

Resources