Timezone when reading orc in hive - hadoop

I have the external hive table (stored as orc). I put orc file using PutORC processor in Nifi.
When I select from table using hive-cli, values in timestamp columns 3 hours less than in the orc file.
hive> desc transactions;
OK
host string
id bigint
type int
time_ timestamp
hive> select id, time_ from transactions where id=9126893492;
OK
9126893492 2020-03-01 08:45:18
I check the contents of the orc file via pyarrow lib and result is: 2020-03-01 11:45:18
Are there any setting for hive to configuring timezones?
**
I use hive 3.1.2 on centos 7. System's timezone - Europe/Moscow
**

If there is no timezone in the timestamp itself, the host operating system's local time zone is used.
Either add the desired time zone to the timestamps or adjust the host OS's local tz to the desired tz.

Related

CurrentTime() generated from Pig showing as NULL in Hive Datetime column

In Pig script I have generated datetime column with its value as CurrentTime().
While reading the data from Hive Table for the output generated by PigScript, it shows as NULL.
Is there any way that I can load the current datetime column from PIG to show in Hive Table?
The data in the file looks like 2020-07-24T14:38:26.748-04:00 and in the hive table the column is of timestamp datatype
Hive timestamp should be in 'yyyy-MM-dd HH:mm:ss.SSS' format (without T and timezone -04:00)
1.Define Hive column as STRING
2.Transfom string to format compatible with Hive timestamp
If you do not need milliseconds:
--use your string column instead of literal
from_unixtime(unix_timestamp('2020-07-24T14:38:26.748-04:00',"yyyy-MM-dd'T'HH:mm:ss.SSSX"))
Returns:
2020-07-24 18:38:26
If you need milliseconds then additionally extract milliseconds and concatenate with transformed timestamp:
select concat(from_unixtime(unix_timestamp('2020-07-24T14:38:26.748-04:00',"yyyy-MM-dd'T'HH:mm:ss.SSSX")),
'.',regexp_extract('2020-07-24T14:38:26.748-04:00','\\.(\\d{3})',1))
Result:
2020-07-24 18:38:26.748
Both results are compatible with Hive timestamp and if necessary can be cast explicitly to Timestamp type using CAST(str as timestamp) function, though comparing these strings with timestamps or inserting into timestamp works without explicit cast.
Alternatively you can format timestamp in Pig to be 'yyyy-MM-dd HH:mm:ss.SSS' I do not have Pig and can not check how ToString works.
Also for LazySimpleSerDe, alternative timestamp formats can be supported by providing the format to the SerDe property "timestamp.formats" (as of release 1.2.0 with HIVE-9298). Try "yyyy-MM-dd'T'HH:mm:ss.SSSX"

TIMESTAMP column not interpreting correct value for ORC file in HDP3.1

As part of cluster migration we are copying ORC hdfs files from old cluster - IBM IOP 4.2 to HDP 3.1. Post migration we see TIMESTAMP column shows -1 hour in HDP 3.1. Similar question posted in - TimeStamp issue in hive 1.1
We cross checked TIME ZONE configuration for all the nodes in cluster - Linux OS and Hive and both set to EDT (local time zone).
Tried testing this scenario by reading the ORC file content using hive -orcfiledump -d and we see the actual file has the correct timestamp value in the orc file. The column value is getting changed when even Hive is reading it and displaying the records.
ORC external table output on OLD cluster.
DATE_KEY DTDATE
20100701 7/1/2010 12:00:00 AM
20100702 7/2/2010 12:00:00 AM
20100703 7/3/2010 12:00:00 AM
ORC external table output on new HDP 3.1 cluster. The DTDATE column shows -1 hour
DATE_KEY DTDATE
20100701 6/30/2010 11:00:00 PM
20100702 7/1/2010 11:00:00 PM
20100703 7/2/2010 11:00:00 PM

how to resolve date difference between Hive text file format and parquet file format

We created one external parquet table in hive, inserted the existing text file data into the external parquet table using insert overwrite.
but we did observe date from existing text file are not matching with parquet Files.
Data from to file
txt file date : 2003-09-06 00:00:00
parquet file date : 2003-09-06 04:00:00
Questions :
1) how we can resolve this issue.
2) why we are getting these discrepancy in data.
Even we faced a similar issue when we are sqooping the tables from sql server this is because of driver or jar issue.
when you are doing an insert overwrite try using cast for the date fields.
This should work let me know if you face any issues.
Thanks for your help..
using both beeline and impala query editor in Hue. to access the data stores in parquet table, with the timestamp issue occuring when you use impala query via Hue.
This is most likely related to a known difference in the way Hive and Impala handles timestamp values:
- when Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time.
- Impala, however on the other hand, does no conversion when it reads the timestamp field, hence, UTC time is returned instead of local time.
If your servers are located in EST time zone, this can give an explanation for the +4h time offset as below:
- the timestamp 2003-09-06 00:00 in the example should be understood as EST EDT time (sept. 06 is daylight saving time, therefore UTC-4h time zone)
- +4h is added to the timestamp when stored by Hive
- the same offset is subtracted when it is read back by Hive, getting the correct value
- no correction is done when read back by Impala, thus showing 2003-09-06 04:00:00

Hive date format not supporting in impala

Hive date format not supporting in impala.
I created partition on date column in hive table but when i can access the same table from hive_metadata in impala its showing
CAUSED BY: TableLoadingException: Failed to load metadata for table
'employee_part' because of unsupported partition-column type 'DATE' in
partition column 'hiredate'.
Please let me know which date format does hive and impala commonly support.
I used date format in hive as yyyy-mm-dd
Impala doesnt support the hive date format.
You have to use a timestamp (which means that you will always carry time but it will be 00:00:00.0000). Then depending on the tool you use after, you have to make a convertion again unfortunately.

Date field issues while using Sqoop with --as-avrodatafile option

Following is the gist of my problem.
Env:
Hadoop 2 (CDH5.1)
database: oracle 11g
Scenarios:
I'm sqooping fact and dimension tables from the database into hdfs. Initially, I had challenges in handling nulls (which was handled using --null-string and --non-null-string) which was set to \N as per the recommendation. Everything was fine when the hive table that was built had string fields even for date and numerics.
Solution so far
Based on a recommendation, I move to importing using the Avro format. I've built the hive table on the avro data and I'm able to query the tables. Now I need to create Hive joins and convert all the fields to their required type like dates to be dates/timestamps, numerics to be int/bigint etc. After the sqooping the avro schema created had converted all date fields to long and the hive table show bigint for those columns.
I'm confused around how sqoop is handling nulls and how those are to be handled in hive/hdfs MR etc.
Could you anybody suggest any practice that has been adopted that could be leveraged?
Thanks
Venkatesh
It was a problem for me too. When I improted schema from parquet tables.. as Parquet stores timestamp as bigint. So I guess the underlying problem is parquet that does not have a separate datatype to store timestamp. Don't use AVRO very often, but I think it is true for AVRO too. So if you sqoop from Oracle date/timestamp into a set of parquet/avro files, then storage type (bigint) is how it is stored, not how you want to access it as (timestamp/date).
That time is stored as number of milliseconds from UNIX epoch time (Jan 1st 1970). There are Hive/Spark/Impala functions from_unixtime() that take number of seconds so the solution is to convert those ms values to s resolution:
SELECT ..
, from_unixtime(cast(bigint_column/1000 as bigint))
So you will see timestamps like:
1999-04-14 06:00:00
1999-04-15 06:00:00
Notice 6 hours shift. In my case original Oracle's data type was DATE without any time part (00:00:00), but I got time shifted by 06 hours because of my timezone (MST). So to get exact dates:
SELECT ..
, from_unixtime(cast(bigint_column/1000 - 6*3600 as bigint))
which resulted in:
1999-04-14 00:00:00
1999-04-15 00:00:00
ps. "Data Type Considerations for Parquet Tables"
http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_parquet.html#parquet_data_types_unique_1 :
INT96 -> TIMESTAMP
Thanks Gergely. The approaches that we followed to overcome this issue was to sqoop import the date fields as Strings type when sqooped into hdfs. This was achieve using
sqoop --option-file $OPTION_FILE_NAME \
--table $TABLE_NAME \
--map-column-java DAY_END_DTE=String \
--target-dir $TARGET_DIR \
--as-avrodatafile
This would cause the timestamp information to be sqooped as string of 'yyyy-mm-dd hh:mm:ss.f' format which could be casted into a date field.
it is not a solution, it is a workaround:
You can convert the imported data to timestamp with this command:
select cast(long_column as TIMESTAMP) from imported_table;
BR,
Gergely

Resources