how to resolve date difference between Hive text file format and parquet file format

how to resolve date difference between Hive text file format and parquet file format - hadoop

We created one external parquet table in hive, inserted the existing text file data into the external parquet table using insert overwrite.
but we did observe date from existing text file are not matching with parquet Files.
Data from to file
txt file date : 2003-09-06 00:00:00
parquet file date : 2003-09-06 04:00:00
Questions :
1) how we can resolve this issue.
2) why we are getting these discrepancy in data.

Even we faced a similar issue when we are sqooping the tables from sql server this is because of driver or jar issue.
when you are doing an insert overwrite try using cast for the date fields.
This should work let me know if you face any issues.

Thanks for your help..
using both beeline and impala query editor in Hue. to access the data stores in parquet table, with the timestamp issue occuring when you use impala query via Hue.
This is most likely related to a known difference in the way Hive and Impala handles timestamp values:
- when Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time.
- Impala, however on the other hand, does no conversion when it reads the timestamp field, hence, UTC time is returned instead of local time.
If your servers are located in EST time zone, this can give an explanation for the +4h time offset as below:
- the timestamp 2003-09-06 00:00 in the example should be understood as EST EDT time (sept. 06 is daylight saving time, therefore UTC-4h time zone)
- +4h is added to the timestamp when stored by Hive
- the same offset is subtracted when it is read back by Hive, getting the correct value
- no correction is done when read back by Impala, thus showing 2003-09-06 04:00:00

Related

How can I set the timezone in Azure Data Factory for Oracle connection?

We have an issue when we copy data from oracle to ADLS using ADF(Azure Data Factory).
The oracle DB has tables with timestamp values at European timezone. We use azure data factory to copy the data in to ADLS. The Data Factory IR (Integration Runtime) is on an on-prem VM that is in US Eastern time zone.
The issue is - When we copy oracle table that has timestamp (but no timezone), the ADF copy activity automatically converts the timestamp value to US Eastern Timezone. But we don’t want this to happen, we want to ingest the data as it is in the source table.
Example:
Data in Oracle Table - 2020-03-04T00:00:00 ( this is in CET )
Data in ADLS - 2020-03-03T19:00:00.000+0000 ( above date got converted to US EST, since there is no timezon info in Oracle table, and its being interpreted as UTC by Spark (+0000))
Expected in ADLS - 2020-03-04T00:00:00.000+0000 (don't want timezone conversion)
Is there a way to enforce a timezone at oracle connection level in Azure Data Factory ?
We tried to set property in Oracle Linked service - connection parameters ( PFA) but this had no effect on the timezone, we still got it converted to EST.
TIME_ZONE='Europe\Madrid'
TIME_ZONE='CET'

Timestamp is internally converted to Datetime in ADF
Image source: MS document
Thus, In Mapping tab of copy activity, Change the datatype of source column and copy the data. Below is the approach to change type.
Click the JSON representation of the pipeline.
Edit the datatype in Json for column with timestamp to String (both in Source and sink).
Once pipeline is run, data is copied into sink as in source format.
Source:
Sink:

Timezone for COMMIT_TIMESTAMP in V$LOGMNR_CONTENTS

In V$LOGMNR_CONTENTS dictionary view the TIMESTAMP and COMMIT_TIMESTAMP columns are of DATE data type - without any timezone information. So which timezone are they in - database timezone, host timezone, or UTC? Is there a database parameter to configure their timezone?

I guess it is the time zone of database server's operating system. Simply because SYSDATE which might be used for insert is also returned in the time zone of database server's operating system.
Perhaps Oracle uses DATE data type instead of TIMESTAMP data type for historical reasons. I don't know when TIMESTAMP was introduced but certainly DATE came earlier.

When a SELECT statement is executed against the V$LOGMNR_CONTENTS view, the archive redo log files are read sequentially. These archive redo log files are the ones present into the archive log destination. Translated records from the redo log files are returned as rows in this view. This continues until either the filter criteria specified at startup (EndTime or endScn) are met or the end of the archive log file is reached.
The field TIMESTAMP is the Timestamp when the database change was made. This timestamp corresponds to the SCN transformation SCN_TO_TIMESTAMP, so that for a given SCN you have a correspondent timestamp.
The field COMMIT_TIMESTAMP is the timestamp when the transaction was committed; only meaningful if the COMMITTED_DATA_ONLY option was chosen in a DBMS_LOGMNR.START_LOGMNR() invocation. As you know, querying the redo logs and archive logs require that you invoke this package in a log miner session.
Actually, Oracle uses sometimes DATE data types when it probably should use TIMESTAMP in a lot of different dictionary fields. Why ? I honestly don't know, it is the same when they use for some dictionary views owner, for others table_owner and for others owner_name.
The DBTIMEZONE is specified in the CREATE DATABASE statement, so in the moment you create the database. you can change the DBTIMEZONE by using ALTER DATABASE
alter database set time_zone = 'EST';
Keep in mind that altering the database time zone will only take effect after shutdown/startup, and it is not recommendable.
TIMESTAMP WITH TIME ZONE is a variant of TIMESTAMP that includes a time zone region name or time zone offset in its value. The time zone offset is the difference (in hours and minutes) between local time and UTC (Coordinated Universal Time, formerly Greenwich Mean Time).
Oracle Database normalizes all new TIMESTAMP WITH LOCAL TIME ZONE data to the time zone of the database when the data is stored on disk.Oracle Database does not automatically update existing data in the database to the new time zone. Therefore, you cannot reset the database time zone if there is any TIMESTAMP WITH LOCAL TIME ZONE data in the database. You must first delete or export the TIMESTAMP WITH LOCAL TIME ZONE data and then reset the database time zone. For this reason, Oracle does not encourage you to change the time zone of a database that contains data.
An example of my case: I have an Oracle Database in Azure ( where all the servers are using UTC ) In my case I chose to use UTC instead of using a different DBTIMEZONE. Then I created a function to transform any timestamp stored in any table to my time zone.
I wonder why you need to read the redo/archive logs, do you have to recover some lost transactions ? I hope the explanation is satisfactory, please don't hesitate to comment or ask whatever other doubts you may have.

Timezone when reading orc in hive

I have the external hive table (stored as orc). I put orc file using PutORC processor in Nifi.
When I select from table using hive-cli, values in timestamp columns 3 hours less than in the orc file.
hive> desc transactions;
OK
host string
id bigint
type int
time_ timestamp
hive> select id, time_ from transactions where id=9126893492;
OK
9126893492 2020-03-01 08:45:18
I check the contents of the orc file via pyarrow lib and result is: 2020-03-01 11:45:18
Are there any setting for hive to configuring timezones?
**
I use hive 3.1.2 on centos 7. System's timezone - Europe/Moscow
**

If there is no timezone in the timestamp itself, the host operating system's local time zone is used.
Either add the desired time zone to the timestamps or adjust the host OS's local tz to the desired tz.

Sqoop changes Date to Long when ingested data is saved as avrodata

Using Oracle to ingest the data to HDFS as avrodata using SQOOP. The date/timestamp column fields are changed to long so the value is getting altered.
Example :
28-MAR-18 12.42.06.328000 PM changes to 1523401161454.
Any insights on the issue.

Date field issues while using Sqoop with --as-avrodatafile option

Following is the gist of my problem.
Env:
Hadoop 2 (CDH5.1)
database: oracle 11g
Scenarios:
I'm sqooping fact and dimension tables from the database into hdfs. Initially, I had challenges in handling nulls (which was handled using --null-string and --non-null-string) which was set to \N as per the recommendation. Everything was fine when the hive table that was built had string fields even for date and numerics.
Solution so far
Based on a recommendation, I move to importing using the Avro format. I've built the hive table on the avro data and I'm able to query the tables. Now I need to create Hive joins and convert all the fields to their required type like dates to be dates/timestamps, numerics to be int/bigint etc. After the sqooping the avro schema created had converted all date fields to long and the hive table show bigint for those columns.
I'm confused around how sqoop is handling nulls and how those are to be handled in hive/hdfs MR etc.
Could you anybody suggest any practice that has been adopted that could be leveraged?
Thanks
Venkatesh

It was a problem for me too. When I improted schema from parquet tables.. as Parquet stores timestamp as bigint. So I guess the underlying problem is parquet that does not have a separate datatype to store timestamp. Don't use AVRO very often, but I think it is true for AVRO too. So if you sqoop from Oracle date/timestamp into a set of parquet/avro files, then storage type (bigint) is how it is stored, not how you want to access it as (timestamp/date).
That time is stored as number of milliseconds from UNIX epoch time (Jan 1st 1970). There are Hive/Spark/Impala functions from_unixtime() that take number of seconds so the solution is to convert those ms values to s resolution:
SELECT ..
, from_unixtime(cast(bigint_column/1000 as bigint))
So you will see timestamps like:
1999-04-14 06:00:00
1999-04-15 06:00:00
Notice 6 hours shift. In my case original Oracle's data type was DATE without any time part (00:00:00), but I got time shifted by 06 hours because of my timezone (MST). So to get exact dates:
SELECT ..
, from_unixtime(cast(bigint_column/1000 - 6*3600 as bigint))
which resulted in:
1999-04-14 00:00:00
1999-04-15 00:00:00
ps. "Data Type Considerations for Parquet Tables"
http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_parquet.html#parquet_data_types_unique_1 :
INT96 -> TIMESTAMP

Thanks Gergely. The approaches that we followed to overcome this issue was to sqoop import the date fields as Strings type when sqooped into hdfs. This was achieve using
sqoop --option-file $OPTION_FILE_NAME \
--table $TABLE_NAME \
--map-column-java DAY_END_DTE=String \
--target-dir $TARGET_DIR \
--as-avrodatafile
This would cause the timestamp information to be sqooped as string of 'yyyy-mm-dd hh:mm:ss.f' format which could be casted into a date field.

it is not a solution, it is a workaround:
You can convert the imported data to timestamp with this command:
select cast(long_column as TIMESTAMP) from imported_table;
BR,
Gergely

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to resolve date difference between Hive text file format and parquet file format - hadoop

Even we faced a similar issue when we are sqooping the tables from sql server this is because of driver or jar issue. when you are doing an insert overwrite try using cast for the date fields. This should work let me know if you face any issues.

Related

How can I set the timezone in Azure Data Factory for Oracle connection?

Timezone for COMMIT_TIMESTAMP in V$LOGMNR_CONTENTS

Timezone when reading orc in hive

Sqoop changes Date to Long when ingested data is saved as avrodata

Date field issues while using Sqoop with --as-avrodatafile option

Categories

Resources