kafka connect transformations - jdbc

I want to stream the data from postgreSQL to HDFS but all of my tables have multiple timestamp columns. I had tried simple jdbc connect and HDFS sink offered by confluent to read data from postgres and write it to HDFS but all of my date columns are converted into BigInt instead of date format. I don't know which format this big int is in but its not in Unix timestamp.Is there any why to find out in which format that is?
I am also thinking to transform the date columns to UNIX timestamp in my jdbc connect and then transform them to date in my sink by using TimeStampConverter. But i don't know how to implement it in my current procedure.
The issue is reported here.

Related

How can I set the timezone in Azure Data Factory for Oracle connection?

We have an issue when we copy data from oracle to ADLS using ADF(Azure Data Factory).
The oracle DB has tables with timestamp values at European timezone. We use azure data factory to copy the data in to ADLS. The Data Factory IR (Integration Runtime) is on an on-prem VM that is in US Eastern time zone.
The issue is - When we copy oracle table that has timestamp (but no timezone), the ADF copy activity automatically converts the timestamp value to US Eastern Timezone. But we don’t want this to happen, we want to ingest the data as it is in the source table.
Example:
Data in Oracle Table - 2020-03-04T00:00:00 ( this is in CET )
Data in ADLS - 2020-03-03T19:00:00.000+0000 ( above date got converted to US EST, since there is no timezon info in Oracle table, and its being interpreted as UTC by Spark (+0000))
Expected in ADLS - 2020-03-04T00:00:00.000+0000 (don't want timezone conversion)
Is there a way to enforce a timezone at oracle connection level in Azure Data Factory ?
We tried to set property in Oracle Linked service - connection parameters ( PFA) but this had no effect on the timezone, we still got it converted to EST.
TIME_ZONE='Europe\Madrid'
TIME_ZONE='CET'
Timestamp is internally converted to Datetime in ADF
Image source: MS document
Thus, In Mapping tab of copy activity, Change the datatype of source column and copy the data. Below is the approach to change type.
Click the JSON representation of the pipeline.
Edit the datatype in Json for column with timestamp to String (both in Source and sink).
Once pipeline is run, data is copied into sink as in source format.
Source:
Sink:

Handling dates in Hadoop

I'm new to the Big Data/Hadoop ecosystem and have noticed that dates are not always handled in standard way across technologies. I plan to be ingesting data from Oracle into Hive tables on an HDFS using Sqoop with Avro and Parquet file formats. Hive continues to import my dates into BIGINT values, I'd prefer TIMESTAMPS. I've tried using the "--map-column-hive" overrides... but it still does not work.
Looking for suggestions on the best way to handle dates for this use case.
Parquet File Format
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE, DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Avro File Format
Currently, Avro tables cannot contain TIMESTAMP columns. If you need to store date and time values in Avro tables, as a workaround you can use a STRING representation of the values, convert the values to BIGINT with the UNIX_TIMESTAMP() function, or create separate numeric columns for individual date and time fields using the EXTRACT() function.
You can also use your Hive query like this to get the result in your desired TIMESTAMP format.
FROM_UNIXTIME(CAST(SUBSTR(timestamp_column, 1,10) AS INT)) AS timestamp_column;
Other workaround is to import data using --query in sqoop command, where you can cast your column into timestamp format.
Example
--query 'SELECT CAST (INSERTION_DATE AS TIMESTAMP) FROM tablename WHERE $CONDITIONS'
If your SELECT query gets a bit long, you can use configuration files to shorten the length of the command line call. Here is the reference

how to resolve date difference between Hive text file format and parquet file format

We created one external parquet table in hive, inserted the existing text file data into the external parquet table using insert overwrite.
but we did observe date from existing text file are not matching with parquet Files.
Data from to file
txt file date : 2003-09-06 00:00:00
parquet file date : 2003-09-06 04:00:00
Questions :
1) how we can resolve this issue.
2) why we are getting these discrepancy in data.
Even we faced a similar issue when we are sqooping the tables from sql server this is because of driver or jar issue.
when you are doing an insert overwrite try using cast for the date fields.
This should work let me know if you face any issues.
Thanks for your help..
using both beeline and impala query editor in Hue. to access the data stores in parquet table, with the timestamp issue occuring when you use impala query via Hue.
This is most likely related to a known difference in the way Hive and Impala handles timestamp values:
- when Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time.
- Impala, however on the other hand, does no conversion when it reads the timestamp field, hence, UTC time is returned instead of local time.
If your servers are located in EST time zone, this can give an explanation for the +4h time offset as below:
- the timestamp 2003-09-06 00:00 in the example should be understood as EST EDT time (sept. 06 is daylight saving time, therefore UTC-4h time zone)
- +4h is added to the timestamp when stored by Hive
- the same offset is subtracted when it is read back by Hive, getting the correct value
- no correction is done when read back by Impala, thus showing 2003-09-06 04:00:00

Date field issues while using Sqoop with --as-avrodatafile option

Following is the gist of my problem.
Env:
Hadoop 2 (CDH5.1)
database: oracle 11g
Scenarios:
I'm sqooping fact and dimension tables from the database into hdfs. Initially, I had challenges in handling nulls (which was handled using --null-string and --non-null-string) which was set to \N as per the recommendation. Everything was fine when the hive table that was built had string fields even for date and numerics.
Solution so far
Based on a recommendation, I move to importing using the Avro format. I've built the hive table on the avro data and I'm able to query the tables. Now I need to create Hive joins and convert all the fields to their required type like dates to be dates/timestamps, numerics to be int/bigint etc. After the sqooping the avro schema created had converted all date fields to long and the hive table show bigint for those columns.
I'm confused around how sqoop is handling nulls and how those are to be handled in hive/hdfs MR etc.
Could you anybody suggest any practice that has been adopted that could be leveraged?
Thanks
Venkatesh
It was a problem for me too. When I improted schema from parquet tables.. as Parquet stores timestamp as bigint. So I guess the underlying problem is parquet that does not have a separate datatype to store timestamp. Don't use AVRO very often, but I think it is true for AVRO too. So if you sqoop from Oracle date/timestamp into a set of parquet/avro files, then storage type (bigint) is how it is stored, not how you want to access it as (timestamp/date).
That time is stored as number of milliseconds from UNIX epoch time (Jan 1st 1970). There are Hive/Spark/Impala functions from_unixtime() that take number of seconds so the solution is to convert those ms values to s resolution:
SELECT ..
, from_unixtime(cast(bigint_column/1000 as bigint))
So you will see timestamps like:
1999-04-14 06:00:00
1999-04-15 06:00:00
Notice 6 hours shift. In my case original Oracle's data type was DATE without any time part (00:00:00), but I got time shifted by 06 hours because of my timezone (MST). So to get exact dates:
SELECT ..
, from_unixtime(cast(bigint_column/1000 - 6*3600 as bigint))
which resulted in:
1999-04-14 00:00:00
1999-04-15 00:00:00
ps. "Data Type Considerations for Parquet Tables"
http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_parquet.html#parquet_data_types_unique_1 :
INT96 -> TIMESTAMP
Thanks Gergely. The approaches that we followed to overcome this issue was to sqoop import the date fields as Strings type when sqooped into hdfs. This was achieve using
sqoop --option-file $OPTION_FILE_NAME \
--table $TABLE_NAME \
--map-column-java DAY_END_DTE=String \
--target-dir $TARGET_DIR \
--as-avrodatafile
This would cause the timestamp information to be sqooped as string of 'yyyy-mm-dd hh:mm:ss.f' format which could be casted into a date field.
it is not a solution, it is a workaround:
You can convert the imported data to timestamp with this command:
select cast(long_column as TIMESTAMP) from imported_table;
BR,
Gergely

Hadoop Hive - best use cases to create a custom Hive Input and Output formats?

Just wanted to understand what best use cases to create a custom Hive InputFormat and Output format?
If anyone of you have created could you please let know when to decide to develop a custom Input / Output formats?
Thanks,
To make Hive varchar behave like Oracle varchar2:
While working on oracle to hadoop migration, we came across a setting in oracle where if the length of data for a varchar2 column exceeds the value defined in table DDL, oracle rejects the record.
Ex: Lets say we have a column 'name' in oracle and hadoop with max length 10 bytes
name varchar2(10 BYTE) - Oracle
name varchar(10) - Hive
If the value for name field="lengthgreaterthanten", oracle rejects the record as oracle applies schema during write time. Whereas hive reads "lengthgrea" i.e. 10 characters as Hive just applies the schema at the time of reading the records from HDFS.
To get over this problem we came up with a custom input format that checks the length of the varchar field by splitting on the delimiter. If the length is greater than the specified length, it continues to the next record. Else if the length is less than or equal to the specified length, the record is written to HDFS.
Hope this helps.
Thanks
one of the various file formats used for Hive are RCFile, Parquet and ORC file formats. These file formats are columnar file format. This gives an advantage that when you reading large tables you don't have to read and process all the data. Most of the aggregation queries refer to only few columns rather than all of them. This speeds up your processing hugely.
Other application could be storing , reading and processing your custom input format, where data might be stored differently than csv structure. These might be binary files or any other structure.
You will have to follow the documentation to create input formats. For details you can follow the link: Custom InputFormat with Hive

Resources