Ingesting Timestamp field from Oracle to HDFS using NIFI - oracle

I am trying to insert a table from Oracle to HDFS using Nifi. The source table in Oracle has a timestamp(6) datatype field called sd_timestamp.
Nifi is having the following processor:
QueryDatabase:
This queries the OracleDatabase.
CovertAVROSchema:
This one has input and output schemas. Both input and output schemas have sd_timestamp datatype as String.
ConvertAvroToOrc
PutHDFS:
The table that is created in Hive also has the datatype as string for sd_timestamp. When the ingestion is done and I do a select * from the destination hive table, I am getting oracle.sql.timestamp#23aff4 as the value instead of the timestamp.
Please help.

Here are details of what I did to get it working. Did not require the ConvertAvroSchema step.
Oracle table
CREATE TABLE my_table
(
entry_name varchar(10),
sd_timestamp timestamp(6)
);
Populate some data
insert into my_table values('e-1',CURRENT_TIMESTAMP);
insert into my_table values('e-2',CURRENT_TIMESTAMP);
insert into my_table values('e-3',CURRENT_TIMESTAMP);
Verify data
SELECT * FROM my_table;
ENTRY_NAME SD_TIMESTAMP
e-1 09-MAY-18 06.45.24.963327000 PM
e-2 09-MAY-18 06.45.39.291241000 PM
e-3 09-MAY-18 06.45.44.748736000 PM
NiFi Flow
Flow Design
QueryDatabaseTable configuration
ConvertAvroToOrc configuration
PutHDFS configuration
LogAttribute to see the hive.ddl attribute value
Verify results on HDFS
$ hadoop fs -ls /oracle-ingest
/oracle-ingest/50201861895275.orc
Create Hive table to query data using the hive.ddl value and adding location to it
hive> CREATE EXTERNAL TABLE IF NOT EXISTS my_oracle_table
(
ENTRY_NAME STRING,
SD_TIMESTAMP STRING
)
STORED AS ORC
LOCATION '/oracle-ingest';
Query Hive table
hive> select * from my_oracle_table;
e-1 2018-05-09 18:45:24.963327
e-2 2018-05-09 18:45:39.291241
e-3 2018-05-09 18:45:44.748736

I am able to resolve the error by adding the following java argument to the bootstrap.conf file present in Nifi/Conf directory
-Doracle.jdbc.J2EE13Compliant=true

Related

How to create partitioned hive table on dynamic hdfs directories

I am having difficulty in getting hive to discover partitions which are created in HDFS
Here's the directory structure in HDFS
warehouse/database/table_name/A
warehouse/database/table_name/B
warehouse/database/table_name/C
warehouse/database/table_name/D
A,B,C,D being values from a column type
when I create a hive table using the following syntax
CREATE EXTERNAL TABLE IF NOT EXISTS
table_name(`name` string, `description` string)
PARTITIONED BY (`type` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs:///tmp/warehouse/database/table_name'
I am unable to see any records when I query the table.
But when I create directories in HDFS as below
warehouse/database/table_name/type=A
warehouse/database/table_name/type=B
warehouse/database/table_name/type=C
warehouse/database/table_name/type=D
It works and discovers partitions when I check using show partitions table_name
Is there some configuration in hive to able to detect dynamic directories as partitions?
Creating external table on top of some directory is not enough, partitions needs to be mounted also. Discover partitions feature added in Hive 4.0.0. Use MSCK REPAIR TABLE for earlier versions:
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
or it's equivalent on EMR:
ALTER TABLE table_name RECOVER PARTITIONS;
And when you creating dynamic partitions using insert overwrite, partition metadata is being created automatically and partition folders are in the form key=value.

unable to create Parquet file in hive

Can any one pls tell me what is the error in the below query.
insert overwrite directory 'user/cloudera/batch' stored as parquet select * from emp;
I am trying to create parquet table. I am facing the below error when using the above command.
cannot recognize input near 'stored' 'as' 'parquet' in select clause
For creating parquet table first create table and store as parquet.
CREATE TABLE emp (x INT, y STRING) STORED AS PARQUET;
now load the data into this table then you can execute your query.
insert overwrite directory '/user/cloudera/batch' select * from emp;

Can't read data in Presto - can in Hive

I have a Hive DB - I created a table, compatible to Parquet file type.
CREATE EXTERNAL TABLE `default.table`(
`date` date,
`udid` string,
`message_token` string)
PARTITIONED BY (
`dt` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://Bucket/Folder')
I added partitions to this table, but I can't query the data.
In Hive: I can see the partitions when using "Show partitions from default.table", and I get the number of queries when using "Select count(*) from default.table".
In Presto: I can see the partitions when using "Show partitions from default.table", but when I try to query the data itself - it looks like there's no data - empty return with "select *", and 0 when trying "select count(*)".
Hive cluster is AWS EMR, version: emr-5.9.0, Applications: Hive 2.3.0, Presto 0.184, instance type: r3.2xlarge.
Does someone know why I get these differences between Hive and Presto?
Thanks!

Impala can perform a COUNT(*) but not a SELECT * from a table

I came across a bizarre Impala behaviour. I've create a table in HUE from a .csv file I've copied into the Hadoop cluster. I can correctly navigate the table in HUE via the Metastore Manager but I can't run the following query in Impala, as it throws an IllegalStateException: null exception:
select *
from my_db.my_table
limit 100;
The strange thing is that the following command retrieve the correct number of rows:
select
count(*)
from my_db.my_table;
The error is caused by invalid types. Not all hive data types are supported in impala. Impala has a timestamp and no date type. When your table has date type it will show as invalid_type in impala when described and impala cannot select this data type. For solution try changing the column to timestamp
Describe <table name>;
| invalid_type | |
| invalid_type | |
I'm getting the exact same issue. I changed the query to select each column from the table individually (i.e. select col1, col2, col3...etc.) and found that Impala didn't like a date datatype column. Changing it to timestamp fixed the issue and I can now do a select * from the table.

Load from HIVE table into HDFS as AVRO file

I want to load a file into HDFS (as .avro file) from HIVE table.
Currently I am able to move a table as a file from HIVE to HDFS but I am not able to specify a particular format of my Target file. can some one help me in this.??
So your question is really
How do I convert a Hive table to a different storage format?
Create a new table with the same fields and types as the avro table and change the input format. Then insert into the new table from the old table.
INSERT OVERWRITE TABLE newtable SELECT * FROM oldtable

Resources