Sqoop changes Date to Long when ingested data is saved as avrodata - oracle

Using Oracle to ingest the data to HDFS as avrodata using SQOOP. The date/timestamp column fields are changed to long so the value is getting altered.
Example :
28-MAR-18 12.42.06.328000 PM changes to 1523401161454.
Any insights on the issue.

Related

How to delete existing record that is already loaded using hive

I am loading data as per daily routine in a external table of hive from local file system and it is around one year of data I have in my table. Today client informed me that the yesterday`s data was incorrect. Now how to delete the yesterday's data from the table which has already a huge amount of data in it.
You can only delete data from hive table by using Hive Transaction Management.But there are certain limitations:
1)File format should be orc type.
2)Your table must be bucketed.
3)Transaction can not be enabled on external table because its out of meta store control.
By default transaction management feature is off. You can turn this on by updating hive-site.xml file.

how to resolve date difference between Hive text file format and parquet file format

We created one external parquet table in hive, inserted the existing text file data into the external parquet table using insert overwrite.
but we did observe date from existing text file are not matching with parquet Files.
Data from to file
txt file date : 2003-09-06 00:00:00
parquet file date : 2003-09-06 04:00:00
Questions :
1) how we can resolve this issue.
2) why we are getting these discrepancy in data.
Even we faced a similar issue when we are sqooping the tables from sql server this is because of driver or jar issue.
when you are doing an insert overwrite try using cast for the date fields.
This should work let me know if you face any issues.
Thanks for your help..
using both beeline and impala query editor in Hue. to access the data stores in parquet table, with the timestamp issue occuring when you use impala query via Hue.
This is most likely related to a known difference in the way Hive and Impala handles timestamp values:
- when Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time.
- Impala, however on the other hand, does no conversion when it reads the timestamp field, hence, UTC time is returned instead of local time.
If your servers are located in EST time zone, this can give an explanation for the +4h time offset as below:
- the timestamp 2003-09-06 00:00 in the example should be understood as EST EDT time (sept. 06 is daylight saving time, therefore UTC-4h time zone)
- +4h is added to the timestamp when stored by Hive
- the same offset is subtracted when it is read back by Hive, getting the correct value
- no correction is done when read back by Impala, thus showing 2003-09-06 04:00:00

Hive - Hbase integration Transactional update with timestamp

I am new to hadoop and big data, just trying to figure out the possibilities to move my Data store to hbase these days, and I have come across a problem, which some of you might be able to help me with. So its like,
I have a hbase table "hbase_testTable" with Column Family : "ColFam1". I have set the version of "ColFam1" to 10, as I have to maintain history upto 10 updates to this column family. Which works fine. When I try to add new rows through hbase shell with explicit timestamp value it works fine. Basically I want to use the timestamp as my version control. So I specify the time stamp as
put 'hbase_testTable' '1001','ColFam1:q1', '1000$', 3
where '3' is my version. And everything works fine.
Now I am trying to integrate with HIVE external table, and I have all mappings well set to match that of hbase table like below :
create external table testtable (id string, q1 string, q2 string, q3 string)
STOREd BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,colfam1:q1, colfam1:q2, colfam1:q3")
TBLPROPERTIES("hbase.table.name" = "testtable", "transactional" = "true");
And works fine with normal insertion. It updates the HBase table and vice-versa.
Even though the external table is made "Transactional", I am not able to update the data on HIVE. It gives me an error :
FAILED: SemanticException [Error 10294]: Attempt to do update or delete
using transaction manager that does not support these operations
Said that, Any updates, made to the hbase tables are reflected immediately on the hive table.
I can update the Hbase table with hive external table by trying to insert into the hive external table for the "rowid" with new data for the column.
Is it possible to I control the timestamp being written to the referenced hbase table ( like 4,5,6,7..etc) Please help.
The timestamp is one of important element in Hbase versioning. You are trying to create your own timestamp, which works fine at Hbase level.
One point, is you should be very careful, with unique and non-negative. You can look at Custom versioning in HBase-Definitve Guide book.
Now you have Hive on top of Hbase. As per documentation,
there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
Thats for the reading part. And for putting data, you can look here.
It still says that, you have to give valid time stamp and not any other value.
The future versions are expected to expose the timestamp attribute.
I hope you got a better idea regarding how to deal with custom timestamp in Hive-Hbase integration.

Date field issues while using Sqoop with --as-avrodatafile option

Following is the gist of my problem.
Env:
Hadoop 2 (CDH5.1)
database: oracle 11g
Scenarios:
I'm sqooping fact and dimension tables from the database into hdfs. Initially, I had challenges in handling nulls (which was handled using --null-string and --non-null-string) which was set to \N as per the recommendation. Everything was fine when the hive table that was built had string fields even for date and numerics.
Solution so far
Based on a recommendation, I move to importing using the Avro format. I've built the hive table on the avro data and I'm able to query the tables. Now I need to create Hive joins and convert all the fields to their required type like dates to be dates/timestamps, numerics to be int/bigint etc. After the sqooping the avro schema created had converted all date fields to long and the hive table show bigint for those columns.
I'm confused around how sqoop is handling nulls and how those are to be handled in hive/hdfs MR etc.
Could you anybody suggest any practice that has been adopted that could be leveraged?
Thanks
Venkatesh
It was a problem for me too. When I improted schema from parquet tables.. as Parquet stores timestamp as bigint. So I guess the underlying problem is parquet that does not have a separate datatype to store timestamp. Don't use AVRO very often, but I think it is true for AVRO too. So if you sqoop from Oracle date/timestamp into a set of parquet/avro files, then storage type (bigint) is how it is stored, not how you want to access it as (timestamp/date).
That time is stored as number of milliseconds from UNIX epoch time (Jan 1st 1970). There are Hive/Spark/Impala functions from_unixtime() that take number of seconds so the solution is to convert those ms values to s resolution:
SELECT ..
, from_unixtime(cast(bigint_column/1000 as bigint))
So you will see timestamps like:
1999-04-14 06:00:00
1999-04-15 06:00:00
Notice 6 hours shift. In my case original Oracle's data type was DATE without any time part (00:00:00), but I got time shifted by 06 hours because of my timezone (MST). So to get exact dates:
SELECT ..
, from_unixtime(cast(bigint_column/1000 - 6*3600 as bigint))
which resulted in:
1999-04-14 00:00:00
1999-04-15 00:00:00
ps. "Data Type Considerations for Parquet Tables"
http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_parquet.html#parquet_data_types_unique_1 :
INT96 -> TIMESTAMP
Thanks Gergely. The approaches that we followed to overcome this issue was to sqoop import the date fields as Strings type when sqooped into hdfs. This was achieve using
sqoop --option-file $OPTION_FILE_NAME \
--table $TABLE_NAME \
--map-column-java DAY_END_DTE=String \
--target-dir $TARGET_DIR \
--as-avrodatafile
This would cause the timestamp information to be sqooped as string of 'yyyy-mm-dd hh:mm:ss.f' format which could be casted into a date field.
it is not a solution, it is a workaround:
You can convert the imported data to timestamp with this command:
select cast(long_column as TIMESTAMP) from imported_table;
BR,
Gergely

Hive timestamp import from netezza

I an ETLing a Netezza DB into a Hive target DB but I keep getting issues when it concerns timestamps. The source DB for the ETL to Netezza is Oracle and the "dates" there are stored as varchar. When Etled to Netezza they undergo a transformation into the netezza format and are accepted correctly.
When extracting this data from Netezza into hive I get an exception from java.sql.Timestamp that the timestamp is not in the appropriate format.
Note: due to the nature and specificity of the error on this system I cannot show output or logs

Resources