Handling dates in Hadoop - oracle

I'm new to the Big Data/Hadoop ecosystem and have noticed that dates are not always handled in standard way across technologies. I plan to be ingesting data from Oracle into Hive tables on an HDFS using Sqoop with Avro and Parquet file formats. Hive continues to import my dates into BIGINT values, I'd prefer TIMESTAMPS. I've tried using the "--map-column-hive" overrides... but it still does not work.
Looking for suggestions on the best way to handle dates for this use case.

Parquet File Format
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE, DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Avro File Format
Currently, Avro tables cannot contain TIMESTAMP columns. If you need to store date and time values in Avro tables, as a workaround you can use a STRING representation of the values, convert the values to BIGINT with the UNIX_TIMESTAMP() function, or create separate numeric columns for individual date and time fields using the EXTRACT() function.
You can also use your Hive query like this to get the result in your desired TIMESTAMP format.
FROM_UNIXTIME(CAST(SUBSTR(timestamp_column, 1,10) AS INT)) AS timestamp_column;
Other workaround is to import data using --query in sqoop command, where you can cast your column into timestamp format.
Example
--query 'SELECT CAST (INSERTION_DATE AS TIMESTAMP) FROM tablename WHERE $CONDITIONS'
If your SELECT query gets a bit long, you can use configuration files to shorten the length of the command line call. Here is the reference

Related

CurrentTime() generated from Pig showing as NULL in Hive Datetime column

In Pig script I have generated datetime column with its value as CurrentTime().
While reading the data from Hive Table for the output generated by PigScript, it shows as NULL.
Is there any way that I can load the current datetime column from PIG to show in Hive Table?
The data in the file looks like 2020-07-24T14:38:26.748-04:00 and in the hive table the column is of timestamp datatype
Hive timestamp should be in 'yyyy-MM-dd HH:mm:ss.SSS' format (without T and timezone -04:00)
1.Define Hive column as STRING
2.Transfom string to format compatible with Hive timestamp
If you do not need milliseconds:
--use your string column instead of literal
from_unixtime(unix_timestamp('2020-07-24T14:38:26.748-04:00',"yyyy-MM-dd'T'HH:mm:ss.SSSX"))
Returns:
2020-07-24 18:38:26
If you need milliseconds then additionally extract milliseconds and concatenate with transformed timestamp:
select concat(from_unixtime(unix_timestamp('2020-07-24T14:38:26.748-04:00',"yyyy-MM-dd'T'HH:mm:ss.SSSX")),
'.',regexp_extract('2020-07-24T14:38:26.748-04:00','\\.(\\d{3})',1))
Result:
2020-07-24 18:38:26.748
Both results are compatible with Hive timestamp and if necessary can be cast explicitly to Timestamp type using CAST(str as timestamp) function, though comparing these strings with timestamps or inserting into timestamp works without explicit cast.
Alternatively you can format timestamp in Pig to be 'yyyy-MM-dd HH:mm:ss.SSS' I do not have Pig and can not check how ToString works.
Also for LazySimpleSerDe, alternative timestamp formats can be supported by providing the format to the SerDe property "timestamp.formats" (as of release 1.2.0 with HIVE-9298). Try "yyyy-MM-dd'T'HH:mm:ss.SSSX"

Stop sqoop from converting datetime to bigint

Recently I noticed that whenever I ingest from a SQL database using Sqoop, all datetime fields are converted to a bigint (epoch * 1000) instead of to String.
Important to note: I'm storing as parquet.
I have been trying a bunch of sqoop flags like "--map-column-java" but I don't want to manually define this for hundreds of columns in thousands of tables.
What flag am I missing to prevent this sqoop behaviour?
It seems that sqoop didn't do this when storing in plain text.
Instead of letting sqoop do its arcane magic on my tables, I decided to do the following:
Ingest to a temporary table, stored as text.
Create a table (if not exists) like the temporary table, stored as parquet
insert overwrite the text stored temporary table into the parquet stored table
This allows for proper date formatting without the hassle with (maybe not existing) configuration and settings tweaking in Sqoop.
The only tradoff is that it's slightly slower

HDFS String data to timestamp in hive table

Hi I have a data in HDFS as a string '2015-03-26T00:00:00+00:00' ..if i want to load this data into Hive table (column as timestamp).i am not able to load and i am getting null values.
if i specify column as string i am getting the data into hive table
but if i specify column as timestamp i am not able to load the data and i am getting all NULL values in that column.
Eg: HDFS - '2015-03-26T00:00:00+00:00'
hive table- create table t1(my_date string)
i can get output as - '2015-03-26T00:00:00+00:00'
if i specify create table t1(my_date as timestamp)--i can see all null values
Can any one help me on this
Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.
Go through below link:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps
You have to use a staging table. In the staging table load as String and in the final table use UDF as below to convert the string value to Timestamp
from_unixtime(unix_timestamp(column_name, 'dd-MM-yyyy HH:mm'))

Date field issues while using Sqoop with --as-avrodatafile option

Following is the gist of my problem.
Env:
Hadoop 2 (CDH5.1)
database: oracle 11g
Scenarios:
I'm sqooping fact and dimension tables from the database into hdfs. Initially, I had challenges in handling nulls (which was handled using --null-string and --non-null-string) which was set to \N as per the recommendation. Everything was fine when the hive table that was built had string fields even for date and numerics.
Solution so far
Based on a recommendation, I move to importing using the Avro format. I've built the hive table on the avro data and I'm able to query the tables. Now I need to create Hive joins and convert all the fields to their required type like dates to be dates/timestamps, numerics to be int/bigint etc. After the sqooping the avro schema created had converted all date fields to long and the hive table show bigint for those columns.
I'm confused around how sqoop is handling nulls and how those are to be handled in hive/hdfs MR etc.
Could you anybody suggest any practice that has been adopted that could be leveraged?
Thanks
Venkatesh
It was a problem for me too. When I improted schema from parquet tables.. as Parquet stores timestamp as bigint. So I guess the underlying problem is parquet that does not have a separate datatype to store timestamp. Don't use AVRO very often, but I think it is true for AVRO too. So if you sqoop from Oracle date/timestamp into a set of parquet/avro files, then storage type (bigint) is how it is stored, not how you want to access it as (timestamp/date).
That time is stored as number of milliseconds from UNIX epoch time (Jan 1st 1970). There are Hive/Spark/Impala functions from_unixtime() that take number of seconds so the solution is to convert those ms values to s resolution:
SELECT ..
, from_unixtime(cast(bigint_column/1000 as bigint))
So you will see timestamps like:
1999-04-14 06:00:00
1999-04-15 06:00:00
Notice 6 hours shift. In my case original Oracle's data type was DATE without any time part (00:00:00), but I got time shifted by 06 hours because of my timezone (MST). So to get exact dates:
SELECT ..
, from_unixtime(cast(bigint_column/1000 - 6*3600 as bigint))
which resulted in:
1999-04-14 00:00:00
1999-04-15 00:00:00
ps. "Data Type Considerations for Parquet Tables"
http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_parquet.html#parquet_data_types_unique_1 :
INT96 -> TIMESTAMP
Thanks Gergely. The approaches that we followed to overcome this issue was to sqoop import the date fields as Strings type when sqooped into hdfs. This was achieve using
sqoop --option-file $OPTION_FILE_NAME \
--table $TABLE_NAME \
--map-column-java DAY_END_DTE=String \
--target-dir $TARGET_DIR \
--as-avrodatafile
This would cause the timestamp information to be sqooped as string of 'yyyy-mm-dd hh:mm:ss.f' format which could be casted into a date field.
it is not a solution, it is a workaround:
You can convert the imported data to timestamp with this command:
select cast(long_column as TIMESTAMP) from imported_table;
BR,
Gergely

Hadoop Hive - best use cases to create a custom Hive Input and Output formats?

Just wanted to understand what best use cases to create a custom Hive InputFormat and Output format?
If anyone of you have created could you please let know when to decide to develop a custom Input / Output formats?
Thanks,
To make Hive varchar behave like Oracle varchar2:
While working on oracle to hadoop migration, we came across a setting in oracle where if the length of data for a varchar2 column exceeds the value defined in table DDL, oracle rejects the record.
Ex: Lets say we have a column 'name' in oracle and hadoop with max length 10 bytes
name varchar2(10 BYTE) - Oracle
name varchar(10) - Hive
If the value for name field="lengthgreaterthanten", oracle rejects the record as oracle applies schema during write time. Whereas hive reads "lengthgrea" i.e. 10 characters as Hive just applies the schema at the time of reading the records from HDFS.
To get over this problem we came up with a custom input format that checks the length of the varchar field by splitting on the delimiter. If the length is greater than the specified length, it continues to the next record. Else if the length is less than or equal to the specified length, the record is written to HDFS.
Hope this helps.
Thanks
one of the various file formats used for Hive are RCFile, Parquet and ORC file formats. These file formats are columnar file format. This gives an advantage that when you reading large tables you don't have to read and process all the data. Most of the aggregation queries refer to only few columns rather than all of them. This speeds up your processing hugely.
Other application could be storing , reading and processing your custom input format, where data might be stored differently than csv structure. These might be binary files or any other structure.
You will have to follow the documentation to create input formats. For details you can follow the link: Custom InputFormat with Hive

Resources