HDFS String data to timestamp in hive table - hadoop

Hi I have a data in HDFS as a string '2015-03-26T00:00:00+00:00' ..if i want to load this data into Hive table (column as timestamp).i am not able to load and i am getting null values.
if i specify column as string i am getting the data into hive table
but if i specify column as timestamp i am not able to load the data and i am getting all NULL values in that column.
Eg: HDFS - '2015-03-26T00:00:00+00:00'
hive table- create table t1(my_date string)
i can get output as - '2015-03-26T00:00:00+00:00'
if i specify create table t1(my_date as timestamp)--i can see all null values
Can any one help me on this

Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.
Go through below link:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps

You have to use a staging table. In the staging table load as String and in the final table use UDF as below to convert the string value to Timestamp
from_unixtime(unix_timestamp(column_name, 'dd-MM-yyyy HH:mm'))

Related

CurrentTime() generated from Pig showing as NULL in Hive Datetime column

In Pig script I have generated datetime column with its value as CurrentTime().
While reading the data from Hive Table for the output generated by PigScript, it shows as NULL.
Is there any way that I can load the current datetime column from PIG to show in Hive Table?
The data in the file looks like 2020-07-24T14:38:26.748-04:00 and in the hive table the column is of timestamp datatype
Hive timestamp should be in 'yyyy-MM-dd HH:mm:ss.SSS' format (without T and timezone -04:00)
1.Define Hive column as STRING
2.Transfom string to format compatible with Hive timestamp
If you do not need milliseconds:
--use your string column instead of literal
from_unixtime(unix_timestamp('2020-07-24T14:38:26.748-04:00',"yyyy-MM-dd'T'HH:mm:ss.SSSX"))
Returns:
2020-07-24 18:38:26
If you need milliseconds then additionally extract milliseconds and concatenate with transformed timestamp:
select concat(from_unixtime(unix_timestamp('2020-07-24T14:38:26.748-04:00',"yyyy-MM-dd'T'HH:mm:ss.SSSX")),
'.',regexp_extract('2020-07-24T14:38:26.748-04:00','\\.(\\d{3})',1))
Result:
2020-07-24 18:38:26.748
Both results are compatible with Hive timestamp and if necessary can be cast explicitly to Timestamp type using CAST(str as timestamp) function, though comparing these strings with timestamps or inserting into timestamp works without explicit cast.
Alternatively you can format timestamp in Pig to be 'yyyy-MM-dd HH:mm:ss.SSS' I do not have Pig and can not check how ToString works.
Also for LazySimpleSerDe, alternative timestamp formats can be supported by providing the format to the SerDe property "timestamp.formats" (as of release 1.2.0 with HIVE-9298). Try "yyyy-MM-dd'T'HH:mm:ss.SSSX"

How do I convert a sequence file to parquet format

I have a HIVE Table (test) that I need to create in the PARQUET format. I will be using a bunch of SEQUENCE files in order to create and insert into a table.
Once the table is created, is there a way to convert into PARQUET? I mean I know we could have done, say
CREATE TABLE default.test( user_id STRING, location STRING)
PARTITIONED BY ( dt INT ) STORED AS PARQUET
initially while creating the table itself. However, in my case I am forced to use SEQUENCE files to create the table first because it is the format that I have to begin with and cannot directly convert to PARQUET.
Is there a way I could convert into parquet after the table is created and data inserted?
To convert form sequence file to Parquet you need to load the data (CTAS) into a new table.
The question is tagged with presto, so I am giving you Presto syntax for this. I am including partitioning, because example in the question contains it.
CREATE TABLE test_parquet WITH(format='PARQUET', partitioned_by=ARRAY['dt']) AS
SELECT * FROM test_sequencefile;

How to specify the timestamp format when creating a table using a hdfs directory

I have the following csv file located at the path/to/file in my hdfs store.
1842,10/1/2017 0:02
7424,10/1/2017 4:06
I'm trying to create a table using the below command:
create external table t
(
number string,
reported_time timestamp
)
ROW FORMAT delimited fields terminated BY ','
LOCATION 'path/to/file';
I can see in the impala query editor that the reported_time column in the table t is always null. I guess this is due the fact that my timestamp wasn't in an accepted timestamp format.
Question:
How can I specify that the timestamp column should be of the dd/mm/yyyy hh:min format so that it correctly parses the timestamp?
You can't customize the timestamp(as per my exp*) but you can create the table with string data type and then you can convert string to timestamp as below:
select number,
reported_time,
from_unixtime(unix_timestamp(reported_time),'dd/MM/yyyy HH:mm') as reported_time
from t;

Handling dates in Hadoop

I'm new to the Big Data/Hadoop ecosystem and have noticed that dates are not always handled in standard way across technologies. I plan to be ingesting data from Oracle into Hive tables on an HDFS using Sqoop with Avro and Parquet file formats. Hive continues to import my dates into BIGINT values, I'd prefer TIMESTAMPS. I've tried using the "--map-column-hive" overrides... but it still does not work.
Looking for suggestions on the best way to handle dates for this use case.
Parquet File Format
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE, DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Avro File Format
Currently, Avro tables cannot contain TIMESTAMP columns. If you need to store date and time values in Avro tables, as a workaround you can use a STRING representation of the values, convert the values to BIGINT with the UNIX_TIMESTAMP() function, or create separate numeric columns for individual date and time fields using the EXTRACT() function.
You can also use your Hive query like this to get the result in your desired TIMESTAMP format.
FROM_UNIXTIME(CAST(SUBSTR(timestamp_column, 1,10) AS INT)) AS timestamp_column;
Other workaround is to import data using --query in sqoop command, where you can cast your column into timestamp format.
Example
--query 'SELECT CAST (INSERTION_DATE AS TIMESTAMP) FROM tablename WHERE $CONDITIONS'
If your SELECT query gets a bit long, you can use configuration files to shorten the length of the command line call. Here is the reference

Can we use TEXT FILE format for Hive Table with Snappy compression?

I have an hive external table in the HDFS and i am trying to create a hive managed table above it.i am using textfile format with snappy compression but i want to know how it helps the table.
CREATE TABLE standard_cd
(
last_update_dttm TIMESTAMP,
last_operation_type CHAR (1) ,
source_commit_dttm TIMESTAMP,
transaction_dttm TIMESTAMP ,
transaction_type CHAR (1)
)
PARTITIONED BY (process_dt DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("orc.compress" = "SNAPPY");
Let me know if any issues in creating in this format.
As such their is no issue while creating.
but difference in properties:
Table created and stored as TEXTFILE:
Table created and stored as ORC:
although the size of both tables were same after loading some data.
also check documentation about ORC file format

Resources