How to specify the timestamp format when creating a table using a hdfs directory - hadoop

I have the following csv file located at the path/to/file in my hdfs store.
1842,10/1/2017 0:02
7424,10/1/2017 4:06
I'm trying to create a table using the below command:
create external table t
(
number string,
reported_time timestamp
)
ROW FORMAT delimited fields terminated BY ','
LOCATION 'path/to/file';
I can see in the impala query editor that the reported_time column in the table t is always null. I guess this is due the fact that my timestamp wasn't in an accepted timestamp format.
Question:
How can I specify that the timestamp column should be of the dd/mm/yyyy hh:min format so that it correctly parses the timestamp?

You can't customize the timestamp(as per my exp*) but you can create the table with string data type and then you can convert string to timestamp as below:
select number,
reported_time,
from_unixtime(unix_timestamp(reported_time),'dd/MM/yyyy HH:mm') as reported_time
from t;

Related

CurrentTime() generated from Pig showing as NULL in Hive Datetime column

In Pig script I have generated datetime column with its value as CurrentTime().
While reading the data from Hive Table for the output generated by PigScript, it shows as NULL.
Is there any way that I can load the current datetime column from PIG to show in Hive Table?
The data in the file looks like 2020-07-24T14:38:26.748-04:00 and in the hive table the column is of timestamp datatype
Hive timestamp should be in 'yyyy-MM-dd HH:mm:ss.SSS' format (without T and timezone -04:00)
1.Define Hive column as STRING
2.Transfom string to format compatible with Hive timestamp
If you do not need milliseconds:
--use your string column instead of literal
from_unixtime(unix_timestamp('2020-07-24T14:38:26.748-04:00',"yyyy-MM-dd'T'HH:mm:ss.SSSX"))
Returns:
2020-07-24 18:38:26
If you need milliseconds then additionally extract milliseconds and concatenate with transformed timestamp:
select concat(from_unixtime(unix_timestamp('2020-07-24T14:38:26.748-04:00',"yyyy-MM-dd'T'HH:mm:ss.SSSX")),
'.',regexp_extract('2020-07-24T14:38:26.748-04:00','\\.(\\d{3})',1))
Result:
2020-07-24 18:38:26.748
Both results are compatible with Hive timestamp and if necessary can be cast explicitly to Timestamp type using CAST(str as timestamp) function, though comparing these strings with timestamps or inserting into timestamp works without explicit cast.
Alternatively you can format timestamp in Pig to be 'yyyy-MM-dd HH:mm:ss.SSS' I do not have Pig and can not check how ToString works.
Also for LazySimpleSerDe, alternative timestamp formats can be supported by providing the format to the SerDe property "timestamp.formats" (as of release 1.2.0 with HIVE-9298). Try "yyyy-MM-dd'T'HH:mm:ss.SSSX"

Filename extract and insert into column values hive

I have a file called : akolp9app1a_170905_0000.txt
I need to split the values
hostname= akolp9app1a
date=170905 (convert into proper data format)
Now create a table in hive with 2 columns hostname and date and insert this values into the table.
any suggestion
Thanks.
You can use virtual columns, for example INPUT__FILE__NAME. It gives the input file's name.
Then you can use split (or) substring (or) regexp_extract string functions on the input__file__name field and create hostname,date values.
Example:-
the below select query gives date field value as 170905, like this way build your query using string functions to extract hostname
hive> select split(INPUT__FILE__NAME,'[\_]')[1] `date` from tablename;
Store them to separate table by using insert statement.

Partition column equal to current date in Hive

I am trying to load data into a Hive table using partition.
The code is as follow:
CREATE EXTERNAL TABLE URL(url STRING, clicks INT)
COMMENT 'Unique Clicks per URL'
PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/mypath/URL';
LOAD DATA INPATH '/inputpath/' INTO TABLE URL
PARTITION (dt=date_format(CURRENT_TIMESTAMP, "yyyy.MM.dd HH:mm:ss"));
I am gettin the following error:
FAILED: ParseException line 4:14 cannot recognize input near
'date_format' '(' 'CURRENT_TIMESTAMP' in constant
I tried using
SET hive.exec.dynamic.partition.mode=nonstrict;
but nothing changed.
Why is it not working?
How to set the current date as partition column?
Thank you in advance.
Lorenzo
Why move the files when you can create the external table on top of them?
LOAD DATA INPATH just moves the files (HDFS metadata operation) "as is", to the table's location.
Why define the partition column as a string when it is clearly a date?
CREATE EXTERNAL TABLE URL ... PARTITIONED BY(dt DATE) ...
Why are you trying to use non-ISO formats (yyyy.MM.dd)?
ISO date format is yyyy-MM-dd
Since it seems the partition information is not part of the data you have 3 options:
1.
Use a constant (no expression are allowed, including functions), e.g.
LOAD DATA INPATH '/inputpath/' INTO TABLE URL PARTITION (dt=date '2017-03-04');
2.
Create an additional table,URL_STG, similar to URL but without partition and use it to insert the partitions dynamically.
set hive.exec.dynamic.partition.mode=nonstrict;
insert into URL select *,current_date from URL_STG;
3.
Supply the date as a variable from the CLI
hive --hivevar dt=$(date +"%Y-%m-%d") -e \
'LOAD DATA INPATH '\''/inputpath/'\'' INTO TABLE URL PARTITION (dt=date '\''${hivevar:dt}'\'')'

Can we use TEXT FILE format for Hive Table with Snappy compression?

I have an hive external table in the HDFS and i am trying to create a hive managed table above it.i am using textfile format with snappy compression but i want to know how it helps the table.
CREATE TABLE standard_cd
(
last_update_dttm TIMESTAMP,
last_operation_type CHAR (1) ,
source_commit_dttm TIMESTAMP,
transaction_dttm TIMESTAMP ,
transaction_type CHAR (1)
)
PARTITIONED BY (process_dt DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("orc.compress" = "SNAPPY");
Let me know if any issues in creating in this format.
As such their is no issue while creating.
but difference in properties:
Table created and stored as TEXTFILE:
Table created and stored as ORC:
although the size of both tables were same after loading some data.
also check documentation about ORC file format

HDFS String data to timestamp in hive table

Hi I have a data in HDFS as a string '2015-03-26T00:00:00+00:00' ..if i want to load this data into Hive table (column as timestamp).i am not able to load and i am getting null values.
if i specify column as string i am getting the data into hive table
but if i specify column as timestamp i am not able to load the data and i am getting all NULL values in that column.
Eg: HDFS - '2015-03-26T00:00:00+00:00'
hive table- create table t1(my_date string)
i can get output as - '2015-03-26T00:00:00+00:00'
if i specify create table t1(my_date as timestamp)--i can see all null values
Can any one help me on this
Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.
Go through below link:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps
You have to use a staging table. In the staging table load as String and in the final table use UDF as below to convert the string value to Timestamp
from_unixtime(unix_timestamp(column_name, 'dd-MM-yyyy HH:mm'))

Resources