Filename extract and insert into column values hive - shell

I have a file called : akolp9app1a_170905_0000.txt
I need to split the values
hostname= akolp9app1a
date=170905 (convert into proper data format)
Now create a table in hive with 2 columns hostname and date and insert this values into the table.
any suggestion
Thanks.

You can use virtual columns, for example INPUT__FILE__NAME. It gives the input file's name.
Then you can use split (or) substring (or) regexp_extract string functions on the input__file__name field and create hostname,date values.
Example:-
the below select query gives date field value as 170905, like this way build your query using string functions to extract hostname
hive> select split(INPUT__FILE__NAME,'[\_]')[1] `date` from tablename;
Store them to separate table by using insert statement.

Related

Concat_ws not working in insert statement in hive

Using hive, I'm trying to concatenate columns from one table and insert them in another table using the query
insert into table temp_error
select * from (Select 'temp_test','abcd','abcd','abcd',
from_unixtime(unix_timestamp()),concat_ws('|',sno,name,age)
from temp_test_string)c;
I get the required output till I use Select *. But as soon as I try to insert it into the table, it does not give concatenated output but gives the value of sno only instead of whole concatenated output.
Thanks guys.
I found why it was behaving that way. It's because while creating table I gave "separate fields by '|'". So what I was trying to insert as a string into the table, hive was interpreting it as different columns.

How to drop hive column?

I have two columns Id and Name in Hive table, and I want to delete the Name column. I have used following command:
ALTER TABLE TableName REPLACE COLUMNS(id string);
The result was that the Name column values were assigned to the Id column.
How can I drop a specific column of the table and is there any other command in Hive to achieve my goal?
In addition to the existing answers to the question : Alter hive table add or drop column
As per Hive documentation,
REPLACE COLUMNS removes all existing columns and adds the new set of columns.
REPLACE COLUMNS can also be used to drop columns. For example, ALTER TABLE test_change REPLACE COLUMNS (a int, b int); will remove column c from test_change's schema.
The query you are using is right. But this will modify only schema i.e, the metastore. This will not modify anything on data side.
So, before you are dropping the column you should make sure that you hav correct data file.
In your case the data file should not contain name values.
If you don't want to modify the file then create another table with only specific column that you need.
Create table tablename as select id from already_existing_table
let me know if this helps.

HDFS String data to timestamp in hive table

Hi I have a data in HDFS as a string '2015-03-26T00:00:00+00:00' ..if i want to load this data into Hive table (column as timestamp).i am not able to load and i am getting null values.
if i specify column as string i am getting the data into hive table
but if i specify column as timestamp i am not able to load the data and i am getting all NULL values in that column.
Eg: HDFS - '2015-03-26T00:00:00+00:00'
hive table- create table t1(my_date string)
i can get output as - '2015-03-26T00:00:00+00:00'
if i specify create table t1(my_date as timestamp)--i can see all null values
Can any one help me on this
Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.
Go through below link:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps
You have to use a staging table. In the staging table load as String and in the final table use UDF as below to convert the string value to Timestamp
from_unixtime(unix_timestamp(column_name, 'dd-MM-yyyy HH:mm'))

Hive - How to load data from a file with filename as a column?

I am running the following commands to create my table ABC and insert data from all files that are in my designated file path. Now I want to add a column with filename, but I can't find any way to do that without looping through the files or something. Any suggestions on what the best way to do this would be?
CREATE TABLE ABC
(NAME string
,DATE string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
hive -e "LOAD DATA LOCAL INPATH '${DATA_FILE_PATH}' INTO TABLE ABC;"
Hive does have virtual columns, which include INPUT__FILE__NAME. The link shows how to use this in a statement.
To fill another table with the filename as a column. Assuming your location of your data is hdfs://hdfs.location:port/data/folder/filename1
DROP TABLE IF EXISTS ABC2;
CREATE TABLE ABC2 (
filename STRING COMMENT 'this is the file the row was in',
name STRING,
date STRING);
INSERT INTO TABLE ABC2 SELECT split(INPUT__FILE__NAME,'folder/')[1],* FROM ABC;
You can alter the split to change how much of the full path you actually want to store.

update table based on concatenated column value

I have a table with only 4 columns
First column - The concatenated column values for each row from another table.The columns are concatenated based on column id from the metadata table.The order of concatenation is the same order of column ids.
Second column -I have the comma separated primary key columns.
Now, based on the primary keys in the second column, I need to update the 3rd column which will retrieve the values for the primary key from each of the first concatenated field.
4 column _ it has the table name.
I am using cursor and string functions and it works perfectly fine but when I tested it fir huge millions of data , it failed and the performance is very poor.
Could anyone give please me a single update query for the same
There is a comparison tool which compares the data between 2 tables in different database but with same data structure and it dumps the mismatch rows into a table with all the columns concatenated(pipe seperaed).The columns are in the same order as that of column id and I know the primary keys for that table(concatenated but pipe seperated). So, based on this data I need to extract the primary key values for which there is a data mismatch.
I need to do something like
Update column4(primary key values pipe seperated extracted from column2)
Check this LINK, maybe will be useful. With that query you could concatenate a value with a character you need (this works for 11g2 version, for earlier versions use xmlagg
, xmlelement, extract method).
CREATE TABLE TEST(
FIELD INT);
INSERT INTO TEST VALUES(1);
INSERT INTO TEST VALUES(2);
INSERT INTO TEST VALUES(3);
INSERT INTO TEST VALUES(4);
SELECT listagg(FIELD,',' ) WITHIN GROUP (ORDER BY FIELD)
FROM TEST
Returns '1,2,3,4'

Resources