How to use current timestamp as filename for Hive output - hadoop

I'm using this code to write the results of a Hive query to the specified file:
INSERT OVERWRITE DIRECTORY '/user/test.user/test.csv'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ESCAPED BY '"' STORED AS TEXTFILE
SELECT
...
I don't want the filename to be test.csv however but the unix timestamp, that is 1517213651.csv or something like that.
I understand I can't use the concat function to manipulate the filename, but that is as far as I got.
How do I get the timestamp of the moment of query execution to be the filename of my output?
EDIT: We're using Cloudera.

Another option is to put the Hive insert inside of a Shell Script. Define a Date variable in the script and then use the Date Variable to define the output file.
TIMESTAMP_VAR=date +"%Y-%m-%d-%H-%M-%S"
FILENAME_VAR=/user/test/${TIMESTAMP_VAR}.csv
You can manipulate the timestamp layout in numerous ways.

you have to add TalendDate.getDate("CCYYMMDD") in file path.
"/File1/Output_File_" + TalendDate.getDate("CCYYMMDD") + ".csv"

Related

How to Insert parameter from Concurrent Program(.prog file) into a table using sql*loader control file created dynamically

I have the .prog file from a host program created in oracle apps. I am sending a parameter from oracle apps with host program and I can access it in the .prog file like this e.g.
echo "5 Concurrent Program Parameter 1 : " ${5}
I need to use this parameter ($5) into the control file (.ctl) where I will insert some columns and this parameter into a new table. e.g
LOAD DATA
INSERT INTO TABLE TABLE_NAME
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
COL1,
COL2,
DATA_FROM_PROG (5) => ** here i need to insert that data from the .prog file**
)
I am thinking it would have to be included in this command somehow so it creates this control file or another dynamically but I can't figure out how to send that parameter and make this work.
I am familiar with this line that I used in the past for simpler problems
e.g.sqlldr userid=user/pass data=$5 control=control.ctl
Thanks.
I wouldn't know, as I don't know anything about Oracle apps. nor ".prog" files.
Workaround - from my perspective - would be to
load only known data (from the source file)
data_from_prog would be specified as a filler field (and populated with NULL values (if trailing nullcols is specified))
after loading session is over, update that column from Oracle apps. - then you'd use a simple update statement; you're in (PL/)SQL world, it is easy to write such a query (at least, I hope so)
Using Bash Script in the .prog file to create the control file (.ctl) dynamically from scratch seems to be working and I can use the parameters as well.
So in the .prog file we would have:
echo "5 Concurrent Program Parameter 1 : " ${5} /*this is only to test it*/
/* *Printf* with *>* command will create and edit a file.
Alternative *Printf* with *>>* would append to the file*/
printf "LOAD DATA\n
INFILE 'path_to_csv_file.csv'\n /*this is data for col1, col2 etc*/
INSERT INTO TABLE TABLE_NAME\n
FIELDS TERMINATED BY \',\' OPTIONALLY ENCLOSED BY \'\"\'\n
TRAILING NULLCOLS\n
(COL1,\n
COL2,\n
DATA_FROM_PROG CONSTANT ${5})" > [name and path to control file (e.g./folder/control.ctl)]
This way, when the .prog file is executed it will Dynamically create the .ctl file which will have the parameter that we want (${5}).
And we can also add something like this to run the .ctl file
sqlldr userid=user/pass control=[path_to_control]control.ctl log=track.log
Also make sure to escape the quotes ' and double quotes " with \ because you will get some errors otherwise.

Hadoop Hive: Generate Table Name and Attribute Name using Bash script

In our environment we do not have access to Hive meta store to directly query.
I have a requirement to generate tablename , columnname pairs for a set of tables dynamically.
I was trying to achieve this by running "describe extended $tablename" to a file for all tables and pick up tablename and column name pairs from the file.
is there any easier way it is done/it can be done other than this way .
The desired output is like
table1|col1
table1|col2
table1|col3
table2|col1
table2|col2
table3|col1
This script will print columns in desired format for single table. AWK parses strings from describe command, takes only column_name, concatenates with "|" and table_name variable, each string printed with \n as a delimiter between them.
#!/bin/bash
#Set table name here
TABLE_NAME=your_schema.your_table
TABLE_COLUMNS=$(hive -S -e "set hive.cli.print.header=false; describe ${TABLE_NAME};" | awk -v table_name="${TABLE_NAME}" -F " " 'f&&!NF{exit}{f=1}f{printf c table_name "|" toupper($1)}{c="\n"}')
You can easily modify it for generating output for all tables using show tables command for example.
The easier way is to access metadata database directly.

Error while exporting the results of a HiveQL query to CSV?

I am a beginner in Hadoop/Hive. I did some research to find out a way to export results of HiveQL query to CSV.
I am running below command line in Putty -
Hive -e ‘use smartsourcing_analytics_prod; select * from solution_archive_data limit 10;’ > /home/temp.csv;
However below is the error I am getting
ParseException line 1:0 cannot recognize input near 'Hive' '-' 'e'
I would appreciate inputs regarding this.
Run your command from outside the hive shell - just from the linux shell.
Run with 'hive' instead of 'Hive'
Just redirecting your output into csv file won't work. You can do:
hive -e 'YOUR QUERY HERE' | sed 's/[\t]/,/g' > sample.csv
like was offered here: How to export a Hive table into a CSV file?
AkashNegi answer will also work for you... a bit longer though
One way I do such things is to create an external table with the schema you want. Then do INSERT INTO TABLE target_table ... Look at the example below:
CREATE EXTERNAL TABLE isvaliddomainoutput (email_domain STRING, `count` BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
STORED AS TEXTFILE
LOCATION "/user/cloudera/am/member_email/isvaliddomain";
INSERT INTO TABLE isvaliddomainoutput
SELECT * FROM member_email WHERE isvalid = 1;
Now go to "/user/cloudera/am/member_email/isvaliddomain" and find your data.
Hope this helps.

Remove spaces and UTF while writing hive table into HDFS files

I am trying to write the hive table into hdfs file using following queries
insert overwrite directory '<HDFS Location>' select customerid,'\t' ,f1,',', f2,',', f3,',', f4,',', f5 from sd_cust_product_recomm_all_emailid_model2 WHERE EMAILID IS NOT NULL;
I am getting the UTF and spaces in the file . The output is somthing like this :
customer1\t^Af1^A,^Af2^A,^Af3^A,^Af4^A,^Af5^A,
I desired output in following format
customer1/tf1,f2,f3,f4,f5
customer2/tf1,f2,f3,f4,f5
with no spaces and UTF
Thanks for the help
The default delimiter is the issue. Data written to the filesystem is serialized as text with columns separated by ^A.
By explicitly mentioning the Field delimiter(Comma) and Row delimiter(\n) you can overcome the issue.
insert overwrite directory '[HDFS Location]' ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' select
customerid,'\t',f1,f2,f3,f4,f5 from
sd_cust_product_recomm_all_emailid_model2 WHERE EMAILID IS NOT NULL;

Loading data using Hive Sed command

I Have my data in this format.
"123";"mybook1";"2002";"publisher1";
"456";"mybook2;the best seller";"2004";"publisher2";
"789";"mybook3";"2002";"publisher1";
the fields are enclosed in "" and are delimited by ; Also the book name may contain ';' in between.
Can you tell me how to load this data from file to hive table
the below query which i am using now obviously not working ;
create table books (isbn string,title string,year string,publisher string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
if possible i want the userid and year fields to be stored as Int. Please help
Also i dont want to use regexserde command.
how can i use sed command from unix to clean the data and get my output.
i tried to learn about sed command and found the replace option. So i can remove the " double quotations. But how can i handle the extra ; semi colon which comes in the middle of the data
Please help
I think you can preprocess with sed and then use the MetadataTypedColumnsetSerDe WITH SERDEPROPERTIES
sed -r ':a; s/^([^"]*("[^"]*"[^"]*)*);/\1XXXXX/g; t a; s/;/ /g; s/XXXXX/;/g' file
This sed matches the quote pairs to avoid processing what is between quotes putting a placeholder for the semicolons outside of quoted text. Afterward it removes the ;'s from the book title text and replaces them w/a space and puts back the semicolons that are outside quotes.
See here for more how to load data using Hive including an example of MetadataTypedColumnsetSerDe WITH SERDEPROPERTIES:
https://svn.apache.org/repos/asf/hive/trunk/serde/README.txt
create external table books (isbn int,title string,year int,publisher string)
row format SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH
SERDEPROPERTIES ('separatorChar' = '\;' , 'quoteChar' = '\"' ) location 'S3
path/HDFS path for the file';

Resources