Is there a command line in HIVE that can be used to define the format of the output file to CSV?
Something similar to the below example?
set hive.resultset.use.unique.column.names=false;
EDIT - Added the following for further context 12/18.
A terminal window I'm using has predefined settings for the command line when it runs an 'export' through a script. The following is it's commands:
set hive.metastore.warehouse.dir=/idn/home/user;
set mapred.job.queue.name=root.gmis;
set hive.exec.scratchdir=/axp/hivescratch/user;
set hive.resultset.use.unique.column.names=false;
set hive.cli.print.header=true;
set hive.groupby.orderby.position.alias=true;
Is there another command I could add versus the lengthy strings per below? I'm using in the other hive terminal the following; but it's SQL is different(?).
cloak-hive -e "INSERT OVERWRITE LOCAL DIRECTORY '/adshome/user/VS_PMD' ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
You can mention the out put file format to csv refer the following example command. Note that it’s same for beeline and hive
beeline -u jdbc:hive2://localhost:10000/default --silent=true --outputformat=csv2 -e "select * from sample_07 limit 10" > out.txt
On Apache documentation,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...
INSERT OVERWRITE LOCAL DIRECTORY directory1
ROW FORMAT DELIMITED
STORED AS TEXTFILE
SELECT ... FROM ...;
Maybe some work would be needed on ROW FORMAT to achieve expected result.
Please note also that LOCAL value refers to local filesystem.
Related
In our environment we do not have access to Hive meta store to directly query.
I have a requirement to generate tablename , columnname pairs for a set of tables dynamically.
I was trying to achieve this by running "describe extended $tablename" to a file for all tables and pick up tablename and column name pairs from the file.
is there any easier way it is done/it can be done other than this way .
The desired output is like
table1|col1
table1|col2
table1|col3
table2|col1
table2|col2
table3|col1
This script will print columns in desired format for single table. AWK parses strings from describe command, takes only column_name, concatenates with "|" and table_name variable, each string printed with \n as a delimiter between them.
#!/bin/bash
#Set table name here
TABLE_NAME=your_schema.your_table
TABLE_COLUMNS=$(hive -S -e "set hive.cli.print.header=false; describe ${TABLE_NAME};" | awk -v table_name="${TABLE_NAME}" -F " " 'f&&!NF{exit}{f=1}f{printf c table_name "|" toupper($1)}{c="\n"}')
You can easily modify it for generating output for all tables using show tables command for example.
The easier way is to access metadata database directly.
I am new to Hadoop and I have a scenario where I have to export the dataset/file from HDFS to Oracle table using sqoop export. The file has values of 'null' in it so same is getting exported in table as well. I want to know how we can replace 'null' with blank in database while exporting?
You can create a TSV file from hive/beeline in that process you can add nulls to be blank with this --nullemptystring=true
Example : beeline -u ${hhiveConnectionString} --outputformat=csv2 --showHeader=false --silent=true --nullemptystring=true --incremental=true -e 'set hive.support.quoted.identifiers =none; select * from someSchema.someTable where whatever > something' > /Your/Local/Location or EdgeNode/exportingfile.tsv
You can use the created file in the sqoop export for exporting to Oracle table.
You can also replace the nulls with blanks on the file with Unix sed
Ex : sed -i s/null//g /Your/file//Your/Local/Location or EdgeNode/exportingfile.tsv
In oracle empty strings and nulls are treated the same for varchars. That is why Oracle internally converts empty strings into nulls for varchar. When '' assigned to a char(1) it becomes ' ' (char types are blank padded strings). See what Tom Kite says about this: https://asktom.oracle.com/pls/asktom/f?p=100:11:0%3a%3a%3a%3aP11_QUESTION_ID:5984520277372
See this manual: https://www.techonthenet.com/oracle/questions/empty_null.php
I'm using this code to write the results of a Hive query to the specified file:
INSERT OVERWRITE DIRECTORY '/user/test.user/test.csv'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ESCAPED BY '"' STORED AS TEXTFILE
SELECT
...
I don't want the filename to be test.csv however but the unix timestamp, that is 1517213651.csv or something like that.
I understand I can't use the concat function to manipulate the filename, but that is as far as I got.
How do I get the timestamp of the moment of query execution to be the filename of my output?
EDIT: We're using Cloudera.
Another option is to put the Hive insert inside of a Shell Script. Define a Date variable in the script and then use the Date Variable to define the output file.
TIMESTAMP_VAR=date +"%Y-%m-%d-%H-%M-%S"
FILENAME_VAR=/user/test/${TIMESTAMP_VAR}.csv
You can manipulate the timestamp layout in numerous ways.
you have to add TalendDate.getDate("CCYYMMDD") in file path.
"/File1/Output_File_" + TalendDate.getDate("CCYYMMDD") + ".csv"
I am a beginner in Hadoop/Hive. I did some research to find out a way to export results of HiveQL query to CSV.
I am running below command line in Putty -
Hive -e ‘use smartsourcing_analytics_prod; select * from solution_archive_data limit 10;’ > /home/temp.csv;
However below is the error I am getting
ParseException line 1:0 cannot recognize input near 'Hive' '-' 'e'
I would appreciate inputs regarding this.
Run your command from outside the hive shell - just from the linux shell.
Run with 'hive' instead of 'Hive'
Just redirecting your output into csv file won't work. You can do:
hive -e 'YOUR QUERY HERE' | sed 's/[\t]/,/g' > sample.csv
like was offered here: How to export a Hive table into a CSV file?
AkashNegi answer will also work for you... a bit longer though
One way I do such things is to create an external table with the schema you want. Then do INSERT INTO TABLE target_table ... Look at the example below:
CREATE EXTERNAL TABLE isvaliddomainoutput (email_domain STRING, `count` BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
STORED AS TEXTFILE
LOCATION "/user/cloudera/am/member_email/isvaliddomain";
INSERT INTO TABLE isvaliddomainoutput
SELECT * FROM member_email WHERE isvalid = 1;
Now go to "/user/cloudera/am/member_email/isvaliddomain" and find your data.
Hope this helps.
Currently I am able to use the below command:
hive -f hive-job.hql -hiveconf city='CA' -hiveconf country='US'
Here I am passing only 2 variable values. But I have around 15 to 20 variable values which I need to pass it through -hiveconf. These values are stored in a properties/text file.
Is there a possible way to read the file through -hiveconf ?
There is no direct way to add the property value to Hive variables. But there are two ways which I know might be helpful:
1.) Keep all the variables in hive-job-varibales.hql file as
set x=1;
set y=2;
...
Then call this file in the main file i.e hive -f hive-job.hql like this:
select ... from ..
...
hive-job-varibales.hql
2.) Use Java code to read from property files and convert the property values to hive variable format and use Hive JDBC connection to connect to Hive Server and run your queries in the order you want.
As per your requirement I would suggest to use the second option.
Hope it helps...!!!
You can do this using shell tools pretty easily.
Assuming your properties file is in typical "key=val" format, e.g.
a=1
b=some_value
c=foo
Then you can do:
sed 's/^/-hiveconf\n/g' my_properties_file | xargs hive -f hive-job.hql