Hive query output to file - hadoop

I run hive query by java code.
Example:
"SELECT * FROM table WHERE id > 100"
How to export result to hdfs file.

The following query will insert the results directly into HDFS:
INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM table WHERE id > 100;

This command will redirect the output to a text file of your choice:
$hive -e "select * from table where id > 10" > ~/sample_output.txt

This will put the results in tab delimited file(s) under a directory:
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/YourTableDir'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
SELECT * FROM table WHERE id > 100;

I agree with tnguyen80's response. Please note that when there is a specific string value in query better to given entire query in double quotes.
For example:
$hive -e "select * from table where city = 'London' and id >=100" > /home/user/outputdirectory/city details.csv

The ideal way to do it will be using "INSERT OVERWRITE DIRECTORY '/pathtofile' select * from temp where id > 100" instead of "hive -e 'select * from...' > /filepath.txt"

#sarath
how to overwrite the file if i want to run another select * command from a different table and write to same file ?
INSERT OVERWRITE LOCAL DIRECTORY '/home/training/mydata/outputs'
SELECT expl , count(expl) as total
FROM (
SELECT explode(splits) as expl
FROM (
SELECT split(words,' ') as splits
FROM wordcount
) t2
) t3
GROUP BY expl ;
This is an example to sarath's question
the above is a word count job stored in outputs file which is in local directory
:)

Two ways can store HQL query results:
Save into HDFS Location
INSERT OVERWRITE DIRECTORY "HDFS Path" ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
SELECT * FROM XXXX LIMIT 10;
Save to Local File
$hive -e "select * from table_Name" > ~/sample_output.txt
$hive -e "select * from table where city = 'London' and id >=100" > /home/user/outputdirectory/city details.csv

Create an external table
Insert data into the table
Optional drop the table later, which wont delete that file since it is an external table
Example:
Creating external table to store the query results at '/user/myName/projectA_additionaData/'
CREATE EXTERNAL TABLE additionaData
(
ID INT,
latitude STRING,
longitude STRING
)
COMMENT 'Additional Data gathered by joining of the identified cities with latitude and longitude data'
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' STORED AS TEXTFILE location '/user/myName/projectA_additionaData/';
Feeding the query results into the temp table
insert into additionaData
Select T.ID, C.latitude, C.longitude
from TWITER
join CITY C on (T.location_name = C.location);
Dropping the temp table
drop table additionaData

To directly save the file in HDFS, use the below command:
hive> insert overwrite directory '/user/cloudera/Sample' row format delimited fields terminated by '\t' stored as textfile select * from table where id >100;
This will put the contents in the folder /user/cloudera/Sample in HDFS.

Enter this line into Hive command line interface:
insert overwrite directory '/data/test' row format delimited fields terminated by '\t' stored as textfile select * from testViewQuery;
testViewQuery - some specific view

To set output directory and output file format and more, try the following:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...
Example:
INSERT OVERWRITE DIRECTORY '/path/to/output/dir'
ROW FORMAT DELIMITED
STORED AS PARQUET
SELECT * FROM table WHERE id > 100;

Related

Hive select * shows 0 but count(1) show returns millions of rows

I create a hive external table and add a partition by
create external table tbA (
colA int,
colB string,
...)
PARTITIONED BY (
`day` string)
stored as parquet;
alter table tbA add partition(day= '2021-09-04');
Then I put a parquet file to the target HDFS directory by hdfs dfs -put ...
I can get expected results using select * from tbA in IMPALA.
In Hive, I can get correct result when using
select count(1) from tbA
However, when I use
select * from tbA limit 10
It returns no result at all.
If anything is wrong with the parquet file or the directory, IMPALA should not get the correct result and Hive can count out row numbers... Why select * from ... shows nothing? Any help is appreciated.
In addition,
running select distinct day from tbA, it returns 2021-09-04
running select * from tbA, it returns data with day = 2021-09-04
It seems this partition is not recognized correctly? I retried to drop the partition and use msck repair table but still not working...

How to add a semi-colon ; after each create ddl statement using shell scripts

I am trying to add a semi-colon (;) after each create view Hive ddl statement. I have a file that has the below ddl statements in them:
CREATE VIEW `db1.table1` AS SELECT * FROM db2.table1
CREATE VIEW `db1.table2` AS SELECT * FROM db2.table2
CREATE VIEW `db1.table3` AS SELECT * FROM db3.table3
CREATE EXTERNAL TABLE `db1.table4`(
`cus_id` int,
`ren_mt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transient_lastDdlTime'='1558705259')
CREATE EXTERNAL TABLE `sndbx_cmcx.effective_month1`(
`customeridentifier` bigint,
`renewalmonth` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='false',
'transient_lastDdlTime'='1558713596')
I want it to look like below. After each create view statement there is a ; and after each create table there's a ;..
CREATE VIEW `db1.table1` AS SELECT * FROM db2.table1;
CREATE VIEW `db1.table2` AS SELECT * FROM db2.table2;
CREATE VIEW `db1.table3` AS SELECT * FROM db3.table3;
CREATE EXTERNAL TABLE `db1.table4`(
`cus_id` int,
`ren_mt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transient_lastDdlTime'='1558705259');
CREATE EXTERNAL TABLE `sndbx_cmcx.effective_month1`(
`customeridentifier` bigint,
`renewalmonth` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='false',
'transient_lastDdlTime'='1558713596');
Here is my shell script that I use:
#Change database before you run the script
hiveDBName=$1;
showcreate="show create table "
terminate=";"
tables=`hive -e "use $hiveDBName;show tables;"`
tab_list=`echo "${tables}"`
for list in $tab_list
do
echo "Generating table script for " #${hiveDBName}.${list}
showcreatetable=${showcreatetable}${showcreate}${hiveDBName}.${list}${terminate}
done
echo " ====== Create Tables ======= : "# $showcreatetable
#Creates a filter ddls
hive -e "use $hiveDBName; ${showcreatetable}"> a.sql
#Removes the Warn: from the file
grep -v "WARN" a.sql > /home/path/my_ddls/${hiveDBName}_extract_all_tables.sql
echo "Removing Filter File"
#Remove Filter file
rm -f a.sql
#Puts a ; after each create view statement in the document
sed -i '/transient/s/$/;/' "/home/path/my_ddls/${hiveDBName}_extract_all_tables.sql"
This generates the ddls but it only puts a ; after the create table statement but it doesn't put it after each create view statement.
Any ideas or suggestions?
I'd take the easy way and make use of the possibilities that the ; doesn't have to be on the same line as the (end of the) statement and that there may be an empty statement. This gives:
sed -i -e '/^CREATE/i;' -e '$a;' "/home/path/my_ddls/${hiveDBName}_extract_all_tables.sql"

Empty String is not treated as null in Hive

My understanding of the following statement is that if blank or empty string is inserted into hive column, it will be treated as null.
TBLPROPERTIES('serialization.null.format'=''
To test the functionality i have created a table and insertted '' to the filed 3. When i query for nulls on the field3, there are no rows with that criteria.
Is my understanding of making blank string to null correct??
CREATE TABLE CDR
(
field1 string,
field2 string,
field3 string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
**TBLPROPERTIES('serialization.null.format'='');**
insert overwrite table emmtest.cdr select **field1,field2,''** from emmtest.cdr_non_orc;
select * from emmtest.cdr where **field3 is null;**
The last statement has not returned any rows. But i am expecting all rows to be returned since there is blank string in field3.
TBLPROPERTIES('serialization.null.format'='') means the following:
An empty field in the data files will be treated as NULL when you query the table
When inserting rows to the table, NULL values will be written to the data files as empty fields
You are doing something else -
You are inserting an empty string to a table from a query.
It is treated "as is" - an empty string.
Demo
bash
hdfs dfs -mkdir /user/hive/warehouse/mytable
echo Hello,,World | hdfs dfs -put - /user/hive/warehouse/mytable/data.txt
hive
create table mytable (s1 string,s2 string,s3 string)
row format delimited
fields terminated by ','
;
hive> select * from mytable;
OK
s1 s2 s3
Hello World
hive> alter table mytable set tblproperties ('serialization.null.format'='');
OK
hive> select * from mytable;
OK
s1 s2 s3
Hello NULL World
You can use the following in your Hive Query properties:
NULL DEFINED AS ''
or any character inside the quotes.

Excluding the partition field from select queries in Hive

Suppose I have a table definition as follows in Hive(the actual table has around 65 columns):
CREATE EXTERNAL TABLE S.TEST (
COL1 STRING,
COL2 STRING
)
PARTITIONED BY (extract_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LOCATION 'xxx';
Once the table is created, when I run hive -e "describe s.test", I see extract_date as being one of the columns on the table. Doing a select * from s.test also returns extract_date column values. Is it possible to exclude this virtual(?) column when running select queries in Hive.
Change this property
set hive.support.quoted.identifiers=none;
and run the query as
SELECT `(extract_date)?+.+` FROM <table_name>;
I tested it working fine.

Insert schema as first row to result of hive query

I there a way to add schema/column headers as the first row to the output of a Hive query?
Im doing a typical dump to a local directory using this hive statement
INSERT OVERWRITE LOCAL DIRECTORY '/some_path/'
SELECT
... AS column1
... AS column2
...
;
In the output I want:
column1 column2
data data
data data
Since set hive.cli.print.header=true; works only on terminals, it will not print the headers into the file. I would do this in order to achieve that :
bin/hive -e "set hive.cli.print.header=true; select * from demo;" > /path/to/file

Resources