Insert schema as first row to result of hive query - hadoop

I there a way to add schema/column headers as the first row to the output of a Hive query?
Im doing a typical dump to a local directory using this hive statement
INSERT OVERWRITE LOCAL DIRECTORY '/some_path/'
SELECT
... AS column1
... AS column2
...
;
In the output I want:
column1 column2
data data
data data

Since set hive.cli.print.header=true; works only on terminals, it will not print the headers into the file. I would do this in order to achieve that :
bin/hive -e "set hive.cli.print.header=true; select * from demo;" > /path/to/file

Related

Hive select * shows 0 but count(1) show returns millions of rows

I create a hive external table and add a partition by
create external table tbA (
colA int,
colB string,
...)
PARTITIONED BY (
`day` string)
stored as parquet;
alter table tbA add partition(day= '2021-09-04');
Then I put a parquet file to the target HDFS directory by hdfs dfs -put ...
I can get expected results using select * from tbA in IMPALA.
In Hive, I can get correct result when using
select count(1) from tbA
However, when I use
select * from tbA limit 10
It returns no result at all.
If anything is wrong with the parquet file or the directory, IMPALA should not get the correct result and Hive can count out row numbers... Why select * from ... shows nothing? Any help is appreciated.
In addition,
running select distinct day from tbA, it returns 2021-09-04
running select * from tbA, it returns data with day = 2021-09-04
It seems this partition is not recognized correctly? I retried to drop the partition and use msck repair table but still not working...

Oracle DB: custom field delimiter for sql query

Query:
select * from tableA;
result:
column1 column2
------- -------
row1-1 row1-2
row2-1 row2-2
row3-1 row3-1
How can I turn the result above into something below.
column1|column2
------- -------
row1-1 | row1-2
row2-1 | row2-2
row3-1 | row3-1
The assumptions here is that the user will not know what are the columns in the table in advance. This means that using || to do string concatenation on the result set is not preferred.
select column1 ||,|| column2 from tableA <---- This is not what I am looking for.
Note: These tables are already created and I do not have the admin rights to modify the table schema.
If you are running your query in sqlplus, you can do
set colsep '|'
before your query and you should get the output as listed, though the header separator line will also have the same column separator -------|-------
This will apply to all subsequent queries

add partition in hive table based on a sub query

I am trying to add partition to a hive table (partitioned by date)
My problem is that the date needs to be fetched from another table.
My query looks like :
ALTER TABLE my_table ADD IF NOT EXISTS PARTITION(server_date = (SELECT max(server_date) FROM processed_table));
When i run the query hive throws the following error:
Error: Error while compiling statement: FAILED: ParseException line 1:84 cannot recognize input near '(' 'SELECT' 'max' in constant (state=42000,code=40000)
Hive does not allow to use functions/UDF's for the partition column.
Approach 1:
To achieve this you can run the first query and store the result in one variable and then execute the query.
server_date=$(hive -e "set hive.cli.print.header=false; select max(server_date) from processed_table;")
hive -hiveconf "server_date"="$server_date" -f your_hive_script.hql
Inside your script you can use the following statement:
ALTER TABLE my_table ADD IF NOT EXISTS PARTITION(server_date =${hiveconf:server_date});
For more information on the hive variable substitution, you can refer link
Approach 2:
In this approach, you will need to create a temporary table if the partition data you are expecting is already not loaded in any other partitioned table.
Considering your data doesn't have the server_date column.
Load the data into temporary table
set hive.exec.dynamic.partition=true;
Execute the below query:
INSERT OVERWRITE TABLE my_table PARTITION (server_date)
SELECT b.column1, b.column2,........,a.server_date as server_date FROM (select max(server_date) as server_date from ) a, my_table b;

Excluding the partition field from select queries in Hive

Suppose I have a table definition as follows in Hive(the actual table has around 65 columns):
CREATE EXTERNAL TABLE S.TEST (
COL1 STRING,
COL2 STRING
)
PARTITIONED BY (extract_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LOCATION 'xxx';
Once the table is created, when I run hive -e "describe s.test", I see extract_date as being one of the columns on the table. Doing a select * from s.test also returns extract_date column values. Is it possible to exclude this virtual(?) column when running select queries in Hive.
Change this property
set hive.support.quoted.identifiers=none;
and run the query as
SELECT `(extract_date)?+.+` FROM <table_name>;
I tested it working fine.

Hive query output to file

I run hive query by java code.
Example:
"SELECT * FROM table WHERE id > 100"
How to export result to hdfs file.
The following query will insert the results directly into HDFS:
INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM table WHERE id > 100;
This command will redirect the output to a text file of your choice:
$hive -e "select * from table where id > 10" > ~/sample_output.txt
This will put the results in tab delimited file(s) under a directory:
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/YourTableDir'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
SELECT * FROM table WHERE id > 100;
I agree with tnguyen80's response. Please note that when there is a specific string value in query better to given entire query in double quotes.
For example:
$hive -e "select * from table where city = 'London' and id >=100" > /home/user/outputdirectory/city details.csv
The ideal way to do it will be using "INSERT OVERWRITE DIRECTORY '/pathtofile' select * from temp where id > 100" instead of "hive -e 'select * from...' > /filepath.txt"
#sarath
how to overwrite the file if i want to run another select * command from a different table and write to same file ?
INSERT OVERWRITE LOCAL DIRECTORY '/home/training/mydata/outputs'
SELECT expl , count(expl) as total
FROM (
SELECT explode(splits) as expl
FROM (
SELECT split(words,' ') as splits
FROM wordcount
) t2
) t3
GROUP BY expl ;
This is an example to sarath's question
the above is a word count job stored in outputs file which is in local directory
:)
Two ways can store HQL query results:
Save into HDFS Location
INSERT OVERWRITE DIRECTORY "HDFS Path" ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
SELECT * FROM XXXX LIMIT 10;
Save to Local File
$hive -e "select * from table_Name" > ~/sample_output.txt
$hive -e "select * from table where city = 'London' and id >=100" > /home/user/outputdirectory/city details.csv
Create an external table
Insert data into the table
Optional drop the table later, which wont delete that file since it is an external table
Example:
Creating external table to store the query results at '/user/myName/projectA_additionaData/'
CREATE EXTERNAL TABLE additionaData
(
ID INT,
latitude STRING,
longitude STRING
)
COMMENT 'Additional Data gathered by joining of the identified cities with latitude and longitude data'
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' STORED AS TEXTFILE location '/user/myName/projectA_additionaData/';
Feeding the query results into the temp table
insert into additionaData
Select T.ID, C.latitude, C.longitude
from TWITER
join CITY C on (T.location_name = C.location);
Dropping the temp table
drop table additionaData
To directly save the file in HDFS, use the below command:
hive> insert overwrite directory '/user/cloudera/Sample' row format delimited fields terminated by '\t' stored as textfile select * from table where id >100;
This will put the contents in the folder /user/cloudera/Sample in HDFS.
Enter this line into Hive command line interface:
insert overwrite directory '/data/test' row format delimited fields terminated by '\t' stored as textfile select * from testViewQuery;
testViewQuery - some specific view
To set output directory and output file format and more, try the following:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...
Example:
INSERT OVERWRITE DIRECTORY '/path/to/output/dir'
ROW FORMAT DELIMITED
STORED AS PARQUET
SELECT * FROM table WHERE id > 100;

Resources