Hive: writing column headers to local file? - syntax

Hive documentation lacking again:
I'd like to write the results of a query to a local file as well as the names of the columns.
Does Hive support this?
Insert overwrite local directory 'tmp/blah.blah' select * from table_name;
Also, separate question: Is StackOverflow the best place to get Hive Help? #Nija, has been very helpful, but I don't to keep bothering them...

Try
set hive.cli.print.header=true;

Yes you can. Put the set hive.cli.print.header=true; in a .hiverc file in your main directory or any of the other hive user properties files.
Vague Warning: be careful, since this has crashed queries of mine in the past (but I can't remember the reason).

Indeed, #nija's answer is correct - at least as far as I know. There isn't any way to write the column names when doing an insert overwrite into [local] directory ... (whether you use local or not).
With regards to the crashes described by #user1735861, there is a known bug in hive 0.7.1 (fixed in 0.8.0) that, after doing set hive.cli.print.header=true;, causes a NullPointerException for any HQL command/query that produces no output. For example:
$ hive -S
hive> use default;
hive> set hive.cli.print.header=true;
hive> use default;
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:222)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:287)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:517)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Whereas this is fine:
$ hive -S
hive> set hive.cli.print.header=true;
hive> select * from dual;
c
c
hive>
Non-HQL commands are fine though (set,dfs !, etc...)
More info here: https://issues.apache.org/jira/browse/HIVE-2334

Hive does support writing to the local directory. You syntax looks right for it as well.
Check out the docs on SELECTS and FILTERS for additional information.
I don't think Hive has a way to write the names of the columns to a file for the query you're running . . . I can't say for sure it doesn't, but I do not know of a way.
I think the only place better than SO for Hive questions would be the mailing list.

I ran into this problem today and was able to get what I needed by doing a UNION ALL between the original query and a new dummy query that creates the header row. I added a sort column on each section and set the header to 0 and the data to a 1 so I could sort by that field and ensure the header row came out on top.
create table new_table as
select
field1,
field2,
field3
from
(
select
0 as sort_col, --header row gets lowest number
'field1_name' as field1,
'field2_name' as field2,
'field3_name' as field3
from
some_small_table --table needs at least 1 row
limit 1 --only need 1 header row
union all
select
1 as sort_col, --original query goes here
field1,
field2,
field3
from
main_table
) a
order by
sort_col --make sure header row is first
It's a little bulky, but at least you can get what you need with a single query.
Hope this helps!

Not a great solution, but here is what I do:
create table test_dat
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/tmp/test_dat' as select * from YOUR_TABLE;
hive -e 'set hive.cli.print.header=true;select * from YOUR_TABLE limit 0' > /tmp/test_dat/header.txt
cat header.txt 000* > all.dat

Here's my take on it. Note, i'm not very well versed in bash, so improvements suggestions welcome :)
#!/usr/bin/env bash
# works like this:
# ./get_data.sh database.table > data.csv
INPUT=$1
TABLE=${INPUT##*.}
DB=${INPUT%.*}
HEADER=`hive -e "
set hive.cli.print.header=true;
use $DB;
INSERT OVERWRITE LOCAL DIRECTORY '$TABLE'
row format delimited
fields terminated by ','
SELECT * FROM $TABLE;"`
HEADER_WITHOUT_TABLE_NAME=${HEADER//$TABLE./}
echo ${HEADER_WITHOUT_TABLE_NAME//[[:space:]]/,}
cat $TABLE/*

Related

Hive one line command to catch SCHEMA + TABLE NAME info

Is there a way to catch all schema + table name info in a single command through Hive in a similar way to
SELECT * FROM information_schema.tables
from the PostgreSQL world?
show databases and show tables combined in a loop [here an example] is an answer, but I'm looking for a more compact way to have the same result in a single command.
It's been long I have worked on Hive Queries but as far as I remember you can probably use
hive> desc formatted tableName;
or
hive> describe formatted tableName;
It will give you all the relevant information related to the Table like the Schema, Partition info, Table Type like Managed Table, etc
I am not sure If you are particularly looking for this ??
There is another way to query Hive Tables, is writing Hive Scripts which can be called from Hadoop Terminal rather than from Hive Terminal itself.
std]$ cat sample.hql or vi sample.hql
use dbName;
select * from tableName;
desc formatted tableName;
# this hql script can be called from outside the hive terminal
std]$ hive -f sample.hql
or, without even have to write script file you can probably query hive as
std]$ hive -e "use dbName; select * from emp;" > text.txt or >> to append
On the Database level, you can probably query as :
hive> use dbName;
hive> set hive.cli.print.current.db=true;
hive(dbName)> describe database dbName;
it will bring metadata from MySQL(metastore) about the Database.

Hive query in Shell Script

I have an external hive table on top of a parquet file.
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
I want to get the count of table using shell script.
I tried with following command
myVar =$(hive -S -e " select count(*) from parquet_test;")
echo $myVar
Added -S to run hive in silent mode still I get whole map reduce log and count in the myVar variable. How to get only count.
I don't have access to any of the configuration file to enable or disable the level of logging. Is there any other way?
Finally found a work around.
First flushed the query result into a file in HDFS then read answer from file.
The file only contains the result of the query.
(hive -S -e " INSERT OVERWRITE LOCAL DIRECTORY '/home/test/result/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select count(*) from parquet_test;")
Then reading the file into a variable
Count var=$(hdfs dfs -tail /home/test/result/)
echo $var
Thank you
myVar=$(eval "hive -S -e 'select count(*) from parquet_test;' ")
echo $myVar

add date time from flat file name cloudera

I started an EC2 cluster on amazon to install cloudera...I got it installed and configured and loaded some of the Wiki Page Views public snapshot into HDFS. The structure of the files are as such:
projectcode, pagename, pageviews, bytes
the files are named as such:
pagecounts-20090430-230000.gz
date time
when loading the data from HDFS to Impala, I do it as such:
CREATE EXTERNAL TABLE wikiPgvws
(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION '/user/hdfs';
one thing I missed is the date and time of each of the file. The dir:
/user/hdfs
contains multiple pagecount files associated with different dates and times. How can one pull that information and store it in a column when loading to impala?
I think the thing you are missing is the concept of partitions. If you define the table as partitioned, the data may be divided to different partitions based on the timestamp(in the name) of the file. I'm able to work around it in hive, I hope you to do the needful(if any) for impala as there query syntax is the same.
For me, this problem is not possible to solve only using hive. So I mixed up bash with hive scripting and it works fine for me. This is how I wrapped it up :
Create table wikiPgvws with partition
Create table wikiTmp with same fields as wikiPgvws except for partitions
For each file
i. Load data into wikiTmp
ii. grep timeStamp from fileName
iii. Use sed to replace placeholders in a predefined hql script file to load the data to the actual table. Then run it.
Drop table wikiTmp & remove tmp.hql
The script is as follows :
#!/bin/bash
hive -e "CREATE EXTERNAL TABLE wikiPgvws(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
PARTITIONED BY(dts STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE";
hive -e "CREATE TABLE wikiTmp(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE"
for fileName in $(hadoop fs -ls /user/hdfs/bounty/pagecounts-*.txt | grep -Po '(?<=\s)(/user.*$)')
do
echo "currentFile :$fileName"
dst=$(echo $filename | grep -oE '[0-9]{8}-[0-9]{6}')
echo "currentStamp $dst"
sed "s!sourceFile!'$fileName'!" t.hql > tmp.hql
sed -i "s!targetPartition!$dst!" tmp.hql
hive -f tmp.hql
done
hive -e "DROP TABLE wikiTmp"
rm -f tmp.hql
The hql script consists of just two lines :
LOAD DATA INPATH sourceFile OVERWRITE INTO TABLE wikiTmp;
INSERT OVERWRITE TABLE wikiPgvws PARTITION (dts = 'targetPartition') SELECT w.* FROM wikiTmp w;
Epilogue :
Check, whether options equivalent to hive -e & hive -f are available in impala. Without them, this script is of no use to you. Again the grep commands to fetch the fileName & timeStamp need to be modified according to your table location and stamp pattern. It's just one a way to show how the job can be done, but couldn't able to find another one.
Enhencement
If everything works well, consider merging the first two DDLs into another script to make it look cleaner. Although, I'm not sure that hql script arguments can be used to define partition values, still you can have a look to replace sed.

Hive error: parseexception missing EOF

I am not sure what I am doing wrong here:
hive> CREATE TABLE default.testtbl(int1 INT,string1 STRING)
stored as orc
tblproperties ("orc.compress"="NONE")
LOCATION "/user/hive/test_table";
FAILED: ParseException line 1:107 missing EOF at 'LOCATION' near ')'
while the following query works perfectly fine:
hive> CREATE TABLE default.testtbl(int1 INT,string1 STRING)
stored as orc
tblproperties ("orc.compress"="NONE");
OK
Time taken: 0.106 seconds
Am I missing something here. Any pointers will help. Thanks!
Try put the "LOCATION" in front of "tblproperties" like below, worked for me.
CREATE TABLE default.testtbl(int1 INT,string1 STRING)
stored as orc
LOCATION "/user/hive/test_table"
tblproperties ("orc.compress"="NONE");
It seems even the sample SQL from book "Programming Hive" got the order wrong. Please reference to the official definition of create table command:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
#Haiying Wang pointed out that LOCATION is to be put in front of tblproperties.
But I think the error also occurs when location is specified above stored as.
Its better to stick to the correct order:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
Refer: Hive Create Table
Check this post:
Loading Data from a .txt file to Table Stored as ORC in Hive
And check your source files present at the specified directory /user/hive/test_table. Incase the files are in .txt or some other non ORC format then you can follow the steps in the above post to come out of the error.
ParseException line lineNumber missing EOF at '.' near 'schemaName':
Got the above error while trying to execute the following command from linux script to truncate a hive table
dse -u username -p password hive -e "truncate table keyspace.tablename;"
Fix:
Need to separate the commands within the script line as follows -
dse -u username -p password hive -e "use keyspace; truncate table keyspace.tablename;"
Happy coding!
Got the same error while creating a table in hive.
I used the drop command to drop the table and then run the create table command that I had again.
Worked for me.
If you see this error when running the HiveQL from a file with the command "hive -f file.hql". And that it points the first line of your query most definitely this is because of a forgotten semicolon(;) for a previous query.
Since parser looks for semicolon(;) as a terminator for each query.
for example:
DROP TABLE IF EXISTS default.emp
create table default.emp (
field1 type,
field2 type)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 's3://gts-promocube/source-data/Lowes/POS/';
If you save the above in a file and execute it with hive -f, then you'll get the error:
FAILED: ParseException line 2:0 missing EOF at 'CREATE' near emp.
Solution: Put a semicolon(;) for the DROP TABLE command above.

Multiple Line Variable into SQLPlus from Shell Script

What is the best way to pass multiple values from one variable into separate records in an oracle db?
I want to take the output from:
hddlist=`iostat -Dl|awk '{print ""$1"="$(NF)}'
This returns output like this:
hdisk36=0.8
hdisk37=0.8
hdisk38=0.8
hdisk40=5.5
hdisk52=4.9
I want to insert them into a database like so:
sqlplus -s /nolog <<EOF1
connect / as sysdba
set verify off
insert into my_table ##Single Record Here
EOF1
How can I systematically separate out the values so i can create individual records that look like this:
Disk Value
--------- -------
hdisk36 0.8
hdisk37 0.8
hdisk38 0.8
hdisk40 5.5
hdisk52 4.9
I originally tried a while loop with a counter but could not seem to get it to work. An exact solution would be nice but some directional advice would be just as helpful.
Loop and generate insert statements.
sql=$(iostat -Dl | awk '{print ""$1"="$(NF)}' | while IFS== read -r k v ; do
printf 'insert into mytable (k, v) values (%s, %s);\n' "$k" "$v"
done)
This output can be passed in some manner to sqlplus, perhaps like this
sqlplus -s /nolog <<EOF1
connect / as sysdba
set verify off
$sql
EOF1
Although, depending on the line format of iostat, it might be simpler to just omit awk and parse with read directly.
You can redirect the output to a file and then use an external table
It should look something like this:
CREATE TABLE hddlist_ext_table (
disk CHAR(16),
value CHAR(3)
ORGANIZATION EXTERNAL (
TYPE ORACLE_LOADER DEFAULT DIRECTORY tab_dir
ACCESS PARAMETERS (RECORDS DELIMITED BY NEWLINE
FIELDS TERMINATED BY '=')
LOCATION ('your_file_name'));
Then you can either use this table for your data or insert-select from it to your table;
insert into my_table
select disk, value from hddlist_ext_table;
You can insert multiple rows in a single SQL statement in Oracle like this
INSERT ALL
INTO mytable (column1, column2, column3) VALUES ('val1.1', 'val1.2', 'val1.3')
INTO mytable (column1, column2, column3) VALUES ('val2.1', 'val2.2', 'val2.3')
INTO mytable (column1, column2, column3) VALUES ('val3.1', 'val3.2', 'val3.3')
SELECT * FROM dual;
If you intend to run this script automatically at intervals to then see the results of each disk, you will probably need additional columns to hold the date and time.
You might also look at sqlldr as you can specify a control file telling it what your data contains and then this will load the data into a table. It is more suited to the purpose if you are loading lots of data than SQL Plus.

Resources