Impala - how to set a variable in a query? - hadoop

How can I set a variable in an Impala query?
In SQL:
select * from users where id=(#id:=123)
In Impala:
impala-shell> ?
Impala version is v2.0.0. Any suggestions will be appreciated. Thanks!

impala-shell> set var:id=123;select * from users where id=${VAR:id};
This variable can also be passed from command-line using --var
impala-shell --var id=123
impala-shell> select * from users where id=${VAR:id};

There's an open feature request for adding variable substitution support to impala-shell: IMPALA-1067, to mimic Hive's similar feature (hive --hivevar param=60 substitutes ${hivevar:param} inside a query with 60).
Variables that you can use in other SQL contexts (e.g. from a JDBC client) are not supported either, and I couldn't even find an open request for it... You might want to open a request for it: https://issues.cloudera.org/browse/IMPALA

impala-shell -i node.domain:port -B --var"table=metadata" --var="db=transaction" -f "file.sql"
file.sql:
SELECT * FROM ${var:db}.${var:table}"

Related

Oracle sqlplus execute sql script from command line while passing parameters to the sql script

I have one script script.sql which I want to execute from command line using oracle and passing it two parameters, as shown below
sqlplus user/pass # script.sql my_parameter1_value my_parameter2_value
What should it be in script.sql in order to be able to run it with the parameter values?
The solution can be prepared looking at oracle blogs:
https://blogs.oracle.com/opal/sqlplus-101-substitution-variables#2_7
For the question above, the solution would be to create a script.sql like this:
DEFINE START_VALUE = &1;
DEFINE STOP_VALUE = &2;
SELECT * FROM my_table
WHERE
value BETWEN &&START_VALUE AND &&STOP_VALUE;
I wanted to run a script that would return all orders raised during the last seven days. Here's how...
the script
SELECT * FROM orders_detail WHERE order_date BETWEEN '&1' AND '&2';
EXIT;
the command
sqlplus ot/Orcl1234#xepdb1 #"/opt/oracle/oradata/Custom Scripts/orders_between_dates.sql" $(date +%d-%b-%Y -d '-7 days') $(date +%d-%b-%Y)
Hope that helps someone. Luck.

Hive one line command to catch SCHEMA + TABLE NAME info

Is there a way to catch all schema + table name info in a single command through Hive in a similar way to
SELECT * FROM information_schema.tables
from the PostgreSQL world?
show databases and show tables combined in a loop [here an example] is an answer, but I'm looking for a more compact way to have the same result in a single command.
It's been long I have worked on Hive Queries but as far as I remember you can probably use
hive> desc formatted tableName;
or
hive> describe formatted tableName;
It will give you all the relevant information related to the Table like the Schema, Partition info, Table Type like Managed Table, etc
I am not sure If you are particularly looking for this ??
There is another way to query Hive Tables, is writing Hive Scripts which can be called from Hadoop Terminal rather than from Hive Terminal itself.
std]$ cat sample.hql or vi sample.hql
use dbName;
select * from tableName;
desc formatted tableName;
# this hql script can be called from outside the hive terminal
std]$ hive -f sample.hql
or, without even have to write script file you can probably query hive as
std]$ hive -e "use dbName; select * from emp;" > text.txt or >> to append
On the Database level, you can probably query as :
hive> use dbName;
hive> set hive.cli.print.current.db=true;
hive(dbName)> describe database dbName;
it will bring metadata from MySQL(metastore) about the Database.

How to pass multiple parameter in hive script

employee:
Table data
I want to fetch records of year=2016 by running hive script sample.hql.
use octdb;
select * from '${hiveconf:table}' where year = '${hiveconf:year}';
[cloudera#quickstart ~]$ hive -hiveconf table='employee', year=2016 -f sample.hql
But i am getting error NoViableAltException(307#[]).......
You need to use the --hiveconf option twice:
hive --hiveconf table=employee --hiveconf year=2016 -f sample.hql
You should use --hivevar instead with newer Hive versions. Earlier, developers were able to set configuration using --hiveconf and it was also used for variables. However, later --hivevar was implemented to have separate namespace for variables as mentioned in HIVE-2020.
Use following with beeline
beeline --hivevar table=employee --hivevar year=2016 -f sample.hql
With this, in the Hive script file you can access this variables directly or using hivevar namespace like below.
select * from ${table};
select * from ${hivevar:table};
Please, note that you may need to specify URL string using -u <db_URL> option.
By doing R&D found the correct answer, ${hiveconf:table} should define in script without ' '.
sample.hql:-
use ${hiveconf:database};
select * from ${hiveconf:table} where year = ${hiveconf:year};
Running sample.hql
[cloudera#quickstart shell]$ hive -hiveconf database=octdb -hiveconf table=employee -hiveconf year=2016 -f sample.hql
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
OK
Time taken: 1.484 seconds
OK
1 A 2016
2 B 2016
4 D 2016
Time taken: 4.423 seconds, Fetched: 3 row(s)
Passing variables can also be achieved through "hivevar" along with "hiveconf".
Here is the difference:
The hiveconf namespace was added and (--hiveconf) should be used to set Hive configuration values.
The hivevar namespace was added and (--hivevar) should be used to define user variables.
Using hiveconf will also work, but isn't recommended for variable substitution as hivevar is explicitly created for that purpose.
set hivevar:YEAR=2018;
SELECT * from table where year=${YEAR};
hive --hiveconf var='hello world' -e '!echo ${hiveconf:var};'
-- this will print: hello world

Is it possible to count the number of partitions?

I am working on a test in which I must find out the number of partitions of a table and check if it is right. If I use show partitions TableName I get all the partitions by name, but I wish to get the number of partitions, like something along the lines show count(partitions) TableName (which retuns OK btw.. so it's not good) and get 12 (for ex.).
Is there any way to achieve this??
Using Hive CLI
$ hive --silent -e "show partitions <dbName>.<tableName>;" | wc -l
--silent is to enable silent mode
-e tells hive to execute quoted query string
You could use:
select count(distinct <partition key>) from <TableName>;
By using the below command, you will get the all partitions and also at the end it shows the number of fetched rows. That number of rows means number of partitions
SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)];
< failed pictoral example >
You can use the WebHCat interface to get information like this. This has the benefit that you can run the command from anywhere that the server is accessible. The result is JSON - use a JSON parser of your choice to process the results.
In this example of piping the WebHCat results to Python, only the number 24 is returned representing the number of partitions for this table. (Server name is the name node).
curl -s 'http://*myservername*:50111/templeton/v1/ddl/database/*mydatabasename*/table/*mytablename*/partition?user.name=*myusername*' | python -c 'import sys, json; print len(json.load(sys.stdin)["partitions"])'
24
In scala you can do following:
sql("show partitions <table_name>").count()
I used following.
beeline -silent --showHeader=false --outputformat=csv2 -e 'show partitions <dbname>.<tablename>' | wc -l
Use the following syntax:
show create table <table name>;

Hive: writing column headers to local file?

Hive documentation lacking again:
I'd like to write the results of a query to a local file as well as the names of the columns.
Does Hive support this?
Insert overwrite local directory 'tmp/blah.blah' select * from table_name;
Also, separate question: Is StackOverflow the best place to get Hive Help? #Nija, has been very helpful, but I don't to keep bothering them...
Try
set hive.cli.print.header=true;
Yes you can. Put the set hive.cli.print.header=true; in a .hiverc file in your main directory or any of the other hive user properties files.
Vague Warning: be careful, since this has crashed queries of mine in the past (but I can't remember the reason).
Indeed, #nija's answer is correct - at least as far as I know. There isn't any way to write the column names when doing an insert overwrite into [local] directory ... (whether you use local or not).
With regards to the crashes described by #user1735861, there is a known bug in hive 0.7.1 (fixed in 0.8.0) that, after doing set hive.cli.print.header=true;, causes a NullPointerException for any HQL command/query that produces no output. For example:
$ hive -S
hive> use default;
hive> set hive.cli.print.header=true;
hive> use default;
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:222)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:287)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:517)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Whereas this is fine:
$ hive -S
hive> set hive.cli.print.header=true;
hive> select * from dual;
c
c
hive>
Non-HQL commands are fine though (set,dfs !, etc...)
More info here: https://issues.apache.org/jira/browse/HIVE-2334
Hive does support writing to the local directory. You syntax looks right for it as well.
Check out the docs on SELECTS and FILTERS for additional information.
I don't think Hive has a way to write the names of the columns to a file for the query you're running . . . I can't say for sure it doesn't, but I do not know of a way.
I think the only place better than SO for Hive questions would be the mailing list.
I ran into this problem today and was able to get what I needed by doing a UNION ALL between the original query and a new dummy query that creates the header row. I added a sort column on each section and set the header to 0 and the data to a 1 so I could sort by that field and ensure the header row came out on top.
create table new_table as
select
field1,
field2,
field3
from
(
select
0 as sort_col, --header row gets lowest number
'field1_name' as field1,
'field2_name' as field2,
'field3_name' as field3
from
some_small_table --table needs at least 1 row
limit 1 --only need 1 header row
union all
select
1 as sort_col, --original query goes here
field1,
field2,
field3
from
main_table
) a
order by
sort_col --make sure header row is first
It's a little bulky, but at least you can get what you need with a single query.
Hope this helps!
Not a great solution, but here is what I do:
create table test_dat
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/tmp/test_dat' as select * from YOUR_TABLE;
hive -e 'set hive.cli.print.header=true;select * from YOUR_TABLE limit 0' > /tmp/test_dat/header.txt
cat header.txt 000* > all.dat
Here's my take on it. Note, i'm not very well versed in bash, so improvements suggestions welcome :)
#!/usr/bin/env bash
# works like this:
# ./get_data.sh database.table > data.csv
INPUT=$1
TABLE=${INPUT##*.}
DB=${INPUT%.*}
HEADER=`hive -e "
set hive.cli.print.header=true;
use $DB;
INSERT OVERWRITE LOCAL DIRECTORY '$TABLE'
row format delimited
fields terminated by ','
SELECT * FROM $TABLE;"`
HEADER_WITHOUT_TABLE_NAME=${HEADER//$TABLE./}
echo ${HEADER_WITHOUT_TABLE_NAME//[[:space:]]/,}
cat $TABLE/*

Resources