How to use output of hive query in another hive query? - hadoop

Hello Friends,
I want to use output of one query in another query.
set iCount = 12;
This constant value is fine, but I don't know how to set this variable dynamically as given below.
set iCount = select count(distinct colName) from table;
This will result a string, whatever query is passed. Instead of query I want result of this query.
Thanks in advance
Pankaj Sharma

You can't do it that way. You could try using Oozie to automate the hive query and the java process you want to execute, storing the output of the hive query in a directory that the java program will read from.

Related

Create parameterized view in Impala

My goal is to create a parameterized view in Impala so users can easily change values in a query. If I run below query, for example, in HUE, is possible to introduce a value.
SELECT * FROM customers WHERE customer_id = ${id}
But I would like to create a view as follows, that when you run it, it asks you for the value you want to search. But this way is not working:
CREATE VIEW test AS SELECT * FROM customers WHERE customer_id = ${id}
Someone know if it is possible?
Many thanks
When you creating a view, it takes the actual variable's value.
Two workarounds exist:
Create a real table where you will store/update the parameter.
CREATE VIEW test AS SELECT * FROM customers JOIN id_table ON customer_id = id_tableid
Pass a parameter into the view with the help of the user-defined function(UDF). Probably you will need two UDFs set and get. Set UDF will write UDF on HDFS and Get UDF will read the variable from HDFS.
Two above mentioned workarounds work but not ideal. My suggestion is to use Hive for parametrized view creation. You can create a GenericUDF via which you can access hive configuration and read the variable and perform filtration. You can't use it for Impala.
SELECT Generic_UDF(array(customer_id)) FROM customers
GenericUDFs has method configure you can use it to read the hive variable:
public void configure(MapredContext mapredContext) {
String name = mapredContext.getJobConf().get("name");
}
You could do the opposite, e.g. parameterize the query on the view instead

Hive - Using date_add in a where clause

I have a table that is partitioned by date. To query the last 10 days of data I usually write something that looks like:
SELECT * FROM table WHERE date = date_add(current_date, -10);
A coworker said this makes the query less efficient that using a simple date string. Is this the case? Can some explain this to me? Is there a way to write a dynamic date into the where clause that is efficient?
The only problem here can be with partition pruning. Partition pruning may not work with function in some Hive versions. You can easily check it yourself by executing EXPLAIN EXTENDED <your select query> command. It will print all partition paths to be queried.
In this case, use pre-calculated in a shell value and pass it as a parameter:
date_var=$(date +'%Y_%m_%d' --date "-10 day")
#call your script
hive -hivevar date_var="$date_var" -f your_script.hql
And use variable in the script:
SELECT * FROM table WHERE date = '${hivevar:date_var}';
And if partition pruning works good, you do not need to bother at all.

How to execute select query on oracle database using pi spark?

I have written a program using pyspark to connect to oracle database and fetch data. Below command works fine and returns the contents of the table:
sqlContext.read.format("jdbc")
.option("url","jdbc:oracle:thin:user/password#dbserver:port/dbname")
.option("dbtable","SCHEMA.TABLE")
.option("driver","oracle.jdbc.driver.OracleDriver")
.load().show()
Now I do not want to load the entire table data. I want to load selected records. Can I specify select query as part of this command? If yes how?
Note: I can use dataframe and execute select query on the top of it but I do not want to do it. Please help!!
You can use subquery in dbtable option
.option("dbtable", "(SELECT * FROM tableName) AS tmp where x = 1")
Here is similar question, but about MySQL
In general, the optimizer SHOULD be able to push down any relevant select and where elements so if you now do df.select("a","b","c").where("d<10") then in general this should be pushed down to oracle. You can check it by doing df.explain(true) on the final dataframe.

How can I run Hive Explain command from java code?

I want to run Hive and Impala Explain and compute stats command from java code. So that I can use the collected information for my analysis purpose. If any one have any idea please help
You can run it as any other jdbc query against impala.
The compute stats query for a table called temp would be "compute stats temp" and you can pass this as an argument for the jdbc statement.execute
Similarly, to explain a query, say "select count( * ) from temp" the query to pass as an argument for statement.execute is "explain select count(*) from temp".

Query partition with calculation and avoid full table scan

I am an analyst trying to build a query to pull data of last 7 days from a table in Hadoop. The table itself is partitioned by date.
When I test my query with hard-coded dates, everything works as expected. However, when I write it to calculate based on today's timestamp, it's doing full table scan and I had to kill the job.
Sample query:
SELECT * FROM target_table
WHERE date >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),7);
I'd appreciate some advice how I can revise my query while avoiding full table scan.
Thank you!
I'm not sure that I have an elegant solution, but since I use Oozie for workflow coordination, I pass in the start_date and end_date from Oozie. In the absence of Oozie I might use bash to calculate the appropriate dates and pass them in as parameters.
Partition filters have always had this problem, so I just found myself a workaround.
I had some workaround, and it works for me if the no of Date is more than 30/60/90/120.
query like
(((unix_timestamp(date,'yyyy-MM-dd')) >= (unix_timestamp(date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd') ,${sub_days}),'yyyy-MM-dd'))) and((unix_timestamp(date,'yyyy-MM-dd')) <= (unix_timestamp(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),'yyyy-MM-dd'))))
sub_days = passing paramenter , here it could be 7

Resources