How to use Hive TABLESAMPLE with subquery - hadoop

I'm using Hive version 1.1.0.
I'm trying to get a sample from a table using TABLESAMPLE statement with subquery to use WHERE clause.
SELECT *
FROM (SELECT * FROM table WHERE field='A') f
TABLESAMPLE(1 PERCENT);
But I have an error:
Error: Error while compiling statement: FAILED: ParseException line 1:45 missing EOF at 'TABLESAMPLE' near 'f' (state=42000,code=40000)
How to correctly use TABLESAMPLE with subqueries?

I am not sure about hive 1.x. But in hive 2.1, TABLESAMPLE can be used right after the table name.
So, you need to add tablesample after table and then put where in outer clause. This is working in my hive. Not sure about your version.
Logically it makes sense, we should pick a sample and then work on it.
SELECT *
FROM (SELECT * FROM mytable TABLESAMPLE(1 PERCENT)) f
WHERE field='A'
Of course if you want to know more you can refer to hive guide.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

Related

How can I get this syntax to work in Oracle?

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
simplest repo, how can I get this to work in Oracle?
SELECT 'X' NewColumn, * FROM MyTable;
I get
ORA-00936: missing expression
00936. 00000 - "missing expression"
*Cause:
*Action: Error at Line: 1 Column: 23
My actual issue is:
I'm using an ETL tool that allows automapping if I use SELECT *
I want to use ORA_ROWSCN to implement incremental loads
So the real query I'm running is:
SELECT ORA_ROWSCN, * FROM MyTable;
I get the same error for this
The syntax is valid if you alias the table and the *.
SELECT ORA_ROWSCN, t.* FROM MyTable t;
No idea whether your ETL tool knows how to do that. Of course, you could always create a view vw_table that runs this select and then use the view in the ETL tool.

SubQuery works in IMPALA but not HIVE

I'm trying to understand why the following subquery will work in Impala and not Hive.
select * from MySchema.MyTable where identifier not in
(select identifier from schema.table where status_code in (1,2,3));
EDIT:
Added the error
Error while compiling statement: FAILED: SemanticException [Error
10249]: line 1:55 Unsupported SubQuery Expression 'identifier':
Correlating expression cannot contain unqualified column references.
Issue could be because of 'identifier' in both queries, in main query and inner subquery. Explicitly mentioning which 'identifier' you are referring to like 'mytable.identifier' should resolve this issue.
This is probably an issue with Hive that has been fixed in recent versions and issue is not reproduced in hive 3.1.0.
If you are still facing the issue, let us know the hive version you are using and DDL statements used to create tables.

getting error while using alias in hive

I am trying to subtract two columns and fetch the result if the difference is greater then 100 in hive. I have written the following query:
select District.ID,Year,(volume_IN-volume_OUT) as d1 from petrol where d1>100;
but I am getting error.
The table column names are:
District.ID, Distributer name, volume_IN ,volume_OUT, Year
Please help me, Is there any error in the query. I am new to the hive.
One of the Limitations of hive is you cannot refer the alias you used in the same query
Try writing a subquery, may be something like below
select * from (select District_ID,year, (volume_IN-volume_OUT) as d1 from petrol) t1 where d1>100;

Hadoop view created with CTE misbehaves

Here is the view definition(Runs fine. View gets created)
CREATE OR REPLACE VIEW my_view
AS WITH Q1
AS (SELECT MAX(LOAD_DT) AS LOAD_DT FROM load_table WHERE UCASE(TBL_NM) = 'FACT_TABLE')
SELECT F.COLUMN1
, F.COLUMN2
FROM Q1, FACT_TABLE F
WHERE Q1.LOAD_DT = F.TRAN_DT
;
However, when run
SELECT * from my_view;
getting following error message:
FAILED: SemanticException Line N:M Table not found 'Q1' in definition of view my_view....etc..
Looks like hive is trying to treat the Q1 (wich is CTE) as a physical table. Any ideas how to work around this?
Thank You,
Natalia
We have faced the similar issue in our environment. To answer your question, it's a bug in Hive. Fortunately, we have a workaround to make it work. If you were using impala and hive and both are using same metastore. Create the view in Impala and it will work on both hive and impala.
Reason:
Hive is appending your database name to the CTE reference created which is causing the issue.
Thanks,
Neo

Write a nested select statement with a where clause in Hive

I have a requirement to do a nested select within a where clause in a Hive query. A sample code snippet would be as follows;
select *
from TableA
where TA_timestamp > (select timestmp from TableB where id="hourDim")
Is this possible or am I doing something wrong here, because I am getting an error while running the above script ?!
To further elaborate on what I am trying to do, there is a cassandra keyspace that I publish statistics with a timestamp. Periodically (hourly for example) this stats will be summarized using hive, once summarized that data will be stored separately with the corresponding hour. So when the query runs for the second time (and consecutive runs) the query should only run on the new data (i.e. - timestamp > previous_execution_timestamp). I am trying to do that by storing the latest executed timestamp in a separate hive table, and then use that value to filter out the raw stats.
Can this be achieved this using hive ?!
Subqueries inside a WHERE clause are not supported in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
However, often you can use a JOIN statement instead to get to the same result:
https://karmasphere.com/hive-queries-on-table-data#join_syntax
For example, this query:
SELECT a.KEY, a.value
FROM a
WHERE a.KEY IN
(SELECT b.KEY FROM B);
can be rewritten to:
SELECT a.KEY, a.val
FROM a LEFT SEMI JOIN b ON (a.KEY = b.KEY)
Looking at the business requirements underlying your question, it occurs that you might get more efficient results by partitioning your Hive table using hour. If the data can be written to use this factor as the partition key, then your query to update the summary will be much faster and require fewer resources.
Partitions can get out of hand when they reach the scale of millions, but this seems like a case that will not tease that limitation.
It will work if you put in :
select *
from TableA
where TA_timestamp in (select timestmp from TableB where id="hourDim")
EXPLANATION : As > , < , = need one exact figure in the right side, while here we are getting multiple values which can be taken only with 'IN' clause.

Resources