I am trying to subtract two columns and fetch the result if the difference is greater then 100 in hive. I have written the following query:
select District.ID,Year,(volume_IN-volume_OUT) as d1 from petrol where d1>100;
but I am getting error.
The table column names are:
District.ID, Distributer name, volume_IN ,volume_OUT, Year
Please help me, Is there any error in the query. I am new to the hive.
One of the Limitations of hive is you cannot refer the alias you used in the same query
Try writing a subquery, may be something like below
select * from (select District_ID,year, (volume_IN-volume_OUT) as d1 from petrol) t1 where d1>100;
Related
im getting an column ambiguously defined im trying to write a query to get data from three different tables my query is
select flight_no,
country_code,
destination,
depatue_time,
arrival_time
from flight,
country,
flight_availibility
where country_code='MCT'
and destination='IND'
order by flight_no;
and im getting error can anybody tell me what is wrong!!!
The error is telling you that you're asking for a column that shares a name from more than one of those tables. You want 'flight_no' - but, there may be a column named 'flight_no' in both the flight and country tables.
To avoid this, use aliases in your FROM clause.
Example
select a.col1, b.col1
from table1 a, table2 b
where a.id = b.id;
This isn't ambiguous, because you're explicitly telling Oracle which columns named 'col1' you want - you're not making it guess.
This isn't part of your question, but your query as you write it will cause the database to join every record in each table to every other record in the other two tables. This is known as a Cartesian Join or Product.
Is it necessarily wrong? Maybe not. But 99% of the time, it's not what you want. You only want the rows that 'match' - so use a WHERE clause and filter out the rows.
Or go the ANSI JOIN way and do it in the FROM clause.
The error occurred because same column name having in the 2 tables , so when we run the query using only column name , this error is occurring
Try the below to avoid those issue :
select a.flight_no,b.country_code,a.destination,c.depature_time,c.arrival_time
from
flight a ,
country b
I am facing difficulties in getting the dump(text file delimited by ^) for a Query in hive for my project -sentimental analysis in stock market using twitter.
The query which should fetch me an output in hdfs or local file-system is given below:
hive> select t.cmpname,t.datecol,t.tweet,st.diff FROM tweet t LEFT OUTER JOIN stock st ON(t.datecol = st.datecol AND lower(t.cmpname) = lower(st.cmpname));
The query produces the correct output but when I try dumping it in hdfs it gives me an error.
I ran through various other solutions given in stackoverflow for dumping but I was not able to find an appropriate solution which suits me.
Thanks for your help.
INSERT OVERWRITE DIRECTORY '/path/to/dir'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '^'
SELECT t.cmpname,t.datecol,t.tweet,st.diff FROM tweet t LEFT OUTER JOIN stock st
ON(t.datecol = st.datecol AND lower(t.cmpname) = lower(st.cmpname));
I'm attempting to do the following query where I use a windowing to fetch the next log timestamp and then do a subtraction between it and the current timestamp.
SELECT
LEAD(timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS lead_timestamp,
timestamp,
(lead_timestamp - timestamp) as delta
FROM logs;
However, when I do this, I get the following error:
FAILED: SemanticException [Error 10004]: Line 4:1 Invalid table alias or column reference 'lead_timestamp': (possible column names are: logs.timestamp, logs.latitude, logs.longitude, logs.principal_id)
If I drop this subtraction, the rest of the query works, so I'm stumped - am I using the AS syntax wrong above for lead_timestamp?
One of the limitations of Hive is that you can't refer to aliases you assigned in the same query (except for the HAVING clause). This is due to the way the code is structured around aliasing. You'll have to write this using a sub query.
SELECT lead_timestamp, timestamp, (lead_timestamp - timestamp) AS delta
FROM (
SELECT LEAD(timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS lead_timestamp,
timestamp
FROM logs
) a;
It's ugly, but works.
I have a requirement to do a nested select within a where clause in a Hive query. A sample code snippet would be as follows;
select *
from TableA
where TA_timestamp > (select timestmp from TableB where id="hourDim")
Is this possible or am I doing something wrong here, because I am getting an error while running the above script ?!
To further elaborate on what I am trying to do, there is a cassandra keyspace that I publish statistics with a timestamp. Periodically (hourly for example) this stats will be summarized using hive, once summarized that data will be stored separately with the corresponding hour. So when the query runs for the second time (and consecutive runs) the query should only run on the new data (i.e. - timestamp > previous_execution_timestamp). I am trying to do that by storing the latest executed timestamp in a separate hive table, and then use that value to filter out the raw stats.
Can this be achieved this using hive ?!
Subqueries inside a WHERE clause are not supported in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
However, often you can use a JOIN statement instead to get to the same result:
https://karmasphere.com/hive-queries-on-table-data#join_syntax
For example, this query:
SELECT a.KEY, a.value
FROM a
WHERE a.KEY IN
(SELECT b.KEY FROM B);
can be rewritten to:
SELECT a.KEY, a.val
FROM a LEFT SEMI JOIN b ON (a.KEY = b.KEY)
Looking at the business requirements underlying your question, it occurs that you might get more efficient results by partitioning your Hive table using hour. If the data can be written to use this factor as the partition key, then your query to update the summary will be much faster and require fewer resources.
Partitions can get out of hand when they reach the scale of millions, but this seems like a case that will not tease that limitation.
It will work if you put in :
select *
from TableA
where TA_timestamp in (select timestmp from TableB where id="hourDim")
EXPLANATION : As > , < , = need one exact figure in the right side, while here we are getting multiple values which can be taken only with 'IN' clause.
I am new to Oracle and working with a fairly large database. I would like to perform a query that will select the desired columns, order by a certain column and also limit the results. According to everything I have read, the below query should be working but it is returning "ORA-00918: column ambiguously defined":
SELECT * FROM(SELECT * FROM EAI.EAI_EVENT_LOG e,
EAI.EAI_EVENT_LOG_MESSAGE e1 WHERE e.SOURCE_URL LIKE '%.XML'
ORDER BY e.REQUEST_DATE_TIME DESC) WHERE ROWNUM <= 20
Any suggestions would be greatly appreciated :D
The error message means your result set contains two columns with the same name. Each column in a query's projection needs to have a unique name. Presumably you have a column (or columns) with the same name in both EAI_EVENT_LOG and EAI_EVENT_LOG_MESSAGE.
You also want to join on that column. At the moment you are generating a cross join between the two tables. In other words, if you have a hundred records in EAI_EVENT_LOG and two hundred records EAI_EVENT_LOG_MESSAGE your result set will be twenty thousand records (without the rownum). This is probably your intention.
"By switching to innerjoin, will that eliminate the error with the
current code?"
No, you'll still need to handle having two columns with the same name. Basically this comes from using SELECT * on two multiple tables. SELECT * is bad practice. It's convenient but it is always better to specify the exact columns you want in the query's projection. That way you can include (say) e.TRANSACTION_ID and exclude e1.TRANSACTION_ID, and avoid the ORA-00918 exception.
Maybe you have some columns in both EAI_EVENT_LOG and EAI_EVENT_LOG_MESSAGE tables having identical names? Instead of SELECT * list all columns you want to select.
Other problem I see is that you are selecting from two tables but you're not joining them in the WHERE clause hence the result set will be the cross product of those two table.
You need to stop using SQL '89 implicit join syntax.
Not because it doesn't work, but because it is evil.
Right now you have a cross join which in 99,9% of the cases is not what you want.
Also every sub-select needs to have it's own alias.
SELECT * FROM
(SELECT e.*, e1.* FROM EAI.EAI_EVENT_LOG e
INNER JOIN EAI.EAI_EVENT_LOG_MESSAGE e1 on (......)
WHERE e.SOURCE_URL LIKE '%.XML'
ORDER BY e.REQUEST_DATE_TIME DESC) s WHERE ROWNUM <= 20
Please specify a join criterion on the dotted line.
Normally you do a join on a keyfield e.g. ON (e.id = e1.event_id)
It's bad idea to use select *, it's better to specify exactly which fields you want:
SELECT e.field1 as customer_id
,e.field2 as customer_name
.....