How to use NOT IN in Hive - hadoop

Suppose I have 2 tables as shown below. Now, if I want to achieve result which sql will give using, insert into B where id not in(select id from A)
which will insert 3 George in Table B.
How to implement this in hive?
Table A
id name
1 Rahul
2 Keshav
3 George
Table B
id name
1 Rahul
2 Keshav
4 Yogesh

NOT IN in the WHERE clause with uncorrelated subqueries is supported since Hive 0.13 which was released more than 3 years ago, on 21 April, 2014.
select * from A where id not in (select id from B where id is not null);
+----+--------+
| id | name |
+----+--------+
| 3 | George |
+----+--------+
On earlier versions the column of the outer table should be qualified with the table name/alias.
hive> select * from A where id not in (select id from B where id is not null);
FAILED: SemanticException [Error 10249]: Line 1:22 Unsupported SubQuery Expression 'id': Correlating expression cannot contain unqualified column references.
hive> select * from A where A.id not in (select id from B where id is not null);
OK
3 George
P.s.
When using NOT IN you should add is not null to the inner query, unless you are 100% sure that the relevant column does not contain null values.
One null value is enough to cause your query to return no results.

Related

UPDATE with JOIN syntax for Oracle Database

First, I execute the following SQL statements.
drop table names;
drop table ages;
create table names (id number, name varchar2(20));
insert into names values (1, 'Harry');
insert into names values (2, 'Sally');
insert into names values (3, 'Barry');
create table ages (id number, age number);
insert into ages values (1, 25);
insert into ages values (2, 30);
insert into ages values (3, 35);
select * from names;
select * from ages;
As a result, the following tables are created.
ID NAME
---------- ----------
1 Harry
2 Sally
3 Barry
ID AGE
---------- ----------
1 25
2 30
3 35
Now, I want to update increment the age of Sally by 1, i.e. set it to 31. The following query works fine.
update ages set age = age + 1 where id = (select id from names where name = 'Sally');
select * from ages;
The table now looks like this.
ID AGE
---------- ----------
1 25
2 31
3 35
I want to know if there is a way it can be done by joins. For example, I tried the following queries but they fail.
SQL> update ages set age = age + 1 from ages, names where ages.id = names.id and names.name = 'Sally';
update ages set age = age + 1 from ages, names where ages.id = names.id and names.name = 'Sally'
*
ERROR at line 1:
ORA-00933: SQL command not properly ended
SQL> update ages set age = age + 1 from names join ages on ages.id = names.id where names.name = 'Sally';
update ages set age = age + 1 from names join ages on ages.id = names.id where names.name = 'Sally'
*
ERROR at line 1:
ORA-00933: SQL command not properly ended
The syntax of the UPDATE statement is:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_10007.htm
where dml_table_expression_clause is:
Please pay attention on ( subquery ) part of the above syntax.
The subquery is a feature that allows to perform an update of joins.
In the most simplest form it can be:
UPDATE (
subquery-with-a-join
)
SET cola=colb
Before update a join, you must know restrictions listed here:
https://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_8004.htm
The view must not contain any of the following constructs:
A set operator
A DISTINCT operator
An aggregate or analytic function
A GROUP BY, ORDER BY, MODEL, CONNECT BY, or START WITH clause
A collection expression in a SELECT list
A subquery in a SELECT list
A subquery designated WITH READ ONLY
Joins, with some exceptions, as documented in Oracle Database Administrator's Guide
and also common rules related to updatable views - here (section: Updating a Join View):
http://docs.oracle.com/cd/B19306_01/server.102/b14231/views.htm#sthref3055
All updatable columns of a join view must map to columns of a
key-preserved table. See "Key-Preserved Tables" for a discussion of
key-preserved tables. If the view is defined with the WITH CHECK
OPTION clause, then all join columns and all columns of repeated
tables are not updatable.
We can first create a subquery with a join:
SELECT age
FROM ages a
JOIN names m ON a.id = m.id
WHERE m.name = 'Sally'
This query simply returns the following result:
AGE
----------
30
and now we can try to update our query:
UPDATE (
SELECT age
FROM ages a
JOIN names m ON a.id = m.id
WHERE m.name = 'Sally'
)
SET age = age + 1;
but we get an error:
SQL Error: ORA-01779:cannot modify a column which maps to a non key-preserved table
This error means, that one of the above restriction is not meet (key-preserved table).
However if we add primary keys to our tables:
alter table names add primary key( id );
alter table ages add primary key( id );
then now the update works without any error and a final outcome is:
select * from ages;
ID AGE
---------- ----------
1 25
2 31
3 35

how to group by desc order in hql

I have the following table called questions in HQL Hibernate:
ID | Name
1 | Bread
2 | Bread
3 | Rise
4 | Rise
I want to select each PRODUT only once and if there are multiple PRODUCT with the same name, select the one of the highest id. So, the expected results:
ID | NAME
3 | Bread
4 | Rise
I use the following query:
from Product AS E group by E.producto
So it selects the first 'Product' it encounters instead of the last one.
Thanks
The syntax is almost identical to SQL:
select max(p.id), p.name from Product p group by p.name
Relevant documentation:
http://docs.jboss.org/hibernate/core/4.3/manual/en-US/html/ch16.html#queryhql-aggregation
http://docs.jboss.org/hibernate/core/4.3/manual/en-US/html/ch16.html#queryhql-grouping

Hive: SemanticException [Error 10002]: Line 3:21 Invalid column reference 'name'

I am using the following hive query script for the version 0.13.0
DROP TABLE IF EXISTS movies.movierating;
DROP TABLE IF EXISTS movies.list;
DROP TABLE IF EXISTS movies.rating;
DROP DATABASE IF EXISTS movies;
ADD JAR /usr/local/hadoop/hive/hive/lib/RegexLoader.jar;
CREATE DATABASE IF NOT EXISTS movies;
CREATE EXTERNAL TABLE IF NOT EXISTS movies.list (id STRING, name STRING, genre STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s");
CREATE EXTERNAL TABLE IF NOT EXISTS movies.rating (id STRING, userid STRING, rating STRING, timestamp STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'
with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s %4$s");
LOAD DATA LOCAL INPATH 'ml-10M100K/movies.dat' into TABLE movies.list;
LOAD DATA LOCAL INPATH 'ml-10M100K/ratings.dat' into TABLE movies.rating;
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating STRING);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, rating.rating from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id;
The issue is when I execute the script without the "GROUP BY" clause it works fine.
But when I execute it with the "GROUP BY" clause, I get the following error
FAILED: SemanticException [Error 10002]: Line 4:21 Invalid column reference 'name'
Any ideas what is happening here?
Appreciate your help
Thanks!
If you group by a column, your select statement can only select a) that column, b) columns derived only from that column, or c) a UDAF applied to other columns.
In this case, you're only grouping by list.id, so when you try to select list.name, that's invalid. Think about it this way: what if your list table contained the following two entries:
id|name |genre
--+-----+------
01|name1|comedy
01|name2|horror
What would you expect this query to return:
select list.id, list.name, list.genre from list group by list.id;
In this case it's nonsensical. I'm guessing that id in reality is a primary key, but note that hive does not know this, so the above data set is perfectly valid.
With all that in mind, it's not clear to me how to fix it because I don't know the desired output. For example, let's say without the group by (just the join), you have as output:
id|name |genre |rating
--+-----+------+-------
01|name1|comedy|'pretty good'
01|name1|comedy|'bad'
02|name2|horror|'9/10'
03|name3|action|NULL
What would you want the output to be with the group by? What are you trying to accomplish by doing the group by?
OK let me see if I can ask this in a better way.
Here are my two tables
Movies list table - Consists of movies information
ID | Movie Name | Genre
1 | Movie 1 | comedy
2 | movie 2 | action
3 | movie 3 | thriller
And I have ratings table
MOVIE_ID | USER ID | RATING on 5 | TIMESTAMP
1 | xyz | 5 | 12345612
1 | abc | 4 | 23232312
2 | zvc | 1 | 12321123
2 | zyx | 2 | 12312312
What I would like to do is get the output in the following way:
Movie ID | Movie Name | Genre | Rating Average
1 | Movie 1 | comedy | 4.5
2 | Movie 2 | action | 1.5
I am not a db expert but I understand this, when you group the data together you need to convert the multiple values to the scalar values or all the values, if string should be same right?
For example in my previous case, I was grouping them together as a string. So which is okay for list.id, list.name and list.genre, but the list.rating, well that is always going to give some problem here (I just learnt PIG along with hive, so grouping works differently there)
So to tackle the problem, I casted the rating and averaged it out and stored it in the float table. Have a look at my code below:
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating FLOAT);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, AVG(cast(rating.rating as FLOAT)) from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id, list.name,list.genre order by list.id DESC;
Thank you for your explanation. I might save the following question for the next thread but here is my observation:
The performance of the Overall job is reduced when performing Grouping and Joining together than to do it in two separate queries. For the same job, I had changed the code a bit to perform the grouping first and then joining the data and the over all time was reduced by 40 seconds. Earlier it was taking 140 seconds and now it is taking 100 seconds. Any reasons to that?
Once again thank you for your explanation.
I came across same issue:
org.apache.hadoop.hive.ql.parse.SemanticException: Invalid column reference "charge_province"
After I put the "charge_province" in the group by, the issue is gone. I don't know why.

find string from delimited values from delimited values (Oracle PL-SQL)

How can write a query for Oracle database such that I can find a comma delimited list of values from a column that contains comma delimited list of values. The :parameter passed to sql statement is also a comma delimited values that user selected.
For e.g
We have a column in tables that contains
1 | 'A','B','C'
2 | 'C','A'
3 | 'A','B'
on the web application interface we have multi select box that shows
A
B
C
and allows user to to select one or more items.
I want rows 1 and 2 to show up if they select A and B, If they select A only the all three should show up b/c all rows 1 to 3 have 'A' value in it.
This example will hopefully help and it matches the values irrespective of which order they appear in the string in the DB record.
Create example table:
CREATE TABLE t
(val VARCHAR2(100));
Insert records:
INSERT INTO t VALUES
('1|''A'',''B'',''C''');
INSERT INTO t VALUES
('2|''C'',''A''');
INSERT INTO t VALUES
('3|''A'',''B''');
Check values:
SELECT * FROM t;
1|'A','B','C'
2|'C','A'
3|'A','B'
Check solution for 'A':
SELECT val
FROM t
WHERE REGEXP_LIKE(val, '(A)');
1|'A','B','C'
2|'C','A'
3|'A','B'
Check solution for A and B
SELECT val
FROM t
WHERE REGEXP_LIKE(val, '(A|B).*(A|B)');
1|'A','B','C'
3|'A','B'
If you want to make sure the 1| part of the result isn't matched by anything then you could query using:
SELECT val
FROM t
WHERE REGEXP_LIKE(val, '(.\|.*)(A)');
and
SELECT val
FROM t
WHERE REGEXP_LIKE(val, '(.\|.*)(A|B).*(A|B)');
Hope this helps...
You could use a where clause with varying number of bind values, depending on the number of selected options:
TEST#PRJ> create table t (c varchar2(100));
TEST#PRJ> insert into t values ('2 | ''C'',''A''');
TEST#PRJ> insert into t values ('3 | ''A'',''B''');
TEST#PRJ> select * from t where c like '%''A''%' and c like '%''B''%';
C
----------------------------------------------------------------------------------------------------------------
1 | 'A','B','C'
3 | 'A','B'
TEST#PRJ> select * from t where c like '%''A''%';
C
----------------------------------------------------------------------------------------------------------------
1 | 'A','B','C'
2 | 'C','A'
3 | 'A','B'
If the values are stored in order you could use a single bind value:
TEST#PRJ> select * from t where c like '%''A''%''B''%';
C
-----------------------------------------------------------------------------------------
1 | 'A','B','C'
3 | 'A','B'

Select all rows from SQL based upon existence of multiple rows (sequence numbers)

Let's say I have table data similar to the following:
123456 John Doe 1 Green 2001
234567 Jane Doe 1 Yellow 2001
234567 Jane Doe 2 Red 2001
345678 Jim Doe 1 Red 2001
What I am attempting to do is only isolate the records for Jane Doe based upon the fact that she has more than one row in this table. (More that one sequence number)
I cannot isolate based upon ID, names, colors, years, etc...
The number 1 in the sequence tells me that is the first record and I need to be able to display that record, as well as the number 2 record -- The change record.
If the table is called users, and the fields called ID, fname, lname, seq_no, color, date. How would I write the code to select only records that have more than one row in this table? For Example:
I want the query to display this only based upon the existence of the multiple rows:
234567 Jane Doe 1 Yellow 2001
234567 Jane Doe 2 Red 2001
In PL/SQL
First, to find the IDs for records with multiple rows you would use:
SELECT ID FROM table GROUP BY ID HAVING COUNT(*) > 1
So you could get all the records for all those people with
SELECT * FROM table WHERE ID IN (SELECT ID FROM table GROUP BY ID HAVING COUNT(*) > 1)
If you know that the second sequence ID will always be "2" and that the "2" record will never be deleted, you might find something like:
SELECT * FROM table WHERE ID IN (SELECT ID FROM table WHERE SequenceID = 2)
to be faster, but you better be sure the requirements are guaranteed to be met in your database (and you would want a compound index on (SequenceID, ID)).
Try something like the following. It's a single tablescan, as opposed to 2 like the others.
SELECT * FROM (
SELECT t1.*, COUNT(name) OVER (PARTITION BY name) mycount FROM TABLE t1
)
WHERE mycount >1;
INNER JOIN
JOIN:
SELECT u1.ID, u1.fname, u1.lname, u1.seq_no, u1.color, u1.date
FROM users u1 JOIN users u2 ON (u1.ID = u2.ID and u2.seq_no = 2)
WHERE:
SELECT u1.ID, u1.fname, u1.lname, u1.seq_no, u1.color, u1.date
FROM users u1, thetable u2
WHERE
u1.ID = u2.ID AND
u2.seq_no = 2
Check out the HAVING clause for a summary query. You can specify stuff like
HAVING COUNT(*) >= 2
and so forth.

Resources