I have a Hive table with the following fields:
id STRING , x STRING
where x can have values such as 'c'.
I need a query that display number of rows where column x contains a value 'c' and the number of rows where x has values are other than 'c'.
id | count(x='c') | count(x<>'c')
---|--------------|--------------
1 | 3 | 7
I don't know if it's possible.
You can try :
SELECT sum(if(x='c',1,0)), sum(if(x!='c',1,0)) FROM table_name;
This will print two columns. I didn't understand the id field in your sample output.
Related
Can you please advise me for the below?
I'm extracting XML From CLOB data table But returning non distinct multiple values in a cell. I required in a column or the first node value or the last node value.
Select d.b_id
xmltype('<rds><startRD>'||u.con_data||'</startRD></rds>').extract('//rds/startRD/text()').getStringVal() start_RD
From sp_d_a d
Left Outer Join c1_u u
On d.b_id = u.b_id
Where d.b_id In ('4017')
returning as below:
353.000000524.000000527.306000722.430000
I require as below like in a column:
b_id | startRD
==================
12 | 353.000000
12 | 524.000000
12 | 527.306000
12 | 722.430000
I have a table in Oracle where there are two columns. In the first column, sometimes there are duplicate values that corresspond to a different value in the second column. How can I write a query that shows only unique values of the first column and all possible values from the second column?
The table looks somewhat like below
COLUMN_1 | COLUMN_2
NUMBER_1 | 4
NUMBER_2 | 4
NUMBER_3 | 1
NUMBER_3 | 6
NUMBER_4 | 3
NUMBER_4 | 4
NUMBER_4 | 5
NUMBER_4 | 6
You can use listagg() if you are using Oracle 11G or higher like
SELECT
COLUMN_1,
LISTAGG(COLUMN_2, '|') WITHIN GROUP (ORDER BY COLUMN_2) "ListValues"
FROM table1
GROUP BY COLUMN_1
Else, see this link for an alternative for lower versions
Oracle equivalent of MySQL group_concat
I am using the following hive query script for the version 0.13.0
DROP TABLE IF EXISTS movies.movierating;
DROP TABLE IF EXISTS movies.list;
DROP TABLE IF EXISTS movies.rating;
DROP DATABASE IF EXISTS movies;
ADD JAR /usr/local/hadoop/hive/hive/lib/RegexLoader.jar;
CREATE DATABASE IF NOT EXISTS movies;
CREATE EXTERNAL TABLE IF NOT EXISTS movies.list (id STRING, name STRING, genre STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s");
CREATE EXTERNAL TABLE IF NOT EXISTS movies.rating (id STRING, userid STRING, rating STRING, timestamp STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'
with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s %4$s");
LOAD DATA LOCAL INPATH 'ml-10M100K/movies.dat' into TABLE movies.list;
LOAD DATA LOCAL INPATH 'ml-10M100K/ratings.dat' into TABLE movies.rating;
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating STRING);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, rating.rating from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id;
The issue is when I execute the script without the "GROUP BY" clause it works fine.
But when I execute it with the "GROUP BY" clause, I get the following error
FAILED: SemanticException [Error 10002]: Line 4:21 Invalid column reference 'name'
Any ideas what is happening here?
Appreciate your help
Thanks!
If you group by a column, your select statement can only select a) that column, b) columns derived only from that column, or c) a UDAF applied to other columns.
In this case, you're only grouping by list.id, so when you try to select list.name, that's invalid. Think about it this way: what if your list table contained the following two entries:
id|name |genre
--+-----+------
01|name1|comedy
01|name2|horror
What would you expect this query to return:
select list.id, list.name, list.genre from list group by list.id;
In this case it's nonsensical. I'm guessing that id in reality is a primary key, but note that hive does not know this, so the above data set is perfectly valid.
With all that in mind, it's not clear to me how to fix it because I don't know the desired output. For example, let's say without the group by (just the join), you have as output:
id|name |genre |rating
--+-----+------+-------
01|name1|comedy|'pretty good'
01|name1|comedy|'bad'
02|name2|horror|'9/10'
03|name3|action|NULL
What would you want the output to be with the group by? What are you trying to accomplish by doing the group by?
OK let me see if I can ask this in a better way.
Here are my two tables
Movies list table - Consists of movies information
ID | Movie Name | Genre
1 | Movie 1 | comedy
2 | movie 2 | action
3 | movie 3 | thriller
And I have ratings table
MOVIE_ID | USER ID | RATING on 5 | TIMESTAMP
1 | xyz | 5 | 12345612
1 | abc | 4 | 23232312
2 | zvc | 1 | 12321123
2 | zyx | 2 | 12312312
What I would like to do is get the output in the following way:
Movie ID | Movie Name | Genre | Rating Average
1 | Movie 1 | comedy | 4.5
2 | Movie 2 | action | 1.5
I am not a db expert but I understand this, when you group the data together you need to convert the multiple values to the scalar values or all the values, if string should be same right?
For example in my previous case, I was grouping them together as a string. So which is okay for list.id, list.name and list.genre, but the list.rating, well that is always going to give some problem here (I just learnt PIG along with hive, so grouping works differently there)
So to tackle the problem, I casted the rating and averaged it out and stored it in the float table. Have a look at my code below:
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating FLOAT);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, AVG(cast(rating.rating as FLOAT)) from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id, list.name,list.genre order by list.id DESC;
Thank you for your explanation. I might save the following question for the next thread but here is my observation:
The performance of the Overall job is reduced when performing Grouping and Joining together than to do it in two separate queries. For the same job, I had changed the code a bit to perform the grouping first and then joining the data and the over all time was reduced by 40 seconds. Earlier it was taking 140 seconds and now it is taking 100 seconds. Any reasons to that?
Once again thank you for your explanation.
I came across same issue:
org.apache.hadoop.hive.ql.parse.SemanticException: Invalid column reference "charge_province"
After I put the "charge_province" in the group by, the issue is gone. I don't know why.
I have a situation where there are three tables with same table design (same amount of columns, same PK, same FK) but store different value on one date, only one data will be stored per day in each table. Each table has a flag column (Y or N) to determine which should be the value to be reported on the specific day. Only one table's value will be flagged as Y for one CODE in one day.
Example:
Table 1:
Date | Code | Value | Flag
01-DEC, ABC, 111, N
Table 2:
Date | Code | Value | Flag
01-DEC, ABC, 222, N
Table 3:
Date | Code | Value | Flag
01-DEC, ABC, 333, Y
Refer to the example above, the value of CODE ABC for date: 01-DEC should be 333 as it's flagged as Y.
How should the SQL look like?
First off, I would strongly question the data model that has three identical tables that seem to store essentially the same information and serve the same purpose.
You could, assuming that there will be only one Y row, do
SELECT *
FROM( SELECT *
FROM table1
UNION ALL
SELECT *
FROM table2
UNION ALL
SELECT *
FROM table3 )
WHERE flag = 'Y'
How can write a query for Oracle database such that I can find a comma delimited list of values from a column that contains comma delimited list of values. The :parameter passed to sql statement is also a comma delimited values that user selected.
For e.g
We have a column in tables that contains
1 | 'A','B','C'
2 | 'C','A'
3 | 'A','B'
on the web application interface we have multi select box that shows
A
B
C
and allows user to to select one or more items.
I want rows 1 and 2 to show up if they select A and B, If they select A only the all three should show up b/c all rows 1 to 3 have 'A' value in it.
This example will hopefully help and it matches the values irrespective of which order they appear in the string in the DB record.
Create example table:
CREATE TABLE t
(val VARCHAR2(100));
Insert records:
INSERT INTO t VALUES
('1|''A'',''B'',''C''');
INSERT INTO t VALUES
('2|''C'',''A''');
INSERT INTO t VALUES
('3|''A'',''B''');
Check values:
SELECT * FROM t;
1|'A','B','C'
2|'C','A'
3|'A','B'
Check solution for 'A':
SELECT val
FROM t
WHERE REGEXP_LIKE(val, '(A)');
1|'A','B','C'
2|'C','A'
3|'A','B'
Check solution for A and B
SELECT val
FROM t
WHERE REGEXP_LIKE(val, '(A|B).*(A|B)');
1|'A','B','C'
3|'A','B'
If you want to make sure the 1| part of the result isn't matched by anything then you could query using:
SELECT val
FROM t
WHERE REGEXP_LIKE(val, '(.\|.*)(A)');
and
SELECT val
FROM t
WHERE REGEXP_LIKE(val, '(.\|.*)(A|B).*(A|B)');
Hope this helps...
You could use a where clause with varying number of bind values, depending on the number of selected options:
TEST#PRJ> create table t (c varchar2(100));
TEST#PRJ> insert into t values ('2 | ''C'',''A''');
TEST#PRJ> insert into t values ('3 | ''A'',''B''');
TEST#PRJ> select * from t where c like '%''A''%' and c like '%''B''%';
C
----------------------------------------------------------------------------------------------------------------
1 | 'A','B','C'
3 | 'A','B'
TEST#PRJ> select * from t where c like '%''A''%';
C
----------------------------------------------------------------------------------------------------------------
1 | 'A','B','C'
2 | 'C','A'
3 | 'A','B'
If the values are stored in order you could use a single bind value:
TEST#PRJ> select * from t where c like '%''A''%''B''%';
C
-----------------------------------------------------------------------------------------
1 | 'A','B','C'
3 | 'A','B'