Impala - Getting error with Multiple count of distinct values - hadoop

I am using CDH-5.4.4 Cloudera Edition, I have a CSV file in HDFS location, My requirement is to perform Real time SQL queries on Hadoop Environement (OLTP).
So I decided to go with Impala, I have created MetaStore table to a CSV file, then execuing query in impala editor (within HUE application) .
When i am executing below query, i am getting error like
"AnalysisException: all DISTINCT aggregate functions need to have the
same set of parameters as count(DISTINCT City); deviating function:
count(DISTINCT Country)".
CSV File
OrderID,CustomerID,City,Country
Ord01,Cust01,Aachen,Germany
Ord02,Cust01,Albuquerque,USA
Ord03,Cust01,Aachen,Germany
Ord04,Cust02,Arhus,Denmark
Ord05,Cust02,Arhus,Denmark
Problamatic Query
Select CustomerID,Count(Distinct City),Count(Distinct Country) From CustomerOrders Group by CustomerID
Problem:
Unable to execute the Impala Query with More than one Distinct Values in an Query.. I have searched over internet they provide NDV() method as a workaround, But NDV method only returns approximate count of distinct values, I need Exact unique count for more than one fields.
Expectation:
What is the best way to do Exact unique count for more than one fields? Kindly modify the above query to work with Impala.
Note: This is not my original table, I have replicate for the forum question.

I've the same problem in Impala. Here is my workaround:
SELECT CustomerID
,sum(nr_of_cities)
,sum(nr_of_countries)
FROM (
SELECT CustomerID
,Count(DISTINCT City) AS nr_of_cities
,0 AS nr_of_countries
FROM CustomerOrders
GROUP BY CustomerID
UNION ALL
SELECT CustomerID
,0 AS nr_of_cities
,Count(DISTINCT Country) AS nr_of_countries
FROM CustomerOrders
GROUP BY CustomerID
) AS aa
GROUP BY CustomerID

I think this can be done cleaner (untested):
WITH
countries AS
(
SELECT CustomerID
,COUNT(DISTINCT City) AS nr_of_countries
FROM CustomerOrders
GROUP BY 1
)
,
cities AS
(
SELECT CustomerID
,COUNT(DISTINCT City) AS nr_of_cities
FROM CustomerOrders
GROUP BY 1
)
SELECT CustomerID
,nr_of_cities
,nr_of_countries
FROM cities INNER JOIN countries USING (CustomerID)

Related

Oracle select rows from a query which are not exist in another query

Let me explain the question.
I have two tables, which have 3 columns with same data tpyes. The 3 columns create a key/ID if you like, but the name of the columns are different in the tables.
Now I am creating queries with these 3 columns for both tables. I've managed to independently get these results
For example:
SELECT ID, FirstColumn, sum(SecondColumn)
FROM (SELECT ABC||DEF||GHI AS ID, FirstTable.*
FROM FirstTable
WHERE ThirdColumn = *1st condition*)
GROUP BY ID, FirstColumn
;
SELECT ID, SomeColumn, sum(AnotherColumn)
FROM (SELECT JKM||OPQ||RST AS ID, SecondTable.*
FROM SecondTable
WHERE AlsoSomeColumn = *2nd condition*)
GROUP BY ID, SomeColumn
;
So I make a very similar queries for two different tables. I know the results have a certain number of same rows with the ID attribute, the one I've just created in the queries. I need to check which rows in the result are not in the other query's result and vice versa.
Do I have to make temporary tables or views from the queries? Maybe join the two tables in a specific way and only run one query on them?
As a beginner I don't have any experience how to use results as an input for the next query. I'm interested what is the cleanest, most elegant way to do this.
No, you most probably don't need any "temporary" tables. WITH factoring clause would help.
Here's an example:
with
first_query as
(select id, first_column, ...
from (select ABC||DEF||GHI as id, ...)
),
second_query as
(select id, some_column, ...
from (select JKM||OPQ||RST as id, ...)
)
select id from first_query
minus
select id from second_query;
For another result you'd just switch the tables, e.g.
with ... <the same as above>
select id from second_query
minus
select id from first_query

Joining 4 tables using nested queries Oracle

I am using nested queries to achieve this:
Basically, I have this:
employee table:
employee_id, locale
audience table
employee_id
country table
country_name,country_code
country_language
country_code, geo
I need this: employee_id,audience_id,country_name,locale from these tables that come under "APAC" geo:
I have this query:
SELECT employee_id
FROM audience
WHERE employee_id IN
(SELECT employee_id
FROM employee
WHERE LOCALE IN
(SELECT LOCALE
FROM COUNTRY_LANGUAGE
WHERE COUNTRY_CODE IN
(SELECT COUNTRY_CODE
FROM COUNTRY
WHERE GEO='apac')
)
)
ORDER BY employee_id);
This is throwing this error: "SQL command not properly ended"
Also, will this query produce right results if run properly? If not, can u suggest something else?
Used this as joins. Did not return anything:
select a.employee_id,
a.locale,
b.audience_id,
c.LOCALE_CODE,
d.COUNTRY_NAME
from employee a,
audience b,
country_language c,
country d
where
a.employee_id=b.employee_ID
and d.geo='apac'
and d.country_code=c.country_code
and a.locale=c.LOCALE_CODE;
You can try to use UNION SELECT

How to use GROUP BY clause with COUNT(*)

I have two tables on Oracle database, one is named departments_table and the other is locations_table. The departments.table has dep_id, dep_name, location_id, staff_id, employer_id. The locations table consists of location_id, city_id, streetname_id and postcode_id. How do I calculate the number of departments that each location has?
This is the code below is what I have tried to replicate but have been unsuccessful. The error message below that is what shows once the code has submitted.
SELECT dep_name, location_id,
COUNT(*)
FROM departments_table
WHERE location_id => 1
GROUP BY dep_name;
The results of this is an error, " not a single group function "
If you want to count how many departments are in each location, then you must group by location, not by department name, right? Let's start with that.
Then, you don't need ANYTHING about the individual departments in the output of the query, do you? You just need the location id and the count of departments.
select location_id, count(*) as cnt
from departments_table
group by location_id
;
This does most of the work. You may want to add the location name (city, address, etc.), which is/are stored elsewhere - in the locations_table. So you will need a join. And there may be locations in that table that are not, in fact, the location of any department (their id doesn't appear in the departments_table at all). If so, you would need an OUTER join. Also for those departments you probably want to show a count of 0 (rather than null) - you can "fix" that with the nvl() function. So you will end up with something like
select l.*, nvl(g.cnt, 0) as department_count
from locations_table l
left outer join
( select location_id, count(*) as cnt
from departments_table
group by location_id
) g
on l.location_id = g.location_id
;
SELECT l.location_id, l.city, COUNT(d.DEPARTMENT_ID)
FROM OEHR_LOCATIONS l, OEHR_DEPARTMENTS d WHERE l.location_id = d.location_id
GROUP BY l.location_id, l.city ORDER BY l.city;
This method works. I created aliases and made minor changes. OEHR stands for the table names so ignore that.

Unable to get a expected output using hive aggregate function

I have a created a table (movies) in Hive as below(id,name,year,rating,views)
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333
I want to write a hive query to get the name of the movie with highest views.
select name,max(views) from movies;
but it gives me an error
FAILED: Error in semantic analysis: Line 1:7 Expression not in GROUP BY key name
but doing a group by with name gives me the complete list (which is expected).
What changes should I make to my query?
It is very possible that there is a simpler way to do this.
select name
from(
select max(views) as views
, name
, row_number() over (order by max(views) desc) as row_num
from movies
group by name
) m
where row_num = 1
After little bit of digging, I found out that the answer is not so straightforward as we do in SQL. Below query gives the expected result.
select a.name,a.views from movies a left semi join(select max(views) views from movies)b on (a.views=b.views);

Why is "group by" giving only one column as output?

I have a table something like this:
ID|Value
01|1
02|4
03|12
01|5
02|14
03|22
01|9
02|32
02|62
01|13
03|92
I want to know how much progress have each id made (from initial or minimal value)
so in sybase I can type:
select ID, (value-min(value)) from table group by id;
ID|Value
01|0
01|4
01|8
01|12
02|0
02|10
02|28
02|58
03|0
03|10
03|80
But monetdb does not support this (I am not sure may be cz it uses SQL'99).
Group by only gives one column or may be average of other values but not the desired result.
Are there any alternative to group by in monetdb?
You can achieve this with a self join. The idea is that you build a subselect that gives you the minimum value for each id, and then join that to the original table by id.
SELECT a.id, a.value-b.min_value
FROM "table" a INNER JOIN
(SELECT id, MIN(value) AS min_value FROM "table" GROUP BY id) AS b
ON a.id = b.id;

Resources