Filter out duplicate rows based on a subset of columns - hadoop

I have some data that looks like this:
ID,DateTime,Category,SubCategory
X01,2014-02-13T12:36:14,Clothes,Tshirts
X01,2014-02-13T12:37:16,Clothes,Tshirts
X01,2014-02-13T12:38:33,Shoes,Running
X02,2014-02-13T12:39:23,Shoes,Running
X02,2014-02-13T12:40:42,Books,Fiction
X02,2014-02-13T12:41:04,Books,Fiction
what I would like to do is to only keep one instance of each datapoint in time like this (I don't care which instance in time):
ID,DateTime,Category,SubCategory
X01,2014-02-13T12:36:14,Clothes,Tshirts
X02,2014-02-13T12:39:23,Shoes,Running
X02,2014-02-13T12:40:42,Books,Fiction
Unfortunately, according to the Hive Language Manual, Hive's DISTINCT expression works on entire tables so doing something like this is not an option:
SELECT DISTINCT(ID, SubCategory),
DateTime,
Category
FROM sometable
How do I go about getting the second table above? Thanks in advance!

The usual approach for this kind of thing in SQL is a group by:
select ID, category, subcategory, min(datetime) datetime
from sometable
group by ID, category, subcategory

Related

How to find the sum of each group which is grouped by another group?

I am actually trying to make a matrix table using Oracle Analytics tool and PL/SQL.
Let's say i have a query which has in select statement variables Employee, Description, orderid ,amount and is grouped by Employee, Description. Orderid and amount belong to the same group. From this query i want extract the sum of the amount of each description from all employees. Do you have any idea how i can do this?
Thank you.
Edit:
Let's say we have the following query:
Select Employee, Description, orderid ,amount
From Employees
Group by Employee,Description
I want to extract the sum of amount from each Description group but from all Employees.A way to do this could be like this:
Select Description,sum(amount)
From Employees
Group by Description
But the actual query is much more complex and if i choose to make another query for finding the sum of each description i have to link it somehow to the first query to be able to show the results at the report.
Do you have any idea of a way to do this through oracle analytics publisher?
Thank you.
Select coalesce(Employee,'ALL_EMPOYEES'), coalesce(Description,'ALL_DESCRIPTION'), orderid ,amount
From Employees
Group by ROLLUP (Employee,Description)
Select coalesce(Employee,'ALL_EMPOYEES'), coalesce(Description,'ALL_DESCRIPTION'), orderid ,amount
From Employees
Group by CUBE (Employee,Description)

How to Create a VIEW in oracle

So I'm supposed to create a view product_view that presents the information about how many products of a particular type are in each warehouse: product ID, product name, category_id, warehouse id, total quantity on hand for this warehouse.
So I used this query and tried to change it so many times but I keep getting errors
CREATE OR REPLACE VIEW PRODUCT_VIEW AS
SELECT p.product_id, p.product_name,
COUNT(p.product_id), SUM(i.quantity_on_hand)
FROM oe.product_information p JOIN oe.inventories i
ON p.product_id=i.product_id
ORDER BY i.warehouse_id;
ERROR at line 2:
ORA-00928: missing SELECT keyword
Please help... Thanks
Image showing the Tables in the OE schema
Image showing the error that occurs
When I get errors creating a view, I firstly drop the CREATE ... AS line and fix the query until it works. Then you need to name all the columns, for instance COUNT(p.product_id) won't work, you'll need to write something like COUNT(p.product_id) AS product_count or specify a list of aliases, like so
I'm not sure what the output of your query should look like. You'll get better answers quicker on stackexchange if you type a minimal example including the CREATE statments, some input data and your desired output, leaving out columns that are not essential.

SSRS: how to use a parameter directly in the query?

Here's my query. In the application, the query is a little more complex but the focus is more on how to use the parameter
SELECT EmpName, Department, Salary
From tblEmployees
WHERE Salary >= #baseSalary
If a user want to select employees whose salaries start from a certain level they can do so.
I've found some videos on Plurasight on how to filter the result, but none on how to use the parameter directly in the query.
Thanks for helping
I create parameters in SSRS and then map them to the query with the Parameter tab in the Dataset Properties. If you use the same name for your parameter as the query, they will map automatically.
Here's an example of how I use them together:
Parameters
Dataset query
Parameter Tab of Dataset Properties

ORA-00979 not a Group By function error

Iam trying to select 2 values from a Table, Employee emp_name, emp_location grouping by emp_location, iam aware that the columns which are in group by function needs to be in select clause, but i would like to know whether is there any other way to get these value in a single query.
My intention is to select only one employee per location based on age.
sample query
select emp_name,emp_location
from Employee
where emp_age=25
group by emp_location
please help in this regard.
Thanks a lot for all the guys who have responded for this question. I will try to learn these windows functions as these are very handy.
The reason why this works in MySQL and not in Oracle, is because in Oracle, as well most other databases, you either need to specify a field (or expression) in the group by clause, or it has to be an aggregation which combines the values of all values in the group into a single one. For instance, this would work:
select max(emp_name),emp_location
from Employee
where emp_age=25
group by emp_location
However, it's may not the best solution. It will work if you want just the name, but you'll get into trouble when you want to have multiple fields for an employee. In that case max won't do the trick. In the query below, you might get a first name that doesn't match the last name.
select max(emp_firstname), max(emp_lastname), emp_location
from Employee
where emp_age=25
group by emp_location
On solution for this, is using a window function (analytical function). With those, you can generate a value for each record, without immediately reducing the number of records. For instance, with a windowed max function, you could select the max age for people named John, and display that value next to every John in the result, even if they don't have that age.
Some functions, like rank, dense_rank and row_number can be used to generate a number for each employee, which you can then use to filter by. In the example below, I created such a counter per location (partition by), and ordered by, in this case name and id. You can specify other fields as well, for instance if you want one name per age per location, you specify both age and location in partition by. If you want the oldest employee of each location, you can remove where emp_age=25 and order by emp_age desc instead.
select
*
from
(select
emp_name, emp_location,
dense_rank() over (partition by emp_location order by emp_name, emp_id) as emp_rank
from Employee
where emp_age=25)
where
emp_rank = 1
ORA-00979 not a Group By function error
Only aggregate functions and columns specified in the GROUP BY clause are allowed in the SELECT clause.
In that regard, Oracle follows the SQL standard closely. But, as you noticed in your comment, some other RDBMS are less strict than Oracle regarding that point. For example, to quote MySQL's documentation (emphasis mine):
MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. [...]
However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
So, in the recommended use case, adding the extra columns to the GROUP BY clause will lead to the same result.
select emp_name,emp_location
-- ^^^^^^^^
-- this is *not* part of the ̀`GROUP BY` clause
from Employee
where emp_state=25
group by emp_location
Maybe are you looking for:
...
group by emp_location, emp_name
select emp_name,emp_location
from Employee
where emp_age=25
group by emp_name,emp_location
or
select max(emp_name) emp_name,emp_location
from Employee
where emp_age=25
group by emp_location

Group by specified column in PostgreSQL

Maybe, this question is a little stupid, but I'm confused.
How to group records by specified column ? :)
Item.group(:category_id)
does't works...
It says:
ActiveRecord::StatementInvalid: PGError: ERROR: column "items.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT "items".* FROM "items" GROUP BY category_id
What kind of aggregate function should i use?
Please, could you provide a simple example.
You will have to define, how to group values that share the same category_id. Concatenate them? Calculate a sum?
To create comma-separated lists of values your statement could look like this:
SELECT category_id
,string_agg(col1, ', ') AS col1_list
,string_agg(col2, ', ') AS col2_list
FROM items
GROUP BY category_id
You need Postgres 9.0 or later for string_agg(col1, ', ').
In older versions you can substitute with array_to_string(array_agg(col1), ', '). More aggregate functions here.
To aggregate values in PostgreSQL is the clearly superior approach as opposed to aggregating values in the client. Postgres is very fast at this and it reduces (network) traffic.
You can use sum, avg, count or any other aggregate function. More on this topic you can find here.
But it seems that you don't really need to use SQL grouping.
Try to fetch all records and then use Array#collect function to group Items by category_id
Grouping in SQL means that the server groups one or more records from the database table into one resulting row. So, if you for example group by category_id, you might have several records matching the given category, so you can't expect the database to return all columns from the table (that's what SELECT * actually does).
Instead, when you use GROUP BY, you can SELECT only:
columns you have grouped by, and/or
aggregate functions which are performed on all the records belonging to a resulting group
Depending on what you exactly need, modify your .select accordingly.

Resources