Select distinct on specific columns but select other columns also in hive

Select distinct on specific columns but select other columns also in hive - hadoop

I have multiple columns in a table in hive having around 80 columns. I need to apply the distinct clause on some of the columns and get the first values from the other columns also. Below is the representation of what I am trying to achieve.
select distinct(col1,col2,col3),col5,col6,col7
from abc where col1 = 'something';
All the columns mentioned above are text columns. So I cannot apply group by and aggregate functions.

You can use row_number function to solve the problem.
create table temp as
select *, row_number() over (partition by col1,col2,col3) as rn
from abc
where col1 = 'something';
select *
from temp
where rn=1
You can also sort the table while partitioning.
row_number() over (partition by col1,col2,col3 order by col4 asc) as rn

DISTINCT is the most overused and least understood function in SQL. It's the last thing that is executed over your entire result set and removes duplicates using ALL columns in your select. You can do a GROUP BY with a string, in fact that is the answer here:
SELECT col1,col2,col3,COLLECT_SET(col4),COLLECT_SET(col5),COLLECT_SET(col6)
FROM abc WHERE col1 = 'something'
GROUP BY col1,col2,col3;
Now that I re-read your question though, I'm not really sure what you are after. You might have to join the table to an aggregate of itself.

Related

Oracle Select unique on multiple column

How can I achieve this to Select to one row only dynamically since
the objective is to get the uniqueness even on multiple columns

select distinct
coalesce(least(ColA, ColB),cola,colb) A1, greatest(ColA, ColB) B1
from T

The best solution is to use UNION
select colA from your_table
union
select colB from your_table;
Update:
If you want to find the duplicate then use the EXISTS as follows:
SELECT COLA, COLB FROM YOUR_TABLE T1
WHERE EXISTS (SELECT 1 FROM YOUR_tABLE T2
WHERE T2.COLA = T1.COLB OR T2.COLB = T1.COLA)

If I correctly understand words: objective is to get the uniqueness even on multiple columns, number of columns may vary, table can contain 2, 3 or more columns.
In this case you have several options, for example you can unpivot values, sort, pivot and take unique values. The exact code depends on Oracle version.
Second option is listagg(), but it has limited length and you should use separators not appearing in values.
Another option is to compare data as collections. Here I used dbms_debug_vc2coll which is simple table of varchars. Multiset except does main job:
with t as (select rownum rn, col1, col2, col3,
sys.dbms_debug_vc2coll(col1, col2, col3) as coll
from test )
select col1, col2, col3 from t a
where not exists (
select 1 from t b where b.rn < a.rn and a.coll multiset except b.coll is empty )
dbfiddle with 3-column table, nulls and different test cases

Error on Partition over row number

I am creating a sub-query to select distinct entries on a certain column, DIS_COL, then return all other columns for those distinct entries, arbitrarily selecting the first row.
To do this I'm creating a sub-query that selects only first rows using over - partition by, then selecting from that sub-query.
There is an error with my code however; "ORA-00923: FROM keyword not found where expected".
My code is below:
select *
from (
select *,
row_number() over (partition by DIS_COL order by COL_2) as row_number --ORDER BY FIELD DETERMINES WHICH ROW IS THE FIRST ROW AND THUS WHICH ONE IS SELECTED.
from MY_TABLE
) as rows
where row_number = 1
AND CRITERIA_COL = 'CRIT_1'
OR CRITERIA_COL_2 = 'CRIT_2';
How can I correct my code to achieve the desired result?
I am working on an Oracle database.

Remove as rows. It is not proper syntax for the table/query alias. It is syntax for column alias.
select *
from (
select T.*,
row_number() over (partition by DIS_COL order by COL_2) as row_number --ORDER BY FIELD DETERMINES WHICH ROW IS THE FIRST ROW AND THUS WHICH ONE IS SELECTED.
from MY_TABLE t
)
where row_number = 1
AND (CRITERIA_COL = 'CRIT_1'
OR CRITERIA_COL_2 = 'CRIT_2');

It's not the ROW_NUMBER, it's the *, Add an alias to the subquery:
select *
from (
select T.*, -- here
row_number() over (partition by DIS_COL order by COL_2) as row_number --ORDER BY FIELD DETERMINES WHICH ROW IS THE FIRST ROW AND THUS WHICH ONE IS SELECTED.
from MY_TABLE
)T as rows -- and here
where row_number = 1
AND CRITERIA_COL = 'CRIT_1'
OR CRITERIA_COL_2 = 'CRIT_2';

Best practice for pagination in Oracle?

Problem: I need write stored procedure(s) that will return result set of a single page of rows and the number of total rows.
Solution A: I create two stored procedures, one that returns a results set of a single page and another that returns a scalar -- total rows. The Explain Plan says the first sproc has a cost of 9 and the second has a cost of 3.
SELECT *
FROM ( SELECT ROW_NUMBER() OVER ( ORDER BY D.ID DESC ) AS RowNum, ...
) AS PageResult
WHERE RowNum >= #from
AND RowNum < #to
ORDER BY RowNum
SELECT COUNT(*)
FROM ...
Solution B: I put everything in a single sproc, by adding the same TotalRows number to every row in the result set. This solution feel hackish, but has a cost of 9 and only one sproc, so I'm inclined to use this solution.
SELECT *
FROM ( SELECT ROW_NUMBER() OVER ( ORDER BY D.ID DESC ) RowNum, COUNT(*) OVER () TotalRows,
WHERE RowNum >= from
AND RowNum < to
ORDER BY RowNum;
Is there a best-practice for pagination in Oracle? Which of the aforementioned solutions is most used in practice? Is any of them considered just plain wrong? Note that my DB is and will stay relatively small (less than 10GB).
I'm using Oracle 11g and the latest ODP.NET with VS2010 SP1 and Entity Framework 4.4. I need the final solution to work within the EF 4.4. I'm sure there are probably better methods out there for pagination in general, but I need them working with EF.

If you're already using analytics (ROW_NUMBER() OVER ...) then adding another analytic function on the same partitioning will add a negligible cost to the query.
On the other hand, there are many other ways to do pagination, one of them using rownum:
SELECT *
FROM (SELECT A.*, rownum rn
FROM (SELECT *
FROM your_table
ORDER BY col) A
WHERE rownum <= :Y)
WHERE rn >= :X
This method will be superior if you have an appropriate index on the ordering column. In this case, it might be more efficient to use two queries (one for the total number of rows, one for the result).
Both methods are appropriate but in general if you want both the number of rows and a pagination set then using analytics is more efficient because you only query the rows once.

In Oracle 12C you can use limit LIMIT and OFFSET for the pagination.
Example -
Suppose you have Table tab from which data needs to be fetched on the basis of DATE datatype column dt in descending order using pagination.
page_size:=5
select * from tab
order by dt desc
OFFSET nvl(page_no-1,1)*page_size ROWS FETCH NEXT page_size ROWS ONLY;
Explanation:
page_no=1
page_size=5
OFFSET 0 ROWS FETCH NEXT 5 ROWS ONLY - Fetch 1st 5 rows only
page_no=2
page_size=5
OFFSET 5 ROWS FETCH NEXT 5 ROWS ONLY - Fetch next 5 rows
and so on.
Refrence Pages -
https://dba-presents.com/index.php/databases/oracle/31-new-pagination-method-in-oracle-12c-offset-fetch
https://oracle-base.com/articles/12c/row-limiting-clause-for-top-n-queries-12cr1#paging

This may help:
SELECT * FROM
( SELECT deptno, ename, sal, ROW_NUMBER() OVER (ORDER BY ename) Row_Num FROM emp)
WHERE Row_Num BETWEEN 5 and 10;

A clean way to organize your SQL code could be trough WITH statement.
The reduced version implements also total number of results and total pages count.
For example
WITH SELECTION AS (
SELECT FIELDA, FIELDB, FIELDC FROM TABLE),
NUMBERED AS (
SELECT
ROW_NUMBER() OVER (ORDER BY FIELDA) RN,
SELECTION.*
FROM SELECTION)
SELECT
(SELECT COUNT(*) FROM NUMBERED) TOTAL_ROWS,
NUMBERED.*
FROM NUMBERED
WHERE
RN BETWEEN ((:page_size*:page_number)-:page_size+1) AND (:page_size*:page_number)
This code gives you a paged resultset with two more fields:
TOTAL_ROWS with the total rows of your full SELECTION
RN the row number of the record
It requires 2 parameter: :page_size and :page_number to slice your SELECTION
Reduced Version
Selection implements already ROW_NUMBER() field
WITH SELECTION AS (
SELECT
ROW_NUMBER() OVER (ORDER BY FIELDA) RN,
FIELDA,
FIELDB,
FIELDC
FROM TABLE)
SELECT
:page_number PAGE_NUMBER,
CEIL((SELECT COUNT(*) FROM SELECTION ) / :page_size) TOTAL_PAGES,
:page_size PAGE_SIZE,
(SELECT COUNT(*) FROM SELECTION ) TOTAL_ROWS,
SELECTION.*
FROM SELECTION
WHERE
RN BETWEEN ((:page_size*:page_number)-:page_size+1) AND (:page_size*:page_number)

Try this:
select * from ( select * from "table" order by "column" desc ) where ROWNUM > 0 and ROWNUM <= 5;

I also faced a similar issue. I tried all the above solutions and none gave me a better performance. I have a table with millions of records and I need to display them on screen in pages of 20. I have done the below to solve the issue.
Add a new column ROW_NUMBER in the table.
Make the column as primary key or add a unique index on it.
Use the population program (in my case, Informatica), to populate the column with rownum.
Fetch Records from the table using between statement. (SELECT * FROM TABLE WHERE ROW_NUMBER BETWEEN LOWER_RANGE AND UPPER_RANGE).
This method is effective if we need to do an unconditional pagination fetch on a huge table.

Sorry, this one works with sorting:
SELECT * FROM (SELECT ROWNUM rnum,a.* FROM (SELECT * FROM "tabla" order by "column" asc) a) WHERE rnum BETWEEN "firstrange" AND "lastrange";

Oracle: Returning the earliest date

I have a stored procedure which I use to return the earliest available date in a column of dates. I need to only return the earliest, and currently am using date arithmetic to reduce the number of returned rows. However, doing it this way, my procedure gets stuck in a loop of the first two top returned values, meaning I have several rows which are never read. Could somebody please let me know where I need to use the MIN function in the following WHERE clause, please? Thanks:
SELECT **COLS**
INTO **VARS**
FROM **TABLE**
INNER JOIN **TABLE TO JOIN**
ON **JOIN TARGET**
WHERE ROWNUM = 1 AND LASTREADTIME < SYSDATE - (30/86400)
ORDER BY LASTREADTIME DESC;

if you only need the earliest date
SELECT MIN(LastReadTime)
INTO **VARS**
FROM table
if you need other datas
SELECT t2.col1, t1.col1, t1.col2, t1.LastTreadTime
INTO **VARS**
FROM table t1
JOIN table2 t2 on t1.col1 = t2.col1
WHERE t1.LastReadTime = (SELECT MIN(t2.LastReadTime) FROM table t2);

Oracle ROWNUM pseudocolumn

I have a complex query with group by and order by clause and I need a sorted row number (1...2...(n-1)...n) returned with every row. Using a ROWNUM (value is assigned to a row after it passes the predicate phase of the query but before the query does any sorting or aggregation) gives me a non-sorted list (4...567...123...45...). I cannot use application for counting and assigning numbers to each row.

Is there a reason that you can't just do
SELECT rownum, a.*
FROM (<<your complex query including GROUP BY and ORDER BY>>) a

You could do it as a subquery, so have:
select q.*, rownum from (select... group by etc..) q
That would probably work... don't know if there is anything better than that.

Can you use an in-line query? ie
SELECT cols, ROWNUM
FROM (your query)

Assuming that you're query is already ordered in the manner you desire and you just want a number to indicate what row in the order it is:
SELECT ROWNUM AS RowOrderNumber, Col1, Col2,Col3...
FROM (
[Your Original Query Here]
)
and replace "Colx" with the names of the columns in your query.

I also sometimes do something like:
SELECT * FROM
(SELECT X,Y FROM MY_TABLE WHERE Z=16 ORDER BY MY_DATE DESC)
WHERE ROWNUM=1

If you want to use ROWNUM to do anything more than limit the total number of rows returned in a query (e.g. AND ROWNUM < 10) you'll need to alias ROWNUM:
select *
(select rownum rn, a.* from
(<sorted query>) a))
where rn between 500 and 1000

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Select distinct on specific columns but select other columns also in hive - hadoop

Related

Oracle Select unique on multiple column

Error on Partition over row number

Best practice for pagination in Oracle?

Oracle: Returning the earliest date

Oracle ROWNUM pseudocolumn

Categories

Resources