Hive case-insensitive alphabetical sorting - sorting

When I have an "order by" clause inside of a hive query, for example:
SELECT *
FROM categories
ORDER BY category_name
The results will be sorted as all the capital letters first and then all the lower letters. I need some table constraint or configuration to enforce the below behavior. A session sorting with UPPER/LOWER won't help.
Current results:
AAA
KKK
ZZZ
aaa
bbb
yyy
Expected results:
aaa
AAA
bbb
KKK
yyy
ZZZ
Is there any configuration which enforces hive to sort the data Alphabetical sorting first?
Within sql it's a collation. Within oracle it's LTS.
What is the right configuration for this kind of expected sorting results, and where to set it?

How about just using lower()?
SELECT *
FROM categories
ORDER BY LOWER(category_name);
Note: this will be arbitrary about the case of the result. Because lower-case letters come after upper case in all modern collations, you could do:
SELECT c.*
FROM categories c
ORDER BY LOWER(c.category_name), c.category_name DESC;

In order to implement the alphabetical sorting or any kind of sorting you can use cluster by in your query.
SELECT *
FROM categories
cluster BY LOWER(category_name);
You can alternatively use the distribute by with sort by option for more customized solution.
SELECT *
FROM categories
DISTRIBUTE BY LOWER(category_name)
SORT BY LOWER(category_name) DESC

Related

Does ordered hint in oracle also decides the order of rows in which they are fetched?

I read that 'The ORDERED hint causes Oracle to join tables in the order in which they appear in the FROM clause.'
But does it also fetch the rows in specific order?
For example: If I have ordered hint on column emp_code which has values as 'A','B' and 'C'[lets consider that more than 2 tables are joined to get emp_code ].
Will the output always have the specific order of rows? For example will 'A' always be the first row and 'C' be the last? does it decides the order of rows? and if yes then how?
No. The only thing that controls the order of rows in the final result set is the use of the ORDER BY clause in the SELECT statement. Hints are to influence the access plan chosen by the optimizer, not ordering of the result set.
select emp_id,
emp_name
from emp
order by emp_id -- <this is the only thing that controls the order of rows in the result set
;

Confirmation on re-written query

Original query:
SELECT CAST(cust_mart.acct_identifier AS STRING) as f0
FROM cts_work.cust_xref cust_mart
GROUP BY cust_mart.f0;
Can I replace the above query with below query :
SELECT DISTINCT CAST(cust_mart.acct_identifier AS STRING) as f0
FROM cts_work.cust_xref cust_mart;
Reason:
there is no aggregation so group-by does not make sense, but still confirming my approach I am running this query on hive using TEZ engine
Use EXPLAIN command and compare two query plans to check the difference. These queries should generate identical plans. Group by will work the same as distinct in this case. DISTINCT is also an aggregation, just another word for the same group by.

Need a method to filter data for records having more than one record for an id in HIVE

Consider the table below in HIVE:
Here i need to find out the unique combination of household,vehicle and customer.
But the condition is this.If for the same household and vehicle there are two different customers with role DRIVER and OWNER, i have to consider OWNER.
But if for a single household and vehicle there is only a single customer and if that customer is DRIVER or OWNER, i have to consider that record too.
I need HIVE query for this.
The result should be like below table:
Can anyone help me out here?
This can be useful, try this:-
select Household,Vehicle,Customer,Cust_role from (select *,row_number()
over (partition by Household,Vehicle order by Cust_role desc) rn from test_table) tableouter where rn=1;
output:-
I 1 A OWNER
II 2 C DRIVER
III 3 D OWNER
IV 4 E OWNER
Basically what you are looking for is a Top-N windowing function query, with N being 1 in your case. You can write a Hive query with RANK function with an additional "LIMIT 1" clause to achieve what you want. Refer Rank function in Hive to get started.
You can find a simple example here - Hive - top n records within a group

Oracle - select statement alias one column and wildcard to get all remaining columns

New to SQL. Pardon me if this question is a basic one. Is there a way for me to do this below
SELECT COLUMN1 as CUSTOM_NAME, <wildcard for remaining columns as is> from TABLE;
I only want COLUMN1 appear once in the final result
There is no way to make that kind of dynamic SELECT list with regular SQL*.
This is a good thing. Programming gets more difficult the more dynamic it is. Even the simple * syntax, while useful in many contexts, causes problems in production code. The Oracle SQL grammar is already more complicated than most traditional programming languages, adding a little meta language to describe what the queries return could be a nightmare.
*Well, you could create something using Oracle data cartridge, or DBMS_XMLGEN, or a trick with the PIVOT clause. But each of those solutions would be incredibly complicated and certainly not as simple as just typing the columns.
This is about as close as you will get.
It is very handy for putting the important columns up front,
while being able to scroll to the others if needed. COLUMN1 will end up being there twice.
SELECT COLUMN1 as CUSTOM_NAME,
aliasName.*
FROM TABLE aliasName;
In case you have many columns it might be worth to generate a full column list automatically instead of relying on the * selector.
So a two step approach would be to generate the column list with custom first N columns and unspecified order of the other columns, then use this generated list in your actual select statement.
-- select comma separated column names from table with the first columns being in specified order
select
LISTAGG(column_name, ', ') WITHIN GROUP (
ORDER BY decode(column_name,
'FIRST_COLUMN_NAME', 1,
'SECOND_COLUMN_NAME', 2) asc) "Columns"
from user_tab_columns
where table_name = 'TABLE_NAME';
Replace TABLE_NAME, FIRST_COLUMN_NAME and SECOND_COLUMN_NAME by your actual names, adjust the list of explicit columns as needed.
Then execute the query and use the result, which should look like
FIRST_COLUMN_NAME, SECOND_COLUMN_NAME, OTHER_COLUMN_NAMES
Ofcourse this is overhead for 5-ish columns, but if you ever run into a company database with 3 digit number of columns, this can be interesting.

Oracle Select Query, Order By + Limit Results

I am new to Oracle and working with a fairly large database. I would like to perform a query that will select the desired columns, order by a certain column and also limit the results. According to everything I have read, the below query should be working but it is returning "ORA-00918: column ambiguously defined":
SELECT * FROM(SELECT * FROM EAI.EAI_EVENT_LOG e,
EAI.EAI_EVENT_LOG_MESSAGE e1 WHERE e.SOURCE_URL LIKE '%.XML'
ORDER BY e.REQUEST_DATE_TIME DESC) WHERE ROWNUM <= 20
Any suggestions would be greatly appreciated :D
The error message means your result set contains two columns with the same name. Each column in a query's projection needs to have a unique name. Presumably you have a column (or columns) with the same name in both EAI_EVENT_LOG and EAI_EVENT_LOG_MESSAGE.
You also want to join on that column. At the moment you are generating a cross join between the two tables. In other words, if you have a hundred records in EAI_EVENT_LOG and two hundred records EAI_EVENT_LOG_MESSAGE your result set will be twenty thousand records (without the rownum). This is probably your intention.
"By switching to innerjoin, will that eliminate the error with the
current code?"
No, you'll still need to handle having two columns with the same name. Basically this comes from using SELECT * on two multiple tables. SELECT * is bad practice. It's convenient but it is always better to specify the exact columns you want in the query's projection. That way you can include (say) e.TRANSACTION_ID and exclude e1.TRANSACTION_ID, and avoid the ORA-00918 exception.
Maybe you have some columns in both EAI_EVENT_LOG and EAI_EVENT_LOG_MESSAGE tables having identical names? Instead of SELECT * list all columns you want to select.
Other problem I see is that you are selecting from two tables but you're not joining them in the WHERE clause hence the result set will be the cross product of those two table.
You need to stop using SQL '89 implicit join syntax.
Not because it doesn't work, but because it is evil.
Right now you have a cross join which in 99,9% of the cases is not what you want.
Also every sub-select needs to have it's own alias.
SELECT * FROM
(SELECT e.*, e1.* FROM EAI.EAI_EVENT_LOG e
INNER JOIN EAI.EAI_EVENT_LOG_MESSAGE e1 on (......)
WHERE e.SOURCE_URL LIKE '%.XML'
ORDER BY e.REQUEST_DATE_TIME DESC) s WHERE ROWNUM <= 20
Please specify a join criterion on the dotted line.
Normally you do a join on a keyfield e.g. ON (e.id = e1.event_id)
It's bad idea to use select *, it's better to specify exactly which fields you want:
SELECT e.field1 as customer_id
,e.field2 as customer_name
.....

Resources