NOT IN Subquery slow and runs out of memory (Clickhouse) - clickhouse

I have a single table holding DNA variants for different people. I want to show the variants that are unique to a person:
Table DNA (engine ordered by variant):
person | variant
p1 | v1
p1 | v2
p1 | v3
p2 | v2
p2 | v3
p3 | v2
p3 | v3
p4 | v2
p4 | v3
So a simple query:
select variant from DNA where person = 'p1' and variant
not in (select variant from DNA where person in ('p2', 'p3'))
will return all variants unique to p1 vs. p2 and p3 (p4 not considered for this query). However - it is slow and runs out of memory.
Should I be doing this a different way?

I suspect that the reason it is running out of memory is that the select variant from DNA where person in ('p2', 'p3') sub-query will result in v2, v3, v2, v3. This, especially when brought to scale, seems exceedingly inefficient because of the repetition. Potentially, adding distinct to the query may help, but in general this seems like an inefficient method of achieving your results if you have a lot of people (you'd have to manually type a lot of people in where person in (.........).
An alternative to this is to do a self join and basically limit the results to those where the only match is itself. Something like:
SELECT person, COUNT(*)
FROM (
SELECT * FROM table
ALL LEFT JOIN table
USING variant
)
GROUP BY person
HAVING COUNT(*) == 1;

Related

Mysql 8 CASE..WHEN..THEN..END throws syntax error

I have this query which works without problems in Mysql 5.* but I recently upgraded to MySQL 8 and now the query throws a syntax error as follows:
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'row_number, #breed'
The query is (lots of useless detail removed for simplicity):
SELECT `name`, `age`, breed
FROM (
SELECT
`dogs`.`name`,
`dogs`.`age`,
#row_number:=CASE WHEN #breed=breed
THEN #row_number+1
ELSE 1
END AS row_number
, #breed:=breed AS breed
FROM `dogs` /* other details with joins, subqueries and limits left out for simplicity*/;
breed is supposed to maintain a row count so I can get n rows for rows grouped by breed. That is, if n=2, for example, my result would be:
name | age | breed
------------------
fifi | 2 | labrador
bingo | 5 | labrador
rocket | 1 | german shepherd
sky | 1 german shepherd
My main question is why I get the syntax error. Google, is my friend, but not in this case... I tried. I also tried removing "as", adding brackets around the case/when/then/end but no joy!
I'm not certain that MySQL 8 supports user variables in the same way as MySQL 5.x does. In any case, your current syntax is at least deprecated, and you should just be using the ROW_NUMBER analytic function. For example, assuming you wanted the two youngest animals per breed, you could try:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY breed ORDER BY age) rn
FROM dogs
-- maybe joins here
)
SELECT name, age, breed
FROM cte
WHERE rn <= 2;

whats the purpose of using over and rank keywords in hive sql?

What is the meaning/purpose of using over and rank keywords in hive sql?
select rank() over (order by net_worth desc) as rank, name, net_worth from wealth order by rank, name;
+------+---------+---------------+
| rank | name | net_worth |
+------+---------+---------------+
| 1 | Solomon | 2000000000.00 |
| 2 | Croesus | 1000000000.00 |
| 2 | Midas | 1000000000.00 |
| 4 | Crassus | 500000000.00 |
| 5 | Scrooge | 80000000.00 |
+------+---------+---------------+
The OVER clause is powerful in that you can have aggregates over different ranges ("windowing"), whether you use a GROUP BY or not
OVER clause defines a window or user-specified set of rows within a query result set. A window function then computes a value for each row in the window. You can use the OVER clause with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results
Over clause can be used in association with aggregate function and ranking function. The over clause determine the partitioning and ordering of the records before associating with aggregate or ranking function.
suppose you use only rank() function then how sql will understand on which bases rank will be calculated. example table have 3 columns name, net_worth and net_profit. Name with highest net_profit will be first rank. so you have to tell the sql that calculate rank on the bases of highest net_profit.
over() works on a "window" of attributes.
In your example, select rank() over (order by net_worth desc), you have instructed to rank the table with net_worth column in descending order. Due to that reason, ranking is done on descending order of net_worth.
over() is powerful it was used along with partition by.
Have a look at this article, which provides good examples to understand the concepts.
If you have sales table with Territory & Sales Amount, you can provide rank on order of Sales Amount Or create a partition for Territory and rank the Sales amount with in a Territory.
Have a look at this article to get understanding on WindowingAndAnalytics. It will explain how to use aggregate functions in HiveQL.

Oracle display value replacement of flattened, delimited foreign key values

I working on a data export for a painfully denormalized COTS product and am hung up over how to plug display values in my selection for columns that contain a delimited string of foreign keys.
Assume the following sets of data for example.
DEPARTMENTS table:
Key Value
---------------------------------
1 Finance
2 Human Resources
3 Public Affairs
4 Information Technology
PERSONNEL table:
PK FName LName Departments
-------------------------------------------------
111 Marty Graw 1|~*~|3|~*~|
222 Rick Shaw 2|~*~|4|~*~|
333 Jean Poole 4|~*~|2|~*~|3|~*~|1|~*~|
Desired output from select:
FName LName Departments
-----------------------------------------------------------------------------------
Marty Graw Finance, Public Affairs
Rick Shaw Human Resources, Information Technology
Jean Poole Information Technology, Human Resources, Public Affairs, Finance
I've found examples of how to deal with delimited strings but nothing that really seems to fit this particular scenario. Ideally I'd like to figure out how I could do it without having to create functions etc. as my permissions are pretty limited.
This will not preserve the original order of the IDs, but if that's not important then this will work:
select DISTINCT
p.fname
,p.name
,LISTAGG(d.value, ', ')
WITHIN GROUP (ORDER BY d.value)
OVER (PARTITION BY p.pk)
AS departments_list
from personnel p
left join departments d
on INSTR('|~*~|'||p.departments||'|~*~|'
,'|~*~|'||d.key||'|~*~|') > 0;
SQL Fiddle: http://sqlfiddle.com/#!4/d292e/3/0
EDIT
If you really need them listed in the same order as the IDs, you can use this variant:
select DISTINCT
p.fname
,p.lname
,LISTAGG(d.value, ', ')
WITHIN GROUP (
ORDER BY INSTR('|~*~|'||p.departments||'|~*~|'
,'|~*~|'||d.key||'|~*~|'))
OVER (PARTITION BY p.pk) AS departments_list
from personnel p
left join departments d
on INSTR('|~*~|'||p.departments||'|~*~|'
,'|~*~|'||d.key||'|~*~|') > 0;
http://sqlfiddle.com/#!4/d292e/4

how do retrieve specific row in Hive?

I have a dataset looks like this:
---------------------------
cust | cost | cat | name
---------------------------
1 | 2.5 | apple | pkLady
---------------------------
1 | 3.5 | apple | greenGr
---------------------------
1 | 1.2 | pear | yelloPear
----------------------------
1 | 4.5 | pear | greenPear
-------------------------------
my hive query should now compare the cheapest price of each item the customer bought. So I want now to get the 2.5 and 1.2 into one row to get its difference. Since I am new to Hive I don't now how to ignore everything else until I reach next category of item while I still kept the cheapest price in the previous category.
you can use like below:
select cat,min(cost) from table group by cost;
Given your options (brickhouse UDFs, hive windowing functions or a self-join) in Hive, a self-join is the worst way to do this.
select *
, (cost - min(cost) over (partition by cust)) cost_diff
from table
You could create a subquery containing the minimum cost for each customer, and then join it to the original table:
select
mytable.*,
minCost.minCost,
cost - minCost as costDifference
from mytable
inner join
(select
cust,
min(cost) as minCost
from mytable
group by cust) minCost
on mytable.cust = minCost.cust
I created an interactive SQLFiddle example using MySQL, but it should work just fine in Hive.
I think this is really a SQL question rather than a Hive question: If you just want the cheapest cost per customer you can do
select cust, min(cost)
group by cust
Otherwise if you want the cheapest cost per customer per category you can do:
select cust, cat, min(cost)
from yourtable
groupby cust, cat

Can someone explain me how the cartesian product works in relational algebra

here it says
Selection and cross product
Cross product is the costliest operator to evaluate. If the input relations have N and M rows, the result will contain NM rows. Therefore it is very important to do our best to decrease the size of both operands before applying the cross product operator.
suppose that we have 2 relations
first relation is called Student and has 3 attributes, thus
student
|a |b |c |
------------
|__|___|___|
|__|___|___|
|__|___|___|
second relation is university and again with 3 attributes
university
|e |f |g |
------------
|__|___|___|
|__|___|___|
|__|___|___|
we have 3 rows for each relation, so after applying the cross product operation we will get a relation which has 3*3 = 9 rows
now, I don't understand, why 9 and not 3?
won't the final relation be
final relation
|a |b |c |d |e |f |g |
--------------------------
|__|___|___|__|____|__|__|
|__|___|___|__|____|__|__|
|__|___|___|__|____|__|__|
doesn't this have 3 rows again?
Thanks
If the rows in Student are row1, row2 and row3, and the rows in University are row4, row5 and row6, then the cartesian product will contain
row1row4, row1row5, row1row6, row2row4, row2row5, row2row6, row3row4, row3row5, row3row6
Each possible combination of rows. That's how it is defined. Nothing more to it.
Except for your remark "Therefore it is very important to do our best to decrease the size of both operands before applying the cross product operator.". It is important to realise that there do exist optimizers which are able to "rewrite" certain algebra operations. It is certainly not the case that the onus is always on the query writer to determine the "most appropriate way of combining restrictions with other operations". In fact, "moving restrictions to the inside as far as possible" is one of the things industrial optimizers are actually very good at.
Just imagine that you have two tables one with the students and one with the universities, when you do a Cartesian query against a relational database you will get a row for every student which in turn is joined to every university.
Select *
From students,
universities;
OR
SELECT * FROM students CROSS JOIN universities
I know this has little to do with algebra but since your on stackoverflow :D
There is no common attribute to link between student and university so each row in student is matched to each row in university, 3 * 3 = 9
|a|e|
|a|f|
|a|g|
|b|e|
|b|f|
|b|g|
|c|e|
|c|f|
|c|g|
Therefore 9

Resources