Hive Count(DISTINCT column) versus SELECT COUNT(*) from (SELECT DISTINCT column)

Hive Count(DISTINCT column) versus SELECT COUNT(*) from (SELECT DISTINCT column) - performance

There have been discussions and claims that the query 2 is faster than query 1.
Query 1
SELECT COUNT(DISTINCT A) FROM TAB_X;
QUERY 2
SELECT COUNT(*) FROM (SELECT DISTINCT A FROM TAB_X)
I fail to understand exactly why it is so.
This is my understanding of how these queries would be converted to the map reduce behind the scene.
Query 1
- Only one stage
- The mappers emit the Column A as the key and the value as 1. **Is this correct? How distinct is achieved?**
- There would be only one reducer, which would have to just increment the counter for every key and the list of values that it gets. However, not sure how would that single reducer knows when to emit the final count (**how does it know when to emit eventually?**).
Query -2
- Two stages
- Stage 1
- The mappers emit the key as the column A and the value as 1
- There will be a lot of reducers, which can aggregate the results for each key and emit the results of that key (which is column A).
Stage 2
The mappers gets details of each user and emits the same key for all and value as 1.
The reducers would just sum these counts and emits the final result.
Can you please help understand/answer my questions inline for query 1 and confirm my understanding of query 2?

Related

Oracle counting distinct rows using two columns merged as one

I have one table where each row has three columns. The first two columns are a prefix and a value. The third column is what I'm trying to get a distinct count for columns one/two.
Basically I'm trying to get to this.
Account
Totals
prefix & value1
101
prefix & value2
102
prefix & value3
103
I've tried a lot of different versions but I'm basically noodling around this.
select prefix||value as Account, count(distinct thirdcolumn) as Totals from Transactions

It sounds like you want
SELECT
prefix||value Account,
count(distinct thirdcolumn) Totals
FROM Transactions
GROUP BY prefix, value
The count(distinct thirdcolumn) says you want a count of the distinct values in the third column. The GROUP BY prefix, value says you want a row returned for each unique prefix/value combination and that the count applies to that combination.
Note that "thirdcolumn" is a placeholder for the name of your third column, not a magic keyword, since I didn't see the actual name in the post.

If you want the number of rows for each prefix/value pair then you can use:
SELECT prefix || value AS account,
COUNT(*) AS totals
FROM Transactions
GROUP BY prefix, value
You do not want to count the DISTINCT values for prefix/value as if you GROUP BY those values then different values for the pairs will be in different groups so the COUNT of DISTINCT prefix/value pairs would always be one.

Clickhouse query with a LIMIT clause inefficiently reads too many rows

I'm querying Clickhouse with a query that has ORDER BY and LIMIT 1, and the ORDER BY matches the table's sort order. The query returns 1 row as expected, however, 50+ rows were scanned to return the result.
I would expect ClickHouse to scan only 1 row as the ORDER BY is in the table's sort order. What's happening here and what can I do to fix this?
SELECT * FROM comp_intel_scrapes
order by
client_slug,
client_hotel_id,
argset_id,
scrape_datetime,
preferred_country,
preferred_currency,
adults,
children,
nights,
min_checkin_date,
max_checkin_date
limit 1
----
Elapsed: 0.004s
Read: 54 rows (8.84KB)
By the way, Clickhouse.com's cloud is being used here.

It depends on a table engine.
Primary index is sparse https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes/sparse-primary-indexes-design/
Because of this CH is unable to read less than one granule ~8192 rows.

Efficent use of an index for a self join with a group by

I'm trying to speed up the following
create table tab2 parallel 24 nologging compress for query high as
select /*+ parallel(24) index(a ix_1) index(b ix_2)*/
a.usr
,a.dtnum
,a.company
,count(distinct b.usr) as num
,count(distinct case when b.checked_1 = 1 then b.usr end) as num_che_1
,count(distinct case when b.checked_2 = 1 then b.usr end) as num_che_2
from tab a
join tab b on a.company = b.company
and b.dtnum between a.dtnum-1 and a.dtnum-0.0000000001
group by a.usr, a.dtnum, a.company;
by using indexes
create index ix_1 on tab(usr, dtnum, company);
create index ix_2 on tab(usr, company, dtnum, checked_1, checked_2);
but the execution plan tells me that it's going to be an index full scan for both indexes, and the calculations are very long (1 day is not enough).
About the data. Table tab has over 3 mln records. None of the single columns are unique. The unique values here are pairs of (usr, dtnum), where dtnum is a date with time written as a number in the format yyyy,mmddhh24miss. Columns checked_1, checked_2 have values from set (null, 0, 1, 2). Company holds an id for a company.
Each pair can only have one value checked_1, checked_2 and company as it is unique. Each user can be in multple pairs with different dtnum.
Edit
#Roberto Hernandez: I've attached the picture with the execution plan. As for parallel 24, in our company we are told to create tables with options 'parallel [num] nologging compress for query high'. I'm using 24 but I'm no expert in this field.
#Sayan Malakshinov: http://sqlfiddle.com/#!4/40b6b/2 Here I've simplified by giving data with checked_1 = checked_2, but in real life this may not be true.
#scaisEdge:
For
create index my_id1 on tab (company, dtnum);
create index my_id2 on tab (company, dtnum, usr);
I get

For table tab Your join condition is based on columns
company, datun
so you index should be primarly based on these columns
create index my_id1 on tab (company, datum);
The indexes you are using are useless because don't contain in left most position columsn use ij join /where condition
Eventually you can add user right most potition for avoid the needs of table access and let the db engine retrive alla the inf inside the index values
create index my_id1 on tab (company, datum, user, checked_1, checked_2);

Indexes (bitmap or otherwise) are not that useful for this execution. If you look at the execution plan, the optimizer thinks the group-by is going to reduce the output to 1 row. This results in serialization (PX SELECTOR) So I would question the quality of your statistics. What you may need is to create a column group on the three group-by columns, to improve the cardinality estimate of the group by.

HIVE equivalent of FIRST and LAST

I have a table with 3 columns:
table1: ID, CODE, RESULT, RESULT2, RESULT3
I have this SAS code:
data table1
set table1;
BY ID, CODE;
IF FIRST.CODE and RESULT='A' THEN OUTPUT;
ELSE IF LAST.CODE and RESULT NE 'A' THEN OUTPUT;
RUN;
So we are grouping the data by ID and CODE, and then writing to the dataset if certain conditions are met. I want to write a hive query to replicate this. This is what I have:
proc sql;
create table temp as
select *, row_number() over (partition by ID, CODE) as rowNum
from table1;
create table temp2 as
select a.ID, a.CODE, a.RESULT, a.RESULT2, a.RESULT3
from temp a
inner join (select ID, CODE, max(rowNum) as maxRowNum
from temp
group by ID, CODE) b
on a.ID=b.ID and a.CODE=b.CODE
where (a.rowNum=1 and a.RESULT='A') or (a.rowNum=b.maxRowNum and a.RESULT NE 'A');
quit;
There are two issues I see with this.
1) The row that is first or last in each BY group is entirely dependant on the order of rows in table1 in SAS, we aren't ordering by anything. I don't think row order is preserved when translating to a hive query.
2) The SAS code is taking the first row in each BY GROUP or the last, not both. I think that my HIVE query is taking both, resulting in more rows than I want.
Any suggestions or insight on how to improve my query is appreciated. Is it even possible to replicate this SAS code in HIVE?

The SAS code has a by statement (BY ID CODE;), which tells SAS that the set dataset is sorted at those levels. So, not a random selection for first. and last..
That said, we can replicate this in HIVE by using the first_value and last_value window functions.
FIRST.CODE should replicate to
first_value(code) over (partition by Id order by code)fcode
Similarly, LAST.CODE would be
last_value(code) over (partition by Id order by code)lcode
Once you have the fcode and lcode columns, use case when statements for the result column criteria. Like,
case when (code=fcode and result='A') or (code=lcode and result<>'A')
then 1 else 0 end as op_flag
Then the fetch the table with where op_flag = 1
SAMPLE
select id, code, result from (
select *,
first_value(code) over (partition by id order by code)fcode,
last_value(code) over (partition by id order by code)lcode
from footab) f
where (code=fcode and result='A') or (code=lcode and result<>'A')

Regarding point 1) the BY group processing requires the input data to be sorted or indexed on BY variables, so though the code contains no ordering, the source data is processed in order. If the input data was not indexed/sorted, SAS will throw error.
Regarding this, possible differences are on rows with same values of BY variables, especially if the RESULT is different.
In SAS, I would pre-sort data by ID, CODE, RESULT, then use BY ID CODE in order to not be influenced by order of rows.
Regarding 2) FIRST and LAST can be both true in SAS. Since your condition for first and last on RESULT is different, I guess this is not a source of differences.
I guess you could add another field as
row_number() over (partition by ID, CODE desc) as rowNumDesc
to detect last row with rowNumDesc = 1 (so that you skip the join).
EDIT:
I think the two programs above both include random selection of rows for groups with same values of ID and CODE variables, especially with same values of RESULT. But you should get same number of rows from both. If not, just debug it.
However the random aspect in SAS code/storage is based on physical order of rows, while the ROW_NUMBERs randomness within a group will be influenced by the implementation of the function in the engine.

Oracle aggregate function to return a random value for a group?

The standard SQL aggregate function max() will return the highest value in a group; min() will return the lowest.
Is there an aggregate function in Oracle to return a random value from a group? Or some technique to achieve this?
E.g., given the table foo:
group_id value
1 1
1 5
1 9
2 2
2 4
2 8
The SQL query
select group_id, max(value), min(value), some_aggregate_random_func(value)
from foo
group by group_id;
might produce:
group_id max(value), min(value), some_aggregate_random_func(value)
1 9 1 1
2 8 2 4
with, obviously, the last column being any random value in that group.

You can try something like the following
select deptno,max(sal),min(sal),max(rand_sal)
from(
select deptno,sal,first_value(sal)
over(partition by deptno order by dbms_random.value) rand_sal
from emp)
group by deptno
/
The idea is to sort the values within group in random order and pick the first.I can think of other ways but none so efficient.

You might prepend a random string to the column you want to extract the random element from, and then select the min() element of the column and take out the prepended string.
select group_id, max(value), min(value), substr(min(random_value),11)
from (select dbms_random.string('A', 10)||value random_value,foo.* from foo)
In this way you cand avoid using the aggregate function and specifying twice the group by, which might be useful in a scenario where your query is very complicated / or you are just exploring the data and are entering manually queries with a lengthy and changing list of group by columns.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive Count(DISTINCT column) versus SELECT COUNT(*) from (SELECT DISTINCT column) - performance

Related

Oracle counting distinct rows using two columns merged as one

Clickhouse query with a LIMIT clause inefficiently reads too many rows

Efficent use of an index for a self join with a group by

HIVE equivalent of FIRST and LAST

Oracle aggregate function to return a random value for a group?

Categories

Resources