Query 1:
SELECT COUNT(1) FROM STUDENTS
Query 2:
SELECT COUNT(*) FROM STUDENTS
Both the queries return the same result, but is there any performance difference between these two ?
What I had heard is the first query would be faster than the second one, but can any one give specific details about it?
You may use count(*) or count(1), one is not faster than the other. As stated, is just a urban legend :)
One final note, count(*) and count(columnName) may be different!The first one counts all rows, the second one counts the number of rows where the specified column is not NULL.
There is no difference whatsoever between the two statements.
The rumour that count(1) is faster is an urban legend that was never true.
Related
In my case, I have some hive tables, the partition column(dt) is the only column that every table contains.
I execute the sql below in hive
SELECT * FROM (
SELECT row_number() over(ORDER BY T.dt) as row_num,T.* FROM
(select * from ods.test_table where dt='2021-09-06') as T) TT
WHERE TT.row_num BETWEEN 1 AND 10
I get the same result every time.
But I execute the sql in Presto, the result is not the same. I think the root cause is my table lack of a unique key.
Is it possible to do a global query without unique key in Presto?
You are calculating row_number
row_number() over(ORDER BY T.dt)
and ORDER BY column is always the same dt='2021-09-06'. In this case row_number has non-deterministic behavior and can assign the same numbers to different rows from run to run.
The fact that you are always getting the same results in Hive is a coincident, probably you always are running with exactly the same number of splits or even on single mapper, which runs single-threaded and producing results which look like deterministic. Presto may have different parallelism and it affects which rows are passed to the row_number first.
You can try to change something in splits configuration to force more mappers or increase the data size and you will be able to reproduce non-deterministic behavior, many mappers running in parallel on heavy loaded cluster will execute with different speed and different rows will be passed to the row_number.
To have deterministic results, you can add some columns to the ORDER BY which will determine the order of rows. If you have no such columns, then it means that you can have any number of full duplicates.
Even if you do not have unique key, row_number will produce deterministic results if ALL columns are in the order by.
Consider this dataset:
Col1 Col2 Col3
1 1 2
1 1 2
1 1 3
1 1 3
row_number() over(ORDER BY col1) as rn can produce all 4 rows ordered differently each run (let's suppose the dataset is very big one and there are many mappers are running concurrently, some mappers can finish faster, some can fail and restart). Of course, if you have such a small dataset and always processing it in single process, single threaded, the result will be the same, but in general, this is not how databases work.
The same about row_number() over(ORDER BY col1, col2)
But in case of row_number() over(ORDER BY col1, col2, col3) - you will always get the same dataset, guaranteed.
So, the solution is to use as much order by columns as needed to determine the order of rows. In the worst case if you have full duplicates, all columns should be added to the ORDER BY, duplicates will be ordered together and the result will be deterministic.
I am running a query on a large table and I am expecting a large number of returning row.
unfortunately I need to order the result by 2 columns, which makes the query quite slow.
I added an index to those specific columns but was wondering, if the order direction makes a difference.
one column is ordered desc and one is order asc.
thanks and best wishes,
e.
Your query might benefit from an index ordered the same way as your order by clause e.g.
create index index1 on table1 (col1 desc, col2 asc);
Whether it will benefit depends on the relative cost of the index scans and table lookups versus a simple full table scan. If the number of rows you want is low relative to the total number of rows in the table the query might benefit.
The only way to know for sure is try it.
I have a query similar to this
select *
from small_table A
inner join huge_table B on A.DATE =B.DATE
The huge_table is partitioned by DATE, and the PK is DATE, some_id and some_other_id (so the join not is done by pk index).
small_table just contains a few dates.
The total cost of the SQL is 48 minutes
By some reason the explain plan give me a "PARTITION RANGE (ALL)" with a high numbers on cardinality. Looks like access to the full table, not just the partitions indicated by small_table.DATE
If I put the SQL inside a loop and do
for o in (select date from small_table)
loop
select *
from small_table A
inner join huge_table B on A.DATE =B.DATE
where B.DATE=O.DATE
end loop;
Only takes 2 minutes 40 seconds (the full loop).
There is any way to force the partition pruning on Oracle 12c?
Additional info:
small_table has 37 records for 13 different dates. huge_table has 8,000 million of records with 179 dates/partitions. The SQL needs one field from small_table, but I can tweak the SQL to not use it
Update:
With the use_nl hint, now the cardinality show in the execution plan is more accurate and the execution time downs from 48 minutes to 4 minutes.
select /* use_nl(B) */*
from small_table A
inner join huge_table B on A.DATE =B.DATE
This seems like the problem:
"small_table have 37 registries for 13 different dates. huge_table has 8.000 millions of registries with 179 dates/partitions....
The SQL need one field from small_table, but I can tweak the SQL to not use it "
According to the SQL you posted you're joining the two tables on just their DATE columns with no additional conditions. If that's really the case you are generating a cross join in which each partition of huge_table is joined to small_table 2-3 times. So your result set may be much large than you're expecting, which means more database effort, which means more time.
The other thing to notice is that the cardinality of small_table to huge_table partitions is about 1:4; the optimizer doesn't know that there are really only thirteen distinct huge_table partitions in play.
Optimization ought to be a science and this is more guesswork than anything but try this:
select B.*
from ( select /*+ cardinality(t 13) */
distinct t.date
from small_table t ) A
inner join huge_table B
on A.DATE =B.DATE
This should communicate to the optimizer that only a small percentage of the huge_table partitions are required, which may make it choose partition pruning. Also it removes that Cartesian product, which should improve performance too. Obviously you will need to apply that tweak you mentioned, to remove the need to query anything else from small_table.
Question is similar to this except I want to know if I can do it in one query. This is what I have working but as we all know joins are expensive. Any better hql to do this?
select a.tbl1,b.tbl2
from
(
select count(*) as tbl1 from tbl1
) a
join
(
select count(*) as tbl2 from tbl2
) b ON 1=1
Yes, Joins are expensive
When it is said that joins are expensive, this typically refers to the situation where you have many records in multiple tables that need to be matched with eachother.
According to that description your join is not expensive, as you only join 2 sets with 1 record each.
But, you must be looking at overhead
Perhaps you notice that the individual counts take significantly shorter than the command which you use to count and combine the result. This would be because map and reduce operations have significant overhead (can be 30 seconds per stage).
You can play around a bit to see whether you hit a plan that does not incur much overhead, but it could well be that you are out of luck as hive does not scale down that well.
If it is not critical for you to keep them as a separate columns you can use UNION ALL operation to work with row format:
select 'tbl1', count(*) from tbl1
UNION ALL
select 'tbl2', count(*) from tbl2;
This would allow you to avoid extra MAPJOIN operator in your former query. Technically you can have one less mapper in your end execution plan.
Update
In up-to-date distributions of Hadoop you will not get much differences from performance perspective of going either UNION or MAP JOIN approach as these operations would be optimized within former jobs. But keep in mind that on older versions of the cluster or basing on some configuration properties MAPJOIN could be converted into a separate job.
I was wondering how this query is executed :
SELECT TOP 10 * FROM aSybaseTable
WHERE aCondition
The fact is that this query is taking too much time to return results.
So I was wondering if the query is smart enough to stop when the results reach 10 rows of if it returns all the possible results and then print only the 10 first rows.
Thanks in advance for your replies !
When using select top N the query is still executed fully, just the data page reads stop after the specified number of rows is affected. All the index page reads, and sorts still have to occur, so depending on the complexity of the where condition or subqueries, it can definitely still take time to execute. select top N is functionally similar to using set rowcount
Michael is right, but there is one special case here that really needs to be mentioned.
Query WILL execute faster and partially if no order by and group by clauses are used.
But this case is rarely useful since then you will get random N rows which fulfill the condition, in order they are physically located in table/index.