Confirmation on re-written query

Confirmation on re-written query - hadoop

Original query:
SELECT CAST(cust_mart.acct_identifier AS STRING) as f0
FROM cts_work.cust_xref cust_mart
GROUP BY cust_mart.f0;
Can I replace the above query with below query :
SELECT DISTINCT CAST(cust_mart.acct_identifier AS STRING) as f0
FROM cts_work.cust_xref cust_mart;
Reason:
there is no aggregation so group-by does not make sense, but still confirming my approach I am running this query on hive using TEZ engine

Use EXPLAIN command and compare two query plans to check the difference. These queries should generate identical plans. Group by will work the same as distinct in this case. DISTINCT is also an aggregation, just another word for the same group by.

Related

Oracle11g select query with pagination

I am facing a big performance problem when trying to get a list of objects with pagination from an oracle11g database.
As far as I know and as much as I have checked online, the only way to achieve pagination in oracle11g is the following :
Example : [page=1, size=100]
SELECT * FROM
(
SELECT pagination.*, rownum r__ FROM
(
select * from "TABLE_NAME" t
inner join X on X.id = t.id
inner join .....
where ......
order
) pagination
WHERE rownum <= 200
)
WHERE r__ > 100
The problem in this query, is that the most inner query fetching data from the table "TABLE_NAME" is returning a huge amount of data and causing the overall query to take 8 seconds (there are around 2 Million records returned after applying the where clause, and it contains 9 or 10 join clause).
The reason of this is that the most inner query is fetching all the data that respects the where clause and then the second query is getting the 200 rows, and the third to exclude the first 100 to get the second pages' data we need.
Isn't there a way to do that in one query, in a way to fetch the second pages' data that we need without having to do all these steps and cause performance issues?
Thank you!!

It depends on your sorting options (order by ...): database needs to sort whole dataset before applying outer where rownum<200 because of your order by clause.
It will fetch only 200 rows if you remove your order by clause. In some cases oracle can avoid sort operations (for example, if oracle can use some indexes to get requested data in the required order). Btw, Oracle uses optimized sorting operations in case of rownum<N predicates: it doesn't sort full dataset, it just gets top N records instead.
You can investigate sort operations deeper using sort trace event: alter session set events '10032 trace name context forever, level 10';
Furthermore, sometimes it's better to use analytic functions like
select *
from (
select
t1.*
,t2.*
,row_number()over([partition by ...] order by ...) rn
from t1
,t2
where ...
)
where rn <=200
and rn>=100
because in some specific cases Oracle can transform your query to push sorting and sort filter predicates to the earliest possible steps.

Why are subqueries in selects slow compared to joins?

I have two queries. The first retrieves some aggregate from another table as a column, using a subquery in the select (returns a string concatenation of a column for all rows).
The second query does the same by having a subselect in the from and then join the results. This second query however is doing the aggregate on the complete table before joining, but it is much faster (286ms vs 7645ms).
I don't understand why the subquery is so much slower, while the second query does the aggregate on a table with 175k rows (on postgresql 9.5). Using a subselect is much easier to integrate in a query builder, so I would like to use that, and the second query will slow down when the number of records increase. Is there a way to increase the speed of a subselect?
Query 1:
select kp_No,
(select string_agg(description,E'\n') from (select nt_Text as description from fgeNote where nt_kp_No=fgeContact.kp_No order by nt_No DESC limit 3) as subquery) as description
from fgeContact
where kp_k_No=729;
Explain: https://explain.depesz.com/s/8sL
Query 2:
select kp_No, NoteSummary
from fgeContact
LEFT JOIN
(select nt_kp_No, string_agg(nt_Text,E'\n') as NoteSummary
from
(select nt_kp_No, nt_Text from fgeNote ORDER BY nt_No DESC) as sortquery
group by nt_kp_No) as joinquery
ON joinquery.nt_kp_No=kp_No
where kp_k_No=729;
Explain: https://explain.depesz.com/s/yk9W

This is because in the second query you are retrieving all registers in a single scan while in the first one, the subquery is executed one time per each selected register of the master table so, each time, the table should be scanned again.
Even with indexed scans, it is usually more expensive than scanning the whole table, even sequentially (in fact, sequential scan is much faster than indexed scan when when selecting more than a few registers because indexing implies some overhead) and picking only for interesting registers.
But that depends to the actual data distribution too. Its perfecly possible that, for a different value of kp_k_No, the second query would become faster if the table contains only one or a few rows with that value for this parameter.
It's a matter of test and guess the different situations that will happen...

Does this linq-to-sql query retrieve all records and then pick one, or retrieve just one?

I'm looking for the highest key in a table with a quite a lot of records, and I'm wondering if LINQ to SQL does this efficiently, or if I need to take another approach to make sure the code isn't pulling across all of the records from the database.
int tkey =
(from r in GMSCore.db.Receipts
orderby r.TransactionKey descending
select r.TransactionKey).FirstOrDefault();
Is this interpreted like:
select Top(1) TransactionKey
from Receipt
order by TransactionKey desc
Or does it pull all of the records and then filter in the C# code?

It retrieves just one record. Generated query will look like:
SELECT TOP(1) [t0].[TransactionKey]
FROM Receipt [t0]
ORDER BY [t0].[TransactionKey] DESC
Query is not executed until you call FirstOrDefault() or other operator which forces execution (see Classification of Standard Query Operators by Manner of Execution).
You even can save query definition into variable (feel the difference - not result of query execution, but query itself)
var query = from r in GMSCore.db.Receipts
orderby r.TransactionKey descending
select r.TransactionKey;
// Nothing is executed yet
int tkey = query.FirstOrDefault(); // SQL query is generated and executed
You can execute original query later
foreach(var key in query) // another SQL is generated and executed
// ...
But in this case generated SQL query will look like
SELECT [t0].[TransactionKey]
FROM Receipt [t0]
ORDER BY [t0].[TransactionKey] DESC
So, you can modify query definition until it is executed. E.g. you can add filtering, or change ordering. In fact operator which executes query also usually affects query generation - e.g. selecting first, single or max item.

Group by specified column in PostgreSQL

Maybe, this question is a little stupid, but I'm confused.
How to group records by specified column ? :)
Item.group(:category_id)
does't works...
It says:
ActiveRecord::StatementInvalid: PGError: ERROR: column "items.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT "items".* FROM "items" GROUP BY category_id
What kind of aggregate function should i use?
Please, could you provide a simple example.

You will have to define, how to group values that share the same category_id. Concatenate them? Calculate a sum?
To create comma-separated lists of values your statement could look like this:
SELECT category_id
,string_agg(col1, ', ') AS col1_list
,string_agg(col2, ', ') AS col2_list
FROM items
GROUP BY category_id
You need Postgres 9.0 or later for string_agg(col1, ', ').
In older versions you can substitute with array_to_string(array_agg(col1), ', '). More aggregate functions here.
To aggregate values in PostgreSQL is the clearly superior approach as opposed to aggregating values in the client. Postgres is very fast at this and it reduces (network) traffic.

You can use sum, avg, count or any other aggregate function. More on this topic you can find here.
But it seems that you don't really need to use SQL grouping.
Try to fetch all records and then use Array#collect function to group Items by category_id

Grouping in SQL means that the server groups one or more records from the database table into one resulting row. So, if you for example group by category_id, you might have several records matching the given category, so you can't expect the database to return all columns from the table (that's what SELECT * actually does).
Instead, when you use GROUP BY, you can SELECT only:
columns you have grouped by, and/or
aggregate functions which are performed on all the records belonging to a resulting group
Depending on what you exactly need, modify your .select accordingly.

Why the Select * FROM Table where ID NOT IN ( list of int ids) query is slow in sql server ce?

well this problem is general in sql server ce
i have indexes on all the the fields.
also the same query but with ID IN ( list of int ids) is pretty fast.
i tried to change the query to OUTER Join but this just make it worse.
so any hints on why this happen and how to fix this problem?

That's because the index is not really helpful for that kind of query, so the database has to do a full table scan. If the query is (for some reason) slower than a simple "SELECT * FROM TABLE", do that instead and filter the unwanted IDs in the program.
EDIT: by your comment, I recognize you use a subquery instead of a list. Because of that, there are three possible ways to do the same (hopefully one of them is faster):
Original statement:
select * from mytable where id not in (select id from othertable);
Alternative 1:
select * from mytable where not exists
(select 1 from othertable where mytable.id=othertable.id);
Alternative 2:
select * from mytable
minus
select mytable.* from mytable in join othertable on mytable.id=othertable.id;
Alternative 3: (ugly and hard to understand, but if everything else fails...)
select * from mytable
left outer join othertable on (mytable.id=othertable.id)
where othertable.id is null;

This is not a problem in SQL Server CE, but overall database.
The OPERATION IN is sargable and NOT IN is nonsargable.
What this mean ?
Search ARGument Able, thies mean that DBMS engine can take advantage of using index, for Non Search ARGument Ablee the index can't be used.
The solution might be using filter statement to remove those IDs
More in SQL Performance Tuning by Peter Gulutzan.

ammoQ is right, index does not help much with your query. Depending on distribution of values in your ID column you could optimise the query by specifying which IDs to select rather than not to select. If you end up requesting say more than ~25% of the table index will not be used anyway though because for nonclustered indexed (which is the only type of indexes which SQL CE supports if memory serves) it would be cheaper to scan the table. Otherwise (if the query is actually selective) you could re-write query with ID ranges to select ('union all' may work better than 'or' to combine ranges if SQL CE supports 'union all', not sure)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Confirmation on re-written query - hadoop

Use EXPLAIN command and compare two query plans to check the difference. These queries should generate identical plans. Group by will work the same as distinct in this case. DISTINCT is also an aggregation, just another word for the same group by.

Related

Oracle11g select query with pagination

Why are subqueries in selects slow compared to joins?

Does this linq-to-sql query retrieve all records and then pick one, or retrieve just one?

Group by specified column in PostgreSQL

Why the Select * FROM Table where ID NOT IN ( list of int ids) query is slow in sql server ce?

Categories

Resources