Most efficient way to select in bulk from a multi million records table - oracle

I'm interested in getting and doing some processing on all the entities A returned by a query of the form:
SELECT * FROM A a WHERE a.id not in (select b.id from B)
Where A is a "complex" entity in the sense that it inherits (InheritanceTyped.Joined) from other entities and that several of its attributes are other entities (#OneToOne and #ManyToOne).
The query itself takes a few minutes to yield results hence my desire to execute it as few as possible.
Here are the different approaches i tried to get those A elements as efficiently as possible :
Pagination using setFirstResult/ setMaxResults
Do the job, but pretty slowly as the query seems to be executed everytime.(around 50 elements processed/sec)
Getting IDs first, A objects next
Keeping all the IDs in memory is doable, so I execute once
SELECT a.id FROM A a WHERE a.id not in (select b.id from B)
and then select a from A a WHERE a.id= :id, which goes relatively fast as the id column is indexed. This is currently the solution that is the most efficient with (around 100 elements processed/sec)
Using ScollableResults I had high hope with this solution, but it ended up being slower than other alternatives, leaving me at around 20 elements processed/sec ...
As a neophyte, I don't know what other options to investigate, or if I did something wrong in any of my attempts.
Hence my questions:
Are there (factually) other approaches to efficiently tackle this kind of problem ?
Is it normal that ScrollableResults performed so poorly ? Is there something I should have paid attention to while implementing this solution?
EDIT:
Here's the execution plan

Related

How make view use index

The view has several join, but no WHEREcloses. It helped our developers to have all the data the needed in one appian object, that could be easily used in the "low code" later on. In most cases, Appian add conditions to query the data on the view, in a subsequent WHERE clause like below:
query: [Report on Record Type], order by: [[Sort[histoDateAction desc], Sort[id asc]]],
filters:[((histoDateAction >= TypedValue[it=9,v=2022-10-08 22:00:00.0])
AND (histoDateAction < TypedValue[it=9,v=2022-10-12 22:00:00.0])
AND (histoUtilisateur = TypedValue[it=3,v=miwem6]))
]) (APNX-1-4198-000) (APNX-1-4205-031)
Now we start to have data in the database, and performances get low. Reason seems to be, from execution plan view, query do not use indexes when data is queried.
Here is how the query for view VIEW_A looks:
SELECT
<columns> (not much transformation here)
FROM A
LEFT JOIN R on R.id=A.id_type1
LEFT JOIN R on R.id=A.id_type2
LEFT JOIN R on R.id=A.id_type3
LEFT JOIN U on U.id=A.id_user <500>
LEFT JOIN C on D.id=A.id_customer <50000>
LEFT JOIN P on P.id=A.id_prestati <100000>
and in the current, Appian added below clauses:
where A.DATE_ACTION < to_date('2022-10-12 22:00:00', 'YYYY-MM-DD HH24:MI:SS')
and A.DATE_ACTION >= to_date('2022-10-08 22:00:00', 'YYYY-MM-DD HH24:MI:SS')
and A.USER_ACTION = 'miwem6'
typically, when I show explain plan for the VIEW_A WHERE <conditions> , I have a cost around 6'000, and when I show explain plan for the <code of the view> where <clause>, the cost is 30.
Is it possible to use some Oracle hint to tell it: "Some day, someone will query this adding a WHERE clause on some columns, so don't be a stupid engine and use indexes when time comes"?
First, this isn't a great architecture. I can't tell you how many times folks have pulled me in to diagnose performance problems due to unpredictably dynamic queries where they are adding an assortment of unforseable WHERE predicates.
But if you have to do this, you can increase your likelihood of using indexes by lowering their cost. Like this:
SELECT /*+ opt_param('optimizer_index_cost_adj',1) */
<columns> (not much transformation here)
FROM A . . .
If you know for sure that nested loops + index use is the way you want to access everything, you can even disable the CBO entirely:
SELECT /*+ rule */
<columns> (not much transformation here)
FROM A . . .
But of course it's on you to ensure that there's an index on every high cardinality column that your system may use to significantly filter desired rows by. That's not every column, but it sounds like it may be quite a few.
Oh, and one more thing... please ignore COST. By definition Oracle always chooses what it computes as the lowest cost plan. When it makes wrong choices, it's because it's computation of cost is incorrect. Therefore by definition, if you are having problems, the COST numbers you see are wrong. Ignore them.

How to speed-up a spatial join in BigQuery?

I have a BigQuery table with point registers along a whole country, and I need to assign a "censal zone" to each one of them, which polygons are contained in another table. I've been trying to do so using a query like this one:
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON ST_CONTAINS(zone_polygon, point_geo)
The first table is quite large, so the query performes very inefficiently as it is comparing each possible pairs of (point, censal zone). However, both tables have a column identifier for the municipality in which they are in, so the question is, can rewrite my query in some way that ST_CONTAINS(*) is performed for each (point, censal zone) pair that belongs to the same municipality, hence not comparing all posible censal zones within the country for each point? Can I do this without having to read points_table multiple times?
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON 1.municipality = 2.municipality
AND ST_CONTAINS(zone_geo, point_geo)
I'm quite new to BigQuery so I don't really know if a query like this would actually do what I'am expecting, as I couldn't find anything in the documentation.
Thanks!
SELECT id_point, code_censal_zone
FROM `points_table`
JOIN `zones_table`
ON 1.municipality = 2.municipality
AND ST_CONTAINS(zone_geo, point_geo)

making inner join linq or using [table].[joiningTable].column is the same?

I have recently started working on linq and I was wondering suppose I have 2 related tables Project (<=with fkAccessLevelId) and AccessLevel and I want to just select values from both tables. Now there are 2 ways I can select values from these tables.
The one i commonly use is:
(from P in DataContext.Projects
join AL in DataContext.AccessLevel
on P.AccessLevelId equals AL.AccessLevelId
select new
{
ProjectName = P.Name,
Access = AL.AccessName
}
Another way of doing this would be:
(from P in DataContext.Projects
select new
{
ProjectName = P.Name,
Access = P.AccessLevel.AccessName
}
What i wanted to know is which of these way is efficient if we increase the number of table say 5-6 with 1-2 tables containing thousands of records...?
You should take a look at the SQL generated. You have to understand that there are several main performance bottle necks in a Linq query (in this case I assume a OMG...Linq to SQL?!?!) the usual main bottle neck is the SQL query on the server.
Typically SQL Server has a very good optimizer, so actually, given the same query, refactored, the perf is pretty uniform.
However in your case, there is a very real difference in the two queries. A project with no Access Level would not appear in the first query, whilst the second query would return with a null AccessName. In effect you would be comparing a LEFT JOIN to an INNER JOIN.
TL:DR For SQL Server/Linq to Entity Framework queries that do the same thing should give similar performance. However your queries are far from similar.

Should I apply string manipulation after or before joining tables in Oracle

I have two tables need to inner join, one table has relatively small number of records compared to the other one. I need to apply some string manipulation to the smaller table, and my question is can I apply the string function after the join, or should I apply them in a sub query and then join the sub select to the bigger table?
An example would be something like this:
Option 1:
SELECT SUBSTR("SMALL_TABLE"."COL_NAME",x,y) "NEW_COL" FROM "BIG_TABLE"
JOIN "SMALL_TABLE" ON ...
Option 2:
SELECT "NEW_COL"
FROM "BIG_TABLE"
JOIN
(
SELECT SUBSTR("SMALL_TABLE"."COL_NAME",x,y) "NEW_COL" FROM "SMALL_TABLE"
) "T"
ON ...
Which is better for performance option 1 or 2?
I am using oracle 11g.
Regardless of how you structure the query, Oracle's optimizer is free to evaluate the function before or after the join. Assuming that the string manipulation is only done as part of the projection step (i.e. it is done only in the SELECT clause and is not used as a predicate in the WHERE clause), I would expect that Oracle would apply the SUBSTR before joining the tables if you used either formulation because it would then have to apply the function to fewer rows (though it can probably treat the SUBSTR as a deterministic call and cache the results if it applies the function after the join).
As with any query optimization question, the first step is always to generate a query plan and see if the different queries actually produce different plans. I would expect the plans to be identical and, thus, the performance to be identical. But there are any number of reasons that one of the two options might produce different plans on your system given your optimizer statistics, initialization parameters, etc.
It is better to apply the operations before doing the join and then joining and querying for the final result. This is called query optimization.
By doing so for ur question you will perform lesser operations when "join"ing as u will be eliminating the useless rows beforehand.
Lots of examples here : http://beginner-sql-tutorial.com/sql-query-tuning.htm
and this is the best one I could find : http://www.cse.iitb.ac.in/~sudarsha/db-book/slide-dir/ch14.ppt‎

Linq-to-NHibernate OrderBy Not Working

I'm trying order a Linq to NHibernate query by the sum of it's
children.
session.Linq<Parent>().OrderBy( p => p.Children.Sum( c => c.SomeNumber ) ).ToList()
This does not seem to work. When looking at NHProf, I can see that it
is ordering by Parent.Id. I figured that maybe it was returning the
results and ordering them outside of SQL, but if I add a .Skip
( 1 ).Take( 1 ) to the Linq query, it still orders by Parent.Id.
I tried doing this with an in memory List and it works just
fine.
Am I doing something wrong, or is this an issue with Linq to
NHibernate?
I'm sure I could always return the list, then do the operations on
them, but that is not an ideal workaround, because I don't want to
return all the records.
To order a query by an aggregate value, you need to use a group by query. In your example you need to use a 'group by' with a join.
The equivelant SQL would be something like:
select id, sum(child.parentid) as childsum from dbo.Parent
inner join child on
parent.id= child.parentid
group by id
order by childsum desc
If only Linq2NH could do this for us... but it can't sadly. You can do it in HQL as long as you're ok with getting the id, and the sum back, but not the whole object.
I battled with Linq2NH for months before I abandoned it. Its slow, doesn't support 2nd level cache and only supports VERY basic queries. (at least when I abandoned it 6 months ago - it may have come along leaps and bounds!) Its fine if you are doing a simple home-made application. If your application is even remotely complex ditch it and spend a bit of time getting to know HQL: a little harder to understand but much more powerful.

Resources