Inserting data from and to the same table - IN or INNER JOIN? - performance

In an application I have 2000 Accounts. The first Account contains 10000 Clients, which is the maximum limit for an Account. Users can select Clients from the first Account and then select some Accounts to copy the selected Clients to the selected Accounts. So the possible maximums are 1999 Accounts and 10000 Clients.
Currently I’m looping over the Account list and calling a Stored Procedure in each iteration from the client application. With each iteration, an Account Id and a table-valued parameter, that contains the list of ids of the clients, are passed to the SP. While testing with 500 Accounts and 10000 Clients, it takes 25 minutes, 34 second and 543 milliseconds. At some point within the SP I’m using the following code –
INSERT INTO Client
SELECT AccountId, CId, Code, Name, Email FROM Client
WHERE Client.Id IN(SELECT Id FROM #clientIdList)
where #clientIdList is the table-type variable that contains the 10000 Client's Id.
The thing is, after each iteration 10000 new Client data is being added to Client table. So, with each iteration, this INSERT operation is gonna take longer in the next iteration. Googling for some SP performance tips I came to know that the IN clause is considered somewhat evil, and most people suggests to use INNER JOIN instead. So I changed the above code to –
INSERT INTO Client
SELECT c.AccountId, c.CId, c.Code, c.Name, c.Email FROM Client AS c
INNER JOIN #clientIdList AS cil
ON c.Id = cil.Id
Now it takes 25 minutes, 17 seconds and 407 milliseconds. Nothing exciting, really!
I’m new to Stored Procedures arena. So, with this amount of data, is it supposed to take this long? And which one should I consider for the given scenario, IN or INNER JOIN? Suggestions and performance tips are welcome. Thanks.

It's hard to tell exactly what is going on without knowing more about your stored procedure.
What I would recommend is checking the Execution plan. To do this, open up SQL Server Management studio. In a new query window make a call to your stored procedure passing in any relevant parameters.
Before you execute this, go up to the Query menu and choose the Client Side Statistics and the Actual Execution Plan menu items.
Now run your query.
Come back in 25 minutes when it's all done and there should be 3 or 4 tabs at the bottom (depending on if it returns data or not.) 1 Tab for results, 1 Tab for messages, 1 tab for the client stats and 1 tab for the execution plan.
The client stats tab is useful for seeing if the changes you make affect performance (it keeps several of your last runs to show you how it performed over those.)
The more interesting tab is the execution plan tab. Look at this one, it should look something like this:
Here it tells me that my query was able to use the primary key index on all my tables. You want to look out for whole table scans (because that means it's not using an index.) Also, if my query hadn't been so simple and had taken a long time, and not used an index then below "Query 1" there would be green text stating "Missing Index" or something along those lines. It will tell you the index you need to create to improve performance.
Also notice it tells you how much each query took, in percentage. I had one query so obviously it took 100% of the time. But if you had 5 queries in your sproc and one took 80% of the time, you might want to check that one out first.
It also tells you how much time was spent on each part of the query, again in percentages. That can be helpful for trying to understand what it is that your query is doing.
Run through this and I'd guess you have other problems with your sproc, and you can ask follow up questions from that.

Related

Holistic SQL queries (inside Oracle PLSQL) and UX

I have a question about how to handle errors with holistic SQL queries. We are using Oracle PL/SQL. Most of our codebase is row-by-row processing which ends up in extremely poor performance. As far as I understand the biggest problem with that is the context switches between the PL/SQL and SQL engine.
Problem with that is that the user don't know what went wrong. Old style would be like:
Cursor above some data
Fire SELECT (count) in another table if data exists, if not show errormsg
SELECT that data
Fire second SELECT (count) in another table if data exists, if not show errormsg
SELECT that data
modify some other table
And that could go on for 10-20 tables. It's basically pretty much like a C program. It's possible to remodel that to something like:
UPDATE (
SELECT TAB1.Status,
10 AS New_Status
FROM TAB1
INNER JOIN TAB2 ON TAB1.FieldX = TAB2.FieldX
INNER ..
INNER ..
INNER ..
INNER ..
LEFT ..
LEFT ..
WHERE TAB1.FieldY = 2
AND TAB3.FieldA = 'ABC'
AND ..
AND ..
AND ..
AND ..
) TAB
SET TAB.Status = New_Status
WHERE TAB.Status = 5;
A holistic SELECT like that would speed up a lot of things extremely. I changed some queries like that and that stuff went down from 5 hours to 3 minutes but that was kinda easy because it was a service without human interaction.
Question is how would you handle stuff like that were someone fills some form and waits for a response. So if something went wrong they need an errormsg. Only solution that came to my mind was checking if rows were updated and if not jump into another code section that still does all the single selects to determinate that error. But after every change we would have to update the holistic select and all single selects. Guess after some time they would differ and lead to more problems.
Another solution would be a generic errormsg which would lead to hundred calls a day and us replacing 50 variables into the query, kill some of the where conditions/joins to find out what condition filtered away the needed rows.
So what is the right approach here to get performance and still be kinda user friendly. At the moment our system feels unusable slow. If you press a button you often have to wait a long time (typically 3-10 seconds, on some more complex tasks 5 minutes).
Set-based operations are faster than row-based operations for large amounts of data. But set-based operations mostly apply to batch tasks. UI tasks usually deal with small amounts of data in a row by row fashion.
So it seems your real aim should be understanding why your individual statements take so long.
" If you press a button you often have to wait a long time (typically 3-10 seconds on some complexer tasks 5 minutes"
That's clearly unacceptable. Equally clearly it's not possible for us to explain it: we don't have the access or the domain knowledge to diagnose systemic performance issues. Probably you need to persuade your boss to spring for a couple of days of on-site consultancy.
But here is one avenue to explore: locking.
"many other people working with the same data, so state is important"
Maybe your problems aren't due to slow queries, but to update statements waiting on shared resources? If so, a better (i.e. pessimistic) locking strategy could help.
"That's why I say people don't need to know more"
Data structures determine algorithms. The particular nature of your business domain and the way its data is stored is key to writing performative code. Why are there twenty tables involved in a search? Why does it take so long to run queries on these tables? Is STORAGE_BIN_ID not a primary key on all those tables?
Alternatively, why are users scanning barcodes on individual bins until they find one they want? It seems like it would be more efficient for them to specify criteria for a bin, then a set-based query could allocate the match nearest to their location.
Or perhaps you are trying to write one query to solve multiple use cases?

The query time of a view increases after having fetched the last page from a view in Oracle PL/SQL

I'm using Oracle PL/SQL Developer on a Oracle Database 11g. I have recently written a view with some weird behaviour. When I run the simple query below without fetching the last page of the query the query time is about 0.5 sec (0.2 when cached).
select * from covenant.v_status_covenant_tuning where bankkode = '4210';
However, if i fetch the last page in PL/SQL Developer or if I run the query from Java-code (i.e. I run a query that retrieves all the rows) something happens to the view and the query time increases to about 20-30 secs.
The view does not start working properly again before I recompile it. The explain plan is exactly the same before and after. All indexes and tables are analyzed. I don't know if it's relevant but the view uses a few analytic expressions like rank() over (partition by .....), lag(), lead() and so on.
As I'm new here I can't post a picture of the explain plan (need a reputation of 10) but in general the optimizer uses indexes efficiently and it does a few sorts because of the analytic functions.
If the plan involves a full scan of some sort, the query will not complete until the very last block in the table has been read.
Imagine a table that has lots of matching rows in the very first few blocks in the table, and no matching rows in the rest of it. If there is a large volume of blocks to check, the query might return the first few pages of results very quickly, as it finds them all in the first few blocks of the table. But before it can return the final "no more results" to the client, it must check every last block of the table - it doesn't know if there might be one more result in the very last block of the table, so it has to wait until it has read that last block.
If you'd like more help, please post your query plan.

ColdFusion's cfquery failing silently

I have a query that retrieves a large amount of data.
<cfsetting requesttimeout="9999999" >
<cfquery name="randomething" datasource="ds" timeout="9999999" >
SELECT
col1,
col2
FROM
table
</cfquery>
<cfdump var="#randomething.recordCount#" /> <!---should be about 5 million rows --->
I can successfully retrieve the data with python's cx_Oracle and using sys.getsizeof on the python list returns 22621060, so about 21 megabytes.
ColdFusion does not return an error on the page, and I can't find anything in any of the logs. Why is cfdump not showing the number of rows?
Additional Information
The reason for doing it this way is because I have about 8000 smaller queries to run against the randomthing query. In other words when I run those 8000 queries against the database it takes hours for that process to complete. I suspect this is because I am competing with several other database users, and the database is getting bogged down.
The 8000 smaller queries are getting counts of col1 over a period of col2.
SELECT
count(col1) as count
WHERE
col2 < 20121109
AND
col2 > 20121108
According to Adam Cameron's suggestions.
cflog is suggesting that the query isn't finishing.
I tried changing the queries timeout both in the code and in the CFIDE/administrator, apparently CF9 no long respects the timeout attribute, regardless of what I tried I couldn't get the query to timeout.
I also started playing around with the maxrows attribute to see if I could discern any information that way.
when maxrows is set to 1300000 everything works fine
when maxrows is 1400000 or greater I get this error
when maxrows is 2000000 I observe my original problem
Update
So this isn't a limit of cfquery. By using QueryNew then looping over it to add data and I can get well past the 2 million mark without any problems.
I also created a ThinClient datasource using the information in this question, I didn't observe any change in behavior.
The messages on the database end are
SQL*Net message from client
and
SQL*Net more data to client
I just discovered that by using the thin client along with blockfactor1="100" I can retrieve more rows (appx. 3000000).
Is there anything logged on the DB end of things?
I wonder if the timeout is not being respected, and JDBC is "hanging up" on the DB whilst it's working. That's a wild guess. What if you set a very low timeout - eg: 5sec - does it error after 5sec, or what?
The browser could be timing out too. What say you write something to a log before and after the <cfquery> block, with <cflog>. To see if the query is eventually finishing.
I have to wonder what it is you intend to do with these 22M records once you get them back to CF. Whatever it is, it sounds to me like CF is the wrong place to be doing whatever it is: CF ain't for heavy data processing, it's for making web pages. If you need to process 22M records, I suspect you should be doing it on the database. That said, I'm second-guessing what you're doing with no info to go on, so I presume there's probably a good reason to be doing it.
Have you tried wrapping your cfquery within cftry tags to see if that reports anything?
<cfsetting requesttimeout="600" >
<cftry>
<cfquery name="randomething" datasource="ds" timeout="590" >
SELECT
col1,
col2
FROM
table
</cfquery>
<cfdump var="#randomething.recordCount#" /> <!--- should be about 5 million rows --->
<cfcatch type="any">
<cfdump var="#cfcatch#">
</cfcatch>
</cftry>
This is just an idea, but you could give it a go:
You mention that using QueryNew you can successfully add the more-than-two-million records you need.
Also that when your maxRows is less than 1,300,000 things work as expected.
So why not first do a query to count(*) the total number of records in the table, divide by a million and round up, then cfloop over that number executing a query with maxRows=1000000 and startRow=((i - 1 * 1000000) + 1) on each iteration...
ArrayAppend each query from within the loop to an array then when it's all done, loop over your array pushing the records into a new Query object. That way you end up with a query at the end containing all the records you were trying to retrieve.
You might hit memory issues, and it will not perform all that well, but hey - this is Coldfusion, those are par for the course, and sometimes crazy things happen / work.
(You could always append the results of each query to the one you're building up from QueryNew as you go rather than pushing each query onto an array, but it'll be easier to debug and see how far you get if it doesn't work if you build an array as you go.)
(Also, using multiple queries within the size that it CF can handle, you may then be able to execute the process you need to by looping over the array and then each query, rather than building up one massive query - would save processing time and memory, but depends on whether you need the full results set in a single Query object or not)
if your date ranges are consistent, i would suggest some aggregate functions in sql instead of having cf process it. something like:
select col1, count(col1), year(col2), month(col2)
from table
group by year(col2), month(col2)
order by year(col2), month(col2)
add day() if you need that detail level, too. you can get really creative with date parts.
this should greatly speed up the entire run time, reduce the main query size.
Your problem here is that ColdFusion cannot time out SQL. This has always been an issue since CF6 I believe. So basically what is happening is that the cfquery is taking longer than 9999999 seconds but CF cannot timeout JDBC so it waits until afterwards tries to run cfdump (which internally uses cfoutput) and this is reported as timing out because the request is now considered to have run too long.
As Adam pointed out, whatever you are trying to do is too large for CF to realistically handle and will either need to be chopped up into smaller jobs or entirely handled in the DB.
So as it turns out the server was running out of memory, apparently cfquery takes up quite a bit more memory than a python list.
It was Barry's comment that got me going in the right direction, I didn't know much about the server monitor up until this point other than the fact that it existed.
As it turns out I am also not very good at reading, the errors that were getting logged in the application.log file were
GC overhead limit exceeded The specific sequence of files included or processed is: \path\to\index.cfm, line: 10 "
and
Java heap space The specific sequence of files included or processed is: \path\to\index.cfm
I'll end up going with Adams suggestion and let the database do the processing. At least now I'll be able to explain why things are slow instead of just saying, "I don't know".

Oracle performance via SQLDeveloper vs application

I am trying to understand the performance of a query that I've written in Oracle. At this time I only have access to SQLDeveloper and its execution timer. I can run SHOW PLAN but cannot use the auto trace function.
The query that I've written runs in about 1.8 seconds when I press "execute query" (F9) in SQLDeveloper. I know that this is only fetching the first fifty rows by default, but can I at least be certain that the 1.8 seconds encompasses the total execution time plus the time to deliver the first 50 rows to my client?
When I wrap this query in a stored procedure (returning the results via an OUT REF CURSOR) and try to use it from an external application (SQL Server Reporting Services), the query takes over one minute to run. I get similar performance when I press "run script" (F5) in SQLDeveloper. It seems that the difference here is that in these two scenarios, Oracle has to transmit all of the rows back rather than the first 50. This leads me to believe that there is some network connectivity issues between the client PC and Oracle instance.
My query only returns about 8000 rows so this performance is surprising. To try to prove my theory above about the latency, I ran some code like this in SQLDeveloper:
declare
tmp sys_refcursor;
begin
my_proc(null, null, null, tmp);
end;
...And this runs in about two seconds. Again, does SQLDeveloper's execution clock accurately indicate the execution time of the query? Or am I missing something and is it possible that it is in fact my query which needs tuning?
Can anybody please offer me any insight on this based on the limited tools I have available? Or should I try to involve the DBA to do some further analysis?
"I know that this is only fetching the
first fifty rows by default, but can I
at least be certain that the 1.8
seconds encompasses the total
execution time plus the time to
deliver the first 50 rows to my
client?"
No, it is the time to return the first 50 rows. It doesn't necessarily require that the database has determined the entire result set.
Think about the table as an encyclopedia. If you want a list of animals with names beginning with 'A' or 'Z', you'll probably get Aardvarks and Alligators pretty quickly. It will take much longer to get Zebras as you'd have to read the entire book. If your query is doing a full table scan, it won't complete until it has read the entire table (or book), even if there is nothing to be picked up in anything after the first chapter (because it doesn't know there isn't anything important in there until it has read it).
declare
tmp sys_refcursor;
begin
my_proc(null, null, null, tmp);
end;
This piece of code does nothing. More specifically, it will parse the query to determine that the necessary tables, columns and privileges are in place. It will not actually execute the query or determine whether any rows meet the filter criteria.
If the query only returns 8000 rows it is unlikely that the network is a significant problem (unless they are very big rows).
Ask your DBA for a quick tutorial in performance tuning.

(ASP.NET) How would you go about creating a real-time counter which tracks database changes?

Here is the issue.
On a site I've recently taken over it tracks "miles" you ran in a day. So a user can log into the site, add that they ran 5 miles. This is then added to the database.
At the end of the day, around 1am, a service runs which calculates all the miles, all the users ran in the day and outputs a text file to App_Data. That text file is then displayed in flash on the home page.
I think this is kind of ridiculous. I was told they had to do this due to massive performance issues. They won't tell me exactly how they were doing it before or what the major performance issue was.
So what approach would you guys take? The first thing that popped into my mind was a web service which gets the data via an AJAX call. Perhaps every time a new "mile" entry is added, a trigger is fired and updates the "GlobalMiles" table.
I'd appreciate any info or tips on this.
Thanks so much!
Answering this question is a bit difficult since there we don't know all of your requirements and something didn't work before. So here are some different ideas.
First, revisit your assumptions. Generating a static report once a day is a perfectly valid solution if all you need is daily reports. Why hit the database multiple times throghout the day if all that's needed is a snapshot (for instance, lots of blog software used to write html files when a blog was posted rather than serving up the entry from the database each time -- many still do as an optimization). Is the "real-time" feature something you are adding?
I wouldn't jump to AJAX right away. Use the same input method, just move the report from static to dynamic. Doing too much at once is a good way to get yourself buried. When changing existing code I try to find areas that I can change in isolation wih the least amount of impact to the rest of the application. Then once you have the dynamic report then you can add AJAX (and please use progressive enhancement).
As for the dynamic report itself you have a few options.
Of course you can just SELECT SUM(), but it sounds like that would cause the performance problems if each user has a large number of entries.
If your database supports it, I would look at using an indexed view (sometimes called a materialized view). It should support allows fast updates to the real-time sum data:
CREATE VIEW vw_Miles WITH SCHEMABINDING AS
SELECT SUM([Count]) AS TotalMiles,
COUNT_BIG(*) AS [EntryCount],
UserId
FROM Miles
GROUP BY UserID
GO
CREATE UNIQUE CLUSTERED INDEX ix_Miles ON vw_Miles(UserId)
If the overhead of that is too much, #jn29098's solution is a good once. Roll it up using a scheduled task. If there are a lot of entries for each user, you could only add the delta from the last time the task was run.
UPDATE GlobalMiles SET [TotalMiles] = [TotalMiles] +
(SELECT SUM([Count])
FROM Miles
WHERE UserId = #id
AND EntryDate > #lastTaskRun
GROUP BY UserId)
WHERE UserId = #id
If you don't care about storing the individual entries but only the total you can update the count on the fly:
UPDATE Miles SET [Count] = [Count] + #newCount WHERE UserId = #id
You could use this method in conjunction with the SPROC that adds the entry and have both worlds.
Finally, your trigger method would work as well. It's an alternative to the indexed view where you do the update yourself on a table instad of SQL doing it automatically. It's also similar to the previous option where you move the global update out of the sproc and into a trigger.
The last three options make it more difficult to handle the situation when an entry is removed, although if that's not a feature of your application then you may not need to worry about that.
Now that you've got materialized, real-time data in your database now you can dynamically generate your report. Then you can add fancy with AJAX.
If they are truely having performance issues due to to many hits on the database then I suggest that you take all the input and cram it into a message queue (MSMQ). Then you can have a service on the other end that picks up the messages and does a bulk insert of the data. This way you have fewer db hits. Then you can output to the text file on the update too.
I would create a summary table that's rolled up once/hour or nightly which calculates total miles run. For individual requests you could pull from the nightly summary table plus any additional logged miles for the period between the last rollup calculation and when the user views the page to get the total for that user.
How many users are you talking about and how many log records per day?

Resources