I have simple app which execute query on dp, since there are alot rows returned ~ 300-400k and its to much to be retrived and it cause out of memory error i have to use pagination. In groovy.sql.SQL we have rows(String sql,int offset, int maxRows) anyway its works very slow, for example with step 20k rows execution time of rows method starts with around 10 sec and increase with every next call, second way of achiving pagination is using some buile in mechanism for example
select *
from (
select /*+ first_rows(25) */
your_columns,
row_number()
over (order by something unique)rn
from your_tables )
where rn between :n and :m
order by rn;
And for my query second approach tooks 5 seconds with step 20k. My question is, which method is better for database? And what is the reason of slow execution Sql.rows ?
The first_rows hint is no more needed - since Oracle 11g. For Oracle it is best approach producer-consumer design pattern. As database generates data "on-the-fly".
So simple pure select would be suitable:
select your_columns,
row_number() over (order by something unique)rn
from your_tables;
But unfortunately Java frameworks usually can not keep db connection open. They simply fetch all data at once, and then hand over the whole result set to caller.
You do not have many options. Either:
you will need all lot of RAM to fetch everything. Plus you can also use lazy loading on JPA level.
or you have to find a way how keep db connection open in a web application. Which it practically impossible. Also such a approach is not suitable for applications having more than thousands of concurrent users.
PS: under usual circumstances, the usual way how pagination is implemented does not return consistent data, as they can change between executions. So it should not be used for anything else that displaying purposes.
Related
I have a question about how to handle errors with holistic SQL queries. We are using Oracle PL/SQL. Most of our codebase is row-by-row processing which ends up in extremely poor performance. As far as I understand the biggest problem with that is the context switches between the PL/SQL and SQL engine.
Problem with that is that the user don't know what went wrong. Old style would be like:
Cursor above some data
Fire SELECT (count) in another table if data exists, if not show errormsg
SELECT that data
Fire second SELECT (count) in another table if data exists, if not show errormsg
SELECT that data
modify some other table
And that could go on for 10-20 tables. It's basically pretty much like a C program. It's possible to remodel that to something like:
UPDATE (
SELECT TAB1.Status,
10 AS New_Status
FROM TAB1
INNER JOIN TAB2 ON TAB1.FieldX = TAB2.FieldX
INNER ..
INNER ..
INNER ..
INNER ..
LEFT ..
LEFT ..
WHERE TAB1.FieldY = 2
AND TAB3.FieldA = 'ABC'
AND ..
AND ..
AND ..
AND ..
) TAB
SET TAB.Status = New_Status
WHERE TAB.Status = 5;
A holistic SELECT like that would speed up a lot of things extremely. I changed some queries like that and that stuff went down from 5 hours to 3 minutes but that was kinda easy because it was a service without human interaction.
Question is how would you handle stuff like that were someone fills some form and waits for a response. So if something went wrong they need an errormsg. Only solution that came to my mind was checking if rows were updated and if not jump into another code section that still does all the single selects to determinate that error. But after every change we would have to update the holistic select and all single selects. Guess after some time they would differ and lead to more problems.
Another solution would be a generic errormsg which would lead to hundred calls a day and us replacing 50 variables into the query, kill some of the where conditions/joins to find out what condition filtered away the needed rows.
So what is the right approach here to get performance and still be kinda user friendly. At the moment our system feels unusable slow. If you press a button you often have to wait a long time (typically 3-10 seconds, on some more complex tasks 5 minutes).
Set-based operations are faster than row-based operations for large amounts of data. But set-based operations mostly apply to batch tasks. UI tasks usually deal with small amounts of data in a row by row fashion.
So it seems your real aim should be understanding why your individual statements take so long.
" If you press a button you often have to wait a long time (typically 3-10 seconds on some complexer tasks 5 minutes"
That's clearly unacceptable. Equally clearly it's not possible for us to explain it: we don't have the access or the domain knowledge to diagnose systemic performance issues. Probably you need to persuade your boss to spring for a couple of days of on-site consultancy.
But here is one avenue to explore: locking.
"many other people working with the same data, so state is important"
Maybe your problems aren't due to slow queries, but to update statements waiting on shared resources? If so, a better (i.e. pessimistic) locking strategy could help.
"That's why I say people don't need to know more"
Data structures determine algorithms. The particular nature of your business domain and the way its data is stored is key to writing performative code. Why are there twenty tables involved in a search? Why does it take so long to run queries on these tables? Is STORAGE_BIN_ID not a primary key on all those tables?
Alternatively, why are users scanning barcodes on individual bins until they find one they want? It seems like it would be more efficient for them to specify criteria for a bin, then a set-based query could allocate the match nearest to their location.
Or perhaps you are trying to write one query to solve multiple use cases?
Question: How can I process (read in) batches of records 1000 at a time and ensure that only the current batch of 1000 records is in memory? Assume my primary key is called 'ID' and my table is called Customer.
Background: This is not for user pagination, it is for compiling statistics about my table. I have limited memory available, therefore I want to read my records in batches of 1000 records at a time. I am only reading in records, they will not be modified. I have read that StatelessSession is good for this kind of thing and I've heard about people using ScrollableResults.
What I have tried: Currently I am working on a custom made solution where I implemented Iterable and basically did the pagination by using setFirstResult and setMaxResults. This seems to be very slow for me but it allows me to get 1000 records at a time. I would like to know how I can do this more efficiently, perhaps with something like ScrollableResults. I'm not yet sure why my current method is so slow; I'm ordering by ID but ID is the primary key so the table should already be indexed that way.
As you might be able to tell, I keep reading bits and pieces about how to do this. If anyone can provide me a complete way to do this it would be greatly appreciated. I do know that you have to set FORWARD_ONLY on ScrollableResults and that calling evict(entity) will take an entity out of memory (unless you're doing second level caching, which I do not yet know how to check if I am or not). However I don't see any methods in the JavaDoc to read in say, 1000 records at a time. I want a balance between my lack of available memory and my slow network performance, so sending records over the network one at a time really isn't an option here. I am using Criteria API where possible. Thanks for any detailed replies.
May useing of ROWNUM feature of oracle will hepl you.
Lets say we need to fetch 1000 rows(pagesize) of table CUSTOMERS and we need to fetch second page(pageNumber)
Creating and Calling some query like this may be the answer
select * from
(select rownum row_number,customers.* from Customer
where rownum <= pagesize*pageNumber order by ID)
where row_number >= (pagesize -1)*pageNumber
Load entities as read-only.
For HQL
Query.setReadOnly( true );
For Criteria
Criteria.setReadOnly( true );
http://docs.jboss.org/hibernate/orm/3.6/reference/en-US/html/readonly.html#readonly-api-querycriteria
Stateless session quite different with State-Session.
Operations performed using a stateless session never cascade to associated instances. Collections are ignored by a stateless session
http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html#batch-statelesssession
Use flash() and clear() to clean up session cache.
session.flush();
session.clear();
Question about Hibernate session.flush()
ScrollableResults should works that you expect.
Do not forget that each item that you loaded takes memory space unless you evict or clear and need to check it really works well.
ScrollableResults in Mysql J/Connecotr works fake, it loads entire rows, but I think oracle connector works fine.
Using Hibernate's ScrollableResults to slowly read 90 million records
If you find alternatives, you may consider to use this way
1. Select PrimaryKey of every rows that you will process
2. Chopping them into PK chunk
3. iterate -
select rows by PK chunk (using in-query)
process them what you want
I have a query that retrieves a large amount of data.
<cfsetting requesttimeout="9999999" >
<cfquery name="randomething" datasource="ds" timeout="9999999" >
SELECT
col1,
col2
FROM
table
</cfquery>
<cfdump var="#randomething.recordCount#" /> <!---should be about 5 million rows --->
I can successfully retrieve the data with python's cx_Oracle and using sys.getsizeof on the python list returns 22621060, so about 21 megabytes.
ColdFusion does not return an error on the page, and I can't find anything in any of the logs. Why is cfdump not showing the number of rows?
Additional Information
The reason for doing it this way is because I have about 8000 smaller queries to run against the randomthing query. In other words when I run those 8000 queries against the database it takes hours for that process to complete. I suspect this is because I am competing with several other database users, and the database is getting bogged down.
The 8000 smaller queries are getting counts of col1 over a period of col2.
SELECT
count(col1) as count
WHERE
col2 < 20121109
AND
col2 > 20121108
According to Adam Cameron's suggestions.
cflog is suggesting that the query isn't finishing.
I tried changing the queries timeout both in the code and in the CFIDE/administrator, apparently CF9 no long respects the timeout attribute, regardless of what I tried I couldn't get the query to timeout.
I also started playing around with the maxrows attribute to see if I could discern any information that way.
when maxrows is set to 1300000 everything works fine
when maxrows is 1400000 or greater I get this error
when maxrows is 2000000 I observe my original problem
Update
So this isn't a limit of cfquery. By using QueryNew then looping over it to add data and I can get well past the 2 million mark without any problems.
I also created a ThinClient datasource using the information in this question, I didn't observe any change in behavior.
The messages on the database end are
SQL*Net message from client
and
SQL*Net more data to client
I just discovered that by using the thin client along with blockfactor1="100" I can retrieve more rows (appx. 3000000).
Is there anything logged on the DB end of things?
I wonder if the timeout is not being respected, and JDBC is "hanging up" on the DB whilst it's working. That's a wild guess. What if you set a very low timeout - eg: 5sec - does it error after 5sec, or what?
The browser could be timing out too. What say you write something to a log before and after the <cfquery> block, with <cflog>. To see if the query is eventually finishing.
I have to wonder what it is you intend to do with these 22M records once you get them back to CF. Whatever it is, it sounds to me like CF is the wrong place to be doing whatever it is: CF ain't for heavy data processing, it's for making web pages. If you need to process 22M records, I suspect you should be doing it on the database. That said, I'm second-guessing what you're doing with no info to go on, so I presume there's probably a good reason to be doing it.
Have you tried wrapping your cfquery within cftry tags to see if that reports anything?
<cfsetting requesttimeout="600" >
<cftry>
<cfquery name="randomething" datasource="ds" timeout="590" >
SELECT
col1,
col2
FROM
table
</cfquery>
<cfdump var="#randomething.recordCount#" /> <!--- should be about 5 million rows --->
<cfcatch type="any">
<cfdump var="#cfcatch#">
</cfcatch>
</cftry>
This is just an idea, but you could give it a go:
You mention that using QueryNew you can successfully add the more-than-two-million records you need.
Also that when your maxRows is less than 1,300,000 things work as expected.
So why not first do a query to count(*) the total number of records in the table, divide by a million and round up, then cfloop over that number executing a query with maxRows=1000000 and startRow=((i - 1 * 1000000) + 1) on each iteration...
ArrayAppend each query from within the loop to an array then when it's all done, loop over your array pushing the records into a new Query object. That way you end up with a query at the end containing all the records you were trying to retrieve.
You might hit memory issues, and it will not perform all that well, but hey - this is Coldfusion, those are par for the course, and sometimes crazy things happen / work.
(You could always append the results of each query to the one you're building up from QueryNew as you go rather than pushing each query onto an array, but it'll be easier to debug and see how far you get if it doesn't work if you build an array as you go.)
(Also, using multiple queries within the size that it CF can handle, you may then be able to execute the process you need to by looping over the array and then each query, rather than building up one massive query - would save processing time and memory, but depends on whether you need the full results set in a single Query object or not)
if your date ranges are consistent, i would suggest some aggregate functions in sql instead of having cf process it. something like:
select col1, count(col1), year(col2), month(col2)
from table
group by year(col2), month(col2)
order by year(col2), month(col2)
add day() if you need that detail level, too. you can get really creative with date parts.
this should greatly speed up the entire run time, reduce the main query size.
Your problem here is that ColdFusion cannot time out SQL. This has always been an issue since CF6 I believe. So basically what is happening is that the cfquery is taking longer than 9999999 seconds but CF cannot timeout JDBC so it waits until afterwards tries to run cfdump (which internally uses cfoutput) and this is reported as timing out because the request is now considered to have run too long.
As Adam pointed out, whatever you are trying to do is too large for CF to realistically handle and will either need to be chopped up into smaller jobs or entirely handled in the DB.
So as it turns out the server was running out of memory, apparently cfquery takes up quite a bit more memory than a python list.
It was Barry's comment that got me going in the right direction, I didn't know much about the server monitor up until this point other than the fact that it existed.
As it turns out I am also not very good at reading, the errors that were getting logged in the application.log file were
GC overhead limit exceeded The specific sequence of files included or processed is: \path\to\index.cfm, line: 10 "
and
Java heap space The specific sequence of files included or processed is: \path\to\index.cfm
I'll end up going with Adams suggestion and let the database do the processing. At least now I'll be able to explain why things are slow instead of just saying, "I don't know".
I have a product table which as many columns. Primary key is productid. There 50,000 rows but when I issue a select statement like select * from products then it is taking 10 minutes to get the full data. So advise me what to do as a result I can run my query faster.
Is your primary key also the clustering key on that table?
If you do a SELECT * .... you'll basically always get a full table scan. There's really nothing that can speed that query up - you want all rows, all columns - so you get it all and it takes the time it takes.
If you do more "focused" queries like
SELECT col1, col2 FROM dbo.Products WHERE SomeColumn = 42
then you have a chance of speeding this up by using the appropriate indices.
Buy a better computer.
Seriously.
SQL Server 2000 has been retired years ago, so this is an OLD install. 50.000 products is a joke - any table below 1 million is nothing.
But when i issue a select statement like select * from products then it is taking 10 minute
to get the full data.
Assuming this is over LAN, not over a slow internet connection, there can be 2 reasons for that:
System is TERRIBLY OVERLOADED. Like SERIOUS overloaded. Not like I have not seen that on old setups. Been there, seen that - hard discs so overloaded (hey, they are SCSI, they are fast) that they took more than 2 seconds to answer to a request.
System is programmed by incompetents. Could be bad transaction level handling leading to terrible locks for long duration which block you. This is possible, but then you are in for a LOT of rework to get the ridiculous code out of the programming.
A select * from table should not take more than a couple of seconds to transfer all the data over LAN. Point. Unless the bale has tons of binary data (i.e. HUGH amounts of data in some fields).
As your local database specialist to make an analysis. Start with hardware load then move to locking behavior. Consider upgrading to a technology that is more modern, By now you are a LOT of generations behind.
Because there's no criterium (WHERE), the time your query takes is not due to the selection (determining which rows to select) but most likely due to the sheer size of the data.
The only solution is:
Do not use SELECT *, but select only the columns you need.
Are there general ABAP-specific tips related to performance of big SELECT queries?
In particular, is it possible to close once and for all the question of FOR ALL ENTRIES IN vs JOIN?
A few (more or less) ABAP-specific hints:
Avoid SELECT * where it's not needed, try to select only the fields that are required. Reason: Every value might be mapped several times during the process (DB Disk --> DB Memory --> Network --> DB Driver --> ABAP internal). It's easy to save the CPU cycles if you don't need the fields anyway. Be very careful if you SELECT * a table that contains BLOB fields like STRING, this can totally kill your DB performance because the blob contents are usually stored on different pages.
Don't SELECT ... ENDSELECT for small to medium result sets, use SELECT ... INTO TABLE instead.
Reason: SELECT ... INTO TABLE performs a single fetch and doesn't keep the cursor open while SELECT ... ENDSELECT will typically fetch a single row for every loop iteration.
This was a kind of urban myth - there is no performance degradation for using SELECT as a loop statement. However, this will keep an open cursor during the loop which can lead to unwanted (but not strictly performance-related) effects.
For large result sets, use a cursor and an internal table.
Reason: Same as above, and you'll avoid eating up too much heap space.
Don't ORDER BY, use SORT instead.
Reason: Better scalability of the application server.
Be careful with nested SELECT statements.
While they can be very handy for small 'inner result sets', they are a huge performance hog if the nested query returns a large result set.
Measure, Measure, Measure
Never assume anything if you're worried about performance. Create a representative set of test data and run tests for different implementations. Learn how to use ST05 and SAT.
There won't be a way to close your second question "once and for all". First of all, FOR ALL ENTRIES IN 'joins' a database table and an internal (memory) table while JOIN only operates on database tables. Since the database knows nothing about the internal ABAP memory, the FOR ALL ENTRIES IN statement will be transformed to a set of WHERE statements - just try and use the ST05 to trace this. Second, you can't add values from the second table when using FOR ALL ENTRIES IN. Third, be aware that FOR ALL ENTRIES IN always implies DISTINCT. There are a few other pitfalls - be sure to consult the on-line ABAP reference, they are all listed there.
If the number of records in the second table is small, both statements should be more or less equal in performance - the database optimizer should just preselect all values from the second table and use a smart joining algorithm to filter through the first table. My recommendation: Use whatever feels good, don't try to tweak your code to illegibility.
If the number of records in the second table exceeds a certain value, Bad Things [TM] happen with FOR ALL ENTRIES IN - the contents of the table are split into multiple sets, then the query is transformed (see above) and re-run for each set.
Another note: The "Avoid SELECT *" statement is true in general, but I can tell you where it is false.
When you are going to take most of the fields anyway, and where you have several queries (in the same program, or different programs that are likely to be run around the same time) which take most of the fields, especially if they are different fields that are missing.
This is because the App Server Data buffers are based on the select query signature. If you make sure to use the same query, then you can ensure that the buffer can be used instead of hitting the database again. In this case, SELECT * is better than selecting 90% of the fields, because you make it much more likely that the buffer will be used.
Also note that as of the last version I tested, the ABAP DB layer wasn't smart enough to recognize SELECT A, B as being the same as SELECT B, A, which means you should always put the fields you take in the same order (preferable the table order) in order to make sure again that the data buffer on the application is being well used.
I usually follow the rules stated in this pdf from SAP: "Efficient Database Programming with ABAP"
It shows a lot of tips in optimizing queries.
This question will never be completely answered.
ABAP statement for accessing database is interpreted several times by different components of whole system (SAP and DB). Behavior of each component depends from component itself, its version and settings. Main part of interpretation is done in DB adapter on SAP side.
The only viable approach for reaching maximum performance is measurement on particular system (SAP version and DB vendor and version).
There are also quite extensive hints and tips in transaction SE30. It even allows you (depending on authorisations) to write code snippets of your own & measure it.
Unfortunately we can't close the "for all entries" vs join debate as it is very dependent on how your landscape is set up, wich database server you are using, the efficiency of your table indexes etc.
The simplistic answer is let the DB server do as much as possible. For the "for all entries" vs join question this means join. Except every experienced ABAP programmer knows that it's never that simple. You have to try different scenarios and measure like vwegert said. Also remember to measure in your live system as well, as sometimes the hardware configuration or dataset is significantly different to have entirely different results in your live system than test.
I usually follow the following conventions:
Never do a select *, Select only the required fields.
Never use 'into corresponding table of' instead create local structures which has all the required fields.
In the where clause, try to use as many primary keys as possible.
If select is made to fetch a single record and all primary keys are included in where clause use Select single, or else use SELECT UP TO TO 1 ROWS, ENDSELECT.
Try to use Join statements to connect tables instead of using FOR ALL ENTRIES.
If for all entries cannot be avoided ensure that the internal table is not empty and a delete the duplicate entries to increase performance.
Two more points in addition to the other answers:
usually you use JOIN for two or more tables in the database and you use FOR ALL ENTRIES IN to join database tables with a table you have in memory. If you can, JOIN.
usually the IN operator is more convinient than FOR ALL ENTRIES IN. But the kernel translates IN into a long select statement. The length of such a statement is limited and you get a dump when it gets too long. In this case you are forced to use FOR ALL ENTRIES IN despite the performance implications.
With in-memory database technologies, it's best if you can finish all data and calculations on the database side with JOINs and database aggregation functions like SUM.
But if you can't, at least try to avoid accessing database in LOOPs. Also avoid reading the database without using indexes, of course.