it seems that the r2dbc-oracle doesnt have a proper back pressure implementation. If i select a bigger amoount of rows (say 10k) then it is way slower than a regular jdbc/JPA query. If i manually set the fetchsize to 1000 then the query is approx 8 times(!) faster.
so
can you confirm that back pressure is/is not implemented? if not: is that planned?
is there an easier way to set the fetchsize (maybe even global...) then using manual databaseclient.sql()-queries ?
Thanks for sharing these findings.
I can confirm that request signals from a Subscriber do not affect the fetch size of Oracle R2DBC's Row Publisher. Currently, the only supported way to configure the fetch size is by calling io.r2dbc.spi.Statement.fetchSize(int).
This behavior can be attributed to Oracle JDBC's implementation of oracle.jdbc.OracleResultSet.publisherOracle(Function). The Oracle R2DBC Driver is using Oracle JDBC's Publisher to fetch rows from the database.
I can also confirm that the Oracle JDBC Team is aware of this issue, and is working on a fix. The fix will have the publisher use larger fetch sizes when demand from a subscriber exceeds the value configured with Statement.fetchSize(int).
Source: I wrote the code for Oracle R2DBC and Oracle JDBC's row publisher.
Related
Being production support team member, I investigate issues with various Impala queries and while researching on an issue , I see a team submits an Impala query with LIMIT 0 which obviously do not return any rows and then again without LIMIT 0 which gives them result. I guess they submit these queries from IBM Datastage. Before I question them why they do so.. wanted to check what could be a reason for someone to run with LIMIT 0. Is it just to check syntax or connection with Impala? I see a similar question discussed here in context of SQL but thought to ask anyway in Impala perspective. Thanks Neel
I think you are partially correct.
Pls note, limit will process all the data and then apply limit clause.
LIMIT 0 is mostly used to -
to check if syntax of SQL is correct. But impala do fetch all the records before applying limit. so SQL is completely validated. Some system may use this to check out the sql they generated automatically before actually applying it in server.
limit fetching lots of rows from a huge table or a data set every time you run a SQL.
sometime you want to create an empty table using structure of some other tables but do not want to copy store format, configurations etc.
dont want to burden the hue/any interface that is interacting with impala. All data will be processed but will not be returned.
performance test - this will somewhat give you an idea of run time of SQL. i used the word somewhat because its not actual time to complete but estimated time to complete a SQL.
I am just curious of ways to better tune for speed bulk inserts via apache nifi. I am just curious if a different driver or other configurations could speed up the process. Any inputs or references to resources would be greatly appreciated!
This is my current flow with configurations included in pictures, Source DB is Oracle, Destination DB is IBM db2 z/Os:
I think you have a few things working against you:
You probably have low concurrency set on the PutDatabaseRecord processor.
You have a very large fetch size.
You have a very large record-per-flowfile count.
From what I've read in the past, the fetch size controls how many records will be pulled from the query's remote result in each iteration. So in your case, it has to pull 100k records before it will even register data being ready. Try dropping it down to 1k records for the fetch and experiment with 100-1000 records per flowfile.
If you're bulk inserting that flowfile, you're also sending over 100k inserts at once.
A number of blogs and sites mention increasing FetchSize of OracleDataReader to improve performance when fetching big volumes of data (e.g. thousands of rows). There are some documented experiments with exact numbers on this like: http://metekarar.blogspot.com/2013/04/performance-improvement-for-odpnet.html
Trying to replicate these results, I have created a very similar sample application that does such data fetching several times with varying fetch sizes. Strangely, unless the connection pooling is explicitly disabled (e.g. in the connection string), the increase/decrease of FetchSize stops having any effect. When the pooling is disabled though, it's clear that the FetchSize can improve the performance (the more records, the bigger the effect).
Might this be a bug in the particular version of ODP.NET (I am using 2.112.1.0) or this is a universal weird behavior, which in practice removes the possibility to optimize FetchSize per query.
What's the logical link between connection pooling and FetchSize, when the FetchSize is set on the command or the reader (and not on the connection)? Am I missing something?
It turns out that this unexpected behavior is limited to the following conditions:
1) The SELECT statement is exactly the same
2) The pooling is ON
3) The self-tuning is ON
Only in those conditions the first time the FetchSize is set, it gets somehow cached by ODP.NET and attempts to change it don't work.
I have a background thread that is querying an Oracle database via a Select statement. The statement is populating a ResultSet Java object. If the query returns a lot of rows, the ResultSet object might get very large. If it's too large, I want to both cancel the background thread, but more importantly I want to cancel the thread that is creating the ResultSet object and eating up a lot of Java memory.
From what I have read so far online, java.sql.Statement cancel(), seems to be the best way to get this done, can anyone confirm this? is there a better way?
java.sql.Statement close() also works, I could probably catch the ExhaustedResultset exception, but maybe that's not safe.
To clarify, I do not want the ResultSet or the thread - I want to discard both completely from memory.
This depends on the JDBC implementation: Statement.cancel() is a request to the JDBC driver class, that may or may not do what you need or expect.
However, seeing as you are performing a select (normally non-transactional) and seeing as the default row Prefetch property for JDBC is 10, this should probably do the trick. See this answer for similar/related information:
When I call PreparedStatement.cancel() in a JDBC application, does it actually kill it in an Oracle database?
Canceling the thread doesn't solve your problem, if you really need the query results.
If you are concerned about using up too much memory you can set the fetch size on the resultSet, which will limit the number of rows you get back at a time. Then you would have to consume the resultSet as you go (if the data piles up in the data structure you're copying the resultSet rows into then you're back to eating up memory).
Oracle has a great document on memory management depending on your driver version.
I am running queries against an Oracle 10g with JDBC (using the latest drivers and UCP as DataSource) in order to retrieve CLOBs (avg. 20k characters). However the performance seems to be pretty bad: the batch retrieval of 100 LOBs takes 4s in average. The operation is also neither I/O nor CPU nor network bound judging from my observations.
My test setup looks like this:
PoolDataSource dataSource = PoolDataSourceFactory.getPoolDataSource();
dataSource.setConnectionFactoryClassName("...");
dataSource.setConnectionPoolName("...");
dataSource.setURL("...");
dataSource.setUser("...");
dataSource.setPassword("...");
dataSource.setConnectionProperty("defaultRowPrefetch", "1000");
dataSource.setConnectionProperty("defaultLobPrefetchSize", "500000");
final LobHandler handler = new OracleLobHandler();
JdbcTemplate j = new JdbcTemplate(dataSource);
j.query("SELECT bigClob FROM ...",
new RowCallbackHandler() {
public void processRow(final ResultSet rs) throws SQLException {
String result = handler.getClobAsString(rs, "bigClob");
}
});
}
I experimented with the fetch sizes but to no avail. Am I doing something wrong? Is there a way to speed up CLOB retrieval when using JDBC?
The total size of the result set is in the ten thousands - measured over the span of the whole retrieval the initial costs
Is there an Order By in the query? 10K rows is quite a lot if it has to be sorted.
Also, retrieving the PK is not a fair test versus retrieving the entire CLOB. Oracle stores the table rows with probably many in a block, but each of the CLOBs (if they are > 4K) will be stored out of line, each in a series of blocks. Scanning the list of PK's is therefore going to be fast. Also, there is probably an index on the PK, so Oracle can just quickly scan the index blocks and not even access the table.
4 seconds does seem a little high, but it is 2MB that needs to be possible read from disk and transported over the network to your Java program. Network could be an issue. If you perform an SQL trace of the session it will point you at exactly where the time is being spent (disk reads or network).
My past experience of using oracle LOB type data to store large data has not been good. It is fine when it is under 4k since it store it locally like varchar2. Once it is over 4k, you start seeing performance degrade. Perhaps, things may have improved since I last tried it a couple of years ago, but here are the things I found in the past for your information:
As clients need to get LOBs via oracle server, you may consider the following interesting situation.
lob data will compete limited SGA
cache with other data type if oracle
decide to cache it. As clob data are
general big, so it may push other
data
lob data get poor disk read if
oracle decide not to cache it, and
stream the data to the client.
fragmentation is probably something
that you haven't encountered yet. You will see if your applications delete lobs, and oracle tries to reuse the lob. I don't know if oracle support online defragmenting the disk for lob (they have for indexes, but it takes long time when we tried it previous).
You mentioned 4s for 100 lobs of avg 20k, so it's 40ms per lobs. Remember each lob needs to have to retrieved via separate Lob locater (it is not in the result set by default). That is an additional round trip for each lob, I assume (I am not 100% sure on this since it was a while ago) If that is the case, I assume that will be at least 5ms extra time per round trip in serial order, right? If so, your performance is already first limited by sequential lob fetches. You should be able to verify this by tracking the time spent in sql execution vs lob content fetching. Or you can verify this by excluding the lob column as suggested by the previous answer in the post, which should tell you if it is lob related.
Good luck
I had a similar issue and found the JDBC Lobs making a network call when accessin the lobs.
As of Oracle 11.2g JDBC Driver you can use a prefetch.
This speeded up access by 10 times...
statement1.setFetchSize(1000);
if (statement1 instanceof OracleStatement) {
((OracleStatement) statement1).setLobPrefetchSize(250000);
}
Thanks for all the helpful suggestions. Despite being flagged as answer to the problem my answer is that there seems to be no good solution. I tried using parallel statements, different storage characteristics, presorted temp. tables and other things. The operation seems not to be bound to any characteristic visible through traces or explain plans. Even query parallelism seems to be sketchy when CLOBs are involved.
Undoubtedly there would be better options to deal with with large CLOBs (especially compression) in an 11g environment but atm. I am stuck with 10g.
I have opted now for an additional roundtrip to the database in which I'll preprocess the CLOBs into a size optimized binary RAW. In previous deployments this has always been a very fast option and will likely be worth the trouble of maintaining an offline computed cache. The cache will be invalided and update using a persistent process and AQ until someone comes up with a better idea.