How to optimize page size for IBM FileNet search

How to optimize page size for IBM FileNet search - filenet-p8

It's common situation when building some server-side reports to use simple Iterator instead of PageIterator when iterating through FileNet collections because you don't need to send document "portions" to clients.
SearchScope ss = new SearchScope(objectStore);
//what integer to choose?
int pageSize;
RepositoryRowSet rrc = ss.fetchRows(sql, pageSize, propertyFilter, true);
Iterator it = rrc.iterator();
while (it.hasNext()) {
RepositoryRow rr = (RepositoryRow) it.next();
//...
}
But CE API still uses paging inside. So my question is: what page size to choose in this case? On one hand, the more page size, the less count of round-trips to server. On the other hand, we can't enlarge it too much because each periodical request may become too large and slow and may cause perfomance degradation as well. Where is the golden mean?

There is no golden mean here since it is a trade-off among three factors: how fast CE returns results, how many round-trips are made, and the memory consumption both on the client and the server. As you probably understand, these depend heavily on your operating environment.
You can consider default configuration parameters that affect query performance as something reasonable and being a sort of the baseline:
ServerCacheCofiguration.QueryPageMaxSize: 1000
ServerCacheCofiguration.QueryPageDefaultSize: 500
ServerCacheCofiguration.NonPagedQueryMaxSize: 5000
A good approach would be populating your test object store with commensurate number of objects and play with query parameters.

Related

Java8 Stream or Reactive / Observer for Database Requests

I'm rethinking our Spring MVC application behavior, whether it's better to pull (Java8 Stream) data from the database or let the database push (Reactive / Observable) it's data and use backpressure to control the amount.
Current situation:
User requests the 30 most recent articles
Service does a database query and puts the 30 results into a List
Jackson iterates over the List and generates the JSON response
Why switch the implementation?
It's quite memory consuming, because we keep those 30 objects in memory all the time. That's not needed, because the application processes one object at a time. Though the application should be able to retrieve one object, process it, throw it away, and get the next one.
Java8 Streams? (pull)
With java.util.Stream this is quite easy: The Service creates a Stream, which uses a database cursor behind the scenes. And each time Jackson has written the JSON String for one element of the Stream, it will ask for the next one, which then triggers the database cursor to return the next entry.
RxJava / Reactive / Observable? (push)
Here we have the opposite scenario: The database has to push entry by entry and Jackson has to create the JSON String for each element until the onComplete method has been called.
i.e. the Controller tells the Service: give me an Observable<Article>. Then Jackson can ask for as many database entries as it can process.
Differences and concern:
With Streams there's always some delay between asking for next database entry and retrieving / processing it. This could slow down the JSON response time if the network connection is slow or there is a huge amount of database requests that have to be made to fulfill the response.
Using RxJava there should be always data available to process. And if it's too much, we can use backpressure to slow down the data transfer from database to our application. In the worst case scenario the buffer/queue will contain all requested database entries. Then the memory consumption will be equal to our current solution using a List.
Why am I asking / What am I asking for?
What did I miss? Are there any other pros / cons?
Why did (especially) the Spring Data Team extend their API to support Stream responses from the database, if there's always a (short) delay between each database request/response? This could sum up to some noticeable delay for a huge amount of requested entries.
Is it recommended to go for RxJava (or some other reactive implementation) for this scenario? Or did I miss any drawbacks?

You seem to be talking about the fetch size for an underlying database engine.
If you reduce it to one (fetching and processing one row at a time), yes you will save some space during the request time...
But it usually makes sense to have a reasonable chunk size.
If it is too small you will have a lot of expensive network roundtrips. If the chunk size is too large, you are risking to run out of memory or introduce too much of a latency per fetch. So it is a compromise, and the right chunk/fetch size depends on your specific use case.
Regarding reactive approach or not, I believe it is not relevant. Like with RxJava and say Cassandra, one can create an Observable from an asynchronous result set, and it is up to the query (configuration) how many items should be fetched and pushed at a time.

Memory taken by drools Knowkedge session

I have the requirment of having knowledge base partitioned on per user basis. All the session needs to be in memory. In the first phase i have tested with 2000 session, which is taking up almost 750mb of heap memory, considering 5 rules in each session. Can somebody tell me how to determine the size of each session and reduce memory consumption as i need to scale the application to 10000 of user

You just need to run your application with different numbers of concurrent sessions, plot a graph of sessions vs heap size and extrapolate. No special sauce I can think of relating to Drools specifically.
Key to this will be the number of facts in each session and the number of joins in your rules. You should read this section of the manual on "Cross Products", which explains how to reduce joins:
http://docs.jboss.org/drools/release/5.5.0.Final/drools-expert-docs/html_single/#d0e941
Also, two questions you should consider:
Is there a way in which you could refactor to use stateless sessions?
Is there a way in which you can have a single session to cater for all users?
Unless you have huge volumes of facts to insert at the start of each user session, or your application is doing some kind of streaming event processing using Fusion, then you should be able to switch to stateless sessions without any serious performance impact.

It don't show the memory usage but this can help you to get knowledge of all rules and facts currently in runtime.
TO know rules:
ksession2.getKnowledgeBase().getKnowledgePackages().each {
it.rules.each { log.debug "- rules are:"+it.name }
}
for facts:
for (Object fact : ksession2.getObjects()) {
sb2z.append (" Fact: " + fact.class.name);
}
I use above scripted ways to get runtime objects info and use some JVM visualizer to know their size.

GWT RequestFactory Performance

I have a question regarding the performance of RequestFactory and GWT. I have a Domain Entity with 8 fields that returns around 1000 EntityProxies. The time between the request fires and it responds is around 20 seconds. I do the same but returning 10 EntityProxies and the time is 17 seconds, almost the same.
Is this because I'm working in development mode, or when I release the code to the web the time will be the same?
Is there any way to improve the performance? , I'm only reading data so perhaps something that only read and doesn't writes may be the solution?
I read this post with something similar to my problem:
GWT Requestfactory performance suggestions
Thanks a lot.
PD: I read somewhere that one solution could be to create an xml in the server, send it to the client and recreate the object there, I don't want to do this because it would really change the design of my app.

Thank you all for the help, I realize now that perhaps using Request Factory to retrieve thousands of records was a mistake.
I initially used a Locator to override isLive() and Find() methods according to this post:
gwt-requestfactory-performance-suggestions
The response time was reduced to about 13 seconds, but it is still too high.
But I solved it easily. Instead of returning 1000+ Entities , I created a new database table which each field has all the same field records (1000+) concatenated by a separator (each db field has a length of about 10000 ) and I only have one record in the table with around 8 fields.
Something like this:
Field1 | Field2 | Field3
Field1val;Field1val;Field1val;....... | Field2val;Field2val;Field2val;...... | Field3val;Field3val;Field3val;......
I return that One record through RequestFactory to my client and it reduced the speed a lot!, around 1 second. I parse this large String in the client and the duration of that is about 500ms. So instead of wasting around 20 seconds now it takes around 1-2 seconds to accomplish the same.
By the way I am only displaying information, it is not necessary to Insert, Delete or Update records so this solution works for me.
Thought I could share this solution.

Performance Profiling and Fixing issues in GWT is tricky. Avoid all profiling in GWT Hosted mode. They do not mean anything useful.
You should profile only in WEB mode.
GWT RequestFactory by design is slower than GWT RPC and GWT JSON etc. This is a trade off w.r.t GWT RF ability to calculate delta and send only small amount information to server on save.
You should recheck you application design to avoid loading 1000's of proxies. RF is mean for "Form" like applications. The only reason you might need 1000's of proxies is for a Grid display. You probably can use paginated async grid in that scenario.

You should profile your app in order to find out how much time is spent on following steps:
Entities retrieved from the database (server): This can be improved using second level cache and optimized queries
Entities serialized to JSON (server): There is a overhead here because RequestFactory and AutoBean respectively rely on reflections. You can try to only transmit the Entities that you are also going to display on the client. Another optimization which greatly reduces latency is to override the isLive method of your EntitiyLocator and to return true
HTTP request from server to client to tranmit the data (wire): You can think about using gzip compression to reduce the amount of data that has to be transferred (important if you send a lof of objects over the wire).
De-serialization on the client (client): This should be quite fast. There was a benchmark that showed that AutoBean serialization was one of the fastest ways to serialize JSON. Again this will benefit from not sending the whole object graph over the wire.
One way to improve performance is to use caching. You can use HTML5 localstorage to cache data on the client. This applies specifically to data that doesn't change often.

Should Parallel.ForEach be used in DB calls?

I've got a list of Foo IDs. I need to call a stored procedure for each ID.
e.g.
Guid[] siteIds = ...; // typically contains 100 to 300 elements
foreach (var id in siteIds)
{
db.MySproc(id); // Executes some stored procedure.
}
Each call is pretty independent of the other rows, this shouldn't be contentious in the database.
My question: would it be beneficial to parallelize this using Parallel.ForEach? Or is database IO going to be a bottleneck, and more threads would just result in more contention?
I would measure it myself, however, it's difficult to measure this on my test environment where the data and load is much smaller than our real web server.

Out of curiosity, why do you want to optmize it with Parallel.ForEach and spawn threads / open connections / pass data / get response for every item instead of writing a simple "sproc" that will work with list of IDs instead of single ID?
From the first look, it should get you a lot more noticable improvement.

I would think that the Parallel.ForEach would work, assuming that your DB server can handle the ~150-300 concurrent operations.
The only way to know for sure is to measure both.

what is a good Pattern for using AsyncSockets in .net35 when inititiating several client connections

I'm re-building an IM gateway and hope to take advantage of the new performance features in AsyncSockets for .net35.
My existing implementation simply creates packets and forwards IM requests from users to the various IM networks as required, handling request/ response streams for each connected users session(socket).
i presently have to coupe with IasyncResult and as you know it's not very pretty or scalable.
My confusion is this basically:
1) in using the new Begin/End and SocketAsyncEventArgs in 3.5 do we still need to create one SocketAsyncEventArgs per socket?
2) do we gain anything by pre-initializing say, 20000 client connections since we know the expected max_connections per server is 20000
3) do we still need to use a LOH (large object heap) allocated byte[] to handle receive data as shown in SocketServers example on MSDN, we are not building a server per say, but are still handling a lot of independent receives for each connected socket.
4) maybe there is a better pattern altogether for what i'm trying to acheive?
Thanks in advance.
Charles.

1) IAsyncResult/Begin/End is a completely different system from The "xAsync" methods that use SocketAsyncEventArgs. You're better off using SocketAsyncEventArgs and dropping Begin/End entirely.
2) Not really. Initialize a smaller number (50? 100?) and use an intermediate class (ie/ a "resource pool") to manage them. As more requests come in, grow the pool by another 50 or 100 for example. The tough part is efficiently "scaling down" the number of pooled items as resource requirements drop. A large # of sockets/buffers/etc will consume a large amount of memory, so it's better to only allocate it in batches as the server requires it.
3) Don't need to use it, but it's still a good idea. The buffer will still be "pinned" during each call.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio