Spring Data Repository - Paging large data sets (EclipseLink) - spring

I am using Spring Data with EclipseLink JPA to do server side pagination on a database result set. I have everything working and I get the expected paged results, but I noticed performance suffering on large data sets (several million rows). It is taking about 5 minutes to return a page of 20 results. Perhaps this is to be expected, but what concerned me was the query output.
My log output:
SELECT COUNT(filename) FROM document
SELECT filename, datecaptured, din, docdate, docid, doctype, drawer, foldernumber, format, pagenumber, tempfilename, userid FROM document ORDER BY din ASC
I would understand that in order to page, Spring would need to know the max row count, so the first query makes sense.
The second query is pulling the entire database when I specifically only asked for 20 results with a 0 offset (page).
Does Spring/EclipseLink/JPA in fact grab the entire data set and then only return the subset paged request?
If that is the case, how should I modify my repository class to be more efficient?
My test case:
#Test
public void getPagedDocumentsTest() throws IOException {
Page<Document> requestedPage = documentRepository.findAll(new PageRequest(0, 20, Sort.Direction.ASC, "din"));
Assert.assertNotNull("Page is null", requestedPage);
Assert.assertNotNull("Page is empty", requestedPage.getContent());
List<Document> documents = requestedPage.getContent();
LOG.info("{}", documents);
LOG.info("{}", documents.size());
}
My repository class:
import org.springframework.data.jpa.repository.JpaSpecificationExecutor;
import org.springframework.data.repository.PagingAndSortingRepository;
import org.springframework.stereotype.Repository;
import com.example.data.model.Document;
#Repository
public interface DocumentRepository extends PagingAndSortingRepository<Document, String> {
}
Edit - per #Chris's suggestion
Tried adding the platform to my properties, but it didn't make a difference:
eclipselink.weaving=static
eclipselink.allow-zero-id=true
eclipselink.target-database=SQLServer
eclipselink.logging.level=FINE
Also tried adding it to my configuration (I'm using Java Config):
#Bean
public LocalContainerEntityManagerFactoryBean entityManager() {
LocalContainerEntityManagerFactoryBean factory = new LocalContainerEntityManagerFactoryBean();
factory.setPersistenceUnitName("ExampleUnit");
factory.setPackagesToScan("com.example.data.model");
EclipseLinkJpaVendorAdapter eclipseLinkVendorAdapter = new EclipseLinkJpaVendorAdapter();
eclipseLinkVendorAdapter.setDatabase(Database.SQL_SERVER);
eclipseLinkVendorAdapter.setDatabasePlatform("SQLServer");
factory.setJpaVendorAdapter(eclipseLinkVendorAdapter);
factory.setDataSource(dataSource());
factory.setJpaProperties(jpaProperties());
factory.setLoadTimeWeaver(new InstrumentationLoadTimeWeaver());
return factory;
}
Looks like the platform is set correctly.
[EL Config]: connection: 2015-08-06 12:04:05.691--ServerSession(686533955)--Connection(1896042043)--Thread(Thread[main,5,main])--connecting(DatabaseLogin(
platform=>SQLServerPlatform
user name=> ""
connector=>JNDIConnector datasource name=>null
))
But neither helped. The SQL query output remained the same as well.
Edit
Found a related question with a similar answer from #Chris:
EclipseLink generated SQL doesn't include pagination

EclipseLink 2.5 source that I checked I believe has support for database level filtering built into the following database platform classes:
DB2Platform
DerbyPlatform
FirebirdPlatform
H2Platform
HANAPlatform
HSQLPlatform
MySQLPlatform
OraclePlatform
PostgreSQLPlatform
SymfowarePlatform
Each of these override the printSQLSelectStatement method to take advantage of their respective database features to allow filtering in the SQL itself. Other platforms will need to use JDBC filtering, which depend on the driver to restrict rows - they may be able to optimize queries, but it is driver specific and I believe it is why your query takes longer then you desire.
I don't know SQLServer well enough to say what equivalent functionality it has that can be used within the SQL, but if you find it, you would need to create a SQLServerPlatform subclass, override the printSQLSelectStatement method as is done in the above classes, and then specify that platform class be used instead. Please also file a bug/feature to have it included in EclipseLink.
Other options are described here:
http://wiki.eclipse.org/EclipseLink/Examples/JPA/Pagination

One thing you should consider is whether you actually need to know the number of pages / total number of elements. If you are returning a page from a result set that has milions of elements, chances are your users will not be interested in looking through all those pages either way :). Maybe your front end shows the data in a an infinite scroll that just needs to know, if there are any more pages, instead of number of pages.
If any of those cases apply to you, you should consider returning a Slice instead of a Page, as in:
public Slice<MyClass> findByMyField(..);
This way, instead of doing the expensive Count, Spring Data will just ask for one more element than you originally wanted. If that element is present, the Slice will return true from the hasNext method.
Where I work we recently used Slices for several large data sets and with the right indexes (and after clearing the database cache :) we have seen some really significant gains.

Related

Is it safe to pass a Lucene Query String directly from a user into a QueryParser?

tldr: Can I securely pass a raw query string (retrieved as a URL parameter) into a Lucene QueryParser without any added input sanitization?
I'm not a security expert, but I need some advice. As the title states, is it safe to use this controller method:
#CrossOrigin(origins = "${allowed-origin}")
#GetMapping(value = "/search/{query_string}", produces = MediaType.APPLICATION_JSON_VALUE)
public List doSearch(#PathVariable("query_string") String queryString) {
return searchQueryHandlerService.doSearch(queryString);
}
In tandem with this service method (the error handling is for testing only):
public List doSearch(String queryString) {
LOGGER.debug("Parsing query string: " + queryString);
try {
Query q = new QueryParser(null, standardAnalyzer).parse(queryString);
FullTextEntityManager manager = Search.getFullTextEntityManager(entityManager);
FullTextQuery fullTextQuery = manager.createFullTextQuery(q, Poem.class, Book.class, Section.class);
return fullTextQuery.getResultList();
} catch (ParseException e) {
LOGGER.error(e);
return Collections.emptyList();
}
}
With only basic input sanitization? If this isn't safe are there measures I can take to make it safe?
Any help is greatly appreciated.
I've been looking into this on and off for the last few weeks and I cannot find any reason why it wouldn't be safe, but It's such an obscure question (in an area I'm unfamiliar with) that I may be missing some obvious, fundamental problem anyone working in the area would see immediately.
A FullTextQuery is always read only, so you don't have to be concerned with people dropping tables or similar issues that you might have to consider when dealing with SQL injection.
But you might want to be careful if you have security restrictions on what data can be seen by your users.
The API also restricts the operation to a certain set of indexes - in your case those containing the Poem entities - so it's also not possible to break out of the chosen indexes.
But you need to consider:
is it ok if the user is able to somehow find a different Poem than what you expected them to look for
if you share the same index with other entities, there might be some ways to infer data about these other entities
So to be security conscious you might want to:
each entity type gets indexed into its own index (which is the default).
enable some FullTextFilter to restrict the user query based on your custom rules.
actually check the content of each result before rendering it, so to remove content that your other filters didn't catch.
If you are extremely paranoid, consider that any full-text index can actually reveal a bit about how frequent certain terms are in the whole index. People are normally not too concerned about this as it's extremely hard to take advantage of, and only minimal clues about the data distribution are revealed.
So back at your example, if this index just contains poems and you're ok with allowing any user to see any poem you have stored, giving away clues about which poems you are making available is normally not a security concern but is rather the whole point of your service.

What is the difference between a nhibernate query cache and entity cache when using second level caching?

I am trying to setup nhibernate second level caching and i see in this article, and i am trying to understand the difference between query caching and entity caching. It says you need to add
Cache.ReadOnly(); or Cache.ReadWrite();
on every single entity mapping like this:
public class CountryMap : ClassMap<country>
{
public CountryMap()
{
Table("dropdowns");
Id(x => x.Id, "pkey");
Map(x => x.Name, "ddlong");
Map(x => x.Code, "dddesc");
Where("ddtype = 'COUNTRY'");
//Informing NHibernate that the Country entity itself is cache-able.
Cache.ReadOnly();
}
}
But when using nhibernate profiler, i see things hitting the second level cache and I don't have this Cache.ReadOnly() value set.
Is that really required? Should I be doing this for every single entity (no matter how often that entity changes?).
If the answer is yes, that i should be doing this for all entities, I saw a page that mentioned there is a risk of setting an entity with this line as it might lead to Select n + 1 query problem if you are trying to join that entity with other entities in a query. I am using nhibernate profiler and it looks like someething are hitting the second level cache just from the code below. In my session setup, i have the following code:
return configuration
.Mappings(m => m.FluentMappings.AddFromAssemblyOf<ApplicationMap>().Conventions.Add(typeof(Conventions)))
.ExposeConfiguration(
c => {
c.SetProperty("cache.provider_class", "NHibernate.Caches.SysCache.SysCacheProvider, NHibernate.Caches.SysCache");
c.SetProperty("cache.use_second_level_cache", "true");
c.SetProperty("cache.use_query_cache", "true");
c.SetProperty("expiration", "86400");
})
.BuildSessionFactory();
and i have a generic "Query" method that does this:
ICriteria c = Session.CreateCriteria(typeof(T));
c.SetCacheable(true);
return c.Future<T>().AsQueryable();
so basically I am trying to confirm if i setup caching correctly as I see some second level cache hits when I using nhibernate profiler but I have not set the Cache in the entity mapping code. I am trying to determine if there are other things i need to do to get caching working (or working better)
When I use nhibernate profiler (without having the Cache.ReadWrite() set at an entity level), it still seems like it does hit the second level cache. (see screenshot below)
Query cache only stores the identifiers of entities returned as result of a query. Actual entities are stored in entity cache region. Therefore entity must be configured cacheable to be used with query cache. If query cache is used without setting entities cacheable, still only identifiers of query results will be stored in query cache. As stated in a blog
Query cache does not cache the state of the actual entities in the
result set; it caches only identifier values and results of value
type. So the query cache should always be used in conjunction with the
second-level cache.
When re-executing the same query, what NHibernate does is, it gets list of identifiers in query result from query cache, and fetch each entity from entity cache and if not found in cache, queries that element from database (ending up in multiple queries; one for each entity).
Therefore it is always recommended to use 2nd level cache for entities along query cache i.e. you need to specify Cache.ReadOnly(); or Cache.ReadWrite(); in entity mapping else query caching will even further reduce your application performance by making multiple database queries against one cached query result.
I would like to provide some summary about the (2nd level) Caching options we do have in NHibernate. First of all, there are 4 kinds of settings. These in fact, do represent the real power of a very granular caching settings.
<property name="cache.use_second_level_cache"> - Global switch
<property name="cache.use_query_cache"> - Global switch
<cache usage="read-write" region="xxx"/> - Class/Instance level
.SetCacheable(true).SetCacheMode(CacheMode.Normal).SetCacheRegion("yyy") - per Query
The first two are global enablers/disablers. They must be turned on before we can use the next/last two settings.
But in fact, there is already the answer. The global means - support caching, local means - decide 1) how whill be 2) which class/query handled - if will or not at all.
for example, this is a snippet of the SessionFactory.cs:
...
bool useSecondLevelCache = PropertiesHelper
.GetBoolean(Environment.UseSecondLevelCache, properties, true);
bool useQueryCache = PropertiesHelper
.GetBoolean(Environment.UseQueryCache, properties);
if (useSecondLevelCache || useQueryCache)
{
// The cache provider is needed when we either have second-level cache enabled
// or query cache enabled. Note that useSecondLevelCache is enabled by default
settings.CacheProvider = CreateCacheProvider(properties);
}
else
...
Let me explicitly point out the comment:
The cache provider is needed when we either have second-level cache enabled
or query cache enabled. Note that useSecondLevelCache is enabled by default
(NOTE: also see that the first PropertiesHelper.GetBoolean() call passes the last default value true)
That would mean, that if the 3. setting (<cache usage="read-write" region="xxx"/>) would not be important... all the mapped instances would be cached ... unmanaged, default way...
Luckily this is not true. The 3. setting, class level, is important. It is a must. Without explicit setting like this:
// xml
<class name="Country" ...>
<cache usage="read-write" region="ShortTerm" include="non-lazy/all"/>
// fluent
public CountryMap()
{
Cache.IncludeAll() // or .IncludeNonLazy
.Region("regionName")
.NonStrictReadWrite();
The 2nd level cache (class/instance level) won't be used. And this is great - because it is in our hands how to set it.
Fluent NHibernate - apply it as a convention
There is a Q & A discussing how to apply (in fact not apply) these settings globally - via special Fluent NHibernate feature - Convention (In my case, this Q & A was suggested aside of this question):
NHibernate second level cache caching entities with no caching configuration
Small code snippet cite:
public class ClassConvention : IClassConvention
{
public void Apply(IClassInstance instance)
{
instance.Table(instance.EntityType.Name);
instance.LazyLoad();
instance.Cache.NonStrictReadWrite();
}
}
Finally, we should mention here that the 4. option (query level .SetCacheable(true)) should come together with the 2nd level cache:
19.4. The Query Cache
Query result sets may also be cached. This is only useful for queries that are run frequently with the same parameters. To use the query cache you must first enable it:
<add key="hibernate.cache.use_query_cache" value="true" />
... Note that the query cache does not cache the state of any entities in the result set; it caches only identifier values and results of value type. So the query cache should always be used in conjunction with the second-level cache...
Summary: Caching with NHibernate is really powerful. The reason is 1) because it is plugable (see how many providers we can use out of the box e.g. here) and 2) it is configurable. The fact, that we can use different Concurrency strategies, regions, lazy loading handling ... or even NOT to use them... is essential. Some entities are "almost-read-only" (e.g. Country code list), while some are highly changing...
The best we can do is to play with. To experiment. At the end we can have well oiled machine with a good performance.
Yes you have to do this for all entities. However this can be done via xml configuration rather than the mapping:
Configuring NHibernate second level caching in an MVC app

Is there a way to add a filter to the Entity Framework layer to exclude "IsArchived" records?

I have records marked up as "IsArchived". I am looking for an expedient way to exclude these records from a current MVC3 / EF3 web application.
Is there a way to add some kind of "IsArchived" filter to the EF layer. In my case I have a seperate Model project with tables/views represented as POCO entities, and the mappings contained in the CSDL and SSDL files.
Huge thanks for any assistance.
EDIT:
I am using "ObjectContext" and not "DbContext", mainly due to the Data Modelling tool that I am using. This tool creates the context and POCO files.
I am wondering whether I can edit this context file like the following:
public ObjectSet<StdOrg> StdOrg
{
get
{
if ((_StdOrg == null))
{
_StdOrg = base.CreateObjectSet<StdOrg>("StdOrg");
// new line below. Got cast error tween both sides.
_StdOrg = (ObjectSet<StdOrg>) _StdOrg.Where(r => r.IsArchived == false);
}
return _StdOrg;
}
}
Take a look at this http://www.matthidinger.com/archive/2012/01/25/a-smarter-infrastructure-automatically-filtering-an-ef-4-1-dbset.aspx
Basically a filtering DBSet implementation that the example basically shows being used for Soft Delete. We use it without issue in our App.
However we are using DBcontext so not sure how this would work with Object Context or how it could be adapted

Handle multiple JDBC drivers from the SAME VENDOR

I came across with a big problem yesterday. In my current project I use ojdbc6 implementation of Oracle's JDBC for a connection, but also I would need to handle for example oracle 8 databases, which is totally impossible with this JAR.
You would say that I should use ojdbc14 for example which was true for some tests, but lets assume that later on I will need to handle 2 kind of databases from the same vendor, but we know that there is no existing implementation for BOTH and I need to have those simultaneously loaded. Same interface (and well, not just same interface, same class-structure, just different implementation inside!), same URL connection prefix -> JDBC connection will use one driver, but I cannot load multiple of them. So what now?
My first idea was to load the JARs with different classloaders, maybe I could load the same package structure with the same classes separated from each other? I don't really think so, maybe that was a silly idea of mine. This could be also a general problem later not with just JDBC drivers, so even if you cannot answer to my question but you know what is lacking here please tell me
Even if I could do a separate loading of class implementations of the same class names, how I can tell to the DriverManager when creating a connection to use the EXACT driver instead of finding one based on the connection url's prefix? (where I mean jdbc:oracle:thin for example).
I feel like a total dumb now because I think this is not an entirely extraordinary idea to handle in the Java world, BUT I totally don't know how to handle.
Thanks for y'all in advance
You actually have a couple of options:
You can try to load the drivers from different class loaders. That will work if you need only pure JDBC in your application. I doubt that you will get Hibernate to work with such a setup.
Eventually, you will have to run code where you will need to see instances from both classloaders and here, you will get ClassCastExceptions (two classes with the same full qualified name are different when they were loaded from different class loaders).
You can split your application into two. The second one would a small server which takes commands from your original app and translates those into JDBC for the database. The small server talks to Oracle 8 while your app only talks to one database.
This approach would allow you to keep the two concerns completely separate but you won't be able to run joins on the two databases.
You can link the old Oracle 8 database in your new database using CREATE DATABASE LINK. That makes the old tables visible as if they were part of the new database. You app only talks to one DB and Oracle handles the details internally.
Maybe Oracle 8 is too old for this to work but I'd definitely give it a try.
Oracle JDBC drivers are more compatible that you might expect. When you say "which is totally impossible with this JAR", did you try it? I used an Oracle 10 driver to connect to Oracle 7 in the past. Not every feature was supported but I could run the standard queries and updates.
#jdbc.properties
oracle.driver=oracle.jdbc.OracleDriver
oracle.url=jdbc:oracle:thin:#//localhost/xe
oracle.user=scott
oracle.password=tiger
mysql.driver=com.mysql.jdbc.Driver
mysql.url=jdbc:mysql://localhost/sales
mysql.user=root
mssql.driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
mssql.url=jdbc:sqlserver://192.168.1.175;databaseName=sales
mssql.user=dbviewer
mssql.password=dbviewer
And then read the properties file:
class QueryTest2 {
public static void main(String[] args) throws Exception{
Properties settings = new Properties();
FileInputStream fin = new FileInputStream("jdbc.properties");
settings.load(fin);
fin.close();
String dvr = settings.getProperty(args[0] + ".driver");
String url = settings.getProperty(args[0] + ".url");
String usr = settings.getProperty(args[0] + ".user");
String pwd = settings.getProperty(args[0] + ".password");
Class.forName(dvr);
Connection con = DriverManager.getConnection(url, usr, pwd);
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery("select pno,price,stock from products");
while(rs.next()){
System.out.printf("%d\t%.2f\t%d%n", rs.getInt(1), rs.getDouble(2), rs.getInt("stock"));
}
rs.close();
stmt.close();
con.close();
}
}

'Maximum number of expressions in a list is 1000' error with Grails and Oracle

I'm using Grails with an Oracle database. Most of the data in my application is part of a hierarchy that goes something like this (each item containing the following one):
Direction
Group
Building site
Contract
Inspection
Non-conformity
Data visible to a user is filtered according to his accesses which can be at the Direction, Group or Building Site level depending on user role.
We easily accomplished this by creating a listWithSecurity method for the BuildingSite domain class which we use instead of list across most of the system. We created another listWithSecurity method for Contract. It basically does a Contract.findAllByContractIn(BuildingSite.listWithSecurity). And so on with the other classes. This has the advantage of keeping all the actual access logic in BuildingSite.listWithsecurity.
The problem came when we started getting real data in the system. We quickly hit the "ora-01795 maximum number of expressions in a list is 1000" error. Fair enough, passing a list of over 1000 literals is not the most efficient thing to do so I tried other ways even though it meant I would have to deport the security logic to each controller.
The obvious way seemed to use a criteria such as this (I only put the Direction level access here for simplicity):
def c = NonConformity.createCriteria()
def listToReturn = c.list(max:params.max, offset: params.offset?.toInteger() ?: 0)
{
inspection {
contract {
buildingSite {
group {
'in'("direction",listOfOneOrTwoDirections)
}
}
}
}
}
I was expecting Grails to generate a single query with joins that would avoid the ora-01795 error but it seems to be calling a separate query for each level and passing the result back to Oracle as literal in an 'in' to query the other level. In other words, it does exactly what I was doing so I get the same error.
Actually, it might be optimising a bit. It seems to be solving the problem but only for one level. In the previous example, I wouldn't get an error for 1001 inspections but I would get it for 1001 contracts or building sites.
I also tried to do basically the same thing with findAll and a single HQL where statement to which I passed a single direction to get the nonConformities in one query. Same thing. It solves the first levels but I get the same error for other levels.
I did manage to patch it by splitting my 'in' criteria into many 'in' inside an 'or' so no single list of literals is more than 1000 long but that's profoundly ugly code. A single findAllBy[…]In becomes over 10 lines of code. And in the long run, it will probably cause performance problems since we're stuck doing queries with a very large amount of parameters.
Has anyone encountered and solved this problem in a more elegant and efficient way?
This won't win any efficiency awards but I thought I'd post it as an option if you just plainly need to query a list of more than 1000 items none of the more efficient options are available/appropriate. (This stackoverflow question is at the top of Google search results for "grails oracle 1000")
In a grails criteria you can make use of Groovy's collate() method to break up your list...
Instead of this:
def result = MyDomain.createCriteria().list {
'in'('id', idList)
}
...which throws this exception:
could not execute query
org.hibernate.exception.SQLGrammarException: could not execute query
at grails.orm.HibernateCriteriaBuilder.invokeMethod(HibernateCriteriaBuilder.java:1616)
at TempIntegrationSpec.oracle 1000 expression max in a list(TempIntegrationSpec.groovy:21)
Caused by: java.sql.SQLSyntaxErrorException: ORA-01795: maximum number of expressions in a list is 1000
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:440)
You'll end up with something like this:
def result = MyDomain.createCriteria().list {
or { idList.collate(1000).each { 'in'('id', it) } }
}
It's unfortunate that Hibernate or Grails doesn't do this for you behind the scenes when you try to do an inList of > 1000 items and you're using an Oracle dialect.
I agree with the many discussions on this topic of refactoring your design to not end up with 1000+ item lists but regardless, the above code will do the job.
Along the same lines as Juergen's comment, I've approached a similar problem by creating a DB view that flattens out user/role access rules at their most granular level (Building Site in your case?) At a minimum, this view might contain just two columns: a Building Site ID and a user/group name. So, in the case where a user has Direction-level access, he/she would have many rows in the security view - one row for each child Building Site of the Direction(s) that the user is permitted to access.
Then, it would be a matter of creating a read-only GORM class that maps to your security view, joining this to your other domain classes, and filtering using the view's user/role field. With any luck, you'll be able to do this entirely in GORM (a few tips here: http://grails.1312388.n4.nabble.com/Grails-Domain-Class-and-Database-View-td3681188.html)
You might, however, need to have some fun with Hibernate: http://grails.org/doc/latest/guide/hibernate.html

Resources