Sitecore - Load-balance Lucene queries - performance

Sitecore.NET 6.6.0 (rev. 130404)
Our production website is very search-heavy and our Lucene indexes are queried heavily throughout the day. This amounts to considerable amount of CPU power being spent on Lucene query processing. Are there industry practices to offload Lucene indexes and queries to a different machine? or are there any hardware mechanisms that can be used to boost Lucene query performance?
(Our most used Lucene index contains less than 10,000 entries)
Update (more info):
Although our index contains less than 10,000, can the CPU usage be caused by high number of Lucene queries that get executed parallely? We have a very complex faceted search. Initially, when users try out various search criteria, we were displaying result-count-breakdowns alongside all the search options (resulting in 50-60 count queries with each search request). This caused the CPU usage reach 90-95% during high traffic. When we removed the counts, the CPU stabilized around 20-30%.
Here are the two methods we use for querying:
public static Document[] GetLuceneDocuments(ACIndex acIndex, Query query, Sort sort = null, int maxResults = 999, bool trackScores = false, bool fillFields = true)
{
Index index = SearchManager.GetIndex(GetIndexName(acIndex));
if (sort == null)
{
sort = new Sort(new SortField(null, SortField.SCORE));
}
using (IndexSearchContext searchContext = index.CreateSearchContext())
{
Lucene.Net.Search.IndexSearcher searcher = searchContext.Searcher;
TopFieldCollector collector = TopFieldCollector.create(sort, maxResults, fillFields, trackScores, false, false);
searcher.Search(query, collector);
TopDocs topdocs = collector.TopDocs();
Document[] documents = new Document[topdocs.ScoreDocs.Length];
for (int i = 0; i < topdocs.ScoreDocs.Length; i++)
{
documents[i] = searcher.Doc(topdocs.ScoreDocs[i].doc);
}
return documents;
}
}
public static int GetSearchResultCount(ACIndex acIndex, Query query)
{
Index index = SearchManager.GetIndex(GetIndexName(acIndex));
using (IndexSearchContext searchContext = index.CreateSearchContext())
{
Lucene.Net.Search.IndexSearcher searcher = searchContext.Searcher;
TopScoreDocCollector collector = TopScoreDocCollector.create(1, false);
searcher.Search(query, collector);
return collector.GetTotalHits();
}
}

You should look into implementing Solr for your searches. While not an expert on the subject, Solr is Lucene based (making the transition easier) and runs off a central server or servers, dealing with all your search requirements.
Solr isn't natively officially supported in versions prior to Sitecore 7 - but I have worked on a number of Sitecore 6 solutions that did use Solr.
This article should give you a lead start: How to implement Solr into Sitecore
As far as industry processes go, with Sitecore, Solr is the solution to this particular problem. Depending on your solution implementation however, it could take some doing to get up and going.

You might look at www.alpha-solutions.dk/sitecore-search-solution for a Solr on Sitecore 6 approach.
Note: I am affiliated with Alpha Solutions

Your index is small, I know there are recommendations that you rearchitect the whole solution, however, I recommend something I have done in the past that has worked well for me and will not require that you provision another server or install another indexing tool like Elastic or SOLR.
First, store the fields in the index that you facet on, like below (either in configuration or using a custom crawler):
_group
_path
_creator
Manufacturer
Size
Year
... [other fields]
Create a class that represents a result
public class MyThing
{
public string Manufacturer { get; set; }
public string Size { get; set; }
public int Year { get; set; }
public MyThing(Document doc)
{
Manufacturer = doc.GetField("Manufacturer").Value;
Size = doc.GetField("Size").Value;
Year = int.Parse(doc.GetField("Year").Value);
}
}
Then take your main search result hits, instantiate your lightweight POCO's, and do counts off of that. Voila, 1 query!
int countForSomething = results.Count(result=>result.Size == "XL");
NOTE: I kind of wrote this code off the top of my head, but you get the idea. I have used this process on indexes in Lucene up to 700K+ results in Sitecore without much issue. Good luck sir!

Ah! Just tackled the issue of faceted search and CPU usage myself. This is some border-line black-magic coding and some really creative caching.
We found a way to implement Solr's faceted querying into Lucene, and boy oh boy are the results stunningly fast.
Short version:
Build a static class that holds onto a dictionary. Key: unique representation of an individual filter, Value: the BitArray produced by a Lucene QueryFilter object.
var queryFilter = new QueryFilter(filterBooleanQuery);
var bits = queryFilter.Bits(indexReader);
result[filter.ID.ToString()] = bits
Build this dictionary periodically asynchronously in the background. My index of about 80k documents only took about 15 seconds to build, but that's enough to make a lot of users angry so doing it in a non-blocking manner is crucial.
Query this dictionary using bitwise logic to find the resulting BitArray representing the hits you're looking for.
var combo =
facetDictionary[thisFilter.ID.ToString()]
.And(facetDictionary[selectedFilter.ID.ToString()]);
Long Version:
http://www.devatwork.nl/articles/lucenenet/faceted-search-and-drill-down-lucenenet/
Now, our implementation was only to get the cardinality of these result sets, but theoretically you could use these bit arrays to get actual documents out of the index as well.
Good luck!

Upgrading to sitecore 7 would give you the facets out of the box. Abstracted in a nice LINQ API that lets you switch from Lucene and SOLR (others, like ElasticSearch are coming)...

Related

Slow query over large collection

I'm working on an audit log which saves sessions in RavenDB. Initially, the website for querying the audit logs was responsive enough but as the amount of logged data has increased, the search page became unusable (it times out before returning using default settings - regardless of the query used). Right now we have about 45mil sessions in the table that gets queried but steady state is expected to be around 150mil documents.
The problem is that with this much live data, playing around to test things has become impractical. I hope some one can give me some ideas what would be the most productive areas to investigate.
The index looks like this:
public AuditSessions_WithSearchParameters()
{
Map = sessions => from session in sessions
select new Result
{
ApplicationName = session.ApplicationName,
SessionId = session.SessionId,
StartedUtc = session.StartedUtc,
User_Cpr = session.User.Cpr,
User_CprPersonId = session.User.CprPersonId,
User_ApplicationUserId = session.User.ApplicationUserId
};
Store(r => r.ApplicationName, FieldStorage.Yes);
Store(r => r.StartedUtc, FieldStorage.Yes);
Store(r => r.User_Cpr, FieldStorage.Yes);
Store(r => r.User_CprPersonId, FieldStorage.Yes);
Store(r => r.User_ApplicationUserId, FieldStorage.Yes);
}
The essense of the query is this bit:
// Query input paramters
var fromDateUtc = fromDate.ToUniversalTime();
var toDateUtc = toDate.ToUniversalTime();
sessionQuery = sessionQuery
.Where(s =>
s.ApplicationName == applicationName &&
s.StartedUtc >= fromDateUtc &&
s.StartedUtc <= toDateUtc
);
var totalItems = Count(sessionQuery);
var sessionData =
sessionQuery
.OrderByDescending(s => s.StartedUtc)
.Skip((page - 1) * PageSize)
.Take(PageSize)
.ProjectFromIndexFieldsInto<AuditSessions_WithSearchParameters.ResultWithAuditSession>()
.Select(s => new
{
s.SessionId,
s.SessionGroupId,
s.ApplicationName,
s.StartedUtc,
s.Type,
s.ResourceUri,
s.User,
s.ImpersonatingUser
})
.ToList();
First, to determine the number of pages of results, I count the number of results in my query using this method:
private static int Count<T>(IRavenQueryable<T> results)
{
RavenQueryStatistics stats;
results.Statistics(out stats).Take(0).ToArray();
return stats.TotalResults;
}
This turns out to be very expensive in itself, so optimizations are relevant both here and in the rest of the query.
The query time is not related to the amount of result items in any relevant way. If I use a different value for the applicationName parameter than any of the results, it is just as slow.
One area of improvement could be to use sequential IDs for the sessions. For reasons not relevant to this post, I found it most practical to use guid based ids. I'm not sure if I can easily change IDs of the existing values (with this much data) and I would prefer not to drop the data (but might if the expected impact is large enough). I understand that sequential ids result in better behaving b-trees for the indexes, but I have no idea how significant the impact is.
Another approach could be to include a timestamp in the id and query for documents with ids starting with the string matching enough of the time to filter the result. An example id could be AuditSessions/2017-12-31-24-31-42/bc835d6c-2fba-4591-af92-7aab96339d84. This also requires me to update or drop all the existing data. This of course also has the benefits of mostly sequential ids.
A third approach could be to move old data into a different collection over time, in recognition of the fact that you would most often look at the most recent data. This requires a background job and support for querying across collection time boundaries. It also has the issue that the collection with the old sessions is still slow if you need to access it.
I'm hoping there is something simpler than these solutions, such as modifying the query or the indexed fields in a way that avoids a lot of work.
At a glance, it is probably related to the range query on the StartedUtc.
I'm assuming that you are using exact numbers, so you have a LOT of distinct values there.
If you can, you can dramatically reduce the cost by changing the index to index on a second / minute granularity (which is usually what you are querying on), and then use Ticks, which allow us to use numeric range query.
StartedUtcTicks = new Datetime(session.StartedUtc.Year, session.StartedUtc.Month, session.StartedUtc.Day, session.StartedUtc.Hour, session.StartedUtc.Minute, session.StartedUtc.Second).Ticks,
And then query by the date ticks.

Can I get all of the terms and docId lists from ElasticSearch

How can I get all of the terms and doc lists in ES.For example the inverted index data looks like the following:
word1: doc1,doc5,doc6...
word2: doc3,doc9,doc12...
word3: doc5,doc100...
I just wanna get all of the terms and correspond doc list. Any api I can do this. Thanks!
In order to retrieve this you should understand a little bit about how Lucene operates. In Lucene, the structure of the index is structured (as you seem to know) with Fields->Terms->PostingLists (represented as PostingsEnums).
To retrieve these values, you can use this as a template Lucene tool (assuming you have access to the base reader - AtomicReader:
// get every one of the fields in the reader
Fields fields = MultiFields.getFields(reader);
for (String field: fields) {
// get the Terms for the field
TermsEnum terms = fields.terms(field).iterator(null);
// a term is represented by a BytesRef in lucene
// and we will iterate across all of them using
// the TermsEnum syntax (read lucene docs for this)
BytesRef t;
while ((t = terms.next()) != null) {
// get the PostingsEnum (not that this is called
// DocsEnum in Lucene 4.X) which represents an enumeration
// of all of the documents for the Term t
PostingsEnum docs = terms.postings(null, null);
String line = String.format("%s: ",t);
while (docs.nextDoc() != NO_MORE_DOCS) {
line += String.valueOf(docs.docID());
line += ", "
}
System.out.println(line);
}
}
I haven't actually had a chance to run this code exactly as is (I have a similar tool I've written for my specific fork of Lucene to compare Indexes), but hopefully this gives you the general knowledge of the structure of Lucene so that you can write your own tool.
The tricky part will be getting the explicit AtomicReader from your index - but I'm sure there are other StackOverflow answers to help you with that! (As a little hint you might want to look at opening your index with DirectoryReader#open(File f)#leaves())

Spring + Hibernate: Query Plan Cache Memory usage

I'm programming an application with the latest version of Spring Boot. I recently became problems with growing heap, that can not be garbage collected. The analysis of the heap with Eclipse MAT showed that, within one hour of running the application, the heap grew to 630MB and with Hibernate's SessionFactoryImpl using more than 75% of the whole heap.
Is was looking for possible sources around the Query Plan Cache, but the only thing I found was this, but that did not play out. The properties were set like this:
spring.jpa.properties.hibernate.query.plan_cache_max_soft_references=1024
spring.jpa.properties.hibernate.query.plan_cache_max_strong_references=64
The database queries are all generated by the Spring's Query magic, using repository interfaces like in this documentation. There are about 20 different queries generated with this technique. No other native SQL or HQL are used.
Sample:
#Transactional
public interface TrendingTopicRepository extends JpaRepository<TrendingTopic, Integer> {
List<TrendingTopic> findByNameAndSource(String name, String source);
List<TrendingTopic> findByDateBetween(Date dateStart, Date dateEnd);
Long countByDateBetweenAndName(Date dateStart, Date dateEnd, String name);
}
or
List<SomeObject> findByNameAndUrlIn(String name, Collection<String> urls);
as example for IN usage.
Question is: Why does the query plan cache keep growing (it does not stop, it ends in a full heap) and how to prevent this? Did anyone encounter a similar problem?
Versions:
Spring Boot 1.2.5
Hibernate 4.3.10
I've hit this issue as well. It basically boils down to having variable number of values in your IN clause and Hibernate trying to cache those query plans.
There are two great blog posts on this topic.
The first:
Using Hibernate 4.2 and MySQL in a project with an in-clause query
such as: select t from Thing t where t.id in (?)
Hibernate caches these parsed HQL queries. Specifically the Hibernate
SessionFactoryImpl has QueryPlanCache with queryPlanCache and
parameterMetadataCache. But this proved to be a problem when the
number of parameters for the in-clause is large and varies.
These caches grow for every distinct query. So this query with 6000
parameters is not the same as 6001.
The in-clause query is expanded to the number of parameters in the
collection. Metadata is included in the query plan for each parameter
in the query, including a generated name like x10_, x11_ , etc.
Imagine 4000 different variations in the number of in-clause parameter
counts, each of these with an average of 4000 parameters. The query
metadata for each parameter quickly adds up in memory, filling up the
heap, since it can't be garbage collected.
This continues until all different variations in the query parameter
count is cached or the JVM runs out of heap memory and starts throwing
java.lang.OutOfMemoryError: Java heap space.
Avoiding in-clauses is an option, as well as using a fixed collection
size for the parameter (or at least a smaller size).
For configuring the query plan cache max size, see the property
hibernate.query.plan_cache_max_size, defaulting to 2048 (easily too
large for queries with many parameters).
And second (also referenced from the first):
Hibernate internally uses a cache that maps HQL statements (as
strings) to query plans. The cache consists of a bounded map limited
by default to 2048 elements (configurable). All HQL queries are loaded
through this cache. In case of a miss, the entry is automatically
added to the cache. This makes it very susceptible to thrashing - a
scenario in which we constantly put new entries into the cache without
ever reusing them and thus preventing the cache from bringing any
performance gains (it even adds some cache management overhead). To
make things worse, it is hard to detect this situation by chance - you
have to explicitly profile the cache in order to notice that you have
a problem there. I will say a few words on how this could be done
later on.
So the cache thrashing results from new queries being generated at
high rates. This can be caused by a multitude of issues. The two most
common that I have seen are - bugs in hibernate which cause parameters
to be rendered in the JPQL statement instead of being passed as
parameters and the use of an "in" - clause.
Due to some obscure bugs in hibernate, there are situations when
parameters are not handled correctly and are rendered into the JPQL
query (as an example check out HHH-6280). If you have a query that is
affected by such defects and it is executed at high rates, it will
thrash your query plan cache because each JPQL query generated is
almost unique (containing IDs of your entities for example).
The second issue lays in the way that hibernate processes queries with
an "in" clause (e.g. give me all person entities whose company id
field is one of 1, 2, 10, 18). For each distinct number of parameters
in the "in"-clause, hibernate will produce a different query - e.g.
select x from Person x where x.company.id in (:id0_) for 1 parameter,
select x from Person x where x.company.id in (:id0_, :id1_) for 2
parameters and so on. All these queries are considered different, as
far as the query plan cache is concerned, resulting again in cache
thrashing. You could probably work around this issue by writing a
utility class to produce only certain number of parameters - e.g. 1,
10, 100, 200, 500, 1000. If you, for example, pass 22 parameters, it
will return a list of 100 elements with the 22 parameters included in
it and the remaining 78 parameters set to an impossible value (e.g. -1
for IDs used for foreign keys). I agree that this is an ugly hack but
could get the job done. As a result you will only have at most 6
unique queries in your cache and thus reduce thrashing.
So how do you find out that you have the issue? You could write some
additional code and expose metrics with the number of entries in the
cache e.g. over JMX, tune logging and analyze the logs, etc. If you do
not want to (or can not) modify the application, you could just dump
the heap and run this OQL query against it (e.g. using mat): SELECT l.query.toString() FROM INSTANCEOF org.hibernate.engine.query.spi.QueryPlanCache$HQLQueryPlanKey l. It
will output all queries currently located in any query plan cache on
your heap. It should be pretty easy to spot whether you are affected
by any of the aforementioned problems.
As far as the performance impact goes, it is hard to say as it depends
on too many factors. I have seen a very trivial query causing 10-20 ms
of overhead spent in creating a new HQL query plan. In general, if
there is a cache somewhere, there must be a good reason for that - a
miss is probably expensive so your should try to avoid misses as much
as possible. Last but not least, your database will have to handle
large amounts of unique SQL statements too - causing it to parse them
and maybe create different execution plans for every one of them.
I have same problems with many(>10000) parameters in IN-queries. The number of my parameters is always different and I can not predict this, my QueryCachePlan growing too fast.
For database systems supporting execution plan caching, there's a better chance of hitting the cache if the number of possible IN clause parameters lowers.
Fortunately Hibernate of version 5.2.18 and higher has a solution with padding of parameters in IN-clause.
Hibernate can expand the bind parameters to power-of-two: 4, 8, 16, 32, 64.
This way, an IN clause with 5, 6, or 7 bind parameters will use the 8 IN clause, therefore reusing its execution plan.
If you want to activate this feature, you need to set this property to true hibernate.query.in_clause_parameter_padding=true.
For more information see this article, atlassian.
I had the exact same problem using Spring Boot 1.5.7 with Spring Data (Hibernate) and the following config solved the problem (memory leak):
spring:
jpa:
properties:
hibernate:
query:
plan_cache_max_size: 64
plan_parameter_metadata_max_size: 32
Starting with Hibernate 5.2.12, you can specify a hibernate configuration property to change how literals are to be bound to the underlying JDBC prepared statements by using the following:
hibernate.criteria.literal_handling_mode=BIND
From the Java documentation, this configuration property has 3 settings
AUTO (default)
BIND - Increases the likelihood of jdbc statement caching using bind parameters.
INLINE - Inlines the values rather than using parameters (be careful of SQL injection).
I had a similar issue, the issue is because you are creating the query and not using the PreparedStatement. So what happens here is for each query with different parameters it creates an execution plan and caches it.
If you use a prepared statement then you should see a major improvement in the memory being used.
TL;DR: Try to replace the IN() queries with ANY() or eliminate them
Explanation:
If a query contains IN(...) then a plan is created for each amount of values inside IN(...), since the query is different each time.
So if you have IN('a','b','c') and IN ('a','b','c','d','e') - those are two different query strings/plans to cache. This answer tells more about it.
In case of ANY(...) a single (array) parameter can be passed, so the query string will remain the same and the prepared statement plan will be cached once (example given below).
Cause:
This line might cause the issue:
List<SomeObject> findByNameAndUrlIn(String name, Collection<String> urls);
as under the hood it generates different IN() queries for every amount of values in "urls" collection.
Warning:
You may have IN() query without writing it and even without knowing about it.
ORM's such as Hibernate may generate them in the background - sometimes in unexpected places and sometimes in a non-optimal ways.
So consider enabling query logs to see the actual queries you have.
Fix:
Here is a (pseudo)code that may fix issue:
query = "SELECT * FROM trending_topic t WHERE t.name=? AND t.url=?"
PreparedStatement preparedStatement = connection.prepareStatement(queryTemplate);
currentPreparedStatement.setString(1, name); // safely replace first query parameter with name
currentPreparedStatement.setArray(2, connection.createArrayOf("text", urls.toArray())); // replace 2nd parameter with array of texts, like "=ANY(ARRAY['aaa','bbb'])"
But:
Don't take any solution as a ready-to-use answer. Make sure to test the final performance on actual/big data before going to production - no matter which answer you choose.
Why? Because IN and ANY both have pros and cons, and they can bring serious performance issues if used improperly (see examples in references below). Also make sure to use parameter binding to avoid security issues as well.
References:
100x faster Postgres performance by changing 1 line - performance of Any(ARRAY[]) vs ANY(VALUES())
Index not used with =any() but used with in - different performance of IN and ANY
Understanding SQL Server query plan cache
Hope this helps. Make sure to leave a feedback whether it worked or not - in order to help people like you. Thanks!
I had a big issue with this queryPlanCache, so I did a Hibernate cache monitor to see the queries in the queryPlanCache.
I am using in QA environment as a Spring task each 5 minutes.
I found which IN queries I had to change to solve my cache problem.
A detail is: I am using Hibernate 4.2.18 and I don't know if will be useful with other versions.
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Set;
import javax.persistence.EntityManager;
import javax.persistence.PersistenceContext;
import org.hibernate.ejb.HibernateEntityManagerFactory;
import org.hibernate.internal.SessionFactoryImpl;
import org.hibernate.internal.util.collections.BoundedConcurrentHashMap;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.dao.GenericDAO;
public class CacheMonitor {
private final Logger logger = LoggerFactory.getLogger(getClass());
#PersistenceContext(unitName = "MyPU")
private void setEntityManager(EntityManager entityManager) {
HibernateEntityManagerFactory hemf = (HibernateEntityManagerFactory) entityManager.getEntityManagerFactory();
sessionFactory = (SessionFactoryImpl) hemf.getSessionFactory();
fillQueryMaps();
}
private SessionFactoryImpl sessionFactory;
private BoundedConcurrentHashMap queryPlanCache;
private BoundedConcurrentHashMap parameterMetadataCache;
/*
* I tried to use a MAP and use compare compareToIgnoreCase.
* But remember this is causing memory leak. Doing this
* you will explode the memory faster that it already was.
*/
public void log() {
if (!logger.isDebugEnabled()) {
return;
}
if (queryPlanCache != null) {
long cacheSize = queryPlanCache.size();
logger.debug(String.format("QueryPlanCache size is :%s ", Long.toString(cacheSize)));
for (Object key : queryPlanCache.keySet()) {
int filterKeysSize = 0;
// QueryPlanCache.HQLQueryPlanKey (Inner Class)
Object queryValue = getValueByField(key, "query", false);
if (queryValue == null) {
// NativeSQLQuerySpecification
queryValue = getValueByField(key, "queryString");
filterKeysSize = ((Set) getValueByField(key, "querySpaces")).size();
if (queryValue != null) {
writeLog(queryValue, filterKeysSize, false);
}
} else {
filterKeysSize = ((Set) getValueByField(key, "filterKeys")).size();
writeLog(queryValue, filterKeysSize, true);
}
}
}
if (parameterMetadataCache != null) {
long cacheSize = parameterMetadataCache.size();
logger.debug(String.format("ParameterMetadataCache size is :%s ", Long.toString(cacheSize)));
for (Object key : parameterMetadataCache.keySet()) {
logger.debug("Query:{}", key);
}
}
}
private void writeLog(Object query, Integer size, boolean b) {
if (query == null || query.toString().trim().isEmpty()) {
return;
}
StringBuilder builder = new StringBuilder();
builder.append(b == true ? "JPQL " : "NATIVE ");
builder.append("filterKeysSize").append(":").append(size);
builder.append("\n").append(query).append("\n");
logger.debug(builder.toString());
}
private void fillQueryMaps() {
Field queryPlanCacheSessionField = null;
Field queryPlanCacheField = null;
Field parameterMetadataCacheField = null;
try {
queryPlanCacheSessionField = searchField(sessionFactory.getClass(), "queryPlanCache");
queryPlanCacheSessionField.setAccessible(true);
queryPlanCacheField = searchField(queryPlanCacheSessionField.get(sessionFactory).getClass(), "queryPlanCache");
queryPlanCacheField.setAccessible(true);
parameterMetadataCacheField = searchField(queryPlanCacheSessionField.get(sessionFactory).getClass(), "parameterMetadataCache");
parameterMetadataCacheField.setAccessible(true);
queryPlanCache = (BoundedConcurrentHashMap) queryPlanCacheField.get(queryPlanCacheSessionField.get(sessionFactory));
parameterMetadataCache = (BoundedConcurrentHashMap) parameterMetadataCacheField.get(queryPlanCacheSessionField.get(sessionFactory));
} catch (Exception e) {
logger.error("Failed fillQueryMaps", e);
} finally {
queryPlanCacheSessionField.setAccessible(false);
queryPlanCacheField.setAccessible(false);
parameterMetadataCacheField.setAccessible(false);
}
}
private <T> T getValueByField(Object toBeSearched, String fieldName) {
return getValueByField(toBeSearched, fieldName, true);
}
#SuppressWarnings("unchecked")
private <T> T getValueByField(Object toBeSearched, String fieldName, boolean logErro) {
Boolean accessible = null;
Field f = null;
try {
f = searchField(toBeSearched.getClass(), fieldName, logErro);
accessible = f.isAccessible();
f.setAccessible(true);
return (T) f.get(toBeSearched);
} catch (Exception e) {
if (logErro) {
logger.error("Field: {} error trying to get for: {}", fieldName, toBeSearched.getClass().getName());
}
return null;
} finally {
if (accessible != null) {
f.setAccessible(accessible);
}
}
}
private Field searchField(Class<?> type, String fieldName) {
return searchField(type, fieldName, true);
}
private Field searchField(Class<?> type, String fieldName, boolean log) {
List<Field> fields = new ArrayList<Field>();
for (Class<?> c = type; c != null; c = c.getSuperclass()) {
fields.addAll(Arrays.asList(c.getDeclaredFields()));
for (Field f : c.getDeclaredFields()) {
if (fieldName.equals(f.getName())) {
return f;
}
}
}
if (log) {
logger.warn("Field: {} not found for type: {}", fieldName, type.getName());
}
return null;
}
}
We also had a QueryPlanCache with growing heap usage. We had IN-queries which we rewrote, and additionally we have queries which use custom types. Turned out that the Hibernate class CustomType didn't properly implement equals and hashCode thereby creating a new key for every query instance. This is now solved in Hibernate 5.3.
See https://hibernate.atlassian.net/browse/HHH-12463.
You still need to properly implement equals/hashCode in your userTypes to make it work properly.
We had faced this issue with query plan cache growing too fast and old gen heap was also growing along with it as gc was unable to collect it.The culprit was JPA query taking some more than 200000 ids in the IN clause. To optimise the query we used joins instead of fetching ids from one table and passing those in other table select query..

Querying RavenDb with max 30 requests error

Just want to get some ideas from anyone who have encountered similar problems and how did you guys come up with the solution.
Basically, we have around 10K documents stored in RavenDB. And we need the ability to allow users to perform filter and search against those documents. I am aware that there is a maximum of 1024 page size within RavenDb. So in order for the filter and search to work, I need to do my own paging. But my solution gives me the following error:
The maximum number of requests (30) allowed for this session has been reached.
I have tried many different ways of disposing the session by wrapping it around using keyword and also explicitly calling Dispose after every call to RavenDb with no success.
Does anyone know how to get around this issue? what's the best practice for this kind of scenario?
var pageSize = 1024;
var skipSize = 0;
var maxSize = 0;
using (_documentSession)
{
maxSize = _documentSession.Query<LogEvent>().Count();
}
while (skipSize < maxSize)
{
using (_documentSession)
{
var events = _documentSession.Query<LogEvent>().Skip(skipSize).Take(pageSize).ToList();
_documentSession.Dispose();
//building finalPredicate codes..... which i am not providing here....
results.AddRange(events.Where(finalPredicate.Compile()).ToList());
skipSize += pageSize;
}
}
Raven limits the number of Request (Load, Query, ...) to 30 per Session. This behavior is documented.
I can see that you dispose the session in your code. But I don't see where you recreating the session. Anyways loading data they way you intend to do is not a good idea.
We're using indexes and paging and never load more than 1024.
If you're expecting thousands of documents or your precicate logic doesn't work as an index and you don't care about how long your query will take use the unbounded results API.
var results = new List<LogEvent>();
var query = session.Query<LogEvent>();
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
if (predicate(enumerator.Current.Document)) {
results.Add(enumerator.Current.Document);
}
}
}
Depending on the amount of document this will use a lot of RAM.

Neo4j formulating Cypherquery, performance issue, multiple startpoint

I have to execute the following Query:
#Query("START whps=node:__types__(className = 'de.adma.domain.WHProcessStep'),
csd=node:__types__(className = 'de.adma.domain.CSDocument'),
whm=node:__types__(className = 'de.adma.domain.WHMachine')
MATCH whps<-[r1:RELATES_TO]-csd<-[r2:OCCURS_IN]-whm
WHERE (whps.id IN {0}) AND (csd.id IN {1})
RETURN DISTINCT whm ")
Each of these classes (CSDocument, WHMachine, ..) have the same scaffold:
#NodeEntity
public class CSDocument {
#GraphId
Long nodeId;
#Indexed(unique = true)
String id;
#Indexed(indexType = IndexType.FULLTEXT, indexName = "accessUri")
String accessUri;
// .. definition of some RelatedToVia-Relationships and getter/setters
}
Is the query as formulated the correct way to query neo4j?
Currently this works fine for small amounts up to ~100k Nodes/Relationships (query needs <5 seconds).
I need this for ~10mio Nodes/Relationship, but the query runs several minutes.
My test environment is a VM, Xeon 2,18Ghz (hexacore), 32GB Ram, SSD.
JVM config:
-Xmx14000m
-XX:MaxPermSize=4048m
-Xss3068m
-XX:+UseConcMarkSweepGC
I am using Neo4j embedded 1.8.1 inside an Java-Spring application.
Any ideas how i could improve the performance?
Is there an other way for the multiple startpoints when using the IN-statement? It seems as these multiple starting points slows down the queries.
Do I have to define an index?
Thanks!
Use index lookup on your id-index:
#Query("START whps=node:WHProcessStep(id = {0}),
csd=node:CSDocument(id = {1})
MATCH whps<-[:RELATES_TO]-csd<-[:OCCURS_IN]-whm
RETURN DISTINCT whm ")
If you want to pass multiple id's to the index you unfortunately have to pass the whole index query as parameters to your method:
#Query("START whps=node:WHProcessStep({0}),
csd=node:CSDocument({1})
MATCH whps<-[:RELATES_TO]-csd<-[:OCCURS_IN]-whm
RETURN DISTINCT whm ")
Collection<WHMachine> find(String whps, String csd);
where the two strings are: String whps = "id:(id1 id2 id3)";
Is it better if you simply drop these parts of your START clause?
csd=node:__types__(className = 'de.adma.domain.CSDocument'),
whm=node:__types__(className = 'de.adma.domain.WHMachine')
You're making a cartesian product of all of your start variables and then reducing it in the match. It will be quicker to do type checks after the match.

Resources