Hibernate Search Automatic Indexing - performance

I am working on developing an application which caters to about 100,000 searches everyday. We can safely assume that there are about the same number of updates / insertions / deletions in the database daily. The current application uses native SQL and we intend to migrate it to Hibernate and use Hibernate Search.
As there are continuous changes in the database records, we need to enable automatic indexing. The management has concerns about the performance impact automatic indexing can cause.
It is not possible to have a scheduled batch indexing as the changes in the records have to be available for search as soon as they are changed.
I have searched to look for some kind of performance statistics but have found none.
Can anybody who has already worked on Hibernate Search and faced a similar situation share their thoughts?
Thanks for the help.
Regards,
Shardul.

It might work fine, but it's hard to guess without a baseline. I have experience with even more searches / day and after some fine tuning it works well, but it's impossible to know if that will apply for your scenario without trying it out.
If normal tuning fails and NRT doesn't proof fast enough, you can always shard the indexes, use a multi-master configuration and plug in a distributed second level cache such as Infinispan: all combined the architecture can achieve linear scalability, provided you have the time to set it up and reasonable hardware.
It's hard to say what kind of hardware you will need, but it's a safe bet that it will be more efficient than native SQL solutions. I would suggest to make a POC and see how far you can get on a single node; if the kind of queries you have are a good fit for Lucene you might not need more than a single server. Beware that Lucene is much faster in queries than in updates, so since you estimate you'll have the same amount of writes and searches the problem is unlikely to be in the amount of searches/second, but in the writes(updates)/second and total data(index) size. Latest Hibernate Search introduced an NRT index manager, which suites well such use cases.

Related

is there any issue if i using ElasticSearch instead of relational database?

as the question title, if crud data directly through elasticsearch without relation database(mysql/postgresql), is there any issue here?
i know elasticsearch good at searhing, but if update data frequencies, maybe got bad performance?
if every update-request setRefreshPolicy(IMMEDIATE), maybe got bad performance also?
ElasticSearch will likely outperform a relational db on similar hardware, though workloads can vary. However, ElasticSearch can do this because it has made certain design decisions that are different than the design decisions of a relational database.
ElasticSearch is eventually consistent. This means that queries immediately after your insert might still get old results. There are things that can be done to mitigate this but nothing will eliminate the possibility.
Prior to version 5.x ElasticSearch was pretty good at losing data when bad things happen the 5.x release was all about making Elastic more robust in those regards, and data loss is no longer the problem it was previously, though potential for data loss still exists, particularly if you make configuration mistakes.
If you frequently modify documents in ElasticSearch you will generate large numbers of deleted documents as every update generates a new document and marks an old document as deleted. Over time those old documents fall off, or you can force the system to clean them out, but if you are doing rapid modifications this could present a problem for you.
The application I am working for is using Elasticsearch as the backend. There are 9 microservices connecting to this backend. Writes are fewer when compared to reads. Our write APIs have a performance requirements of max. 3 seconds.
We have configured 1 second as the refresh interval and always using WAIT_FOR instead of IMMEDIATE and fewer times using NONE in the case of asynchronous updates.

Is it appropriate to use a search engine as a caching layer?

We're talking about a normalized dataset, with several different entities that must often be accessed along with related records. We want to be able to search across all of this data. We also want to use a caching layer to store view-ready denormalized data.
Since search engines like Elasticsearch and Solr are fast, and since it seems appropriate in many cases to put the same data into both a search engine and a caching layer, I've read at least anecdotal accounts of people combining the two roles. This makes sense on a surface level, at least, but I haven't found much written about the pros and cons of this architecture. So: is it appropriate to use a search engine as a cache, or is using one layer for two roles a case of being penny wise but pound foolish?
These guys have done this...
http://www.artirix.com/elasticsearch-as-a-smart-cache/
The problem I see is not in the read speed, but in the write speed. You are incurring a pretty hefty cost for adding things to the cache (forcing spool to disk and index merge).
Things like memcached or elastic cache if you are on AWS, are much more efficient at both inserts and reads.
"Elasticsearch and Solr are fast" is relative, caching infrastructure is often measured in single-digit millisecond range, same for inserts. These search engines are at least measured in 10's of milliseconds for reads, and much higher for writes.
I've heard of setups where ES was used for what is it really good for: full context search and used in parallel with a secondary storage. In these setups data was not stored (but it can be) - "store": "no" - and after searching with ES in its indices, the actual records were retrieved from the second storage level - usually a RDBMS - given that ES was holding a reference to the actual record in the RDBMS (an ID of some sort). If you're not happy with whatever secondary storage gives in you in terms of speed and "search" in general I don't see why you couldn't setup an ES cluster to give you the missing piece.
The disadvantage here is the time spent architecting the ES data structure because ES is not as good as a RDBMS at representing relationships. And it really doesn't need to, its main job and purpose is different. And is, actually, happier with a denormalized set of data to search over.
Another disadvantage is the complexity of keeping in sync the two storage systems which will require some thinking ahead. But, once the initial setup and architecture is in place, it should be easy afterwards.
the only recommended way of using a search engine is to create indices that match your most frequently accessed denormalised data access patterns. You can call it a cache if you want. For searching it's perfect, as it's fast enough.
Recommended thing to add cache for there - statistics for "aggregated" queries - "Top 100 hotels in Europe", as a good example of it.
May be you can consider in-memory lucene indexes, instead of SOLR or elasticsearch. Here is an example

Can I disable caching of Fuseki server?

I want to disable SPARQL query caching of Fuseki server. Can I disable it? And how to do ? I'm considering the following ways:
Using command line argument - It looks unprepared
Using settings file (*.ttl) - I couldn't find notation to disable caching
Edit server code - Basically I won't do it :(
Please tell how can I disable caching.
What caching are you talking about?
As discussed in JENA-388 the current default behaviour is actually to add headers that disable caching so there is not any HTTP level caching.
If you are using the TDB backend then there are caches used to improve query performance and those are not configurable AFAIK. Also even if you could do it turning them off would likely drastically worsen performance so would not be a good idea.
Edit
The --mem option uses a pure in-memory dataset so there is no caching. Be aware that this will actually be much slower than using TDB as you scale up your data and is only faster at small dataset sizes.
If you are looking to benchmark then there are much better ways to eliminate the effect of caches than turning them off since disabling caches (even when you can) won't give you realistic performance numbers. There are several real world ways to eliminate cache effects:
Run warmups - either some fixed number or until you see the system reach a steady state.
Eliminate outliers in your statistics, discard the best and worst N results and compute your statistics over the remainder
Use query parameterisation, use a query template and substitute different constants into it each time thus ensuring you aren't issuing an identical query each time. Query plan caching may still come into effect but as Jena doesn't do this anyway it won't matter for your tests.
You may want to take a look at my 2012 SemTech talk Practical SPARQL Benchmarking and the associated SPARQL Query Benchmarker tool. We've been working on a heavily revised version of the tool lately which has a lot of new features such as support for query parameterisation.

Insert performance with and without Index

Was doing a couple of tests.
Based on some great suggestions by Wes etc., I have tuned some of the neo4j properties with no cache to do insert on a large scale in a multithreaded environment and the performance is not bad.
However, when I introduce index (on the nodes), the performance degrades a lot. The difference is easily 5 fold. Are there configuration settings to make it better?
Thanks in advance,
Sachin
Neo4j version - 1.8.1; JVM - 1.6
Inserting nodes (or relationships) into a Lucene index is costly. Lucene is a powerful but complex tool, designed for fulltext/keyword search. Compared with the bare database, it is rather slow.
This is why most bulk insert tools do the indexing asynchronously, like Michael's batch inserter:
http://jexp.de/blog/2012/10/parallel-batch-inserter-with-neo4j/
Some even circumvent transactions, or write the store files directly:
http://blog.xebia.com/2012/11/13/combining-neo4j-and-hadoop-part-i/
To improve performance, using a SSD disk could help. But as Neo4j is a fully ACID transactional database, and the Lucene index is tightly coupled with the transactions (which is a good thing), there's not much else you can do besides optimizing your infrastructure for best write performance.
Just in case this additional answer is still of use for anyone running Neo4j on an ext4 filesystem under Linux:
By trading some transaction safety (negligible on USV/battery-buffered systems or laptops), the write performance can be increased by a factor of 10-15!
Read more in this recent blog post: http://structr.org/blog/neo4j-performance-on-ext4

Is excessive use of lucene good?

In my project, entire searching and listing of content is depend on Lucene. I am not facing any performance issues. Still, the project is in development phase and long way to go in production.
I have to find out the performance issues before the project completed in large structure.
Whether the excessive use of lucene is feasible or not?
As an example, I have about 3 GB of text in a Lucene index, and it functions very quickly (milliseconds response times on searches, filters, and sorts). This index contains about 300,000 documents.
Hope that gave some context to your concerns. This is in a production environment.
Lucene is very mature and has very good performance for what it was designed to do. However, it is not an RDBMS. The amount of fine-tuning you can do to improve performance is more limited than a database engine.
You shouldn't rely only on lucene if:
You need frequent updates
You need to do joined queries
You need sophisticated backup solutions
I would say that if your project is large enough to hire a DBA, you should use one...
Performance wise, I am seeing acceptable performance on a 400GB index across 10 servers (a single (4GB, 2CPU) server can handle 40GB of lucene index, but no more. YMMV).
By excessive, do you mean extensive/exclusive?
Lucene's performance is generally very good. I recently ran some performance tests for Lucene on my Desktop with QuadCore # 2.4 GHz 2.39 GHz
I ran various search queries against a disk index composed of 10MM documents, and the slowest query (MatchAllDocs) returned results within 1500 ms. Search queries with two or more search terms would return around 100 ms.
There are tons of performance tweaks you can do for Lucene, and they can significantly increase your search speed.
What would you define as excessive?
If your application has a solid design, and the performance is good, I wouldn't worry too much about it.
Perhaps you could get a data dump to test the performance in a live scenario.
We use lucence to enable type-ahead searching. This means for every letter typed, it hits the lucence index to get the results. Multiple that to tens of textboxes on multiple interfaces and again tens of employees typing, with no complaints and extremely fast response times. (Actually it works faster than any other type-ahead solution we tried).

Resources