Is excessive use of lucene good? - performance

In my project, entire searching and listing of content is depend on Lucene. I am not facing any performance issues. Still, the project is in development phase and long way to go in production.
I have to find out the performance issues before the project completed in large structure.
Whether the excessive use of lucene is feasible or not?

As an example, I have about 3 GB of text in a Lucene index, and it functions very quickly (milliseconds response times on searches, filters, and sorts). This index contains about 300,000 documents.
Hope that gave some context to your concerns. This is in a production environment.

Lucene is very mature and has very good performance for what it was designed to do. However, it is not an RDBMS. The amount of fine-tuning you can do to improve performance is more limited than a database engine.
You shouldn't rely only on lucene if:
You need frequent updates
You need to do joined queries
You need sophisticated backup solutions
I would say that if your project is large enough to hire a DBA, you should use one...
Performance wise, I am seeing acceptable performance on a 400GB index across 10 servers (a single (4GB, 2CPU) server can handle 40GB of lucene index, but no more. YMMV).

By excessive, do you mean extensive/exclusive?
Lucene's performance is generally very good. I recently ran some performance tests for Lucene on my Desktop with QuadCore # 2.4 GHz 2.39 GHz
I ran various search queries against a disk index composed of 10MM documents, and the slowest query (MatchAllDocs) returned results within 1500 ms. Search queries with two or more search terms would return around 100 ms.
There are tons of performance tweaks you can do for Lucene, and they can significantly increase your search speed.

What would you define as excessive?
If your application has a solid design, and the performance is good, I wouldn't worry too much about it.
Perhaps you could get a data dump to test the performance in a live scenario.

We use lucence to enable type-ahead searching. This means for every letter typed, it hits the lucence index to get the results. Multiple that to tens of textboxes on multiple interfaces and again tens of employees typing, with no complaints and extremely fast response times. (Actually it works faster than any other type-ahead solution we tried).

Related

How do websites do fulltext search and sort?

How do websites implement search and sort? (example: ecommerce search for a product and sort by price)
I've been wrestling with this for a while. I'm using MySQL and after long discussions here it seems that MySQL can't handle this. I've also asked here here whether posgres can do this and again it seems like the answer is no.
So how do websites do it?
EDIT: To be clear, I'm asking how websites do it in a way that uses both fulltext search and some sort of BTREE index for the sorting. To do fulltext search and sort without using one of the indexes would be easy (albeit slow).
I worked for a large ecommerce site that used SQL Server full-text search to accomplish this. Conceptually, the full-text search engine would produce a list of ids, which would be joined against the b-tree indexes to return sorted results. Performance was acceptable, but we pushed it as far as we could go with the largest hardware available at the time (80 cpu, 512 GB RAM, etc). With 20-25 million documents a simple full-text query (2-3 terms) would have response times in the 3-5 second range. That was for the historical data. The live data set (around 1 million documents) would average 200ms with a wide distribution. We were able to handle 150-200 queries per second.
We eventually ended up moving away from SQL Server for search because we wanted additional full-text functionality that SQL Server didn't offer, specifically highly tunable relevance sorting for results. We researched various options and settled on elastic search hosted on aws.
Elastic search offered substantial improvements in features. Performance was great. We went live with 4 xlarge instances on aws. Query response times were right around 150-175 ms, and very, very consistent. We could easily scale up/down the number of nodes to keep performance consistent with varying amounts of load.
SQL Server was still the system of record. We had to develop several services to push changes from SQL Server to ES (incremental loading, bulk loading, etc). Translating the SQL search logic to ES was straight forward.
In conclusion, if your database can't meet your search needs, then use a tool (elasticsearch) that does.

Is it appropriate to use a search engine as a caching layer?

We're talking about a normalized dataset, with several different entities that must often be accessed along with related records. We want to be able to search across all of this data. We also want to use a caching layer to store view-ready denormalized data.
Since search engines like Elasticsearch and Solr are fast, and since it seems appropriate in many cases to put the same data into both a search engine and a caching layer, I've read at least anecdotal accounts of people combining the two roles. This makes sense on a surface level, at least, but I haven't found much written about the pros and cons of this architecture. So: is it appropriate to use a search engine as a cache, or is using one layer for two roles a case of being penny wise but pound foolish?
These guys have done this...
http://www.artirix.com/elasticsearch-as-a-smart-cache/
The problem I see is not in the read speed, but in the write speed. You are incurring a pretty hefty cost for adding things to the cache (forcing spool to disk and index merge).
Things like memcached or elastic cache if you are on AWS, are much more efficient at both inserts and reads.
"Elasticsearch and Solr are fast" is relative, caching infrastructure is often measured in single-digit millisecond range, same for inserts. These search engines are at least measured in 10's of milliseconds for reads, and much higher for writes.
I've heard of setups where ES was used for what is it really good for: full context search and used in parallel with a secondary storage. In these setups data was not stored (but it can be) - "store": "no" - and after searching with ES in its indices, the actual records were retrieved from the second storage level - usually a RDBMS - given that ES was holding a reference to the actual record in the RDBMS (an ID of some sort). If you're not happy with whatever secondary storage gives in you in terms of speed and "search" in general I don't see why you couldn't setup an ES cluster to give you the missing piece.
The disadvantage here is the time spent architecting the ES data structure because ES is not as good as a RDBMS at representing relationships. And it really doesn't need to, its main job and purpose is different. And is, actually, happier with a denormalized set of data to search over.
Another disadvantage is the complexity of keeping in sync the two storage systems which will require some thinking ahead. But, once the initial setup and architecture is in place, it should be easy afterwards.
the only recommended way of using a search engine is to create indices that match your most frequently accessed denormalised data access patterns. You can call it a cache if you want. For searching it's perfect, as it's fast enough.
Recommended thing to add cache for there - statistics for "aggregated" queries - "Top 100 hotels in Europe", as a good example of it.
May be you can consider in-memory lucene indexes, instead of SOLR or elasticsearch. Here is an example

Insert performance with and without Index

Was doing a couple of tests.
Based on some great suggestions by Wes etc., I have tuned some of the neo4j properties with no cache to do insert on a large scale in a multithreaded environment and the performance is not bad.
However, when I introduce index (on the nodes), the performance degrades a lot. The difference is easily 5 fold. Are there configuration settings to make it better?
Thanks in advance,
Sachin
Neo4j version - 1.8.1; JVM - 1.6
Inserting nodes (or relationships) into a Lucene index is costly. Lucene is a powerful but complex tool, designed for fulltext/keyword search. Compared with the bare database, it is rather slow.
This is why most bulk insert tools do the indexing asynchronously, like Michael's batch inserter:
http://jexp.de/blog/2012/10/parallel-batch-inserter-with-neo4j/
Some even circumvent transactions, or write the store files directly:
http://blog.xebia.com/2012/11/13/combining-neo4j-and-hadoop-part-i/
To improve performance, using a SSD disk could help. But as Neo4j is a fully ACID transactional database, and the Lucene index is tightly coupled with the transactions (which is a good thing), there's not much else you can do besides optimizing your infrastructure for best write performance.
Just in case this additional answer is still of use for anyone running Neo4j on an ext4 filesystem under Linux:
By trading some transaction safety (negligible on USV/battery-buffered systems or laptops), the write performance can be increased by a factor of 10-15!
Read more in this recent blog post: http://structr.org/blog/neo4j-performance-on-ext4

Hibernate Search Automatic Indexing

I am working on developing an application which caters to about 100,000 searches everyday. We can safely assume that there are about the same number of updates / insertions / deletions in the database daily. The current application uses native SQL and we intend to migrate it to Hibernate and use Hibernate Search.
As there are continuous changes in the database records, we need to enable automatic indexing. The management has concerns about the performance impact automatic indexing can cause.
It is not possible to have a scheduled batch indexing as the changes in the records have to be available for search as soon as they are changed.
I have searched to look for some kind of performance statistics but have found none.
Can anybody who has already worked on Hibernate Search and faced a similar situation share their thoughts?
Thanks for the help.
Regards,
Shardul.
It might work fine, but it's hard to guess without a baseline. I have experience with even more searches / day and after some fine tuning it works well, but it's impossible to know if that will apply for your scenario without trying it out.
If normal tuning fails and NRT doesn't proof fast enough, you can always shard the indexes, use a multi-master configuration and plug in a distributed second level cache such as Infinispan: all combined the architecture can achieve linear scalability, provided you have the time to set it up and reasonable hardware.
It's hard to say what kind of hardware you will need, but it's a safe bet that it will be more efficient than native SQL solutions. I would suggest to make a POC and see how far you can get on a single node; if the kind of queries you have are a good fit for Lucene you might not need more than a single server. Beware that Lucene is much faster in queries than in updates, so since you estimate you'll have the same amount of writes and searches the problem is unlikely to be in the amount of searches/second, but in the writes(updates)/second and total data(index) size. Latest Hibernate Search introduced an NRT index manager, which suites well such use cases.

Fastest nosql option for number crunching?

I had always thought that Mongo had excellent performance with it's mapreduce functionality, but am now reading that it is a slow implementation of it. So if I had to pick an alternative to benchmark against, what should it be?
My software will be such that users will often have millions of records, and often be sorting and crunching through unpredictable subsets that are 10s or 100s of thousands. Most of the analysis of data that uses the full millions of records can be done in summary tables and the like. I'd originally thought Hypertable was a viable alternative, but in doing research I saw in their documents their mention that Mongo would be a more performant option, while Hypertable had other benefits. But for my application speed is my number one initial priority.
First of all, it's important to decide on what is "fast enough". Undoubtedly there are faster solutions than MongoDB's map/reduce but in most cases you may be looking at significantly higher development cost.
That said MongoDB's map/reduce runs, at time of writing, on a single thread which means it will not utilize all the cpu available to it. Also, MongoDB has very little in the way of native aggregation functionality. This will change fixed with version 2.1 onwards that should improve performance though (see https://jira.mongodb.org/browse/SERVER-447 and http://www.slideshare.net/cwestin63/mongodb-aggregation-mongosf-may-2011).
Now, what MongoDB is good at is scaling up easily, especially when it comes to reads. And this is important because the best solution for number crunching on large datasets is definitely a map/reduce cloud like Augusto suggested. Let such an m/r do the number crunching while MongoDB makes the required data available at high speeds. Database query throughput too low is easily solved by adding more mongo shards. Number crunching/aggregation performance too slow is solved by adding more m/r boxes. Basically performance becomes a function of number of instances you reserve for the problem, and thus cost.

Resources