Solr Range Query Slow (SolrBridge in Magento CE) - magento

I'm developing a Magento Community Edition site using Solr via the SolrBridge extension. The system is fast when there are only a few thousand SKUs, but after importing ~100k products the searches have slowed significantly. The page load went from under a second to over two seconds, and New Relic monitoring identified this time as waiting for a response from Solr.
Noticing that search suggestions were still lightning fast, I decided to investigate what the difference between the autocomplete search and the full search listings were. After experimenting with altering different aspects of the search to bring it in line with the autocomplete search.
The system sped up immensely when I disabled the range fields part of the query which looks like the following:
facet.range=GBP_0_price_decimal&f.GBP_0_price_decimal.facet.range.start=0&f.GBP_0_price_decimal.facet.range.end=1000000&f.GBP_0_price_decimal.facet.range.gap=100&f.GBP_0_price_decimal.facet.mincount=1
With this code included, the search takes in the region of 1.7-1.8 seconds. Without it, the search takes only a few miliseconds.
This is, I believe, schema definition for the field. It does seem to be indexed:
<dynamicField name="*_decimal" type="float" indexed="true" stored="true" />
Any idea what the slowing factor is? Solr is running a single core. It's on the same physical box as Magento and the database. The box's specs are relatively high - 64GB ram and dual Xeon E5620s.
Thanks for any assistance. If you need any more information to provide assistance, let me know.

The facet gap (100) is perhaps too small.
You are generating (potentially) 10000 buckets for faceting in ranges (facet.end divided by facet.gap), which is kinda heavy.
I could increase the gap size or use a smaller range end.

Related

How do websites do fulltext search and sort?

How do websites implement search and sort? (example: ecommerce search for a product and sort by price)
I've been wrestling with this for a while. I'm using MySQL and after long discussions here it seems that MySQL can't handle this. I've also asked here here whether posgres can do this and again it seems like the answer is no.
So how do websites do it?
EDIT: To be clear, I'm asking how websites do it in a way that uses both fulltext search and some sort of BTREE index for the sorting. To do fulltext search and sort without using one of the indexes would be easy (albeit slow).
I worked for a large ecommerce site that used SQL Server full-text search to accomplish this. Conceptually, the full-text search engine would produce a list of ids, which would be joined against the b-tree indexes to return sorted results. Performance was acceptable, but we pushed it as far as we could go with the largest hardware available at the time (80 cpu, 512 GB RAM, etc). With 20-25 million documents a simple full-text query (2-3 terms) would have response times in the 3-5 second range. That was for the historical data. The live data set (around 1 million documents) would average 200ms with a wide distribution. We were able to handle 150-200 queries per second.
We eventually ended up moving away from SQL Server for search because we wanted additional full-text functionality that SQL Server didn't offer, specifically highly tunable relevance sorting for results. We researched various options and settled on elastic search hosted on aws.
Elastic search offered substantial improvements in features. Performance was great. We went live with 4 xlarge instances on aws. Query response times were right around 150-175 ms, and very, very consistent. We could easily scale up/down the number of nodes to keep performance consistent with varying amounts of load.
SQL Server was still the system of record. We had to develop several services to push changes from SQL Server to ES (incremental loading, bulk loading, etc). Translating the SQL search logic to ES was straight forward.
In conclusion, if your database can't meet your search needs, then use a tool (elasticsearch) that does.

Increase Solr performance when querying a subset of documents

The Usecase
I have an index of potentially millions of documents. I want to make around 20'0000 searches on a subset of these documents (around 25'000 documents). These 25'000 documents could take up around 100 MB stored in Solr (consisting of stored and indexes text fields).
The Problem
As the number of indexed documents increases, the performance of the queries decreases a lot. For example running 20'000 searches that hit 25'000 documents on 100'000 document index takes around 4 minutes. Running the same searches on 200'000 document index takes around 20 minutes.
So is there any way to cache these 25'000 documents in RAM before hitting them with searches?
UPDATE
Some things that really helped:
reducing returned row count (In almost all cases I had to iterate through returned results and in almost all cases where were no more than 100 matching results, but I had set rows to a very large value. Reducing the row count improved the performance around 2x. This seemed counter intuitive. If there are only 79 matches and I set returned row count to 100 it performs better than in a case when where are 79 matches and I set the row count to 1000. In the first case Solr already returns found item count and does it fast. Why should there be a performance difference?)
reducing multithreading (I had added multiple threads for querying because on the development box there were more resources available. On the resource constrained production box it was slowing things down. Using only one or two threads got me around 2x speed improvement.)
Some things that did not really help:
splitting up field queries (I was already using field queries everywhere it was possible, but I was combining them in one fq for each query fq=name:a AND type:b. Splitting them up with fq=name:a&fq=type:b caches them separately (see Apache Solr documentation) and could improve performance. But it did not make a huge difference in this case.
changing caching settings in this case filterCache seemed to have the most potential. However, increasing it or changing its settings did not make a huge difference.
A few things that are recommended for performance:
Have enough spare RAM on the box so index files can be in OS cache
Try to play around with solr caching settings in SolrConfig
Play around with autowarming after commits
Try to develop your queries to limit the result set. Large result sets, specifically if using grouping and faceting will kill performance. Now 200,000 document index is really quite small, so you should not have any problems, but I thought I'd mention this for when you scale.
Try to use Filter query (FQ) whenever possible. They are much faster than doing field:val in q, plus they are cached.

Performance issues using Elasticsearch as a time window storage

We are using elastic search almost as a cache, storing documents found in a time window. We continuously insert a lot of documents of different sizes and then we search in the ES using text queries combined with a date filter so the current thread does not get documents it has already seen. Something like this:
"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"
We maintain the data in the elastic search for 30 minutes, using the TTL feature. Today we have at least 3 machines inserting new documents in bulk requests every minute for each machine and searching using queries like the one above pratically continuously.
We are having a lot of trouble indexing and retrieving these documents, we are not getting a good throughput volume of documents being indexed and returned by ES. We can't get even 200 documents indexed per second.
We believe the problem lies in the simultaneous queries, inserts and TTL deletes. We don't need to keep old data in elastic, we just need a small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?
Thanks in advance
Machine type:
An Amazon EC2 medium instance (3.7 GB of RAM)
Additional information:
The code used to build the index is something like this:
https://gist.github.com/dggc/6523411
Our elasticsearch.json configuration file:
https://gist.github.com/dggc/6523421
EDIT
Sorry about the long delay to give you guys some feedback. Things were kind of hectic here at our company, and I chose to wait for calmer times to give a more detailed account of how we solved our issue. We still have to do some benchmarks to measure the actual improvements, but the point is that we solved the issue :)
First of all, I believe the indexing performance issues were caused by a usage error on out part. As I told before, we used Elasticsearch as a sort of a cache, to look for documents inside a 30 minutes time window. We looked for documents in elasticsearch whose content matched some query, and whose insert date was within some range. Elastic would then return us the full document json (which had a whole lot of data, besides the indexed content). Our configuration had elastic indexing the document json field by mistake (besides the content and insertDate fields), which we believe was the main cause of the indexing performance issues.
However, we also did a number of modifications, as suggested by the answers here, which we believe also improved the performance:
We now do not use the TTL feature, and instead use two "rolling indexes" under a common alias. When an index gets old, we create a new one, assign the alias to it, and delete the old one.
Our application does a huge number of queries per second. We believe this hits elastic hard, and degrades the indexing performance (since we only use one node for elastic search). We were using 10 shards for the node, which caused each query we fired to elastic to be translated into 10 queries, one for each shard. Since we can discard the data in elastic at any moment (thus making changes in the number of shards not a problem to us), we just changed the number of shards to 1, greatly reducing the number of queries in our elastic node.
We had 9 mappings in our index, and each query would be fired to a specific mapping. Of those 9 mappings, about 90% of the documents inserted went to two of those mappings. We created a separate rolling index for each of those mappings, and left the other 7 in the same index.
Not really a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor elastic search and learn important metrics, such as the number of queries fired -> sematext.com/spm/index.html
Our usage numbers are relatively small. We have about 100 documents/second arriving which have to be indexed, with peaks of 400 documents/second. As for searches, we have about 1500 searches per minute (15000 before changing the number of shards). Before those modifications, we were hitting those performance issues, but not anymore.
TTL to time-series based indexes
You should consider using time-series-based indexes rather than the TTL feature. Given that you only care about the most recent 30 minute window of documents, create a new index for every 30 minutes using a date/time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)
Using Elasticsearch's index aliasing feature (http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/), you can alias docs to the most recently created index so that when you are bulk indexing, you always use the alias docs, but they'll get written to docs-201309120130, for example.
When querying, you would filter on a datetime field to ensure only the most recent 30 mins of documents are returned, and you'd need to query against the 2 most recently created indexes to ensure you get your full 30 minutes of documents - you could create another alias here to point to the two indexes, or just query against the two index names directly.
With this model, you don't have the overhead of TTL usage, and you can just delete the old, unused indexes from over an hour in the past.
There are other ways to improve bulk indexing and querying speed as well, but I think removal of TTL is going to be the biggest win - plus, your indexes only have a limited amount of data to filter/query against, which should provide a nice speed boost.
Elasticsearch settings (eg. memory, etc.)
Here are some setting that I commonly adjust for servers running ES - http://pastebin.com/mNUGQCLY, note that it's only for a 1GB VPS, so you'll need to adjust.
Node roles
Looking into master vs data vs 'client' ES node types might help you as well - http://www.elasticsearch.org/guide/reference/modules/node/
Indexing settings
When doing bulk inserts, consider modifying the values of both index.refresh_interval index.merge.policy.merge_factor - I see that you've modified refresh_interval to 5s, but consider setting it to -1 before the bulk indexing operation, and then back to your desired interval. Or, consider just doing a manual _refresh API hit after your bulk operation is done, particularly if you're only doing bulk inserts every minute - it's a controlled environment in that case.
With index.merge.policy.merge_factor, setting it to a higher value reduces the amount of segment merging ES does in the background, then back to its default after the bulk operation restores normal behaviour. A setting of 30 is commonly recommended for bulk inserts and the default value is 10.
Some other ways to improve Elasticsearch performance:
increase index refresh interval. Going from 1 second to 10 or 30 seconds can make a big difference in performance.
throttle merging if it's being overly aggressive. You can also reduce the number of concurrent merges by lowering index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit. Lowering the index.merge.scheduler.max_thread_count can help as well
It's good to see you are using SPM. Its URL in your EDIT was not hyperlink - it's at http://sematext.com/spm . "Indexing" graphs will show how changing of the merge-related settings affects performance.
I would fire up an additional ES instance and have it form a cluster with your current node. Then I would split the work between the two machines, use one for indexing and the other for querying. See how that works out for you. You might need to scale out even more for your specific usage patterns.

What are the performance considerations when adding a large number of documents to a large Solr core?

If I have a Solr core with a half-dozen small fields that's loaded with 100 million documents, will adding a batch of 1 million documents run in a reasonable amount of time? How about 10 million? By reasonable, I'm thinking hours, rather than days. I've been told that this will take a long time to run. Is this really an issue? What are known strategies to improve performance? The fields are typically small, that is, 5-50 characters.
two suggestions on top of already mentioned in other answers for improving the performance (first tried, second to be tried):
1) decrease logging while updating: on INFO level SOLR appends one entry per document. See here on how we did it: http://dmitrykan.blogspot.fi/2011/01/solr-speed-up-batch-posting.html Some people reported "x3 speed increase".
2) set the amount of segments in solrconfig.xml to something very large for indexing, like 10000. Once the batch indexing is complete, change the parameter value back to something reasonably low, like 10.
This is a very "tricky" question whose answer differs from schema to schema.
Your solr installation has a half-dozen fields. But, how many are actually indexed? If only one field is indexed, then adding 1 million documents will be faster than adding 1 million docs when 6 fields are indexed.
I think the type of fields that are indexed also matters. A field that is of the type "text_general" is broken down into tokens while indexing whereas a field that is of the type "string" is not. "String" type is not analyzed and is stored as one complete token.
I have got some very long fields which are indexed and adding 2 million docs take a few minutes (although my installation does not contain 100 million documents). So, I do not think that it will take days to add 10 million records to your installation.
I am not sure about this but maybe the configuration of your cpu which is running the solr instance also matters. So, you might need to see if you cpu and memory can handle this much load.
It's upto you to decide if a long running data post is an issue or not. If your application is user intensive, then I suggest that you follow some kind of master-slave configuration so that the user is not impacted by the high cpu usage when you post the data. Some strategies which I know about improving performance is "sharding".
http://carsabi.com/car-news/2012/03/23/step-by-step-solr-sharding/
or if it is possible to demarcate the records by some field and put those different documents onto different servers.
100 million records is a fairly large index for Solr. But adding 10 million records on a good machine should be hours not days. You may find the following email thread interesting as it includes both in-depth questions and some final advice on tuning for 10M records index process.
Also, you did not say if you 'store' the fields as well as index them. If you do, you may also look forward to Solr 4.1 field compression.
An important parameter which impacts the indexing performance(in terms of Time) is the way in which you have defined your data-config.xml file.
If your fields come from multiple tables in a Database, you can configure it in 2 ways:
Entities within entities
A single entity with a join query
The second method is comparatively faster than the first one by a large degree because the number of queries fired against the database is decreased.

Is excessive use of lucene good?

In my project, entire searching and listing of content is depend on Lucene. I am not facing any performance issues. Still, the project is in development phase and long way to go in production.
I have to find out the performance issues before the project completed in large structure.
Whether the excessive use of lucene is feasible or not?
As an example, I have about 3 GB of text in a Lucene index, and it functions very quickly (milliseconds response times on searches, filters, and sorts). This index contains about 300,000 documents.
Hope that gave some context to your concerns. This is in a production environment.
Lucene is very mature and has very good performance for what it was designed to do. However, it is not an RDBMS. The amount of fine-tuning you can do to improve performance is more limited than a database engine.
You shouldn't rely only on lucene if:
You need frequent updates
You need to do joined queries
You need sophisticated backup solutions
I would say that if your project is large enough to hire a DBA, you should use one...
Performance wise, I am seeing acceptable performance on a 400GB index across 10 servers (a single (4GB, 2CPU) server can handle 40GB of lucene index, but no more. YMMV).
By excessive, do you mean extensive/exclusive?
Lucene's performance is generally very good. I recently ran some performance tests for Lucene on my Desktop with QuadCore # 2.4 GHz 2.39 GHz
I ran various search queries against a disk index composed of 10MM documents, and the slowest query (MatchAllDocs) returned results within 1500 ms. Search queries with two or more search terms would return around 100 ms.
There are tons of performance tweaks you can do for Lucene, and they can significantly increase your search speed.
What would you define as excessive?
If your application has a solid design, and the performance is good, I wouldn't worry too much about it.
Perhaps you could get a data dump to test the performance in a live scenario.
We use lucence to enable type-ahead searching. This means for every letter typed, it hits the lucence index to get the results. Multiple that to tens of textboxes on multiple interfaces and again tens of employees typing, with no complaints and extremely fast response times. (Actually it works faster than any other type-ahead solution we tried).

Resources