What's the difference bettween Fielddata enabled vs eager global ordinals for optimising initial search query latency in Elasticsearch - elasticsearch

I have an elasticsearch (7.10) cluster running that is primarily meant for powering search on text documents. The index that I'm working with does not need to be updated often, and there is no great necessity for speed during index time. Performance in this system is really needed for search time. The number of documents will likely always be in the range of 50-70 million and the store size is ~300GB once it's all built.
The mapping for the index and field I'm concerned with looks something like this:
"mappings": {
"properties": {
"document_text": {
"type": "text"
}
}
}
The document_text is a string of text anywhere in the region of 50-500 words. The typical queries being sent to this index are match queries chained together inside a boolean should query. Usually, the number of clauses are in the range of 5-15.
The issue I've been running into is that the initial latency for search queries to the index is very high usually in the range of 4-6s but after the first search the document is cached so the latency becomes much lower <1s. The cluster has 3 data nodes, 3 master nodes and 2 ingest/client nodes and is backed by fast SSD. I noticed that the heap on the data nodes is never really under too much pressure nor is the RAM this led me to realize that the documents weren't cached in advance the way I wanted them to be. From what I've researched I've landed on either enabling fielddata=true to get the field data object in memory at index time rather than constructing that at search time. I understand this will increase pressure on the JVM heap so I may do some frequency filtering to only place certain documents in memory. The other option I've come across is setting eager_global_ordinals=true which in some ways seems similar to enabling fielddata as it builds the mappings in-memory at index time also. I'm a bit new with ES and the terminology between the two is somewhat confusing to me. What I'd love to know is what is the difference between the two and does enable one or both of them to seem reasonable to solve the latency issues I'm having or I have completely misunderstood the docs. Thanks!

Enabling eager_global_ordinals won't do any kind of effect on your queries. Enabling global ordinals would only help for aggregations, doc values would be loaded at index refresh time instead of loading them at query time.
Enabling fielddata would also not do any real effect on your queries. Its primary purpose is sorting and aggregation, which you don't really want to do on a text field.
There's probably not much you can do with first ES queries being slower. Better focus on optimal index mappings, settings, shards, and document sizes.

Related

Would ordering of documents when indexing improve Elasticsearch search performance?

I'm indexing about 40M documents into Elasticsearch. It's usually a one off data load and then we do run queries on top. There's no further updates to the index itself. However the default settings of Elasticsearch isn't getting me the throughput I expected.
So in long list of things to tune and verify, I was wondering whether ordering by a business key would help improve the search throughput. All our analysis queries use this key and it is indexed as a keyword already and we do a filter on it like below,
{
"query" : {
"bool" : {
"must" : {
"multi_match" : {
"type": "cross_fields",
"query":"store related query",
"minimum_should_match": "30%",
"fields": [ "field1^5", "field2^5", "field3^3", "field4^3", "firstLine", "field5", "field6", "field7"]
}
},
"filter": {
"term": {
"businessKey": "storename"
}
}
}
}
}
This query is run in a bulk fashion about 20M times in a matter of few hours. Currently I cannot go past 21k/min. But that could be because of various factors. Any tips to improve performance for this sort of work flow (load once and search a lot) would be appreciated.
However I'm particularly interested to know if I could order the data first by business key when I'm indexing so that data for that businessKey lives within one single Lucene segment and hence the lookup would be quicker. Is that line of thoughts correct? Is this something ES already does given that it's keyword term?
It's a very good performance optimization use-case and as you already mentioned there will be a list of performance optimization which you need to do.
I can see, you are already building the query correctly that is that filtering the records based on businessKey and than search on remaining docs, this way you are already utilizing the filter-cache of elasticsearch.
As you have huge number of documents ~40M docs, it doesn't make sense to put all of them in single segments, default max size of segment is 5 GB and beyond that
merge process will be blocked on segments, hence its almost impossible for you to have just 1 segment for your data.
I think couple of things which you can do is:
Disable refresh interval when you are done with ingesting your data and preparing index for search.
As you are using the filters, your request_cache should be used and you should monitor the cache usage when you are querying and monitor how many times results are coming from cache.
GET your-index/_stats/request_cache?human
Read throughput is more when you have more replicas and if you have nodes in your elasticsearch cluster makes sure these nodes, have replicas of your ES index.
Monitor the search queues on each nodes and make sure its not getting exhausted otherwise you will not be able to increase the throughput, refer threadpools in ES for more info
You main issue is around throughput and you want to go beyond current limit of 21k/min, so it requires a lot of index and cluster configuration optimization as well and I have written short tips to improve search performance please refer them and let me know how it goes.
Indexing is faster with more shards(& segments) and Querying is faster with less shards (& segments).
If you are doing a one-time load you can do a force merge of segments post indexing and also think about adding more replicas post indexing to distribute search and increase throughput.
If each query is going to be specific to a business key, segregating data while indexing and creating one index per documents related to a business key might help. And similarly while querying direct the query to specific index pertaining to the business key being queried.
Going through below links might help
Tune for Search
Tune for Indexing
Optimise Search

Advice on efficient ElasticSearch document design

I'm working on a project that deals with listings (think: Craiglist, Ebay, Trulia, etc).
The basic unit of information is a "Listing", something like this:
{
"id": 1,
"title": "Awesome apartment!",
"price": 1000000,
// other stuff
}
Some fields can be searched upon (e.g price, location, etc), others are just for display purposes on the application (e.g title, description which contains lots of HTML etc).
My question is: should i store all the data in one document, or split it into two (one for searching e.g 'ListingSearchIndex', one for display, e.g 'ListingIndex').
I also have to do some pretty hefty aggregations across the documents too.
I guess the question is, would searching across smaller documents then doing another call to fetch the results by id be faster than just searching across the full document?
The main factors is obviously speed, but if i split the documents then maintenance would be a factor too.
Any suggestions on best practices?
Thanks :)
In my experience with Elasticsearch, shard configuration has been significant in cluster performance/ speed when querying, aggregating etc. Since, every shard by itself consumes cluster resources (memory/cpu) and has a cost towards cluster overhead it is ideal to get the shard count right so the cluster is not overloaded. Our cluster was over-sharded and it impacted loading search results, visualizations, heavy aggregations etc. Once we fixed our shard count it worked flawlessly!
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.
The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.
Besides performance, I think there's other aspects to consider here.
ElasticSearch offers weaker guarantees in terms of correctness and robustness than other databases (on this topic see their blog post ElasticSearch as a NoSQL database). Its focus is on search, and search performance.
For those reasons, as they mention in the blog post above:
Elasticsearch is commonly used in addition to another database
One way to go about following that pattern:
Store your data in a primary database (e.g. a relational DB)
Index only what you need for your search and aggregations, and to link search results back to items in your primary DB
Get what you need from the primary DB before displaying - i.e. the data for display should mostly come from the primary DB.
The gist of this approach is to not treat ElasticSearch as a source of truth; and instead have another source of truth that you index data from.
Another advantage of doing things that way is that you can easily reindex from your primary DB when you change your index mapping for a new search use case (or on changing index-time processing like analyzers etc...).
I think you can't answer this question without knowing all your queries in advance. For example consider that you split to documents and later you decide that you need to filter based on a field stored in one index and sort by a field that is stored in another index. This will be a big problem!
So my advice to you, If you are not sure where you are heading, just put everything in one index. You can later reindex and remodel.

consequences of increasing max_result_window on elastic search

we have an index which default max_result_window was set up to 10000, but our data is increasing and we expect we have more than 1 Million Docs there, on of our requirements is scroll all data from the start to end with 1000 in each epic , our documents are not very big and I'll write down one example on following :
{
"serp_query": "c=44444&ct=333333",
"uid": "5815697",
"notify_status": 0,
"created_at": "2018-02-04 10:00:00"
}
I've set max_result_window to 10,000,000 but at this time we have almost 50K docs in our Index, I've read the some texts about consequences of this increasing
Values higher than can consume significant chunks of heap memory per
search and per shard executing the search. It’s safest to leave this
value as it is an use the scroll api for any deep scrolling
https://www.elastic.co/guide/en/elasticsearch/reference/2.x/breaking_21_search_changes.html#_from_size_limits
But we our Documents are not too big and our Elastic Server has 16GB dedicated RAM and guess there is not problem,
I'm writing to ask two questions,
according to the sample Doc ( all our docs should have the same fields) how much it could be Big for one Million Docs,I mean how much heap memory will needed for handle this?
is it very bad solution and will faced us with big problem in future ? are we use scrolling instead of offset and start?
our query is not very complicated, loop on all data ordered by "created_at" descending and get 1000 Docs in each epic.
FYI: our elastic search engine version in 2.7
Just to share the Result with others,
If the Document is not very big and your queries are not very complicated increasing max_result_window has not big effect on performance.

Performance issues using Elasticsearch as a time window storage

We are using elastic search almost as a cache, storing documents found in a time window. We continuously insert a lot of documents of different sizes and then we search in the ES using text queries combined with a date filter so the current thread does not get documents it has already seen. Something like this:
"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"
We maintain the data in the elastic search for 30 minutes, using the TTL feature. Today we have at least 3 machines inserting new documents in bulk requests every minute for each machine and searching using queries like the one above pratically continuously.
We are having a lot of trouble indexing and retrieving these documents, we are not getting a good throughput volume of documents being indexed and returned by ES. We can't get even 200 documents indexed per second.
We believe the problem lies in the simultaneous queries, inserts and TTL deletes. We don't need to keep old data in elastic, we just need a small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?
Thanks in advance
Machine type:
An Amazon EC2 medium instance (3.7 GB of RAM)
Additional information:
The code used to build the index is something like this:
https://gist.github.com/dggc/6523411
Our elasticsearch.json configuration file:
https://gist.github.com/dggc/6523421
EDIT
Sorry about the long delay to give you guys some feedback. Things were kind of hectic here at our company, and I chose to wait for calmer times to give a more detailed account of how we solved our issue. We still have to do some benchmarks to measure the actual improvements, but the point is that we solved the issue :)
First of all, I believe the indexing performance issues were caused by a usage error on out part. As I told before, we used Elasticsearch as a sort of a cache, to look for documents inside a 30 minutes time window. We looked for documents in elasticsearch whose content matched some query, and whose insert date was within some range. Elastic would then return us the full document json (which had a whole lot of data, besides the indexed content). Our configuration had elastic indexing the document json field by mistake (besides the content and insertDate fields), which we believe was the main cause of the indexing performance issues.
However, we also did a number of modifications, as suggested by the answers here, which we believe also improved the performance:
We now do not use the TTL feature, and instead use two "rolling indexes" under a common alias. When an index gets old, we create a new one, assign the alias to it, and delete the old one.
Our application does a huge number of queries per second. We believe this hits elastic hard, and degrades the indexing performance (since we only use one node for elastic search). We were using 10 shards for the node, which caused each query we fired to elastic to be translated into 10 queries, one for each shard. Since we can discard the data in elastic at any moment (thus making changes in the number of shards not a problem to us), we just changed the number of shards to 1, greatly reducing the number of queries in our elastic node.
We had 9 mappings in our index, and each query would be fired to a specific mapping. Of those 9 mappings, about 90% of the documents inserted went to two of those mappings. We created a separate rolling index for each of those mappings, and left the other 7 in the same index.
Not really a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor elastic search and learn important metrics, such as the number of queries fired -> sematext.com/spm/index.html
Our usage numbers are relatively small. We have about 100 documents/second arriving which have to be indexed, with peaks of 400 documents/second. As for searches, we have about 1500 searches per minute (15000 before changing the number of shards). Before those modifications, we were hitting those performance issues, but not anymore.
TTL to time-series based indexes
You should consider using time-series-based indexes rather than the TTL feature. Given that you only care about the most recent 30 minute window of documents, create a new index for every 30 minutes using a date/time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)
Using Elasticsearch's index aliasing feature (http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/), you can alias docs to the most recently created index so that when you are bulk indexing, you always use the alias docs, but they'll get written to docs-201309120130, for example.
When querying, you would filter on a datetime field to ensure only the most recent 30 mins of documents are returned, and you'd need to query against the 2 most recently created indexes to ensure you get your full 30 minutes of documents - you could create another alias here to point to the two indexes, or just query against the two index names directly.
With this model, you don't have the overhead of TTL usage, and you can just delete the old, unused indexes from over an hour in the past.
There are other ways to improve bulk indexing and querying speed as well, but I think removal of TTL is going to be the biggest win - plus, your indexes only have a limited amount of data to filter/query against, which should provide a nice speed boost.
Elasticsearch settings (eg. memory, etc.)
Here are some setting that I commonly adjust for servers running ES - http://pastebin.com/mNUGQCLY, note that it's only for a 1GB VPS, so you'll need to adjust.
Node roles
Looking into master vs data vs 'client' ES node types might help you as well - http://www.elasticsearch.org/guide/reference/modules/node/
Indexing settings
When doing bulk inserts, consider modifying the values of both index.refresh_interval index.merge.policy.merge_factor - I see that you've modified refresh_interval to 5s, but consider setting it to -1 before the bulk indexing operation, and then back to your desired interval. Or, consider just doing a manual _refresh API hit after your bulk operation is done, particularly if you're only doing bulk inserts every minute - it's a controlled environment in that case.
With index.merge.policy.merge_factor, setting it to a higher value reduces the amount of segment merging ES does in the background, then back to its default after the bulk operation restores normal behaviour. A setting of 30 is commonly recommended for bulk inserts and the default value is 10.
Some other ways to improve Elasticsearch performance:
increase index refresh interval. Going from 1 second to 10 or 30 seconds can make a big difference in performance.
throttle merging if it's being overly aggressive. You can also reduce the number of concurrent merges by lowering index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit. Lowering the index.merge.scheduler.max_thread_count can help as well
It's good to see you are using SPM. Its URL in your EDIT was not hyperlink - it's at http://sematext.com/spm . "Indexing" graphs will show how changing of the merge-related settings affects performance.
I would fire up an additional ES instance and have it form a cluster with your current node. Then I would split the work between the two machines, use one for indexing and the other for querying. See how that works out for you. You might need to scale out even more for your specific usage patterns.

Should I keep the size of stored fields in Solr to a minimum?

I am looking to introduce Solr to power the search for a business listing website. The site has around 2 million records.
There is a search results page which will display some key data for each result. I believe the data needed for this summary information is around 1KB per result.
I could simply index the fields needed for the search within Solr - but this means a separate database call for each result to populate the summary information. If Solr could return all of this data I would expect it to yield greater performance than ~40 database round-trips.
The concern is that Solr's memory usage would be too large (how might I calculate this?) and that indexing might take too long with the extra data.
You would benefit greatly to store those fields in Solr compared to the 40 db roundtrips. Just make sure that you marked the field as "not indexed" (indexed = false) in your schema config and maybe also compressed (compressed = true) (however this will of course use some CPU when indexing and retrieving).
When marking a field as "not indexed" no analyzers will process the field when indexing making it stored much faster than a indexed field.
It's a trade off, and you will have to analyze this yourself.
Solr's performance greatly depends on caching, not only of queries, but also of the documents themselves. Those caches depend on memory, and the bigger your documents are, the less you can fit in a fixed amount of memory.
Document size also affects index size and replication times. For large indices with master slave configurations, this can impact the rate at which you can update the index.
Ideally you should measure cache hit rates at different cache sizes, with and without the fields. If you can spend the memory to get a high enough cache hit rate with the fields, then by all means go for it. If you cannot, you may have to fetch the document content from another system.
There is a third alternative you didn't mention, which is to store the documents outside of the DB, but not in Solr. They should be stored in a format which is as close as possible to what you deliver with search results. The code which creates/updates the indices could create/update these documents as well. This is a lot of work, but like everything it comes down to how much performance you need and what you are willing to do to get it.
EDIT: For measuring cache hit rates and throughput, I've found the best test source is your current query logs. Take a day or two worth of live queries and run them against different indexes and configurations to see how well they work.

Resources