Facet Queries Slow During Replication in solr 6.4.2 - performance

We have a solr core in which we replicate data every 30 minutes.
I am facing an issue regarding searcher .
Every time Replication happens the simple queries are executed as expected but the facet queries take a very long time in execution .
I have enabled cold searcher setting.
Sample facet queries:
2017-12-22 04:40:25.801 INFO (qtp834133664-539) [ x:cdaapp] o.a.s.c.S.Request [cdaapp] webapp=/solr path=/select params={q=sJID:8664459&facet.field=sS&facet.field=sHQLID&facet.field=sFCLID&facet.field=sASIDN&facet.field=sNEx&qt=edismax&facet.mincount=1&rows=0&facet=on&wt=json} hits=15 status=0 QTime=14651
2017-12-22 04:40:25.823 INFO (qtp834133664-569) [ x:cdaapp] o.a.s.c.S.Request [cdaapp] webapp=/solr path=/select params={q=sJID:8641232&facet.field=sS&facet.field=sHQLID&facet.field=sFCLID&facet.field=sASIDN&facet.field=sNEx&qt=edismax&facet.mincount=1&rows=0&facet=on&wt=json} hits=13 status=0 QTime=11226

I'd suggest setting facet method to enum.
facet.method=enum
This parameter indicates what type of algorithm/method to use when faceting a field.
enum - Enumerates all terms in a field, calculating the set
intersection of documents that match the term with documents that
match the query. This was the default (and only) method for faceting
multi-valued fields prior to Solr 1.4.
fc (stands for Field Cache) The facet counts are calculated by
iterating over documents that match the query and summing the terms
that appear in each document. This was the default method for single
valued fields prior to Solr 1.4.
fcs (stands for Field Cache per Segment) works the same as fc except
the underlying cache data structure is built for each segment of the
index individually

Related

Does Elastic Search query aggregation has limit to upto what it can process or aggregate?

Hello Community I am fairly new to elastic search and have stumbled upon this issue.
My Elastic Search Application requires aggregation of data to get top hits. The query is working perfectly on almost all use cases and we are getting the values in buckets. But for some cases in which the field on which the aggregation is being performed has a very long text field is not giving the desired results the aggregation is not happening and we are getting buckets as an empty array. So my question is that is there a case that Elastic Search aggregation has a size limit that it cannot aggregate very long text fields?

Using stored_fields for retrieving a subset of the fields in Elastic Search

The documentation and recommendation for using stored_fields feature in ElasticSearch has been changing. In the latest version (7.9), stored_fields is not recommended - https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-fields.html
Is there a reason for this?
Where as in version 7.4.0, there is no such negative comment - https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-store.html
What is the guidance in using this feature? Is using _source filtering a better option? I ask because in some other doc, _source filtering is supposed to kill performance - https://www.elastic.co/blog/found-optimizing-elasticsearch-searches
If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents.
What is the best way to filter fields and not kill performance with Elastic Search?
source filtering is the recommended way to fetch the fields and you are getting confused due to the blog, but you seem to miss the very important concept and use-case where it is applicable. Please read the below statement carefully.
_source is intended to be used when accessing the resulting hits, not when processing millions of documents.
By default, elasticsearch returns only 10 hits/search results which can be changed based on the size parameter and if in your search results, you want to fetch few fields value than using source_filter makes perfect sense as it's done on the final result set(not all the documents matching search results),
While if you use the script, and using source value try to read field-value and filter the search result, this will cause queries to scan all the index which is the second part of the above-mentioned statement(not when processing millions of documents.)
Apart from the above, as all the field values are already stored as part of _source field which is enabled by default, you need not allocate extra space if you explicitly mark few fields as stored(disabled by default to save the index size) to retrieve field-values.

Elastic search _doc with sort

I am paginating elastic search data using search_after with sort using _uid. Sorting over huge amount of data leads to circuit breaker exception, as the field data size limit is exceeded. A possible solution provided by elastic search for sorting the large dataset is with _doc
(https://www.elastic.co/guide/en/elasticsearch/reference/6.3/search-request-sort.html)
This helps in getting the response quickly without failing with circuit breaker exception. However I am concerned about the unique value which is being used in search_after to get the next set of records as this value will be used as a cursor to get the next records.
https://www.elastic.co/guide/en/elasticsearch/reference/6.3/search-request-scroll.html
To understand what does the _doc value represent I was going through the documentation but there is no description added.
In my data I see more than few records with same _doc value. these documents with same _doc value has different _ids meaning they are two different records. Can anyone help me in understanding what does this value represents. Can I used it in search_after? I am using elastic version 5.4

ElasticSearch aggregations for all pages

I use size and from keywords for pagination across ElasticSearch results and each page change requires another search query to be executed.
I would like to compute facets with the aggregations feature, however aggregations are computed only based on the results constrained by size and from keywords e.g. when I ask for records 20-30 from the list, the aggregations are computed only on these 10 records that are returned. And I would like of course to have global facets computed on all the matching records that do not change while I switch the pages.
Any ideas how to do it apart from performing an additional global (uncostrained by size and from) search?
Aggregations are computed on all documents that match "query". The scope of aggregations has nothing to do with "size" and "from" values.

Elasticsearch questions: search, performance and caching

I'm new to elasticsearch, have been reading their API and some things are not clear to me
1) It is said that filters are cached. what does that mean? if i send a query with a filter on it, what gets cached? The results of that query? If i send a different query with the same filter, will the cache help me somehow?
I know the question is kinda vague, but so is ElasticSearch's documentation for this.
2) Is there a real performance difference between a query matching a term X to the "_all" field or to a specific field? As far i understand, both queries will be compared against all documents that contain X in one of their fields, and the only difference is in how many fields will be matched against X, in these documents. is that correct?
1) For your first question take a look at this link.
To quote from the post
"Filters don’t score documents – they simply include or exclude. If a document matches a filter, it is represented with a one in the BitSet; otherwise a zero. This means that Elasticsearch can store an entire segment’s filter state (“who matches this particular filter?”) in a single, compact BitSet.
The first time Elasticsearch executes a filter, it parses Lucene segment data structures to determine what matches your filter. Instead of throwing away this information, it caches it inside a BitSet.
The next time the same filter is executed, Elasticsearch can reference the compact BitSet instead of the Lucene segments. This has huge performance benefits."
2) "The idea of the _all field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. This comes at the expense of CPU cycles and index size."link
So if you know what fields you are going to query use specifics fields to search on.

Resources