Solr facet results differ on each query - caching

I'm running Solr 4.1 and have many facet counts defined. I noticed that when I send the same query to Solr multiple times, the result facet counts differ each time. For some reason, my filterCache is causing the first query result's facet counts to be different than all subsequent queries. When I remove my filterCache from the config, I do not see this issue at all.
Why does filterCache cause this inconsistency? Here's all the caching I have in my configs:
<filterCache class="solr.FastLRUCache"
size="512"
initialSize="512"
autowarmCount="32"/>
<queryResultCache class="solr.LRUCache"
size="1024"
initialSize="1024"
autowarmCount="512"/>
<documentCache class="solr.LRUCache"
size="1024"
initialSize="1024"
autowarmCount="0"/>
<queryResultWindowSize>10</queryResultWindowSize>
<queryResultMaxDocsCached>100</queryResultMaxDocsCached>

Related

Different scores for identical documents after upgrading from spring-data-elasticsearch 4.2.1 to 4.3.0

I'm currently in the process of upgrading the spring boot version of my project. After upgrading from 2.5 to 2.6 a few tests started failing which deal with the retrival of elasticsearch documents. I'm trying to fetch only the highest scoring documents, but when expecting 2 identical documents, only 1 is retrieved.
After reading up on the issue I figured out that the problem comes down to the Elasticsearchindex using multiple shards, each having their own scoring logic and (probably?) the identical documents being fetched from different shards, thus resulting in different scores despite being virtually the same.
Now, can anyone tell me why this happens in the newer spring-data-elasticsearch version and if there is a setting to return it to the old functionality?
I've set up a little test project to play around with this. If anyone is interested in trying this for themselves, feel free to check it out: https://github.com/Moldavis/elasticsearch-scoring-poc
Actually found my own answer in the spring data breaking changes documentation (duh).
https://docs.spring.io/spring-data/elasticsearch/docs/current/reference/html/#elasticsearch-migration-guide-4.2-4.3.breaking-changes
search_type default value
The default value for the search_type in Elasticsearch is query_then_fetch. This now is also set as default value in the Query implementations, it was previously set to dfs_query_then_fetch.
The dfs_query_then_fetch option queries all shards for document and term frequency to equal out the score between different shards. This is no longer used by default, therefore the mentioned problem occurs.
It can be fixed by setting the searchtype for the query like so:
queryBuilder.withSearchType(SearchType.DFS_QUERY_THEN_FETCH);

Does Elasticsearch have a Default Sort Order for Filter Queries?

Does Elasticsearch have a defined default sort order for filter queries if none is specified? Or is it more like an RDBMS without an order by - i.e. nothing is guaranteed?
From my experiments I appear to be getting my documents back in order of their id - which is exactly what I want - I am just wondering if this can be relied on?
When you only have filters (i.e. no scoring) and no explicit sort clause, then the documents are returned in index order, i.e. implicitly sorted by the special field named _doc.
Index order simply means the sequential order in which the documents have been indexed.
If your id is sequential and you've indexed your documents in the same order as your id, then what you observe is correct, but it might not always be the case.
No, the order cannot be relied on (in ES 7.12.1 at least)!
I've tested in a production environment, where we have a cluster with multiple shards and replicas and even running the simplest query like this returns results in different order on every few requests:
POST /my_index/_search
One way to ensure the same order is to add order by _id, which seems to bring a small performance hit with it.
Also, I know it's not related to this question, but keep in mind that if you do have scoring in your query and you still get random results, even after adding an order by _id, the problem is that the scores are randomly generated in a cluster environment. This problem can be solved with adding a parameter to you query:
POST /my_index/_search?search_type=dfs_query_then_fetch
More info and possible solutions can be found here:
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/consistent-scoring.html

Why elasticsearch return docs in different order with the same query?

In elasticsearch 7.9, I have an Index with 1 shard and 1 replica. I use simple datetime filter to get docs between start time and end time, but I often get same result set in different order. I do not want to use Sort clause and compute scores. I just want to get results in same order.
So there is anyway to do this without using Sort?
It may be happening due to the fact, that you have 1 replica for your index, which might have some difference or different values for your timestamp field, you can use the preference param and make sure, your search results are always returned from the same shard.
Refer bouncy result issue blog in ES for more info.

Stormcrawler - how does the es.status.filterQuery work?

I am using stormcrawler to put data into some Elasticsearch indexes, and I have a bunch of URL's in the status index, with a variety of statuses - DISCOVERED, FETCHED, ERROR, etc.
I was wondering if I could tell StormCrawler to just crawl the urls that are https and with the status: DISCOVERED and if that would actually work. I have the es-conf.yaml set as follows:
es.status.filterQuery: "-(url:https* AND status:DISCOVERED)"
Is that correct? how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?
See code of the AggregationSpout.
how does SC make use of the es.status.filterQuery? Does it run a
search and apply the value as a filter to retrieve only the applicable
documents to fetch?
yes, it filters the queries sent to the ES shards. This is useful for instance to process a subset of a crawl.
It is a positive filter i.e. the documents must match the query in order to be retrieved; you'd need to remove the - for it to do what you described.

Facet Queries Slow During Replication in solr 6.4.2

We have a solr core in which we replicate data every 30 minutes.
I am facing an issue regarding searcher .
Every time Replication happens the simple queries are executed as expected but the facet queries take a very long time in execution .
I have enabled cold searcher setting.
Sample facet queries:
2017-12-22 04:40:25.801 INFO (qtp834133664-539) [ x:cdaapp] o.a.s.c.S.Request [cdaapp] webapp=/solr path=/select params={q=sJID:8664459&facet.field=sS&facet.field=sHQLID&facet.field=sFCLID&facet.field=sASIDN&facet.field=sNEx&qt=edismax&facet.mincount=1&rows=0&facet=on&wt=json} hits=15 status=0 QTime=14651
2017-12-22 04:40:25.823 INFO (qtp834133664-569) [ x:cdaapp] o.a.s.c.S.Request [cdaapp] webapp=/solr path=/select params={q=sJID:8641232&facet.field=sS&facet.field=sHQLID&facet.field=sFCLID&facet.field=sASIDN&facet.field=sNEx&qt=edismax&facet.mincount=1&rows=0&facet=on&wt=json} hits=13 status=0 QTime=11226
I'd suggest setting facet method to enum.
facet.method=enum
This parameter indicates what type of algorithm/method to use when faceting a field.
enum - Enumerates all terms in a field, calculating the set
intersection of documents that match the term with documents that
match the query. This was the default (and only) method for faceting
multi-valued fields prior to Solr 1.4.
fc (stands for Field Cache) The facet counts are calculated by
iterating over documents that match the query and summing the terms
that appear in each document. This was the default method for single
valued fields prior to Solr 1.4.
fcs (stands for Field Cache per Segment) works the same as fc except
the underlying cache data structure is built for each segment of the
index individually

Resources