In a multi-node solr installation (without SolrCloud), during a paging scenario (e.g., start=1000, rows=200), the primary node asks for 1200 rows from each shard. If highlighting is ON, then the primary node is asking for highlighting all the 1200 results from each shard, which doesn't scale well. Is there a way to break the shard query in two steps e.g. ask for the 1200 rows and after sorting the 1200 responses from each shard and finding final rows to return (1001 to 1200) , issue another query to shards for asking highlighted response for the relevant docs?
So, it turns out that Solr behavior has changed between my old version and 6.6. Although, this seems like a bug to me after the initial investigation.
I found that if i have specified fl=* in the query then it is doing the right thing (a 2 pass process as it used to do in solr 4.5). However, my queries have fl=id+score , in which case, the shards are asked for highlighting all the results on the first request (and there is no second request).
The fl=* query is (in my sample case) finishing in 100 msec while same query with fl=id+score finishes in 1200 msec.
Related
I am using coherene for storing cached data. I am also using apache Lucene indexing for faster retrieval of the searched records on some attribute values. I am facing a problem of search delay or some time TimeOut while searching for records with one specific attribute.
If the same record is searched with other attribute-value, it is searched and retrieved instantly. E.g. A record {key=1234, value={a=abc,b=def,c=pqr}}, when searched with Lucene query b=def, it searches faster as expected. While if the same record, when searched with Lucene query c=pqr, it either times out in Coherence or takes significant time (more than 100 ms as against 2 to 5ms as expected). I verified that the lucene indexes, sort are exactly same for both b and c fields. Not able to figure out the reason of this delay and resolve.
I tried to debug the code to identify any different paths of execution for different fields search. However did not find any differences.
Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.
I'm using elasticsearch.js to move a document from one index to another.
1a) Query index_new for all docs and display on the page.
1b) Use query of index_old to obtain a document by id.
2) Use an insert to index_new, inserting result from index_old.
3) Delete document from index_old (by id).
4) Requery index_new to see all docs (including the new one). However, at this point, it returns the same list of results as returned in 1a. Not including the new document.
Is this because of caching? When I refresh the whole page, and 1a is triggered, the new document is there.. But not without a refresh.
Thanks,
Daniel
This is due to the segments merging and refreshing that happens inside the elasticsearch indexes per shard and replica.
Whenever you are writing to the index wou never write to the original index file but rather write to newer smaller files called segment which then gets merged into the bigger file in background batch jobs.
Next question that you might have is
How often does this thing happen or how can one have a control over this
There is a setting in the index level configuration called refresh_interval. It can have multiple values depending upon the kind of strategy that you want to use.
refresh_interval -
-1 : To stop elasticsearch handle the merging and you control at your end with the _refresh API in elasticsearch.
X : x is an integer and has a value in seconds. Hence elasticsearch will refresh all the indexes every x seconds.
If you have replication enabled into your indexes then you might also experience in result value toggling. This happens just because the indexes have multiple shard and a shard has multiple replicas. Hence different replicas have different window pattern for refreshing. Hence while querying the query actually routes to different shard replicas in the meantime which shows different states in the time window.
Hence if you are using a setting to set periods of refresh interval then assume to have a consistent state in next X to 2X seconds at max.
Segment Merge Background details
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/indices-update-settings.html
I have 3 nodes Solr data center. I am trying to redirect all queries to node1 using solr http api because i think i have problems with node2, node3. I will replace them. I enabled datastax solr slow query metric. I see two main problem.
Even though i set shard.shuffling.strategy=host
Document says that
host
Shards are selected based on the host that received the query.
and i expect that when request http://node1:8983/solr/.... , the coordinator_ip and node_ip columns in solr_slow_sub_query_log table will be the same. When i get the records, i see 80% percentage is node1. Is not that wrong? I expect 100% request use node1.
When i get records from solr_slow_sub_query_log, i see that the rows coordinator_id=node1 and node_ip=node2ornode3 has too much elapsed_millis such as 1300 seconds even though document says netty_client_request_timeout is 60 seconds.
I'm having issue with scoring: when I run the same query multiple times, each documents are not scored the same way. I found out that the problem is well known, it's the bouncing result issue.
A bit of context: I have multiple shards across multiple nodes (60 shards, 10 data nodes), all the nodes are using ES 2.3 and we're heavily using nested document - the example query doesn't use them, for simplicity.
I tried to resolve it by using the preference search parameter, with a custom value. The documentation states:
A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.
However, when I run this query multiple times:
GET myindex/_search?preference=asfd
{
"query": {
"term": {
"has_account": {
"value": "twitter"
}
}
}
}
I end up having the same documents, but with different scoring/sorting. If I enable explain, I can see that those documents are coming from different shards.
If I use preference=_primary or preference=_replica, we have the expected behavior (always the same shard, always the same scoring/sorting) but I can't query only one or the other...
I also experimented with search_type=dfs_search_then_fetch, which should generate the scoring based on the whole index, across all shards, but I still get different scoring for each run of the query.
So in short, how do I ensure the score and the sorting of the results of a query stay the same during a user's session?
Looks like my replicas went out of sync with the primaries.
No idea why, but deleting the replicas and recreating them have "fixed" the problem... I'll need some investigations on why it went out of sync
Edit 21/10/2016
Regarding the "preference" option not being taken into account, it's linked to the AWS zone awareness: if the preferred replica is in another zone than the client node, then the preference will be ignored.
The differences between the replicas are "normal" if you delete (or update) documents, from my understanding the deleted document count will vary between the replicas, since they're not necessarily merging segments at the same time.