is _id of document affects on scoring? - elasticsearch

I add two same documents the only different thing is _id of documents (I restart scenario for each of them and I do not add them sequentially. to be sure my test is correct)
one of them changes order of result of this query and one of them does not:
GET index_for_test/business/_search
{
"query": {
"multi_match": {
"query": "italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ]
}
}
}
my original question was:
https://github.com/elastic/elasticsearch/issues/10341

as mentioned here: https://groups.google.com/forum/?fromgroups=&hl=en-GB#!topic/elasticsearch/VWqA_P4zzH8
my answer is in this documentation:
https://www.elastic.co/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch
documents are spread in 5 shards by default and queries run with an algorithm that scores documents in each shard and then fetch them, in small data this ends to inaccurate result so if the database is small it is better to run you queries with search_type=dfs_query_then_fetch but it has scalability problems and should be changed when it grows

Related

Is query context evaluated before filter context in elasticsearch? How to determine the order of evaluation?

I am using the below query :
GET customer/doc/_search?routing=123
{
"query": {
"bool": {
"filter": [
{
"term": {
"location": "Delhi"
}
}
],
"should": [
{
"match_phrase_prefix": {
"phone": {
"query": "650",
"max_expansions": 100
}
}
}
]
}
}
}
The problem is my search on phone isn't working anymore. It used to work fine when I had less data, now every shard has data for multiple locations. Search on phone now requires me to type in 6 or 7 characters at times. (There may be matching phone numbers that have different location but are on this shard)
This is due to max_expansions I am guessing. When I increase it to 500 it does return me search results (not all), but the query becomes slow.
Isn't there a way to force es to apply filter first (and restrict the dataset) and then apply the should clause, so that I get the matching results even with small value of max_expansions?
Any help is appreciated.
It is due to max_expansions. Restricting dataset is not exactly what you may want to do ( Thats also not very straight forward - you may have to use some script which will in turn slowdown query).
When you query for a wildcard expression, Lucene expands the wildcard expression into set of actual terms in your inverted index term dictionary. Now , when you restrict the term expansion to 500 - it might miss a few.
I would consider using prefixes during indexing phase. Prefixes helps to avoid the costly expansion in runtime phase.

Application-side Joins Elasticsearch

I have two indexes in Elasticsearch, a system index, and a telemetry index. I'd like to perform queries and aggregations on the telemetry index using filters from the systems index. The systems index is relatively small and only receives new documents occasionally, but the telemetry index is much larger and is constantly receiving new documents. This seems like an ideal situation for using an application-side join.
I tried emulating the example query at the pervious link, but it turns out the filtered query is deprecated as of ES 5.0. (Why is this example in the current documentation?!)
Here are my queries:
GET /system/_search
{
"query": {
"match": {
"name": "George's system"
}
}
}
GET /telemetry/_search
{
"query": {
"bool":{
"must": {
"multi_match": {
"operator": "and",
"fields": ["systemId"]
, [1] }
}
}
}
}
}
The second one fails with a json_parse_exception because for some reason it doesn't like the [ ] characters after "fields".
Can anyone provide a simple example of using application-side joins?
Once such a query is defined (perhaps in Kibana's Dev Tools console) is there a way to visualize it in Kibana?
With elastic there is no way to execute two nested queries like in a relational database where the first query uses the response of the second. The example in the application-side join, means that you are actually making two queries (two different requests to elastic) on the application side.
First query you get the list of ids you need to filter on.
Second query you pass the list of ids that you got to the terms filter.
This works when you have no more than 1024 values for systemId. Because terms query has a limit on the number of terms.
Because this query is not feasible, then you can't visualize it in kibana.
In such case you have to sacrifice a little of space and add the systemId to your mapping.
Good Luck!

Elasticsearch filter vs term query for many ids

I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).
After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.
you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}

Querying large amounts of terms without expanding maxClauseCount

In a data flow of mine, I am trying to retrieve a subset of documents from a previous terms aggregation, but hitting the maxClauseCount limit within my ES cluster. The follow up query is along these lines:
GET dataset/_search
{
"size": 2000,
"query": {
"bool": {
"must": [
(a filter or two)...,
{
"terms":{
"otherid":[
"789e18f2-bacb-4e38-9800-bf8e4c65c206",
"8e6967aa-5b98-483e-b50f-c681c7396a6a",
...
]
}
}
]}
}
}
In my research I've come across a lookup - which sadly we can't use - as well as the ids query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ids-query.html
From experimentation, it appears that the ids query doesn't share the limit the terms query has (potentially it's not converted into terms clauses). Do any of you know if there's a good way to achieve similar functionality to the ids query without using the ids fields.
My version of ES is 5.0.
Thanks!
instead of using terms use the Terms filter it will solve the issue
OR
index.query.bool.max_clause_count: increase to higher value(*Not Recommended)
http://george-stathis.com/2013/10/18/setting-the-booleanquery-maxclausecount-in-elasticsearch/

How does elasticsearch fetch AND operator query from its indexes

Suppose I have a AND/MUST operator query in elasticsearch on two different indexed fields
as follows :
"bool": {
"must": [
{
"match" : {
"query": "Will",
"fields": [ "first",],
"minimum_should_match": "100%" // assuming this is q1
}
},
{
"match" : {
"query": "Smith",
"fields": [ "last" ]
"minimum_should_match": "100%" //assuming this is q2
}
}
]
}
Now I wanted to know how in background elastic search will fetch documents.
Whether it will get all id of documents where index matches q1 and then iterate over all which also has index q2.
or
It does intersection of two sets and how?.
How can I index my data to optimize and QUERIES on two separate fields?
First some basics: ElasticSearch uses lucene behind the scenes. In lucene a query returns a scorer, and that scorer is responsible for returning the list of documents matching the query.
Your boolean query will internally be translated to lucene BooleanQuery which in this case will return ConjunctionScorer, as it has only must clauses.
Each of the clauses is a TermQuery that returns a TermScorer which, when advanced, gives next matching document in increasing order of document id.
ConjunctionScorer computes intersection of the matching documents returned by scorers for each clause by simply advancing each scorer in turns.
So you can think of TermScorer as of one returning an ordered list of the documents, and of ConjunctionScorer as of one simply intersecting two ordered lists.
There's not much you can do to optimize it. Maybe, since you're not really interested in scores, you could use a filter query instead and let ElasticSearch cache it.

Resources