what does the total hits mean without specify track_total_hits=true? - elasticsearch

When send query to elasticsearch, I can get a response which has a hits->total like below:
"hits" : {
"total" : {
"value" : 10000,
"relation" : "eq"
},
...
I think the value inside total indicates the count of document matched the query. However the value is not accurate unless I specify track_total_hits=true in the query. The question is what the value mean if I don't specify track_total_hits?

This value is configured in: index.max_result_window.
10000 is the default
For performance reason, this value is sent when es found more than max_result_window documents.

Related

improving performance of search query using index field when working with alias

I am using an alias name when writing data using Bulk Api.
I have 2 questions:
Can I get the index name after writing data using the alias name maybe as part of the response?
Can I improve performance if I send search queries on specific indexes instead to search on all indexes of the same alias?
If you're using an alias name for writes, that alias can only point to a single index which you're going to receive back in the bulk response
For instance, if test_alias is an alias to the test index, then when sending this bulk command:
POST test_alias/_doc/_bulk
{"index":{}}
{"foo": "bar"}
You will receive this response:
{
"index" : {
"_index" : "test", <---- here is the real index name
"_type" : "_doc",
"_id" : "WtcviYABdf6lG9Jldg0d",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
}
Common sense has it that searching on a single index is always faster than searching on an alias spanning several indexes, but if the alias only spans a single index, then there's no difference.
You can provide the multiple index names while searching the data, if you are using alias and it has multiple indices by default it would search on all the indices, but if you want to filter it based on a few indices in your alias, that is also possible based on the fields in the underlying indices.
You can read the Filter-based aliases to limit access to data section in this blog on how to achieve it, as it queries fewer indices and less data, search performance would be better.
Also alias can have only single writable index, and name of that you can get as part of _cat/alias?v api response as well, which shows which is the write_index for the alias, you can see the sample output here

ElasticSearch Sorted Index not working as expected with multiple shards

I have an elastic index with default sort mapping of price:
shop_prices_sort_index
"sort" : {
"field" : "enrich.price",
"order" : "desc"
},
If I insert 10 documents:
100, 98, 10230, 34, 1, 23, 777, 2323, 3, 109
And Fetch results using /_search. By default it returns documents in order of price descending.
10230, 2323...
But if I distribute my documents into 3 shards, Then the same query returns some other sequence of products:
100, 98, 34...
I am really stuck here, I am not sure if I am missing out something basic or Do I need any extra settings to make a Sorted Index behave correctly.
PS: I also tried 'routing' & 'preference'. but no luck.
Any help much appreciated.
When configuring index sorting, you're only making sure that each segment inside each shard is properly sorted. The goal of index sorting is to provide some more optimization during searches
Due to the distributed nature of ES, when your index has many shards, each shard will be properly sorted, but your search query will still need to use sorting explicitly.
So if your index settings contains the following to apply sorting at indexing time
"sort" : {
"field" : "enrich.price",
"order" : "desc"
}
your search queries will also need to contain the same sort specification at query time
"sort" : {
"field" : "enrich.price",
"order" : "desc"
}
By using index sorting you'll hit a little overhead at indexing time, but your search queries will be a bit faster in the end.

Getting child documents

I have an Elasticsearch index. Each document in that index has a number (i.e 1, 2, 3, etc.) and an array named ChildDocumentIds. There are additional properties too. Still, each item in this array is the _id of a document that is related to this document.
I have a saved search named "Child Documents". I would like to use the number (i.e. 1, 2, 3, etc.) and get the child documents associated with it.
Is there a way to do this in Elastisearch? I can't seem to find a way to do a relational-type query in Elasticsearch for this purpose. I know it will be slow, but I'm o.k. with that.
The terms query allows you to do this. If document #1000 had child documents 3, 12, and 15 then the following two queries would return identical results:
"terms" : { "_id" : [3, 12, 15] }
and:
"terms" : {
"_id" : {
"index" : <parent_index>,
"type" : <parent_type>,
"id" : 1000,
"path" : "ChildDocumentIds"
}
}
The reason that it requires you to specify the index and type a second time is that the terms query supports cross-index lookups.

How to retrieve all the document ids from an elasticsearch index

How to retrieve all the document ids (the internal document '_id') from an Elasticsearch index? if I have 20 million documents in that index, what is the best way to do that?
I would just export the entire index and read off the file system. My experience with size/from and scan/scroll has been disaster when dealing with querying resultsets in the millions. Just takes too long.
If you can use a tool like knapsack, you can export the index to the file system, and iterate through the directories. Each document is stored under it's own directory named after _id. No need to actually open files. Just iterate through the dir.
link to knapsack:
https://github.com/jprante/elasticsearch-knapsack
edit: hopefully you are not doing this often... or this may not be a viable solution
For that amount of documents, you probably want to use the scan and scroll API.
Many client libraries have ready helpers to use the interface. For example, with elasticsearch-py you can do:
es = elasticsearch.Elasticsearch(eshost)
scroll = elasticsearch.helpers.scan(es, query='{"fields": "_id"}', index=idxname, scroll='10s')
for res in scroll:
print res['_id']
First you can issue a request to get the full count of records in the index.
curl -X GET 'http://localhost:9200/documents/document/_count?pretty=true'
{
"count" : 1408,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Then you'll want to loop through the set using a combination of size and from parameters until you reach the total count. Passing an empty field parameter will return only the index and _id that you're interested in.
Find a good page size that you can consume without running out of memory and increment the from each iteration.
curl -X GET 'http://localhost:9200/documents/document/_search?fields=&size=1000&from=5000'
Example item response:
{
"_index" : "documents",
"_type" : "document",
"_id" : "1341",
"_score" : 1.0
},
...

elasticsearch custom_score multiplication is inaccurate

I've inserted some documents which are all identical except for one floating-point field, called a.
When script of a custom_score query is set to just _score, the resulting score is 0.40464813 for a particular query matching some fields. When script is then changed to _score * a (mvel) for the same query, where a is 9.908349251612433, the final score becomes 4.0619955.
Now, if I run this calculation via Chrome's JS console, I get 4.009394996051871.
4.0619955 (elasticsearch)
4.009394996051871 (Chrome)
This is quite a difference and produces an incorrect ordering of results. Why could it be, and is there a way to correct it?
If I run a simple calculation using the numbers you provided, then I get the result that you expect.
curl -XPOST 'http://127.0.0.1:9200/test/test?pretty=1' -d '
{
"a" : 9.90834925161243
}
'
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"custom_score" : {
"script" : "0.40464813 *doc[\u0027a\u0027].value",
"query" : {
"match_all" : {}
}
}
}
}
'
# {
# "hits" : {
# "hits" : [
# {
# "_source" : {
# "a" : 9.90834925161243
# },
# "_score" : 4.009395,
# "_index" : "test",
# "_id" : "lPesz0j6RT-Xt76aATcFOw",
# "_type" : "test"
# }
# ],
# "max_score" : 4.009395,
# "total" : 1
# },
# "timed_out" : false,
# "_shards" : {
# "failed" : 0,
# "successful" : 5,
# "total" : 5
# },
# "took" : 1
# }
I think what you are running into here is testing too little data across multiple shards.
Doc frequencies are calculated per shard by default. So if you have two identical docs on shard_1 and one doc on shard_2, then the docs on shard_1 will score lower than the docs on shard_2.
With more data, the document frequencies tend to even out over shards. But when testing small amounts of data you either want to create an index with only one shard, or to add search_type=dfs_query_then_fetch to the query string params.
This calculates global doc frequencies across all involved shards before calculating the scores.
If you set explain to true in your query, then you can see exactly how your scores are being calculated

Resources