Inconsistent result set - elasticsearch

I am trying to query in Elasticsearch, but time and again I am getting a different result set from elasticsearch for same query. My cluster is having 3 shards and 2 replicas. My first guess was this might be happening because of shards in action there fore I tried querying with dfs_query_then_fetch but still I am having the same issue, after a lot a searching through I found out that the shard to which it is searching on is changing so I used preference to query. Still I observe the same issue. I am out of options now unable to figure out what is the issue.
Pasting my query here
POST _search?search_type=dfs_query_then_fetch&preference=metiswayfinder
{
"query":{
"multi_match":{
"query":"this is a test",
"fields":[
"subject^3",
"message"
]
}
},
"sort":[
{
"_score":{
"order":"desc"
}
},
{
"incidentcount":{
"order":"desc"
}
}
]
}
Relevance score is vital for me, also I observe that the score keeps on changing everytime I do a search. Elasticsearch is giving inconsistent result set and inconsistent score. Any way by which I can neutralize these two scenarios without much change.
Adding the segment query result
Thanks
Ashit

Take a look at https://www.elastic.co/guide/en/elasticsearch/reference/master/consistent-scoring.html
According to the comments in your post, it looks like you already identified that deleting documents with replication can cause this issue - though this post helps explain why. In terms of what to do next, the article suggests the following:
The recommended way to work around this issue is to use a string that identifies the user that is logged is (a user id or session id for instance) as a preference. This ensures that all queries of a given user are always going to hit the same shards, so scores remain more consistent across queries.
I hope that helps!

Related

Elasticsearch Join

I have two indices. One indices "indications" which have some set of values.
Other is "projects". In this indices, I will add indications value like " indication = oncology".
Now I want to show all indications. Which I can do using terms aggregations. But my issue is that I also want to show count of project in which that indication is used .
So for that, I need to write join query.
Can anyone help me to resolve this issue?
Expected result example:
[{name:"onclogogy",projectCount:"12"}]
You cannot have joins in Elasticsearch. What you can do is store indication name in project index and then apply the term aggregation on project index. That basically will get you the different indications from all the project documents and count of each indication.
Something of the sort:
GET /project/_search
{
"query": {},
"aggs": {
"indcation":{
"terms": {
"field": "indication_name"
}
}
}
}
Elasticsearch does not supports joins. That's the whole point of having NoSQL that you keep the data as denormalised as you can. Make the documents more and more self sufficient.
There are some ways with which you can add some sort of relationship b/w your data. This is a nice blog on it.

How to find what index a field belongs to in elasticsearch?

I am new to elasticsearch. I have to write a query using a given field but I don't know how to find the appropriate index. How would I find this information?
Edit:
Here's an easier/better way using mapping API
GET _mapping/field/<fieldname>
One of the ways you can find is to get records where the field exist
Replace the <fieldName> with your fields name. /_search will search across all indices and return any document that matches or has the field. Set _source to false, since you dont care about document contents but only index name.
GET /_search
{
"_source": false,
"query": {
"exists": {
"field": "<fieldName>"
}
}
}
Another, more visual way to do that is through the kibana Index Management UI (assuming you have privileges to access the site).
There you can click on the indices and open the mappings tab to get all fields of the particular index. Then just search for the desired field.
Summary:
#Polynomial Proton's answer is the way of choice in 90% of the time. I just wanted to show you another way to solve your issue. It will require more manual steps than #Polynomial Proton's answer. Also, if you have a large amount of indices this way is not appropriate.

Excluding users in a search, the most optimal way

I have two indexes, on for a collection of profiles, and another containing each users excludes, e.g. blocked profiles.
The per user exclude lists will be updated very often, while in comparison the profiles are seldom updated... In this situation it is recommended to separate the data in two indexes, as I understand it.
EDIT [2017-01-25]
This is the mappings for the two indexes:
PROFILES MAPPING
PUT xyz_profiles
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"profile": {
"_all": {"enabled":false},
"dynamic":"strict",
"properties": {
"profile_id":{"type":"integer"},
"firstname": {"type":"text"},
"age": {"type":"integer"},
"gender":{"type":"keyword"},
"height":{"type":"integer"},
"area_id":{"type":"integer"},
"created":{
"type": "date",
"format":"date_time_no_millis"
},
"location": {"type": "geo_point"}
}
}
}
}
EXCLUDE LISTS MAPPING
PUT xyz_exclude_from_search
{
"settings": {
"auto_expand_replicas": "0-all"
},
"mappings": {
"exclude_profile": {
"_all": {"enabled":false},
"dynamic":"strict",
"properties": {
"profile_id":{"type":"integer"},
"exlude_ids":{"type":"integer"}
}
}
}
number_of_shards is 1 since this is on a single node (my test server).
auto_expand_replicas set to 0-all is to make sure that the exclude list it copied to all nodes. I am aware that this is superfluous on a single node, but I don't want to forget when this is implemented on the production cluster.
exclude_ids will be an array of integers (profile ids) to exclude from the search.
This is the part of a search where certain profiles are excluded using current users (id 3076) exclude list:
GET /xyz_profiles/profile/_search{
"query" : {
"bool": {
"must_not": {
"terms" : {
"profile_id" : {
"index" : "xyz_exclude_from_search",
"type" : "exclude_profile",
"id" : "3076",
"path" : "exclude_ids"
}
}
}
}
}
}
Being very new to Elasticsearch, I would very much like to know if this is the most optimal way of doing it. I imagine there are some very experienced people out there, who can pinpoint if my mappings or my search is missing something obvious that would improve performance.
For example, I haven't fully understood the analyze/not_analyzed part of mappings as well as using routings in the search.
This is an interesting question, I think it's a quite common pattern but at the moment there is not much information about it on Internet.
I was in a similar situation some time ago and it was solved in a similar way to the one you propose. But I did not separate it in two indexes, just added an exclude_ids field to our user index. For example, let's say that when the user with id 1 is searching, we use a Term Query to check that id 1 is not inside exclude_ids of target users, a query like:
{ "term": { "exclude_ids": 1 }
After using it with around two million documents I found out that:
Search is fast
Taking into account how the inverted index works I think this usage is correct
Search is done inside the same index (having to search in other indexes means checking more shards)
Updates are slow
Each time an id is added to exclude_ids the whole document is reindexed, since partial updates to a document are not possible. If the exclude_ids array gets very long, the updates can become specially slow.
For the same reason, indexed data that is not usually updated is reindexed, like name or age.
In your case, since you are separating the exclude list in other index; as you said, the data that is not usually updated does not have to be reindexed each time. But the problem of arrays that grow indefinitely is still there.
Plus, taking into account the way you would do the query (with a Terms Query using lookup I guess), filtering a big amount of data there is a possibility of ending with some overhead. But I'm not sure about this. This is discussed here.
It's difficult to decide which one would escalate better with a huge amount of data, doing loading tests could be a good idea.
A way to solve the expensive updates problem could be not inserting exclude_ids in Elasticsearch, but inserting only the active users exclude lists in memory (using Redis or similar), setting a TTL to them. I supposed that the original data is still being stored in MySQL, so it can be taken from there and be inserted in memory each time it is necessary (for example, when a user becomes active). But I think this is not recommended since it seems that a Terms Query with many terms degrades the performance a lot (explained in this issue).
There is already a similar question, but in my opinion there are many things that should be taken into account not spoken there. I would be happy to read more opinions about the search and update performance with big amounts of data.

elasticsearch scoring on multiple indexes

i have an index for any quarter of a year ("index-2015.1","index-2015.2"... )
i have around 30 million documents on each index.
a document has a text field ('title')
my document sorting method is (1)_score (2)created date
the problem is:
when searching for some text on on 'title' field for all indexes ("index-201*"), always the first results is from one index.
lets say if i am searching for 'title=home' and i have 10k documents on "index-2015.1" with title=home and 10k documents on "index-2015.2" with title=home then the first results are all documents from "index-2015.1" (and not from "index-2015.2", or mixed) even that on "index-2015.2" there are documents with "created date" higher then in "index-2015.1".
is there a reason for this?
The reason is probably, that the scores are specific to the index. So if you really have multiple indices, the result score of the documents will be calculated (slightly) different for each index.
Simply put, among other things, the score of a matching document is dependent on the query terms and their occurrences in the index. The score is calculated in regard to the index (actually, by default even to each separate shard). There are some normalizations elasticsearch does, but I don't know the details of those.
I'm not really able to explain it well, but here's the article about scoring. I think you want to read at least the part about TF/IDF. Which I think, should explain why you get different scores.
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
EDIT:
So, after testing it a bit on my machine, it seems possible to use another search_type, to achieve a score suitable for your case.
POST /index1,index2/_search?search_type=dfs_query_then_fetch
{
"query" : {
"match": {
"title": "home"
}
}
}
The important part is search_type=dfs_query_then_fetch. If you are programming java or something similar, there should be a way to specify it in the request. For details about the search_types, refer to the documentation.
Basically it will first collect the term-frequencies on all affected shards (+ indexes). Therefore the score should be generalized over all these.
according to Andrei Stefan and Slomo, index boosting solve my problem:
body={
"indices_boost" : { "index-2015.4" : 1.4, "index-2015.3" : 1.3,"index-2015.2" : 1.2 ,"index-2015.1" : 1.1 }
}
EDIT:
using search_type=dfs_query_then_fetch (as Slomo described) will solve the problem in better way (depend what is your business model...)

How do I delete all X days old documents from the Java API?

EDIT:
To clarify, the question is "how do I write a query for documents that are X days old so I can delete them".
END OF EDIT
Our code indexes results from a query using ElasticSearch. We want to run a clean up job once a day to delete all old documents. We currently do so by calling an external script, but to cut down on the dependencies we would love to do it from Java.
I can't figure out how to query for the old documents using the API... Clues, help?
If you delete documents which have been stored for a certain amount of time you can set a TTL (time-to-live) parameter, setting the documents deletion bit-set flag once this time has elapsed. See here. Hope this is an alternative you could consider.
UPDATE
"query":{
"match_all": {}
},
"filter":{
range:{
"field":{
lte: 20140225,
gte: 20140201
}
}
}
I assume you want to delete by a query. So this is the Java API
http://www.elasticsearch.org/guide/en/elasticsearch/client/java-api/current/delete-by-query.html
for
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
for deleting date ranges use a range query
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-range-query.html
e.g. with range: lt: '2014-01-01'

Resources