Elasticsearch performance difference between ‘must’ and ‘must_not’ - performance

I wanted to know the performance difference between using must clause and must_not.
I am getting different timings trying from both of them.
Suppose I have 10 groups, and I want to make 5 groups accessible to a user while 5 are excluded.
So, I have two ways of using my query:
I can use must clause inside a boolean query, and do must: ['1', '2' ,'3', '4', '5'].
I can use must_not clause inside a boolean query again, and do must_not:['6', '7', '8', '9', '10'].
I have not provided many details here because I just want to know more about the performance wise difference in using these two terms.
I read about the Boolean query in the ES document, and it said the scoring is ignored in the must_not clause, although I have not yet understood how scoring is performed in the Lucene index.
But I am getting some timing differences, and must_not is taking longer time than must and was curious to post about it.
Note: Currently using, Elasticsearch version:2.4.4, and upgrading it is not possible at the moment.
Can anyone please explain the difference or explain both of the clauses in detail?
Open to any kind of suggestions and answers.
Thanks in advance.

must clause is potentially more efficient because it can utilize the inverted index.
The internal implementation is more like
If _searched_keyword_ in inverted_hash
THEN RETRIEVE inverted_hash[_searched_keyword_ ]
must_not is more costly as the inverted index is not helpful.

Related

How to overcome maxClauseCount error when using multi_match query

I have 10+ Indexes on my Elasticsearch server.
Each Index has 1 or more fields with different kind of Analyzers: keyword, standard, ngram and etc...
For Global search I am using multi_match without specifying any explicit fields.
For querying I am using using elasticsearch-dsl library, the code is bellow:
def search_for_index(indice, term, num_of_result=10):
s = Search(index=indice).sort({"_score": "desc"})
s = s[:num_of_result]
s = s.query('multi_match', query=term, operator='and')
response = s.execute()
return response.to_dict()['hits']['hits']
I get very good result, and search is working just fine, but sometimes someone enters a bit longer text, and I am getting maxClauseCount error.
For example, search that raises an error when search term term is equal to:
term=We are working on your request and will keep you posted at the earliest.
Or any other little longer text raises the same error.
Can you help me figure it out maybe some better approach for my Global search so that I can avoid this kind of error?
First of all - this limitation is there for a reason. The more boolean clauses you have - the heavier search would be. Think of it as crossing (AND) or joining (OR) subset of document ids for each of the clause. This is very heavy operation, that is why initially it has a limit of 1024 clauses.
General recommendation would be to try reduce number of fields you're searching. Maybe you have fields which consist no text data or just have some internal ids. You could cross them out during multi_match query by specifying fields section explicitly.
If you're still decided to go with current approach and you're using Elasticsearch 5.5+ and higher you could alter those by adding following line in elasticsearch.yml and restart your instance.
indices.query.bool.max_clause_count: 250000
If you're using pre-5 version of Elasticsearch the setting is called index.query.bool.max_clause_count

Kibana 4 - Why does my simple query return correct results when using .raw but not without?

I'm trying out Elasticsearch/Kibana 4 and while my simple query:
program.raw:"MYAPPLICATION" AND entityId.raw:"12345-67N"
will return the results I want - i.e. result posts having the program and entityId field and containing the queried terms straight off, as I want.
However, I guess the right way to query this would be:
program:"MYAPPLICATION" AND entityId:"12345-67N"
but that gives my correct results only regarding the program query, and then a lot of hits on terms containing N or n. The entityId-part seems to only query on N?. I'm confused, please explain this. I've read up on the Lucene query syntax and can't find anything explaining this.
The .raw fields are setup by logstash as "not_analyzed" fields in elasticsearch. As such, they are not split into tokens and can be used intact.
To elasticsearch, entityId really looks like ['12345', '67n'], which is why your query doesn't match.
Note that, in your example, program:myapplication should work (since there are no special characters). Lowercase is automatic, IIRC.

Per user behavior based scoring in Elasticsearch

We do understand the behavior of user by analyzing the tags he usually search for.
Now we need to give higher precedence for such tags for these users. I would like to know how we can achieve this using Elasticsearch in an elegant manner.
Well the best approach for this would be to
Analyse the behavior of the user
See which all keywords are of his interests
Maintain one document per user in another index which have all these keywords.
On the searches for that user , boost the occurrence of these keywords using function_score query
You can use terms filter inside boost function to achieve this.Add the boost function under functions in the function score query
In terms filter , you can point to this users document and get the values dynamically
Use custom filter key so that the cache key constructed wont eat too much memory
In this approach , you can avoid lots of code paths in client code.

Sort by multiple fields in specific order in Solr

So I want to sort my Solr response by the following fields:
published_year (desc)
series_number (asc)
status_color
Problem is that status_color must be sorted by the following values (e.i. not alphabetically):
"Green"
"Yellow"
"Red"
This field may only contain one of these values.
I'm hoping theres a way of doing this in the Solr query instead of massaging the result in code. With a result of hundreds of thounsands of documents it's not really an option.
Any help is appreciated.
I think the answer for this question will be valid for you too:
Is it possible in solr to specify an ordering of documents
I believe Solr has Enum types, though I have never seen them used in a while. But they would be a perfect match, so worth a try.

ElasticSearch / Tire & Keywords. Right way to match "or" for a keyword list?

I've got an Entity model (in Mongoid) that I'm trying to search on its keywords field which is an array. I want to do a query where I pass in an array of potential search terms, and any entity that matches any of the terms will pass.
I don't have this working well yet.
But, why I'm asking this question, is that it's more complex. I also DONT want to return any entities that have been marked as "do not return" which I do via a "ignore_project_ids" parameter.
So, when I query, I get 0 results. I was using Bonsai.io. But, I've moved this to my own EC2 instance to reduce complexity/variables on solving the problem.
So, what am I doing wrong? Here are the relevant bits of code.
https://gist.github.com/3405763
You want a terms query rather than a term query - a term query is only interested in equality, whereas a terms query requires that the field match any of the specified values.
Given that you don't seem to care about the query score (you're sorting by another attribute), you'll get faster queries by using a filtered query and expressing your conditions as filters

Resources