Elasticsearch hits.total different with OR - elasticsearch

When I use the following search (/posts/_search) my hits.total is 1400:
{"query": {"query_string": {"query": "Bitcoin"}}}
When I use the following search (/posts/_search) my hits.total is 500:
{"query": {"query_string": {"query": "Ethereum"}}}
When I use an OR in my search, the hits.total is 1400, where I expected it to be 1900.
{"query": {"query_string": {"query": "(Ethereum) OR (Bitcoin)"}}}
Why is my hits.total number different when I am using an "OR"? I am using the hits.total as a counter to display and the number should be the same, right?
I am pretty new with ElasticSearch and hopefully, someone could point me in the right direction. Thanks!

Most probably it Looks like there are some documents where **_all has both terms** i.e. Bitcoin and Ethereum, and hence, same documents get selected when u run the query independently, but when u run, this common documents get included only once.
May be this Venn diagram can explain better
A U B = (7+2+5) + (8+1+2+5) - (2+5) = 23
A + B = (7+2+5) + (8+1+2+5) = 30
If you are sure, these field which can never have multiple values then try adding "default_field" in the query and run the results. When you don't pass "default_field", if defaults to index.query.default_field index settings, which in turn defaults to _all.
{
"query": {
"query_string": {
"default_field": "CRYPTOCURRENCY_TYPE",
"query": "as"
}
}
}
More details you can be found here : https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-query-string-query.html

Related

Match multi tokens with proximity search between them

Having a large corpus of texts (100k) and a ngrams, examples :
query - get all texts with the tokens ['united' , 'airlines']
I would like to retrieve only texts with a full match of both tokens ('united' , 'airlines')
but i also want that the distance between any of the tokens (united -> airlines , or 'airlines-> united') will be up to K positions. lets say k=2
my query now is:
query = {
"size": limit,
"query": {
"query_string": {"query": query,
"phrase_slop":2,
"default_operator":"AND"}
}
}
But it seems that it is not the right method because I am getting results with more than 2 positions (tokens) between them.
Any idea?
I have found the answer to my question:
When using the query string type queries in ElasticSearch we can use proximity search by adding ~k , when k is the number of maximum edit distance of words in a phrase.
For the query in the main question, adding proximity search:
query = {
"size": limit,
"query": {
"query_string": {"query":"united airlines"~2,
"phrase_slop":2,
"default_operator":"AND"}
}
}
More information can be found in the documentation

fastest way to tell if a term exists in the index or not

What is the fastest query that can tell if a term exists in the index or not. I am not looking for scoring or anything, just a quick true/false response form elastic search that it has a document that contains this index.
you can use _count API.
example:
GET /twitter/_count?q=user:kimchy
more information:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html
also you can set the size to 0:
GET /twitter/user/_search {
"size": 0,
"query": {
"match": {
"username": "xyz"
}}}

Fuzzy Matching Fails But Exact Match Passes

I've been constructing an ElasticSearch query using Fuzzy Matching to match a user in the system. When running it against a specific group of users (ones with my name), the query appears to work perfectly, but when running it against a random selection of users, it appears to fail.
For the purposes of my testing, I'm passing in the exact values of a specific user, so I would expect at least 1 match.
In narrowing this down, I found that an exact match against a name returns the data as expected, but putting the same value into a fuzzy block causes it to return 0 results.
For Instance, this query returns a user record as expected:
{
"from": 0,
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"firstName": {
"query": "sVxGBCkPYZ",
"boost": 30
}
}
}
],
"should": [
]
}
},
"fields": [
"id",
"firstName"
]
}
However replacing the match element with the below fails to return any records:
{
"fuzzy": {
"firstName": {
"value": "sVxGBCkPYZ",
"fuzziness": 2,
"boost": 30,
"min_similarity": 0.3
}
}
}
Why would this be happening, and is there anything I can do to remedy the situation?
For reference. This is the ES version i'm currently using:
"version": {
"number": "1.7.1",
"build_hash": "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
"build_timestamp": "2015-07-29T09:54:16Z",
"build_snapshot": false,
"lucene_version": "4.10.4"
}
The match fails because fuzzy searches are term level queries meaning the query string would not be analysed while the data that got indexed, I assume, if of type text with standard analyzer, would be converted to svxgbckpyz in the inverted index.
You can instead, implement fuzziness with match query as below:
POST testindex/_search
{
"query":{
"match":{
"firstname":{
"query":"sVxGBCkPYZ",
"fuzziness":"AUTO"
}
}
}
}
You can change the value from AUTO to 2 or 3 depending on your use case.
The exact match you mentioned also works because query string would get analysed and converts the input string into lower case, which is available in inverted index.
As for how fuzzy query (that you've mentioned) works behind the scene, as per this LINK, is as follows:
The fuzzy query works by taking the original term and building a
Levenshtein automaton—like a big graph representing all the strings
that are within the specified edit distance of the original string.
The fuzzy query then uses the automaton to step efficiently through
all of the terms in the term dictionary to see if they match. Once it
has collected all of the matching terms that exist in the term
dictionary, it can compute the list of matching documents.
Of course, depending on the type of data stored in the index, a fuzzy
query with an edit distance of 2 can match a very large number of
terms and perform very badly.
Note this statement in particular, representing all the strings that are within the specified edit distance of the original string
For e.g. some of the words with distance of 1 for life would be aife, bife, cife, dife....lifz.
So in your case, fuzzy search's automaton would not be able to create term svxgbckpyz from input string sVxGBCkPYZ firstly because the distance between them is 7 (Remember distance is 1 between A and a) which I don't think AUTO option can create and even if you configure it to 7, it may not create the string as there would be huge list of words with distance 7
Adding one more LINK for more info. Hope it helps!

How is Elastic Search sorting when no sort option specified and no search query specified

I wonder how Elastic search is sorting (on what field) when no search query is specified (I just filter on documents) and no sort option specified. It looks like sorting is than random ... Default sort order is _score, but score is always 1 when you do not specify a search query ...
You got it right. Its then more or less random with score being 1. You still get consistent results as far as I remember. You have the "same" when you get results in SQL but don't specify ORDER BY.
Just in case someone may see this post even it posted over 6 yrs ago..
When you wanna know how elasticsearch calculate its own score known as _score, you can use the explain option.
I suppose that your query(with filter & without search) might like this more or less (but the point is making the explain option true) :
POST /goods/_search
{
"explain": true,
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"term": {
"maker_name": "nike"
}
}
}
}
}
As running this, you will notice that the _explaination of each hits describes as below :
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(maker_name:nike)",
"details" : [ ]
}
which means ES gave constant score to all of the hits.
So to answer the question, "yes".
The results are sorted kinda randomly because all the filtered results have same (constant) score without any search query.
By the way, enabling an explain option is more helpful when you use search queries. You will see how ES calculates the score and will understand the reason why it returns in that order.
Score is mainly used for sorting, Score is calculated by lucene score calculating using several constraints,For more info refer here .

Constant Score Query elasticsearch boosting

My understanding of Constant Score Query in elasticsearch is that boost factor would be assigned as score for every matching query. The documentation says:
A query that wraps a filter or another query and simply returns a constant score equal to the query boost for every document in the filter.
However when I send this query:
"query": {
"constant_score": {
"filter": {
"term": {
"source": "BBC"
}
},
"boost": 3
}
},
"fields": ["title", "source"]
all the matching documents are given a score of 1?! I cannot figure out what I am doing wrong, and had also tried with query instead of filter in constant_score.
Scores are only meant to be relative to all other scores in a given result set, so a result set where everything has the score of 3 is the same as a result set where everything has the score of 1.
Really, the only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries. - Elasticsearch Guide
Either the constant score is being ignored because it's not being combined with another query or it's being normalized. As #keety said, check to the output of explain to see exactly what's going on.
Constant score query gives equal score to any matching document irrespective any scoring factors like TF, IDF etc. This can be used when you don't care whether how much a doc matched but just if a doc matched or not and give a score too, unlike filter.
If you want score as 3 literally for all the matching documents for a particular query, then you should be using function score query, something like
"query": {
"function_score": {
"functions": [
{
"filter": { "term": { "source": "BBC" } },
"weight": 3
}
]
}
...
}

Resources