elasticsearch more_like_this query is taking long time to run - elasticsearch

I have the below more_like_this query to elasticsearch.
I run this in a loop for 15 times with different art_title and art_tags each time. For some articles the time it takes is very less but for some articles in the loop it takes too long to execute. Is there anything which I can do to optimize this query. Any help is appreciated.
bodyquery={
"query":
{"bool":
{"should":
[
{"more_like_this":
{
"like_text": art_title,
"fields": ["title"],
"max_query_terms": 30,
"boost": 5,
"min_term_freq": 1
}
},
{"more_like_this":
{
"like_text": art_tags,
"fields": ["tags"],
"max_query_terms": 30,
"boost": 5,
"min_term_freq": 1
}
}
]
}
}
}

I believe you might have solved this already by now but depending on the content of your indexed docs and the analyzers applied to the fields you are looking at, this can take a wide range of time to complete. Think how similarity works and how it will be calculated for your documents and you probably will find the answer. Also, you can use the explain param to get a Lucene detailed step-by-step response to the question
, but just in case I want to add: it is virtually impossible to determine anything without more details:
What your mappings look like
How are those fields analyzed
What version of ES are you using
Your ES setup
Also, describe in english what are you trying to retrieve: "I want documents in the catalog index that have a title similar to art_title and/or a tag similar to art_tag".
There is reference to the syntax in HERE if you are using the latest version of ES
Cheers

Related

Elasticsearch date based function scoring boosting the wrong way

I would like to boost scores of documents based on how "recent" a document is. I am trying to do this using a function_score. Here is an example of me doing this on a field called updated_at:
{
"function_score": {
"boost_mode": "sum",
"functions": [
{
"exp": {
"updated_at": {
"origin": "now",
"scale": "1h",
"decay": 0.01,
},
},
"weight": 1,
}
],
"query": query
},
}
I would expect documents close to the datetime now will have a score closer to 1, and documents closer to scale will have a score closer to decay (as described in the docs). Therefore, I'm using the boost_mode sum, to keep the original document scores, and increase depending on how close to now the updated_at value is. (Also, the query score is useful so I would rather add than multiply, which is the default).
To test this scenario, I create a document (A) that returns a query score of about 2. I then duplicate it (B) and modify the new document's updated_at timestamp to be an hour in the past.
In this scenario, I would expect (A) to have a higher score and (B) to have a lower score. However, when I run this scenario, I get the exact opposite. (B) ends up with a score of 3 and (A) ends up with a score of 2.
What am I misunderstanding here to cause this to happen? And how would I modify my function score to do what I would like?
This turned out to be a a timezone issue.
I ended up using the explain API to look at what was contributing to the score. When doing that, I noticed that the origin set to now was actually in a different timezone to the one I was setting in the documents.
I fixed this by manually providing a UTC timestamp in the elasticsearch query rather than using now as the value.
(If there is a better way to do this, please let me know)

Search After (pagination) in Elasticsearch when sorting by score

Search after in elasticsearch must match its sorting parameters in count and order. So I was wondering how to get the score from previous result (example page 1) to use it as a search after for next page.
I faced an issue when using the score of the last document in previous search. The score was 1.0, and since all documents has 1.0 score, the result for next page turned out to be null (empty).
That's actually make sense, since I am asking elasticsearch for results that has lower rank (score) than 1.0 which are zero, so which score do I use to get the next page.
Note:
I am sorting by score then by TieBreakerID, so one possible solution is using high value (say 1000) for score.
What you're doing sounds like it should work, as explained by an Elastic team member. It works for me (in ES 7.7) even with tied scores when using the document ID (copied into another indexed field) as a tiebreaker. It's true that indexing additional documents while paginating will make your scores slightly unstable, but not likely enough to cause a significant problem for an end user. If you need it to be reliable for a batch job, the Scroll API is the better choice.
{
"query": {
...
},
"search_after": [
12.276552,
14173
],
"sort": [
{ "_score": "desc" },
{ "id": "asc" }
]
}

Fuzzy Matching Fails But Exact Match Passes

I've been constructing an ElasticSearch query using Fuzzy Matching to match a user in the system. When running it against a specific group of users (ones with my name), the query appears to work perfectly, but when running it against a random selection of users, it appears to fail.
For the purposes of my testing, I'm passing in the exact values of a specific user, so I would expect at least 1 match.
In narrowing this down, I found that an exact match against a name returns the data as expected, but putting the same value into a fuzzy block causes it to return 0 results.
For Instance, this query returns a user record as expected:
{
"from": 0,
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"firstName": {
"query": "sVxGBCkPYZ",
"boost": 30
}
}
}
],
"should": [
]
}
},
"fields": [
"id",
"firstName"
]
}
However replacing the match element with the below fails to return any records:
{
"fuzzy": {
"firstName": {
"value": "sVxGBCkPYZ",
"fuzziness": 2,
"boost": 30,
"min_similarity": 0.3
}
}
}
Why would this be happening, and is there anything I can do to remedy the situation?
For reference. This is the ES version i'm currently using:
"version": {
"number": "1.7.1",
"build_hash": "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
"build_timestamp": "2015-07-29T09:54:16Z",
"build_snapshot": false,
"lucene_version": "4.10.4"
}
The match fails because fuzzy searches are term level queries meaning the query string would not be analysed while the data that got indexed, I assume, if of type text with standard analyzer, would be converted to svxgbckpyz in the inverted index.
You can instead, implement fuzziness with match query as below:
POST testindex/_search
{
"query":{
"match":{
"firstname":{
"query":"sVxGBCkPYZ",
"fuzziness":"AUTO"
}
}
}
}
You can change the value from AUTO to 2 or 3 depending on your use case.
The exact match you mentioned also works because query string would get analysed and converts the input string into lower case, which is available in inverted index.
As for how fuzzy query (that you've mentioned) works behind the scene, as per this LINK, is as follows:
The fuzzy query works by taking the original term and building a
Levenshtein automaton—like a big graph representing all the strings
that are within the specified edit distance of the original string.
The fuzzy query then uses the automaton to step efficiently through
all of the terms in the term dictionary to see if they match. Once it
has collected all of the matching terms that exist in the term
dictionary, it can compute the list of matching documents.
Of course, depending on the type of data stored in the index, a fuzzy
query with an edit distance of 2 can match a very large number of
terms and perform very badly.
Note this statement in particular, representing all the strings that are within the specified edit distance of the original string
For e.g. some of the words with distance of 1 for life would be aife, bife, cife, dife....lifz.
So in your case, fuzzy search's automaton would not be able to create term svxgbckpyz from input string sVxGBCkPYZ firstly because the distance between them is 7 (Remember distance is 1 between A and a) which I don't think AUTO option can create and even if you configure it to 7, it may not create the string as there would be huge list of words with distance 7
Adding one more LINK for more info. Hope it helps!

Matches on different words should score higher then multiple matches on one word in elasticsearch

In our elasticsearch we have indexed some persons where each person can have multiple taggings.
Take for example 2 persons (fullname - (taggings)):
Bart Newman - (bart,engineer,ceo)
Bart Holland - (developer,employer)
Our searchquery
{
"multi_match": {
"type": "most_fields",
"query": "bart developer",
"operator": "or",
"boost": 5,
"fields": [
"fullname^5",
"taggings.tag.name^5"
],
"fuzziness": 0
}
}
Let's say we are searching on "bart developer". Then we should expect that Bart Holland should come before Bart Newman, but because Bart Newman has bart in his fullname and bart as tag, he scores higher then Bart Holland does.
Is there a way where I can configure that matches on different words (bart, developer) can score higher then multiple matches on one word (bart).
I already tried the and-operator without success.
Thanks!
This is kind of expected with most fields query, it is field-centric rather than term-centric, From the Docs
most_fields being field-centric rather than term-centric: it looks for
the most matching fields, when really what we’re interested is the
most matching terms.
Another problem is Inverse Document Frequency which is also likely in your case. I guess only few documents have tag named bart which is why its IDF is very high and hence gets higher score.
As given in the above links, you should see how documents are scored with validate and explain.
There are couple of ways to solve this issue
1) You can use custom _all field, i.e copy both full name and tag information to new field with copy_to parameter and then query on it but you have to reindex your data for that
2) I think better solution would be to use cross fields, it takes term-centric approach. From the Docs
The cross_fields type first analyzes the query string to produce a
list of terms, and then it searches for each term in any field.
It also solves IDF issue by blending it across all fields.
This should solve your issue.
{
"query": {
"multi_match": {
"type": "cross_fields",
"query": "bart developer",
"operator": "or",
"fields": [
"fullname",
"tagging.tag.name"
],
"fuzziness": 0
}
}
}
Hope this helps!

is _id of document affects on scoring?

I add two same documents the only different thing is _id of documents (I restart scenario for each of them and I do not add them sequentially. to be sure my test is correct)
one of them changes order of result of this query and one of them does not:
GET index_for_test/business/_search
{
"query": {
"multi_match": {
"query": "italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ]
}
}
}
my original question was:
https://github.com/elastic/elasticsearch/issues/10341
as mentioned here: https://groups.google.com/forum/?fromgroups=&hl=en-GB#!topic/elasticsearch/VWqA_P4zzH8
my answer is in this documentation:
https://www.elastic.co/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch
documents are spread in 5 shards by default and queries run with an algorithm that scores documents in each shard and then fetch them, in small data this ends to inaccurate result so if the database is small it is better to run you queries with search_type=dfs_query_then_fetch but it has scalability problems and should be changed when it grows

Resources