Speed up Elasticsearch more like this query - performance

Whats wrong with this more like this query? It was written from scratch. It returns relevant results, but it is too slow (this example took 187.9 ms)
{
"query": {
"bool": {
"must": [{
"more_like_this": {
"fields": ["similarity.analyzed"],
"like": [{
"_id": 4
}, {
"_id": 550
}, {
"_id": 757
}],
"min_term_freq": 1,
"min_doc_freq": 1,
"analyzer": "searchkick_search2",
"minimum_should_match": "10%"
}
}, {
"range": {
"count_posts": {
"gt": 0
}
}
}],
"must_not": [{
"terms": {
"_id": [4, 550, 757]
}
}]
}
},
"size": 10
}
This query finds similar tags to given tags set.
similarity - text field, with all posts titles, joined with space.
count_posts - numeric field, which contains number of posts if each tag.
Running Elasticseach 7.8.0 on Ubuntu 18.04 as single node. Rails 5 app with Searchkick gem.

Whats wrong with this more like this query?
"like": [{
"_id": 4
}, {
"_id": 550
}, {
"_id": 757
}]
It acts like multi get API. It does the below things.
Get's all the documents mentioned by _id in like
Analyse the field using analyser option ptovided
Analyse the same fields from the matching docs of step1. List of tokenizer,s filters also adds some ms.
Calculate doc, term frequencies along with min match.
And you have two more conditions. Documentation says
A more complicated use case consists of mixing texts with documents already existing in the index.
Unfortunately, I don't think this can be optimised further. But you can add a text instead id in like to make it much better. Hope the query is not always taking more than 100ms due to caching.

Related

How can I influence Elasticsearch scoring by using higher score results informations?

I am upgrading my Elasticsearch server from version 1.6.0 to 7.12.1, which made me rewrite every query I had.
Those queries retrieves materials identified by 3 field : nature.idCat, nature.idNat and marque.idMrq (category ID, nature ID and brand ID).
I have a searching field on my application to search for specific materials, so if the user enter "photoc", the query sent to my Elasticsearch server looks like this :
{
"sort": [
"_score"
],
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "search",
"query": "*photoc*",
"boost": 10
}
},
[...] // Some more irrelevant conditions for this question like
// if nature.idCat = 26 then idNat must be in some range and idMrq in some other range
]
}
}
}
And 2 examples of "hits" results of this query :
"hits": [
{
"_index": "ref_biens",
"_type": "_doc",
"_id": "T3RrpXsBz_TibRxz0akC",
"_score": 13.0,
"_source": {
"search": "Photocopieur GENERIQUE",
"nature": {
"idCat": 26,
"idNat": 665,
"libelle": "Photocopieur",
"ekip": "U03C",
"codeINSEE": 300121,
"noteMaterielArrondi": 5
},
"marque": {
"idMrq": 16,
"libelle": "GENERIQUE",
"ekip": "Z999",
"idVRDuree": 808
}
}
},
{
"_index": "ref_biens",
"_type": "_doc",
"_id": "UHRrpXsBz_TibRxz0akC",
"_score": 13.0,
"_source": {
"search": "Photocopieur INFOTEC",
"nature": {
"idCat": 26,
"idNat": 665,
"libelle": "Photocopieur",
"ekip": "U03C",
"codeINSEE": 300121,
"noteMaterielArrondi": 5
},
"marque": {
"idMrq": 1244,
"libelle": "INFOTEC",
"ekip": "I091",
"idVRDuree": 808
}
}
}
]
This works perfectly !
My problem appears when the user types more than one word, for example if he is searching specifically for the "Photocopieur PANASONIC", the results of the query shows the right material as the first result with a _score of 23 but then every other match has the same _score of 13 which can bring some totally different material as the next results (matching only on the brand name for example) even though I whish for other "Photocopieur" to be displayed first.
The way I'm thinking of doing it is by adding "score points" to results that have the most similarities to the best match, for instance I would add a 6 point boost for the same nature.idCat, 4 points for the same nature.idNat and finally 2 points for the same marque.idMrq.
Any idea on how I can achieve that ? Is this the correct approach to my problem ?

Is there any way to match similar match in Elastic Search

I have a elastic search big document
I am searching with below query
{"size": 1000, "query": {"query_string": {"query": "( string1 )"}}}
Let say my string1 = Product, If some one accident type prduct some one forgot to o
Is there any way to search for that also
{"size": 1000, "query": {"query_string": {"query": "( prdct )"}}} also has to return result of prdct + product
You can use fuzzy query that returns documents that contain terms similar to the search term. Refer this blog to get detailed explanation of fuzzy queries.
Since,you have more edit distance to match prdct. Fuzziness parameter can be defined as :
0, 1, 2
0..2 = Must match exactly
3..5 = One edit allowed
More than 5 = Two edits allowed
Index Data:
{
"title":"product"
}
{
"title":"prdct"
}
Search Query:
{
"query": {
"fuzzy": {
"title": {
"value": "prdct",
"fuzziness":15,
"transpositions":true,
"boost": 5
}
}
}
}
Search Result:
"hits": [
{
"_index": "my-index1",
"_type": "_doc",
"_id": "2",
"_score": 3.465736,
"_source": {
"title": "prdct"
}
},
{
"_index": "my-index1",
"_type": "_doc",
"_id": "1",
"_score": 2.0794415,
"_source": {
"title": "product"
}
}
]
There are many solutions to this problem:
Suggestions (did you mean X instead).
Fuzziness (edits from your original search term).
Partial matching with autocomplete (if someone types "pr" and you provide the available search terms, they can click on the correct results right away) or n-grams (matching groups of letters).
All of those have tradeoffs in index / search overhead as well as the classic precision / recall problem.

Find most similar documents in Elasticsearch

How do I find the top 100 most similar documents between two indices in Elasticsearch?
Document #1 is in index1, type11, field111.
Document #2 is in index2, type21, field211
Edit: Both fields are strings.
I looked at the documentation for More Like This query. But it doesn't tell me how I can quickly compare the results for different kinds of similarity metrics and look at the top results.
Try this query, but substitute the id values for your documents:
GET index1,index2/_search
{
"query": {
"more_like_this": {
"fields": [
"field111",
"field211"
],
"like": [
{
"_index": "index1",
"_id": "DOC_1_ID"
},
{
"_index": "index2",
"_id": "DOC_2_ID"
}
],
"min_term_freq": 1,
"max_query_terms": 12
}
}
}

How are the documents ordered in Elasticsearch if the sort value for two documents is same?

I was working with products data, here: link
The search query that sort by keyword field tags using max mode is as follows.
GET product/_doc/_search
{
"size":100,"from":20,"_source":["tags", "name"],
"query": {
"match_all": {}
},
"sort": [
{"tags":{
"order":"desc",
"mode":"max"
}}
]
}
Some documents have same sort value. I had read somewhere that if the sort value is same, it arranges by internal doc id (_id). However, the case does not seem so. See screenshot below:
First _id: 961 followed by _id:972 (fine). However, then came _id: 114. I am not understanding how it got random.
Help will be appreciated.
As you have already seen, its random. To overcome this you can add another field to be used to sort when the sorting value for first field is same. As you want to use _id the query will be then as follows:
{
"size": 100,
"from": 20,
"_source": [
"tags",
"name"
],
"query": {
"match_all": {}
},
"sort": [
{
"tags": {
"order": "desc",
"mode": "max"
}
},
{
"_id": "asc"
}
]
}

Elastic search aggregation sum

Im using elasticsearch 1.0.2 and I want to perform a search on it using a query with aggregation functions like sum()
Suppose my single record data is something like that
{
"_index": "outboxpro",
"_type": "message",
"_id": "PAyEom_mRgytIxRUCdN0-w",
"_score": 4.5409594,
"_source": {
"team_id": "1bf5f3f968e36336c9164290171211f3",
"created_user": "1a9d05586a8dc3f29b4c8147997391f9",
"created_ip": "192.168.2.245",
"folder": 1,
"report": [
{
"networks": "ec466c09fd62993ade48c6c4bb8d2da7facebook",
"status": 2,
"info": "OK"
},
{
"networks": "bdc33d8ca941b8f00c2a4e046ba44761twitter",
"status": 2,
"info": "OK"
},
{
"networks": "ad2672a2361d10eacf8a05bd1b10d4d8linkedin",
"status": 5,
"info": "[unauthorized] Invalid or expired token."
}
]
}
}
Let's say I need to fetch the count of all success messages posted with status = 2 in report field. There will be many record in the collection. I want to take report of all success messages posted.
I have tried the following code
////////////// Edit
{
"size": 2000,
"query": {
"filtered": {
"query": {
"match": {
"team_id": {
"query": "1bf5f3f968e36336c9164290171211f3"
}
}
}
}
},
"aggs": {
"genders": {
"terms": {
"field": "report.status"
}
}
}
}
Please help me to find some solution. Am newbie in elastic search. Is there any other aggregation method to find this one ?. Your help i much appreciate.
Your script filter is slow on big data and doesn't use benefits of "indexing". Did you think about parent/child instead of nested? If you use parent/child - you could use aggregations natively and use calculate sum.
You will have to make use of nested mappings here. Do have a look at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-mapping.html.
And then you will have to do aggregation on nested fields as in https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-nested-aggregation.html.

Resources