How do I find the top 100 most similar documents between two indices in Elasticsearch?
Document #1 is in index1, type11, field111.
Document #2 is in index2, type21, field211
Edit: Both fields are strings.
I looked at the documentation for More Like This query. But it doesn't tell me how I can quickly compare the results for different kinds of similarity metrics and look at the top results.
Try this query, but substitute the id values for your documents:
GET index1,index2/_search
{
"query": {
"more_like_this": {
"fields": [
"field111",
"field211"
],
"like": [
{
"_index": "index1",
"_id": "DOC_1_ID"
},
{
"_index": "index2",
"_id": "DOC_2_ID"
}
],
"min_term_freq": 1,
"max_query_terms": 12
}
}
}
Related
I have a problem with ElasticSearch, I need you :)
Today I have an index in which I have my documents. These documents represent either Products or Categories.
The structure is this:
{
"_index": "documents-XXXX",
"_type": "_doc",
"_id": "cat-31",
"_score": 1.0,
"_source": {
"title": "Category A",
"type": "category",
"uniqId": "cat-31",
[...]
}
},
{
"_index": "documents-XXXX",
"_type": "_doc",
"_id": "prod-1",
"_score": 1.0,
"_source": {
"title": "Product 1",
"type": "product",
"uniqId": "prod-1",
[...]
}
},
What I'd like to do, in one call, is:
Have 5 documents whose type is "Product" and 2 documents whose type is "Category". Do you think it's possible?
That is, two queries in a single call with query-level limits.
Also, isn't it better to make two different indexes, one for the products, the other for the categories?
If so, I have the same question, how, in a single call, do both queries?
Thanks in advance
If product and category are different contexts I would try to separate them into different indices. Is this type used in all your queries to filter results? Ex: I want to search for the term xpto in docs with type product or do you search without applying any filter?
About your other question, you can apply two queries in a request. The Multi search API can help with this.
You would have two answers one for each query.
GET my-index-000001/_msearch
{ }
{"query": { "term": { "type": { "value": "product" } }}}
{"index": "my-index-000001"}
{"query": { "term": { "type": { "value": "category" } }}}
I am upgrading my Elasticsearch server from version 1.6.0 to 7.12.1, which made me rewrite every query I had.
Those queries retrieves materials identified by 3 field : nature.idCat, nature.idNat and marque.idMrq (category ID, nature ID and brand ID).
I have a searching field on my application to search for specific materials, so if the user enter "photoc", the query sent to my Elasticsearch server looks like this :
{
"sort": [
"_score"
],
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "search",
"query": "*photoc*",
"boost": 10
}
},
[...] // Some more irrelevant conditions for this question like
// if nature.idCat = 26 then idNat must be in some range and idMrq in some other range
]
}
}
}
And 2 examples of "hits" results of this query :
"hits": [
{
"_index": "ref_biens",
"_type": "_doc",
"_id": "T3RrpXsBz_TibRxz0akC",
"_score": 13.0,
"_source": {
"search": "Photocopieur GENERIQUE",
"nature": {
"idCat": 26,
"idNat": 665,
"libelle": "Photocopieur",
"ekip": "U03C",
"codeINSEE": 300121,
"noteMaterielArrondi": 5
},
"marque": {
"idMrq": 16,
"libelle": "GENERIQUE",
"ekip": "Z999",
"idVRDuree": 808
}
}
},
{
"_index": "ref_biens",
"_type": "_doc",
"_id": "UHRrpXsBz_TibRxz0akC",
"_score": 13.0,
"_source": {
"search": "Photocopieur INFOTEC",
"nature": {
"idCat": 26,
"idNat": 665,
"libelle": "Photocopieur",
"ekip": "U03C",
"codeINSEE": 300121,
"noteMaterielArrondi": 5
},
"marque": {
"idMrq": 1244,
"libelle": "INFOTEC",
"ekip": "I091",
"idVRDuree": 808
}
}
}
]
This works perfectly !
My problem appears when the user types more than one word, for example if he is searching specifically for the "Photocopieur PANASONIC", the results of the query shows the right material as the first result with a _score of 23 but then every other match has the same _score of 13 which can bring some totally different material as the next results (matching only on the brand name for example) even though I whish for other "Photocopieur" to be displayed first.
The way I'm thinking of doing it is by adding "score points" to results that have the most similarities to the best match, for instance I would add a 6 point boost for the same nature.idCat, 4 points for the same nature.idNat and finally 2 points for the same marque.idMrq.
Any idea on how I can achieve that ? Is this the correct approach to my problem ?
I have a elastic search big document
I am searching with below query
{"size": 1000, "query": {"query_string": {"query": "( string1 )"}}}
Let say my string1 = Product, If some one accident type prduct some one forgot to o
Is there any way to search for that also
{"size": 1000, "query": {"query_string": {"query": "( prdct )"}}} also has to return result of prdct + product
You can use fuzzy query that returns documents that contain terms similar to the search term. Refer this blog to get detailed explanation of fuzzy queries.
Since,you have more edit distance to match prdct. Fuzziness parameter can be defined as :
0, 1, 2
0..2 = Must match exactly
3..5 = One edit allowed
More than 5 = Two edits allowed
Index Data:
{
"title":"product"
}
{
"title":"prdct"
}
Search Query:
{
"query": {
"fuzzy": {
"title": {
"value": "prdct",
"fuzziness":15,
"transpositions":true,
"boost": 5
}
}
}
}
Search Result:
"hits": [
{
"_index": "my-index1",
"_type": "_doc",
"_id": "2",
"_score": 3.465736,
"_source": {
"title": "prdct"
}
},
{
"_index": "my-index1",
"_type": "_doc",
"_id": "1",
"_score": 2.0794415,
"_source": {
"title": "product"
}
}
]
There are many solutions to this problem:
Suggestions (did you mean X instead).
Fuzziness (edits from your original search term).
Partial matching with autocomplete (if someone types "pr" and you provide the available search terms, they can click on the correct results right away) or n-grams (matching groups of letters).
All of those have tradeoffs in index / search overhead as well as the classic precision / recall problem.
Whats wrong with this more like this query? It was written from scratch. It returns relevant results, but it is too slow (this example took 187.9 ms)
{
"query": {
"bool": {
"must": [{
"more_like_this": {
"fields": ["similarity.analyzed"],
"like": [{
"_id": 4
}, {
"_id": 550
}, {
"_id": 757
}],
"min_term_freq": 1,
"min_doc_freq": 1,
"analyzer": "searchkick_search2",
"minimum_should_match": "10%"
}
}, {
"range": {
"count_posts": {
"gt": 0
}
}
}],
"must_not": [{
"terms": {
"_id": [4, 550, 757]
}
}]
}
},
"size": 10
}
This query finds similar tags to given tags set.
similarity - text field, with all posts titles, joined with space.
count_posts - numeric field, which contains number of posts if each tag.
Running Elasticseach 7.8.0 on Ubuntu 18.04 as single node. Rails 5 app with Searchkick gem.
Whats wrong with this more like this query?
"like": [{
"_id": 4
}, {
"_id": 550
}, {
"_id": 757
}]
It acts like multi get API. It does the below things.
Get's all the documents mentioned by _id in like
Analyse the field using analyser option ptovided
Analyse the same fields from the matching docs of step1. List of tokenizer,s filters also adds some ms.
Calculate doc, term frequencies along with min match.
And you have two more conditions. Documentation says
A more complicated use case consists of mixing texts with documents already existing in the index.
Unfortunately, I don't think this can be optimised further. But you can add a text instead id in like to make it much better. Hope the query is not always taking more than 100ms due to caching.
I am currently using BoolQueryBuilder to build a text search. I am having an issue with wrong spellings. When someone searches for a "chiar" instead of "chair" I have to show them some suggestions.
I have gone through the documentation and observed that the SuggestionBuilder is useful to get the suggestions.
Can I send all the requests in a single query, so that I can show the suggestions if the result is zero?
No need to send different search terms ie chair, chiar to get suggestions, it's not efficient and performant and you don't know all the combinations which user might misspell.
Instead, Use the fuzzy query or fuzziness param in the match query itself, which can be used in the bool query.
Let me show you an example, using the match query with the fuzziness parameter.
index def
{
"mappings": {
"properties": {
"product": {
"type": "text"
}
}
}
}
Index sample doc
{
"product" : "chair"
}
Search query with wrong term chiar
{
"query": {
"match" : {
"product" : {
"query" : "chiar",
"fuzziness" : "4" --> control it according to your application
}
}
}
}
Search result
"hits": [
{
"_index": "so_fuzzy",
"_type": "_doc",
"_id": "1",
"_score": 0.23014566,
"_source": {
"product": "chair"
}
}