Elastic Search comparing sentences with synonyms? - elasticsearch

Is there an api within elastic search to compare the following two sentences?
The weather is great
The climate is good
The search described here https://www.elastic.co/guide/en/elasticsearch/guide/2.x/practical-scoring-function.html doesn't work since the sentences have largely different words

The following query will give you the score that would be computed by elasticsearch. Replace test by the name of your index and field by the name of the field using the correct analyzer.
{
"script": {
"source": "_score"
},
"context": "score",
"context_setup": {
"index": "test",
"query": {
"match": {
"field": "The weather is great"
}
},
"document": {
"field": "The climate is good"
}
}
}
You will not get a score between 0.5 and 1 though. Elasticsearch is not built to perform pairwise string comparison, it is used to search within a collection of documents.
If you really want to get a score between 0.5 and 1 you will have to write a scripted similarity function
But again, I don't think elasticsearch fit with your usecase.

Related

how to perform cosine similarity based semantic search in elastic search query on a text field?

I am performing a match on a text field(skills). I don't want a exact match , instead i want cosine similarity based search on the field.
GET 2/_search
{
"_source": ["Skills"],
"query": {
"function_score": {
"query": {
"match": {
"Job_Group": "sales"
}
},
"functions": [
{
"filter": {
"match":{
"Skills":"Designation"
}
},
"weight": 15
}
]
}
}
}
The above query is for exact match. How do i include some sort of semantic search(Cosine similarity based in the query on skills field). The skills field is a free text field, so i want matching to happen based on their semantic meaning also. Example--- skills -Communication & talking should reflect some sort of similarity and boost the score.
Very simple- Elastic is wrapping Lucene, and Lucene has More-Like-This which implement TF-IDF (BM25 to be more precise) and some additional wisdom. Try it, it will give you good similarity results. Explanation can be found in this link and various others.

Elasticsearch: obtaining each field score in the same document

Assuming I have a document with three fields: name, company, email each one mapped with edge-ngram
{
"name": "John",
"company": "John's company",
"email": "johndoe#gmail.com"
}
When searching for "john" I want to be able to get each field score individually
{
"query": {
"bool": {
"should": [
{ "match": { "name": "john" }},
{ "match": { "company": "john" }},
{ "match": { "email": "john" }}
]
}
}
}
In this example the score from each match clause is added together, then divided by the number of match clauses. So is there anyway to obtain the score from each match clause individually not just the final score for the whole document?
I think setting "explain": true is also not ideal since it provides very low-level details of scoring (inefficient and difficult to parse).
I cannot think of a way that you could do this without modifying the search results.
However if you were to use a different boost on each field you might be able to reverse your way into determining the value of each. For instance boosting one field by 1 the next by 10 and the final by 100, and examining the final number might give you what you are looking for, however the field boosted by 100 will be the only one that matters.
Curious the application of this, as it seems boosting in general might solve what you are looking for.

How to boost individual documents

I have a pretty complex query and now I want to boost some documents that fulfill some criteria. I have the following simplified document structure and I try to give some documents a boost based on the id, genre, tag.
{
"id": 123,
"genres": ["ACTION", "DRAMA"],
"tags": ["For kids", "Romantic", "Nature"]
}
What I want to do is for example
id: 123 boost: 5
genres: ACTION boost: 3
tags: Romantic boost: 0.2
and boost all documents that are contained in my query and fit the criteria but I don't want to filter them out. So query clause boosting is not of any help I guess.
Edit: To make if easier to understand what I want to achieve (not sure if it is possible with elasticsearch, no is also a valid answer).
I want to search with a query and get a result set. In this set I want to boost some documents. But I don't want to enlarge the result set or filter it. The boost should be independent from the query.
For example I search for a specific tag and want to boost all documents with category 'ACTION' in the result set. But I don't want all documents with category 'ACTION' in the result set and also I don't want only documents with the specific tag AND category 'ACTION'.
I think you need to have Dynamic boosting during query time.
The first matches the id title with boost and second one matches the 'genders' ACTION.
{
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"query": "id",
"boost": 5
}
}
},
{
"match": {
"content": "Action"
}
}
]
}
}
}
If you want to have multi_match match based on your query:
{
"multi_match" : {
"query": "some query terms here",
"fields": [ "id^5", "genders^3", "tags^0.2" ]
}
}
Note: the ^5 means boost for the title.
Edit:
Maybe you are asking for different types of multi_match queries (at least for ES 5.x) from the ES reference guide:
best_fields
(default) Finds documents which match any field, but uses
the _score from the best field. See best_fields.
most_fields
Finds documents which match any field and combines the _score from
each field. See most_fields.
cross_fields
Treats fields with the same analyzer as though they were one big
field. Looks for each word in any field. See cross_fields.
phrase
Runs a match_phrase query on each field and combines the _score from
each field. See phrase and phrase_prefix.
phrase_prefix
Runs a match_phrase_prefix query on each field and combines the _score
from each field. See phrase and phrase_prefix.
More at: ES 5.4 ElasticSearch reference
I found a solution and it was pretty simple. I use a boosting query. I now just nest the different boosting criteria with and my original query is now the base query.
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-boosting-query.html
For example:
{
"query": {
"boosting": {
"positive": {
"boosting": {
"positive": {
"match": {
"director": "Spielberg"
}
},
"negative": {
"term": {
"genres": "DRAMA"
}
},
"negative_boost": 1.3
}
},
"negative": {
"term": {
"tags": "Romantic"
}
},
"negative_boost": 1.2
}
}
}

Elasticsearch: document size and query performance

I have an ES index with medium size documents (15-30 Mb more or less).
Each document has a boolean field and most of the times users just want to know if a specific document ID has that field set to true.
Will document size affect the performance of this query?
"size": 1,
"query": {
"term": {
"my_field": True
}
},
"_source": [
"my_field"
]
And will a "size":0 query results in better time performance?
Adding "size":0 to your query, you will avoid some net transfer this behaviour will improve your performance time.
But as I understand your case of use, you can use count
An example query:
curl -XPOST 'http://localhost:9200/test/_count -d '{
"query": {
"bool": {
"must": [
{
"term": {
"id": xxxxx
}
},
{
"term": {
"bool_field": True
}
}
]
}
}
}'
With this query only checking if there is some total, you will know if a doc with some id have set the bool field to true/false depending on the value that you specify in bool_field at query. This will be quite fast.
Considering that Elasticsearch will index your fields, the document size will not be a big problem for the performance. Using size 0 don't affect the query performance inside Elasticsearch but affect positively the performance to retrieve the document because the network transfer.
If you just want to check one boolean field for a specific document you can simply use Get API to obtain the document just retrieving the field you want to check, like this:
curl -XGET 'http://localhost:9200/my_index/my_type/1000?fields=my_field'
In this case Elasticsearch will just retrieve the document with _id = 1000 and the field my_field. So you can check the boolean value.
{
"_index": "my_index",
"_type": "my_type",
"_id": "1000",
"_version": 9,
"found": true,
"fields": {
"my_field": [
true
]
}
}
By looking at your question I see that you haven't mentioned the elasticsearch version you are using. I would say there are lot of factors that affects the performance of a elasticsearch cluster.
However assuming it is the latest elasticsearch and considering that you are after a single value, the best approach is to change your query in to a non-scoring, filtering query. Filters are quite fast in elasticsearch and very easily cached. Making a query non-scoring avoids the scoring phase entirely(calculating relevance, etc...).
To to this:
GET localhost:9200/test_index/test_partition/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"my_field" : True
}
}
}
}
}
Note that we are using the search API. The constant_score is used to convert the term query in to a filter, which should be inherently fast.
For more information. Please refer Finding exact values

Elasticsearch: Having document score equal number of hits in field

Using elasticsearch, I'm searching through an index on a field that typically has a large amount of text and I simply want to know the number of times the query was matched per document. Anyone know of a good way to do this? I'd like to do it through the score value if possible. So for example, if I searched "fox" on "the quick brown fox jumped over the lazy fox", I'd get something that includes:
"_score" : 2.0
The default scoring model also account this into picture , but then this is not the only thing accounts.
What you are looking for is called term frequency.
The default scoring model is based on TF-IDF ( Term frequency and inverse document frequency) and also field length.
You can read more about it here.
Now coming back to your requirement , you can use the scripting module and function score query
{
"query": {
"function_score": {
"query": {
"match": {
"field": "fox"
}
},
"boost_mode": "replace",
"functions": [
{
"script_score": {
"script": "_index['field']['fox'].tf()"
}
}
]
}
}
}

Resources