Elasticsearch changing similarity does not work - elasticsearch

Changing the similarity algorithm of my index does not work. I wan't to compare BM25 vs. TF-IDF, but i always get the same results. I'm using Elasticsearch 5.x.
I have tried literally everything. Setting the similarity of a property to classic or BM25 or don't set anything
"properties": {
"content": {
"type": "text",
"similarity": "classic"
},
I also tried setting the default similarty of my index in the settings and using it in the properties
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "test",
"similarity": {
"default": {
"type": "classic"
}
},
"creation_date": "1493748517301",
"number_of_replicas": "1",
"uuid": "sNuWcT4AT82MKsfAB9JcXQ",
"version": {
"created": "5020299"
}
}
The query im testing looks something like this:
{
"query": {
"match": {
"content": "some search query"
}
}
}

I have created a sample below:
DELETE test
PUT test
{
"mappings": {
"book": {
"properties": {
"content": {
"type": "text",
"similarity": "BM25"
},
"subject": {
"type": "text",
"similarity": "classic"
}
}
}
}
}
POST test/book/1
{
"subject": "A neutron star is the collapsed core of a large (10–29 solar masses) star. Neutron stars are the smallest and densest stars known to exist.[1] Though neutron stars typically have a radius on the order of 10 km, they can have masses of about twice that of the Sun.",
"content": "A neutron star is the collapsed core of a large (10–29 solar masses) star. Neutron stars are the smallest and densest stars known to exist.[1] Though neutron stars typically have a radius on the order of 10 km, they can have masses of about twice that of the Sun."
}
POST test/book/2
{
"subject": "A quark star is a hypothetical type of compact exotic star composed of quark matter, where extremely high temperature and pressure forces nuclear particles to dissolve into a continuous phase consisting of free quarks. These are ultra-dense phases of degenerate matter theorized to form inside neutron stars exceeding a predicted internal pressure needed for quark degeneracy.",
"content": "A quark star is a hypothetical type of compact exotic star composed of quark matter, where extremely high temperature and pressure forces nuclear particles to dissolve into a continuous phase consisting of free quarks. These are ultra-dense phases of degenerate matter theorized to form inside neutron stars exceeding a predicted internal pressure needed for quark degeneracy."
}
GET test/_search?explain
{
"query": {
"match": {
"subject": "neutron"
}
}
}
GET test/_search?explain
{
"query": {
"match": {
"content": "neutron"
}
}
}
subject and content fields have different similarities definitions but in the two documents I provided (from wikipedia) they have the same text in them. Running the two queries you will see in the explanations something like this and also get different scores in results:
from the first query: "description": "idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:"
from the second one: "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",

Related

ElasticSearch-How to combine results of different queries to improve Mean Average Precision

I am making a query A on elastic search and get the first 50 results. I also make a query B which contains the 30% of the terms of the query A. Each result of query A has a similarity score scoreA and each result of B has scoreB.
What I am trying to achieve is combine the results of A and B to improve the Mean Average Precision of each imdividual query. One way that I found is to reorder the results based on this formula:
SIMnew = λ*scoreA + (1-λ)*scoreB
where λ is a hyperparameter which I should tune. I noticed that this formula is very similar to Jelineck-Mercer smoothing which is implemented in Elastic Search (https://www.elastic.co/blog/language-models-in-elasticsearch).
Is there any default way to do this reordering with Elastic Search or the only way is a custom implementation?
(Given that I searched a lot about this formula and didn't find something usefull, it would be great if somenone gave me an intuition of how and why this works)
Combination of results of different queries in Elasticsearch is commonly achieved with bool query. Changes in the way they are combined can be made using function_score query.
In case you need to combine different per-field scoring functions (also known as similarity), to, for instance, do the same query with BM25 and DFR and combine their results, indexing the same field several times with use of fields can help.
Now let me explain how this thing works.
Find official website of David Gilmour
Let's imagine we have an index with following mapping and example documents:
PUT mysim
{
"mappings": {
"_doc": {
"properties": {
"url": {
"type": "keyword"
},
"title": {
"type": "text"
},
"abstract": {
"type": "text"
}
}
}
}
}
PUT mysim/_doc/1
{
"url": "https://en.wikipedia.org/wiki/David_Bowie",
"title": "David Bowie - Wikipedia",
"abstract": "David Robert Jones (8 January 1947 – 10 January 2016), known professionally as David Bowie was an English singer-songwriter and actor. He was a leading ..."
}
PUT mysim/_doc/2
{
"url": "https://www.davidbowie.com/",
"title": "David Bowie | The official website of David Bowie | Out Now ...",
"abstract": "David Bowie | The official website of David Bowie | Out Now Glastonbury 2000."
}
PUT mysim/_doc/3
{
"url": "https://www.youtube.com/channel/UC8YgWcDKi1rLbQ1OtrOHeDw",
"title": "David Bowie - YouTube",
"abstract": "This is the official David Bowie channel. Features official music videos and live videos from throughout David's career, including Space Oddity, Changes, Ash..."
}
PUT mysim/_doc/4
{
"url": "www.davidgilmour.com/",
"title": "David Gilmour | The Voice and Guitar of Pink Floyd | Official Website",
"abstract": "David Gilmour is a guitarist and vocalist with British rock band Pink Floyd, and was voted No. 1 in Fender's Greatest Players poll in the February 2006 Guitarist ..."
}
Practically speaking, we have an official website of David Gilmour, that one of David Bowie, and two other pages about David Bowie.
Let's try to search for David Gilmour's official website:
POST mysim/_search
{
"query": {
"match": {
"abstract": "david gilmour official"
}
}
}
On my machine this returns the following results:
"hits": [
...
"_score": 1.111233,
"_source": {
"title": "David Bowie | The official website of David Bowie | Out Now ...",
...
"_score": 0.752356,
"_source": {
"title": "David Gilmour | The Voice and Guitar of Pink Floyd | Official Website",
...
"_score": 0.68324494,
"_source": {
"title": "David Bowie - YouTube",
...
For some reason, David Gilmour's page is not the first one.
If we take 30% of terms from the first query, like the original post is asking (let's cunningly select gilmour to make our example shine), we should see an improvement:
POST mysim/_search
{
"query": {
"match": {
"abstract": "gilmour"
}
}
}
Now Elasticsearch only returns one hit:
"hits": [
...
"_score": 0.5956734,
"_source": {
"title": "David Gilmour | The Voice and Guitar of Pink Floyd | Official Website",
Let's say we don't want to discard all other results, just want to reorder so the David Gilmour's website is higher in the results. What can we do?
Use simple bool query
The purpose of bool query is to combine results of several queries in OR, AND or NOT fashion. In our case we could go with OR:
POST mysim/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"abstract": "david gilmour official"
}
},
{
"match": {
"abstract": "gilmour"
}
}
]
}
}
}
This seems to do the job (on my machine):
"hits": [
...
"_score": 1.3480294,
"_source": {
"title": "David Gilmour | The Voice and Guitar of Pink Floyd | Official Website",
...
"_score": 1.111233,
"_source": {
"title": "David Bowie | The official website of David Bowie | Out Now ...",
...
"_score": 0.68324494,
"_source": {
"title": "David Bowie - YouTube",
...
What bool query does under the hood is simply summing the scores per each subquery. In this case the top hit's score 1.3480294 is a sum of the document's score against two stand-alone queries we did above:
>>> 0.752356 + 0.5956734
1.3480294000000002
But this might not be good enough. What if we want to combine these scores with different coefficients?
Combine queries with different coefficients
To achieve this we can use function_score query.
POST mysim/_search
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"match": {
"abstract": "david gilmour official"
}
},
"boost": 0.8
}
},
{
"function_score": {
"query": {
"match": {
"abstract": "gilmour"
}
},
"boost": 0.2
}
}
]
}
}
}
Here we implement the formula from the original post with λ = 0.8.
"hits": [
...
"_score": 0.8889864,
"_source": {
"title": "David Bowie | The official website of David Bowie | Out Now ...",
...
"_score": 0.7210195,
"_source": {
"title": "David Gilmour | The Voice and Guitar of Pink Floyd | Official Website",
...
On my machine this still produces "wrong" ordering.
But changing λ to 0.4 seems to do the job! Hooray!
What if I want to combine different similarities?
In case you need to go deeper, and be able to modify how Elasticsearch computes relevance per-field (which is called similarity), it can be done via defining a custom scoring model.
In a case which I can hardly imagine, you may want to combine, say, BM25 and DFR scoring. Elasticsearch only permits one scoring model per field, but it also allows to analyze the same field several times via multi fields.
The mapping might look like this:
PUT mysim
{
"mappings": {
"_doc": {
"properties": {
"url": {
"type": "keyword"
},
"title": {
"type": "text"
},
"abstract": {
"type": "text",
"similarity": "BM25",
"fields": {
"dfr": {
"type": "text",
"similarity": "my_similarity"
}
}
}
}
}
},
"settings": {
"index": {
"similarity": {
"my_similarity": {
"type": "DFR",
"basic_model": "g",
"after_effect": "l",
"normalization": "h2",
"normalization.h2.c": "3.0"
}
}
}
}
}
Notice that here we defined a new similarity called my_similarity which effectively computes DFR (example taken from the documentation).
Now we will be able to do a bool query with a combination of similarities in the following way:
POST mysim/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"abstract": "david gilmour official"
}
},
{
"match": {
"abstract.dfr": "david gilmour official"
}
}
]
}
}
}
Notice that we do the same query to two different fields. Here abstract.dfr is a "virtual" field with scoring model set to DFR.
What else should I consider?
In Elasticsearch scores are computed per-shard, which can lead to unexpected results. For example, IDF is computed not on the whole index, but only on the subset of documents that are in the same shard.
Here you can read how Lucene, Elasticsearch's backbone, computes relevance scores.
Hope that helps!

ElasticSearch (5.5) query or algorithm required to exctract values against timestamp with an interference pattern

I have a very large volume of documents in ElasticSearch (5.5) which hold recorded data at regular time intervals, let's say every 3 seconds.
{
"#timestamp": "2015-10-14T12:45:00Z",
"channel1": 24.4
},
{
"#timestamp": "2015-10-14T12:48:00Z",
"channel1": 25.5
},
{
"#timestamp": "2015-10-14T12:51:00Z",
"channel1": 26.6
}
Let's say that I need to get results back for a query that asks for the point value every 5 seconds. An interference pattern arises where sometimes there will be an exact match (for simplicity's sake, let's say in the example above that 12:45 is the only sample to land on a multiple of five).
On these times, I want elastic to provide me with the exact value recorded at that time if there is one. So at 12:45 there is a match so it returns value 24.4
In the other cases, I require the last (previously recorded) value. So at 12:50, having no data at that precise time, it would return the value at 12:48 (25.5), being the last known value.
Previously I have used aggregations but in this case this doesnt help because I don't want some average made from a bucket of data, I need either an exact value for an exact time match or a previous value if no match.
I could do this programmatically but performance is a real issue here so I need to come up with the most performant method possible to retrieve the data in the way stated. Returning ALL the elastic data and iterating over the results and checking for a match at each time interval else keeping the item at index i-1 sounds slow and I wonder if it isn't the best way.
Perhaps I am missing a trick with Elastic. Perhaps somebody knows a method to do exactly what I am after?! It would be much appreciated...
The mapping is like so:
"mappings": {
"sampleData": {
"dynamic": "true",
"dynamic_templates": [{
"pv_values_template": {
"match": "GroupId", "mapping": { "doc_values": true, "store": false, "type": "keyword" }
}
}],
"properties": {
"#timestamp": { "type": "date" },
"channel1": { "type": "float" },
"channel2": { "type": "float" },
"item": { "type": "object" },
"keys": { "properties": { "count": { "type": "integer" }}},
"values": { "properties": { "count": { "type": "integer" }}}
}
}
}
and the (NEST) method being called looks like so:
channelAggregation => channelAggregation.DateHistogram("HistogramFilter", histogram => histogram
.Field(dataRecord => dataRecord["#timestamp"])
.Interval(interval)
.MinimumDocumentCount(0)
.ExtendedBounds(start, end)
.Aggregations(aggregation => DataFieldAggregation(channelNames, aggregation)));
#Nikolay there may be up to around 1400 buckets (maximum of one velue to be returned per pixel available on the chart)

Elasticsearch: better to have more values or more fields?

Suppose to have an index with documents describing vehicles.
Your index needs to deal with two different type of vehicles: motorcycle and car.
Which of the following mapping is better from a performance point of view?
(nested is required for my purposes)
"vehicle": {
"type": "nested",
"properties": {
"car": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
},
"motorcycle": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
}
}
}
or this one:
"vehicle": {
"type": "nested",
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
},
"vehicle_type": {
"type": "string" ### "car", "motorcycle"
}
}
}
The second one is more readable and thin.
But the drawback that I'll have is that when I make my queries, if I want to focus only on "car", I need to put this condition as part of the query.
If I use the first mapping, I just need to have a direct access to the stored field, without adding overhead to the query.
The first mapping, where cars and motorcycles are isolated in different fields, is more likely to be faster. The reason is that you have one less filter to apply as you already know, and because of the increased selectivity of the queries (e.g less documents for a given value of vehicle.car.model than just vehicle.model)
Another option would be to create two distinct indexes car and motorcycle, possibly with the same index template.
In Elasticsearch, a query is processed by a single-thread per shard. That means, if you split your index in two, and query both in a single request, it will be executed in parallel.
So, when needed to query only one of cars or motorcycles, it's faster simply because indexes are smaller. And when it comes to query both cars and motorcycles it could also be faster by using more threads.
EDIT: one drawback of the later option you should know, the inner lucene dictionary will be duplicated, and if values in cars and motorcycles are quite identical, it doubles the list of indexed terms.

Buckets of documents grouped by term frequency

I want to segment Elasticsearch results in buckets, such that similar documents (with most matching terms) are grouped together (on an analyzed field) in the results. I'm not sure how to go about having aggregated buckets of individual documents this way.
Here's the basic mapping:
PUT movies
{
"mappings": {
"movie": {
"properties": {
"id": { "type": "long" },
"title": { "type" : "text" }
}
}
}
}
Now, for example, if a query is done for hunger then the results should be grouped as buckets of matching documents with most number of similar terms:
{
"buckets": {
"1": [
{
"title": "The Hunger Games"
},
{
"title": "The Hunger Games: Mockingjay"
},
{
"title": "The Hunger Games: Catching Fire"
}
],
"2": [
{
"title": "Aqua Teen Hunger Force"
},
{
"title": "Force of Hunger"
}
],
"3": [
{
"title": "Hunger Pain"
}
],
:
:
:
}
}
In the above example, similar documents are grouped in separate buckets, based on at-least two matching terms. All matching titles without similar terms are still included in the results as separate buckets (e.g. bucket #3).
Any suggestions are appreciated.

ElasticSearch: Highlights every word in phrase query

How can I get Elastic Search to only highlight words that caused the document to be returned?
I have the following index
{
"mappings": {
"document": {
"properties": {
"content": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
Let say I have indexed:
Nuclear power is the use of nuclear reactions that release nuclear
energy[5] to generate heat, which most frequently is then used in
steam turbines to produce electricity in a nuclear power station. The
term includes nuclear fission, nuclear decay and nuclear fusion.
Presently, the nuclear fission of elements in the actinide series of
the periodic table produce the vast majority of nuclear energy in the
direct service of humankind, with nuclear decay processes, primarily
in the form of geothermal energy, and radioisotope thermoelectric
generators, in niche uses making up the rest.
And search for "nuclear elements"~2
I only want "nuclear fission of elements" or parts of "nuclear fission of elements" to be highlighted but every single occurrence of nuclear is now highlighted.
This is my query if it helps:
{
"fields": [
],
"query": {
"query_string": {
"query": "\"nuclear elements\"~2",
"fields": [
"content.english"
]
}
},
"highlight": {
"pre_tags": [
"<em class='h'>"
],
"post_tags": [
"</em>"
],
"fragment_size": 500,
"number_of_fragments": 20,
"fields": {
"content.english": {}
}
}
}
There is a highlighting bug in ES 2.1, which was caused due to this change. This has been fixed by this Pull Request.
According to ES developer
This is a bug that I introduced in #13239 while thinking that the
differences were due to changes in Lucene: extractUnknownQuery is also
called when span extraction already succeeded, so we should only fall
back to Weight.extractTerms if no spans have been extracted yet.
It works in older versions till 2.0 and would work as expected in future versions.

Resources