ElasticSearch: Highlights every word in phrase query - elasticsearch

How can I get Elastic Search to only highlight words that caused the document to be returned?
I have the following index
{
"mappings": {
"document": {
"properties": {
"content": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
Let say I have indexed:
Nuclear power is the use of nuclear reactions that release nuclear
energy[5] to generate heat, which most frequently is then used in
steam turbines to produce electricity in a nuclear power station. The
term includes nuclear fission, nuclear decay and nuclear fusion.
Presently, the nuclear fission of elements in the actinide series of
the periodic table produce the vast majority of nuclear energy in the
direct service of humankind, with nuclear decay processes, primarily
in the form of geothermal energy, and radioisotope thermoelectric
generators, in niche uses making up the rest.
And search for "nuclear elements"~2
I only want "nuclear fission of elements" or parts of "nuclear fission of elements" to be highlighted but every single occurrence of nuclear is now highlighted.
This is my query if it helps:
{
"fields": [
],
"query": {
"query_string": {
"query": "\"nuclear elements\"~2",
"fields": [
"content.english"
]
}
},
"highlight": {
"pre_tags": [
"<em class='h'>"
],
"post_tags": [
"</em>"
],
"fragment_size": 500,
"number_of_fragments": 20,
"fields": {
"content.english": {}
}
}
}

There is a highlighting bug in ES 2.1, which was caused due to this change. This has been fixed by this Pull Request.
According to ES developer
This is a bug that I introduced in #13239 while thinking that the
differences were due to changes in Lucene: extractUnknownQuery is also
called when span extraction already succeeded, so we should only fall
back to Weight.extractTerms if no spans have been extracted yet.
It works in older versions till 2.0 and would work as expected in future versions.

Related

Elasticsearch Rank based on rarity of a field value

I'd like to know how can I rank lower items, which have fields that are frequently appearing among the results.
Say, we have a similar result set:
"name": "Red T-Shirt"
"store": "Zara"
"name": "Yellow T-Shirt"
"store": "Zara"
"name": "Red T-Shirt"
"store": "Bershka"
"name": "Green T-Shirt"
"store": "Benetton"
I'd like to rank the documents in such a manner that the documents containing frequently found fields,
"store" in this case, are deboosted to appear lower in the results.
This is to achieve a bit of variety, so that the search doesn't yield top results from the same store.
In the example above, if I search for "T-Shirt", I want to see one Zara T-Shirt at the top and the rest
of Zara T-Shirts should be appearing lower, after all other unique stores.
So far I tried to research for using aggregation buckets for sorting or script sorting, but without success.
Is it possible to achieve this inside of the search engine?
Many thanks in advance!
This is possible with a combination of diversified sampler aggregation and top hits aggregation, as learned from the Elastic forum. I don't know what the performance implications are, if used on a high-load production system. Here is a code example, use at your own risk:
{
"query": {}, // whatever query
"size": 0, // since we don't use hits
"aggs": {
"my_unbiased_sample": {
"diversified_sampler": {
"shard_size": 100,
"field": "store"
},
"aggs": {
"keywords": {
"top_hits": {
"_source": {
"includes": [ "name", "store" ]
},
"size": 100
}
}
}
}
}
}

Partial update into large document

I'm facing the problem about performance. My application is about chatting.
I designed mapping index with nested object like below.
{
"conversation_id-v1": {
"mappings": {
"stream": {
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
},
"comments": {
"type": "nested",
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
}
}
}
}
}
}
}
}
** actually have a lot of fields
A document has around 4,000 nested objects. When I upsert data into document, It peak the cpu to 100% also disk i/o in case write. Input ratio around 1000/s.
How can I tuning to improve performance?
Hardware
3x 2vCPUs 13GB on GCP
4000 nested fields sounds like a lot - if I were you, I would look long and hard at your mapping design to be very certain you actually need that many nested fields.
Quoting from the docs:
Internally, nested objects index each object in the array as a separate hidden document.
Since a document has to be fully reindexed on update, you're indexing 4000 documents with a single update.
Why so many fields?
The reason you gave in the comments for needing so many fields
I'd like to search comments in nested and come with their parent stream for display.
makes me think that you may be mixing two concerns here.
ElasticSearch is meant for search, and your mapping should be optimized for search. If your mapping shape is dictated by the way you want to display information, then something is wrong.
Design your index around search
Note that by "search" I mean both indexing and querying.
For the use case you have, it seems like you could:
Index only the comments, with a reference (some id) to the parent stream in the indexed comment document.
After you get the search results (a list of comments) back from the search index, you can retrieve each comment along with its parent stream from some other data source (e.g. a relational database).
The point is, it may be much more efficient to re-retrieve the comment along with whatever else you want from some other source that is more better than ElasticSearch at joining data.

Elasticsearch changing similarity does not work

Changing the similarity algorithm of my index does not work. I wan't to compare BM25 vs. TF-IDF, but i always get the same results. I'm using Elasticsearch 5.x.
I have tried literally everything. Setting the similarity of a property to classic or BM25 or don't set anything
"properties": {
"content": {
"type": "text",
"similarity": "classic"
},
I also tried setting the default similarty of my index in the settings and using it in the properties
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "test",
"similarity": {
"default": {
"type": "classic"
}
},
"creation_date": "1493748517301",
"number_of_replicas": "1",
"uuid": "sNuWcT4AT82MKsfAB9JcXQ",
"version": {
"created": "5020299"
}
}
The query im testing looks something like this:
{
"query": {
"match": {
"content": "some search query"
}
}
}
I have created a sample below:
DELETE test
PUT test
{
"mappings": {
"book": {
"properties": {
"content": {
"type": "text",
"similarity": "BM25"
},
"subject": {
"type": "text",
"similarity": "classic"
}
}
}
}
}
POST test/book/1
{
"subject": "A neutron star is the collapsed core of a large (10–29 solar masses) star. Neutron stars are the smallest and densest stars known to exist.[1] Though neutron stars typically have a radius on the order of 10 km, they can have masses of about twice that of the Sun.",
"content": "A neutron star is the collapsed core of a large (10–29 solar masses) star. Neutron stars are the smallest and densest stars known to exist.[1] Though neutron stars typically have a radius on the order of 10 km, they can have masses of about twice that of the Sun."
}
POST test/book/2
{
"subject": "A quark star is a hypothetical type of compact exotic star composed of quark matter, where extremely high temperature and pressure forces nuclear particles to dissolve into a continuous phase consisting of free quarks. These are ultra-dense phases of degenerate matter theorized to form inside neutron stars exceeding a predicted internal pressure needed for quark degeneracy.",
"content": "A quark star is a hypothetical type of compact exotic star composed of quark matter, where extremely high temperature and pressure forces nuclear particles to dissolve into a continuous phase consisting of free quarks. These are ultra-dense phases of degenerate matter theorized to form inside neutron stars exceeding a predicted internal pressure needed for quark degeneracy."
}
GET test/_search?explain
{
"query": {
"match": {
"subject": "neutron"
}
}
}
GET test/_search?explain
{
"query": {
"match": {
"content": "neutron"
}
}
}
subject and content fields have different similarities definitions but in the two documents I provided (from wikipedia) they have the same text in them. Running the two queries you will see in the explanations something like this and also get different scores in results:
from the first query: "description": "idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:"
from the second one: "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",

What differs between post-filter and global aggregation for faceted search?

A common problem in search interfaces is that you want to return a selection of results,
but might want to return information about all documents. (e.g. I want to see all red shirts, but want to know what
other colors are available).
This is sometimes referred to as "faceted results", or
"faceted navigation". the example from the Elasticsearch reference is quite clear in explaining why / how, so
I've used this as a base for this question.
Summary / Question: It looks like I can use both a post-filter or a global aggregation for this. They both seem to
provide the exact same functionality in a different way. There might be advantages or disadvantages to them that I
don't see? If so, which should I use?
I have included a complete example below with some documents and a query with both types of method based on the example
in the reference guide.
Option 1: post-filter
see the example from the Elasticsearch reference
What we can do is have more results in our origional query, so we can aggregate 'on' those results, and afterwards
filter our actual results.
The example is quite clear in explaining it:
But perhaps you would also like to tell the user how many Gucci shirts are available in other colors. If you just add a terms aggregation on the color field, you will only get back the color red, because your query returns only red shirts by Gucci.
Instead, you want to include shirts of all colors during aggregation, then apply the colors filter only to the search results.
See for how this would look below in the example code.
An issue with this is that we cannot use caching. This is in the (not yet available for 5.1) elasticsearch guide warned about:
Performance consideration
Use a post_filter only if you need to differentially filter search results and aggregations. Sometimes people will use post_filter for regular searches.
Don’t do this! The nature of the post_filter means it runs after the query, so any performance benefit of filtering (such as caches) is lost completely.
The post_filter should be used only in combination with aggregations, and only when you need differential filtering.
There is however a different option:
Option 2: global aggregations
There is a way to do an aggregation that is not influenced by the search query.
So instead of getting a lot, aggregate on that, then filter, we just get our filtered results, but do aggregations on
everything. Take a look at the reference
We can get the exact same results. I did not read any warnings about caching for this, but it seems like in the end
we need to do about the same amount of work. So that maybe the only ommission.
It is a tiny bit more complicated because of the sub-aggregation we need (you can't have global and a filter on the
same 'level').
The only complaint I read about queries using this, is that you might have to repeat yourself if you need to do this
for several items. In the end we can generate most queries, so repeating oneself isn't that much of an issue for my usecase,
and I do not really consider this an issue on par with "can not use cache".
Question
It seems both functions are overlapping in the least, or possibly providing the exact same functionality. This baffles me.
Apart from that, I'd like to know if one or the other has an advantage I haven't seen, and if there is any best practice here?
Example
This is largely from the post-filter reference page, but I added the global filter query.
mapping and documents
PUT /shirts
{
"mappings": {
"item": {
"properties": {
"brand": { "type": "keyword"},
"color": { "type": "keyword"},
"model": { "type": "keyword"}
}
}
}
}
PUT /shirts/item/1?refresh
{
"brand": "gucci",
"color": "red",
"model": "slim"
}
PUT /shirts/item/2?refresh
{
"brand": "gucci",
"color": "blue",
"model": "slim"
}
PUT /shirts/item/3?refresh
{
"brand": "gucci",
"color": "red",
"model": "normal"
}
PUT /shirts/item/4?refresh
{
"brand": "gucci",
"color": "blue",
"model": "wide"
}
PUT /shirts/item/5?refresh
{
"brand": "nike",
"color": "blue",
"model": "wide"
}
PUT /shirts/item/6?refresh
{
"brand": "nike",
"color": "red",
"model": "wide"
}
We are now requesting all red gucci shirts (item 1 and 3), the types of shirts we have (slim and normal) for these 2 shirts,
and which colors gucci there are (red and blue).
First, a post filter: get all shirts, aggregate the models for red gucci shirts and the colors for gucci shirts (all colors),
and post-filter for red gucci shirts to show only those as results: (this is a bit different from the example, as we
try to get it as close to a clear application of postfilters as possilbe.)
GET /shirts/_search
{
"aggs": {
"colors_query": {
"filter": {
"term": {
"brand": "gucci"
}
},
"aggs": {
"colors": {
"terms": {
"field": "color"
}
}
}
},
"color_red": {
"filter": {
"bool": {
"filter": [
{
"term": {
"color": "red"
}
},
{
"term": {
"brand": "gucci"
}
}
]
}
},
"aggs": {
"models": {
"terms": {
"field": "model"
}
}
}
}
},
"post_filter": {
"bool": {
"filter": [
{
"term": {
"color": "red"
}
},
{
"term": {
"brand": "gucci"
}
}
]
}
}
}
We could also get all red gucci shirts (our origional query), and then do a global aggregation for the model (for all
red gucci shirts) and for color (for all gucci shirts).
GET /shirts/_search
{
"query": {
"bool": {
"filter": [
{ "term": { "color": "red" }},
{ "term": { "brand": "gucci" }}
]
}
},
"aggregations": {
"color_red": {
"global": {},
"aggs": {
"sub_color_red": {
"filter": {
"bool": {
"filter": [
{ "term": { "color": "red" }},
{ "term": { "brand": "gucci" }}
]
}
},
"aggs": {
"keywords": {
"terms": {
"field": "model"
}
}
}
}
}
},
"colors": {
"global": {},
"aggs": {
"sub_colors": {
"filter": {
"bool": {
"filter": [
{ "term": { "brand": "gucci" }}
]
}
},
"aggs": {
"keywords": {
"terms": {
"field": "color"
}
}
}
}
}
}
}
}
Both will return the same information, the second one only differs because of the extra level introduced by the sub-aggregations. The second query looks a bit more complex, but I don't think this is very problematic. A real world query is generated by code, probably way more complex anyway and it should be a good query and if that means complicated, so be it.
The actual solution we used, while not a direct answer to the question, is basically "neither".
From this elastic blogpost we got the initial hint:
Occasionally, I see an over-complicated search where the goal is to do as much as possible in as few search requests as possible. These tend to have filters as late as possible, completely in contrary to the advise in Filter First. Do not be afraid to use multiple search requests to satisfy your information need. The multi-search API lets you send a batch of search requests.
Do not shoehorn everything into a single search request.
And that is basically what we are doing in above query: a big bunch of aggregations and some filtering.
Having them run in parallel proved to be much and much quicker. Have a look at the multi-search API
In both cases Elasticsearch will end up doing mostly the same thing. If I had to choose, I think I'd use the global aggregation, which might save you some overhead from having to feed two Lucene collectors at once.

ElasticSearch and highlighting performance - plain vs. fast vector highlighter

I am running into performance issues when running a query that uses both slop and the fact vector highlighter. Interestingly, the performance issue goes away when performing the same query with the plain highlighter, and I am not sure why this is the case.
Here's the metadata for the field being searched:
contents: {
store: true
search_analyzer: mySearchAnalyzer
term_vector: with_positions_offsets
type: string
}
The following query, which uses the fact vector highlighter, takes over 60 seconds:
{
"size": 500,
"query": {
"query_string": {
"query": "\"CATERPILLAR FINANCIAL SERVICES ASIA PTE LTD\"~5",
"fields": [
"contents"
],
"default_operator": "and",
}
},
"highlight": {
"fields": {
"contents": {}
}
}
}
However, if I change the query to use the plain analyzer, then it takes only a few milliseconds:
{
"size": 500,
"query": {
"query_string": {
"query": "\"CATERPILLAR FINANCIAL SERVICES ASIA PTE LTD\"~5",
"fields": [
"contents"
],
"default_operator": "and",
}
},
"highlight": {
"fields": {
"contents": {"type" : "plain"}
}
}
}
I have looked at different options for the highlighters (like fragment_size, fragment_offset, phrase_limit), but nothing is immediately obvious as what can be set to improve performance.
Any ideas on what is going on here? Or what type of settings I can try to improve the performance?
Note: One reason we switched from the plain to fact vector highlighter was due to some queries failing with the plain highlighter.
Edit: I've added the reproduction steps which demonstrate the issue in the following link:
https://drive.google.com/file/d/0B-IfDOojIDnIQmpkY2RNN2pMREE/edit?usp=sharing
I think the key is that there is a field which contains lots of similar values (e.g. in this case, Caterpillar is referenced many times).
While not strictly an answer, based on comments from Duc.Duong in which he was not able to reproduce the issue, I tried reproducing this with the version we are using (0.90.3) and the latest versopm (1.3.2). It turns out that this no longer reproduces with the latest version - that the search returns right away.
So, bottom line, this issue does not reproduce with the latest version. Not sure where it was fixed, but the problem occurs in 0.90.3.

Resources