ElasticSearch: Highlighting with Stemming - elasticsearch

I have read this question and attempted to understand the documentation here, but this is complicated.
The problem (I think):
[update 1]
I am using Scala for my code and interface with ES High Level Java API.
I have a stemming analyzer configured. If I search for responsibilities i get results for responsibilities and responsibility. That's great.
BUT
Only the documents with the term responsibilities return highlights.
This is because the search is on the stemmed content , i.e., responsib. However, the highlight is against the unstemmed content. Hence, it finds responsibilities which was a search criteria, but not responsibility, which wasn't.
If I set the highlighter to highlight on the stemmed content, it returns nothing at all. I guess because it is comparing resonsib with responsibilities
Search
I an using the Java high level API. The problem is not the code itself.
Currently, I am highlighting only the content field, returning only responsibilities. Highlighting content.english seems to return nothing
private def buildHighlighter(): HighlightBuilder = {
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder
val highlightBuilder = new HighlightBuilder
val highlightContent = new HighlightBuilder.Field("content")
highlightContent.highlighterType("unified")
highlightBuilder.field(highlightContent)
highlightBuilder
}
Mapping (adumbrated)
{
"settings": {
"number_of_shards": 3,
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": []
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
},
"content": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
[update 2]
Scala code to implement search:
def searchByField(indices: Seq[ESIndexName], terms: Seq[(String, String)], size: Int = 20): SearchResponse = {
val searchRequest = new SearchRequest
searchRequest.indices(indices.map(idx => idx.completeIndexName()): _*)
searchRequest.source(buildTargetFieldsMatchQuery(terms, size))
searchRequest.indicesOptions(IndicesOptions.strictSingleIndexNoExpandForbidClosed())
client.search(searchRequest, RequestOptions.DEFAULT)
}
and query is built as follows:
private def buildTargetFieldsMatchQuery(termsByField: Seq[(String, String)], size: Int): SearchSourceBuilder = {
val query = new BoolQueryBuilder
termsByField.foreach {
case (field, term) =>
if (field == "content") {
logger.debug(field + " should have " + term)
query.should(new MatchQueryBuilder(field+standardAnalyzer, term.toLowerCase))
query.should(new MatchQueryBuilder(field, term.toLowerCase))
}
else if (field == "title"){
logger.debug(field + " should have " + term)
query.should(new MatchQueryBuilder(field+standardAnalyzer, term.toLowerCase())).boost
}
else {
logger.debug(field + " should have " + term)
query.should(new MatchQueryBuilder(field, term.toLowerCase))
}
}
val sourceBuilder: SearchSourceBuilder = new SearchSourceBuilder()
sourceBuilder.query(query)
sourceBuilder.from(0)
sourceBuilder.size(size)
sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS))
sourceBuilder.highlighter(buildHighlighter())
}

With plain REST the following is working fine for me:
PUT test
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": []
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
}
POST test/_doc/
{
"content": "This is my responsibility"
}
POST test/_doc/
{
"content": "These are my responsibilities"
}
GET test/_search
{
"query": {
"match": {
"content.english": "responsibilities"
}
},
"highlight": {
"fields": {
"content.english": {
"type": "unified"
}
}
}
}
The result is then:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5D5PPGoBqgTTLzdtM-_Y",
"_score" : 0.18232156,
"_source" : {
"content" : "This is my responsibility"
},
"highlight" : {
"content.english" : [
"This is my <em>responsibility</em>"
]
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5T5PPGoBqgTTLzdtZe8U",
"_score" : 0.18232156,
"_source" : {
"content" : "These are my responsibilities"
},
"highlight" : {
"content.english" : [
"These are my <em>responsibilities</em>"
]
}
}
]
Looking at your Java / Groovy (?) code it looks close enough to the example in the docs. Could you log the actual query you are running, so we can figure out what is going wrong? Generally it should work like this.

Related

Elasticsearch `Procter&Gamble` and `Procter & Gamble` issues

My task is:
* Make procter&gamble and procter & gamble produce the same results including score
* Make it universal, not via synonyms, as it can be any other Somehow&Somewhat
* Highlight procter&gamble or procter & gamble, not separate tokens if the phrase matches
* I want to use simple_query_stringas I allow search operators
* Make AT&T searchable as well
Here is my snippet. The problems that procter&gamble or procter & gamble searches produce different scores and this different documents as the result.
But the user expects the same result for procter&gamble or procter & gamble
DELETE /english_example
PUT /english_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all" : true,
"preserve_original":true
},
"acronymns_": {
"type": "word_delimiter_graph",
"catenate_all" : true,
"preserve_original":true
},
"custom_stop_words_filter": {
"type": "stop",
"ignore_case": true,
"stopwords": [ "t" ]
}
},
"analyzer": {
"default": {
"tokenizer": "whitespace",
"char_filter": [
"ampersand_filter"
],
"filter": [
"english_possessive_stemmer",
"lowercase",
"acronymns",
"flatten_graph",
"english_stop",
"custom_stop_words_filter",
"english_keywords",
"english_stemmer"
]
}
},
"char_filter": {
"ampersand_filter": {
"type": "pattern_replace",
"pattern": "(?=[^&]*)( {0,}& {0,})(?=[^&]*)",
"replacement": "_and_"
},
"ampersand_filter2": {
"type": "mapping",
"mappings": [
"& => _and_"
]
}
}
}
}
}
PUT /english_example/_bulk
{ "index" : { "_id" : "1" } }
{ "description" : "wi-fi AT&T BB&T Procter & Gamble, some\nOther $500 games with Peter's", "contents" : "Much text with somewhere I meet Procter or Gamble" }
{ "index" : { "_id" : "2" } }
{ "description" : "Procter & Gamble", "contents" : "Much text with somewhere I meet Procter and Gamble" }
{ "index" : { "_id" : "3" } }
{ "description" : "Procter&Gamble", "contents" : "Much text with somewhere I meet Procter & Gamble" }
{ "index" : { "_id" : "4" } }
{ "description" : "Come Procter&Gamble", "contents" : "Much text with somewhere I meet Procter&Gamble" }
{ "index" : { "_id" : "5" } }
{ "description" : "Tome Procter & Gamble", "contents" : "Much text with somewhere I don't meet AT&T" }
# "query": "procter & gamble",
GET english_example/_search
{
"query": {
"simple_query_string": {
"query": "procter & gamble",
"default_operator": "or",
"fields": [
"description^2",
"contents^80"
]
}
},
"highlight": {
"fields": {
"description": {},
"contents": {}
}
}
}
# "query": "procter&gamble",
GET english_example/_search
{
"query": {
"simple_query_string": {
"query": "procter&gamble",
"default_operator": "or",
"fields": [
"description^2",
"contents^80"
]
}
},
"highlight": {
"fields": {
"description": {},
"contents": {}
}
}
}
# "query": "at&t",
GET english_example/_search
{
"query": {
"simple_query_string": {
"query": "at&t",
"default_operator": "or",
"fields": [
"description^2",
"contents^80"
]
}
},
"highlight": {
"fields": {
"description": {},
"contents": {}
}
}
}
In my snippet I redefine the default analyzer using word_delimiter_graph and whitespace tokenizer to search AT&T matches as well.
One option I can think of is to use a should query with a "standard analyzer" and your custom analyzer.
For "proctor & gamble" tokens generated using custom and standard analyzer will be "proctor","gamble"
For "proctor&gamble" tokens generated using custom analyzer will be "proctor","gamble","proctor&gamble" and using standard analyzer will "proctor" and "gamble"
So in should clause we can use a standard analyzer to look for "proctor" or "gamble" and a custom analyzer to look for "proctor&gamble"
GET english_example/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"description": {
"query": "Procter&Gamble",
"analyzer": "standard"
}
}
},
{
"match": {
"description": {
"query": "Procter&Gamble"
}
}
}
],
"minimum_should_match": 1
}
}
}
Second option will be to use synonymns where you define all variations in which proctor and gamble can appear to mean a single thing
I just realized that you are searching a description field and not a company field. So keyword analyzer wont work. I have updated my answer accordingly.
You can potentially try adding a custom field with lowercase and whitespace analyzer and use the same custom analyzer for search as well. When you perform search, search in both standard field and this custom field as a multimatch search. That should allow you to support both. You can boost the score for custom field so that exact matches comes in the top of the search results.
Trick is to convert user input to lower case before performing the search. You shouldn't use user input as is. Else this approach wont work.
You can use below scripts to try it out.
DELETE /test1
PUT /test1
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_analyzer" : {
"filter" : ["lowercase"],
"type" : "custom",
"tokenizer" : "whitespace"
}
}
}
},
"mappings": {
"properties": {
"description" : {
"type": "text",
"analyzer": "standard",
"fields": {
"custom" : {
"type" : "text",
"analyzer" : "lowercase_analyzer",
"search_analyzer" : "lowercase_analyzer"
}
}
}
}
}
}
PUT /test1/_bulk
{ "index" : { "_id" : "1" } }
{ "description" : "wi-fi AT&T BB&T Procter & Gamble, some\nOther $500 games with Peter's" }
{ "index" : { "_id" : "2" } }
{ "description" : "Procter & Gamble" }
{ "index" : { "_id" : "3" } }
{ "description" : "Procter&Gamble" }
GET test1/_search
{
"query": {
"multi_match": {
"query": "procter&gamble",
"fields": ["description", "description.custom"]
}
},
"highlight": {
"fields": {
"description": {},
"description.custom": {}
}
}
}
GET test1/_search
{
"query": {
"multi_match": {
"query": "procter",
"fields": ["description", "description.custom"]
}
},
"highlight": {
"fields": {
"description": {},
"description.custom": {}
}
}
}
GET test1/_search
{
"query": {
"multi_match": {
"query": "at&t",
"fields": ["description", "description.custom"]
}
},
"highlight": {
"fields": {
"description": {},
"description.custom": {}
}
}
}
GET test1/_search
{
"query": {
"multi_match": {
"query": "procter & gamble",
"fields": ["description", "description.custom"]
}
},
"highlight": {
"fields": {
"description": {},
"description.custom": {}
}
}
}
You can add highlighting and try it out.

ElasticSearch "more like this" returning empty result

I made a very simple test to figure out my mistake, but did not find it. I created two indexes and I'm trying to search documents in the ppa index that are similar to a given document in the ods index (like the second example here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html).
These are my settings, mappings and documents for the ppa index:
PUT /ppa
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ppa/_mapping/ppa
{"properties": {"descricao": {"type": "text", "analyzer": "brazilian"}}}
POST /_bulk
{"index":{"_index":"ppa","_type":"ppa"}}
{"descricao": "erradicar a pobreza"}
{"index":{"_index":"ppa","_type":"ppa"}}
{"descricao": "erradicar a pobreza"}
These are my settings, mappings and documents for the ods index:
PUT /ods
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ods/_mapping/ods
{"properties": {"metaodsdescricao": {"type": "text", "analyzer": "brazilian"},"metaodsid": {"type": "integer"}}}
POST /_bulk
{"index":{"_index":"ods","_type":"ods", "_id" : "1" }}
{ "metaodsdescricao": "erradicar a pobreza","metaodsid": 1}
{"index":{"_index":"ods","_type":"ods", "_id" : "2" }}
{"metaodsdescricao": "crianças que vivem na pobreza", "metaodsid": 2}
Now, this search doesn't work:
GET /ppa/ppa/_search
{
"query": {
"more_like_this" : {
"fields" : ["descricao"],
"like" : [
{
"_index" : "ods",
"_type" : "ods",
"_id" : "1"
}
],
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 20
}
}
}
But this one does work:
GET /ppa/ppa/_search
{
"query": {
"more_like_this" : {
"fields" : ["descricao"],
"like" : ["erradicar a pobreza"],
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 20
}
}
}
What is happening?
Please, help me make this return something other than empty.
The "more like this" query work well when you have indexed a lot of data. The empty result can be symptom of very few documents present in the elastic index.

Elasticsearch phrase suggester prefix phonetic differences

I was wondering if there is any way for the phrase suggester to correct prefix spelling mistakes on phonetic differences.
Elasticsearch 5.1.2
Testing in Kibana 5.1.2
For Example:
Instead of "circus" someone wrote "sircus", or instead of "coding" someone wrote "koding".
Funny thing is, that instead of "phrase" you can write "frase" and get a suggestion.
Here is my setup.
Settings:
PUT text_index
{
"settings": {
"analysis": {
"analyzer": {
"suggests_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"shingle_filter"
],
"type": "custom"
},
"reverse": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "reverse"]
}
},
"filter": {
"shingle_filter": {
"min_shingle_size": 2,
"max_shingle_size": 5,
"type": "shingle"
}
}
}
},
"mappings": {
"testtype": {
"properties": {
"suggest_field": {
"type": "text",
"analyzer": "suggests_analyzer",
"fields": {
"reverse": {
"type": "text",
"analyzer": "reverse"
}
}
}
}
}
}
}
Some documents:
POST test_index/test_type/_bulk
{"index":{}}
{ "suggest_field": "phrase"}
{"index":{}}
{ "suggest_field": "Circus"}
{"index":{}}
{ "suggest_field": "Coding"}
Querying:
POST /so-index/_search
{
"suggest" : {
"text" : "sircus",
"simple_phrase" : {
"phrase" : {
"field" : "suggest_field",
"max_errors": 0.9,
"highlight": {
"pre_tag": "<em>",
"post_tag": "</em>"
},
"direct_generator" : [ {
"field" : "suggest_field",
"suggest_mode" : "always"
}, {
"field" : "suggest_field.reverse",
"suggest_mode" : "always",
"pre_filter" : "reverse",
"post_filter" : "reverse"
}]
}
}
}
}
Also, I repeat following steps a few times (between 5 and 10) without changing anything:
delete index
put index, settings & mappings
add documents
query (codign)
Sometimes I get suggestions and sometimes I don't. Is there any explanation for it?
Try setting "prefix_length": 0 in the direct_generator.

Wrong indexation elasticsearch using the analyser

I did a pretty simple test. I build a student index and a type, then I define a mapping:
POST student
{
"mappings" : {
"ing3" : {
"properties" : {
"quote": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
After that I add 3 students to this index:
POST /student/ing3/1
{
"name": "Smith",
"first_name" : "John",
"quote" : "Learning is so cool!!"
}
POST /student/ing3/2
{
"name": "Roosevelt",
"first_name" : "Franklin",
"quote" : "I learn everyday"
}
POST /student/ing3/3
{
"name": "Black",
"first_name" : "Mike",
"quote" : "I learned a lot at school"
}
At this point I thought that the english tokeniser will tokenise all the word in my quotes so if I'm making a search like:
GET /etudiant/ing3/_search
{
"query" : {
"term" : { "quote" : "learn" }
}
}
I will have all the document as a result since my tokeniser will make equal "learn, learning, learned" and I was right. But when I try this request:
GET /student/ing3/_search
{
"query" : {
"term" : { "quote" : "learned" }
}
}
I got zero hit and in my opinion I should have the 3rd document (at least?). But for me Elasticsearch is also supposed to index learned and learning not only learn. Am I wrong? Is my request wrong?
If you check:
GET 'index/_analyze?field=quote' -d "I learned a lot at school"
you will see that your sentence is analyzed as:
{
"tokens":[
{
"token":"i",
"start_offset":0,
"end_offset":1,
"type":"<ALPHANUM>",
"position":0
},
{
"token":"learn",
"start_offset":2,
"end_offset":9,
"type":"<ALPHANUM>",
"position":1
},
{
"token":"lot",
"start_offset":12,
"end_offset":15,
"type":"<ALPHANUM>",
"position":3
},
{
"token":"school",
"start_offset":19,
"end_offset":25,
"type":"<ALPHANUM>",
"position":5
}
]
}
So english analyzer removes punctions and stop words and tokenize words in their root form.
https://www.elastic.co/guide/en/elasticsearch/guide/current/using-language-analyzers.html
You can use match query which will also analyze your search text so will match:
GET /etudiant/ing3/_search
{
"query" : {
"match" : { "quote" : "learned" }
}
}
There is another way. You can both stem the terms (the english analyzer does have a stemmer), but also keep the original terms, by using a keyword_repeat token filter and then using a unique token filter with "only_on_same_position": true to remove unnecessary duplicates after the stemming:
PUT student
{
"settings": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"keyword_repeat",
"english_stemmer",
"unique_stem"
]
}
},
"filter": {
"unique_stem": {
"type": "unique",
"only_on_same_position": true
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
},
"mappings": {
"ing3": {
"properties": {
"quote": {
"type": "string",
"analyzer": "myAnalyzer"
}
}
}
}
}
In this case the term query will work, as well. If you look at what terms are actually being indexed:
GET /student/_search
{
"fielddata_fields": ["quote"]
}
it will be clear why now it matches:
"hits": [
{
"_index": "student",
"_type": "ing3",
"_id": "2",
"_score": 1,
"_source": {
"name": "Roosevelt",
"first_name": "Franklin",
"quote": "I learn everyday"
},
"fields": {
"quote": [
"everydai",
"everyday",
"i",
"learn"
]
}
},
{
"_index": "student",
"_type": "ing3",
"_id": "1",
"_score": 1,
"_source": {
"name": "Smith",
"first_name": "John",
"quote": "Learning is so cool!!"
},
"fields": {
"quote": [
"cool",
"learn",
"learning",
"so"
]
}
},
{
"_index": "student",
"_type": "ing3",
"_id": "3",
"_score": 1,
"_source": {
"name": "Black",
"first_name": "Mike",
"quote": "I learned a lot at school"
},
"fields": {
"quote": [
"i",
"learn",
"learned",
"lot",
"school"
]
}
}
]

Search results ordered by search-text-length/match length

I have this simple mapping:
PUT testindex
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "edgeNGram"]
}
},
"filter" : {
"ngram" : {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer" : "ngram_analyzer"
}
}
}
}
}
With these values:
PUT testindex/test/1
{"name" : "Power"}
PUT testindex/test/2
{"name" : "Pow"}
PUT testindex/test/3
{"name" : "PowerMax"}
PUT testindex/test/4
{"name" : "PowerRangers"}
And searched this:
GET testindex/test/_search
{
"query": {
"match": {
"name": "Po"
}
}
}
And got:
PowerRangers
Power
Pow
PowerMax
All with the same score of 0.2876821
Clearly, the closest result to "Po" is "Pow", and that I expect to receive first; but I don't.
How Should I modify my mapping to behave by this logic?
I think scripted sorting is the solution, but it comes with a performance decrease drawback. See here more about this. And the query you can use is this:
GET testindex/test/_search
{
"query": {
"match": {
"name": "Po"
}
},
"sort": {
"_script": {
"script": "_source['name'].value.length",
"type": "number",
"order": "asc"
}
}
}

Resources