Elasticsearch completion suggester skip_duplicates non duplicates

Elasticsearch completion suggester skip_duplicates non duplicates - elasticsearch

I am confused with the skip_duplicates Elasticsearch completion suggester. The settings and mappings is like this:
{
"settings":{
"index" :{
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
]
}
}
}
}
},
"mappings":
{
"properties": {
"name": {
"type": "text"
},
"category": {
"type": "text"
},
"category_suggest": {
"type": "completion",
"analyzer": "autocomplete_analyzer" }
}
}
}
I have docs like this.
"category_suggest": {
"input": [
"Automotive",
"Auto",
]
}
and also have docs like this
"category_suggest": {
"input": [
"Automotive",
"Car",
]
}
If I use _search endpoint, submit this query, with skip_duplicates=true.
"suggest":
{
"suggestions":{
"prefix": "Aut",
"completion":{
"field": "category_suggest",
"size": 5,
"skip_duplicates": true
}
}
}
It only returns
"suggest": {
"suggestions": [
{
"text": "Aut",
"offset": 0,
"length": 3,
"options": [
{
"text": "Auto",
"_index": "my_index",
"_type": "_doc",
"_id": "oCNRa4IAiXc1hBrN9UnM",
"_score": 1.0
}
]
}
]
}
If I set skip_duplicates=False, both 'Auto' and 'Automotive' will be returned. Is that because in one of document, category_suggest field has both 'Auto' and 'Automotive'. I am using ES 7.10 python.
Thanks

Related

Elasticsearch Highlight the result of script fields

In the last question that I asked I want to remove the HTML tags in my search results, After that I thought I could highlite the results with a common query, But in the highlighting field I got other html contents that you removed with script. Would you please help me to highlight the results without html tags that I saved in my db?
My mapping and settings:
{
"settings": {
"analysis": {
"filter": {
"my_pattern_replace_filter": {
"type": "pattern_replace",
"pattern": "\n",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
],
"char_filter": [
"html_strip"
]
},
"parsed_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"filter": [
"my_pattern_replace_filter"
]
}
}
}
},
"mappings": {
"properties": {
"html": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"raw": {
"type": "text",
"fielddata": true,
"analyzer": "parsed_analyzer"
}
}
}
}
}
}
Search Query:
POST idx_test/_search
{
"script_fields": {
"raw": {
"script": "doc['html.raw']"
}
},
"query": {
"match": {
"html": "more"
}
},"highlight": {
"fields": {
"*":{ "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
}
}
}
Result:
"hits": [
{
"_index": "idx_test2",
"_type": "_doc",
"_id": "GijDsYMBjgX3UBaguGxc",
"_score": 0.2876821,
"fields": {
"raw": [
"Test More test"
]
},
"highlight": {
"html": [
"<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>"
]
}
}
]
Result that I want to get:
"hits": [
{
"_index": "idx_test2",
"_type": "_doc",
"_id": "GijDsYMBjgX3UBaguGxc",
"_score": 0.2876821,
"fields": {
"raw": [
"Test <strong>More</strong> test"
]
}
]

I thought of another solution. You could index two fields, the original html and the html_extract which has only the text.
You would have to use a processor to just index the text coming from the message and highligths would work.
Mapping
PUT idx_html_strip
{
"mappings": {
"properties": {
"html": {
"type": "text"
},
"html_extract": {
"type": "text"
}
}
}
}
Processor Pipeline
PUT /_ingest/pipeline/pipe_html_strip
{
"description": "_description",
"processors": [
{
"html_strip": {
"field": "html",
"target_field": "html_extract"
}
},
{
"script": {
"lang": "painless",
"source": "ctx['html_raw'] = ctx['html_raw'].replace('\n',' ').trim()"
}
}
]
}
Index Data
Note the use ?pipeline=pipe_html_strip
POST idx_html_strip/_doc?pipeline=pipe_html_strip
{
"html": """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>"""
}
Query
GET idx_html_strip/_search?filter_path=hits.hits._source,hits.hits.highlight
{
"query": {
"multi_match": {
"query": "More",
"fields": ["html", "html_extract"]
}
},"highlight": {
"fields": {
"*":{ "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
}
}
}
Results
{
"hits": {
"hits": [
{
"_source": {
"html": """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>""",
"html_extract": "Test More test"
},
"highlight": {
"html": [
"""<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong><strong>More</strong></strong> test</span></body>"""
],
"html_extract": [
"Test <strong>More</strong> test"
]
}
}
]
}
}

why is shingle token filter with analyser isn't yielding expected results?

Hi here are my index details:
PUT shingle_test
{
"settings": {
"analysis": {
"analyzer": {
"evolutionAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"custom_shingle"
]
}
},
"filter": {
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "10",
"output_unigrams": false
}
}
}
},
"mappings": {
"legacy" : {
"properties": {
"name": {
"type": "text",
"fields": {
"shingles": {
"type": "text",
"analyzer": "standard",
"search_analyzer": "evolutionAnalyzer"
},
"as_is": {
"type": "keyword"
}
},
"analyzer": "standard"
}
}
}
}
}
Added 2 docs
PUT shingle_test/legacy/1
{
"name": "Chandni Chowk 2 Banglore"
}
PUT shingle_test/legacy/2
{
"name": "Chandni Chowk"
}
Nothing is being returned if I do this,
GET shingle_test/_search
{
"query": {
"match": {
"name": {
"query": "Chandni Chowk",
"analyzer": "evolutionAnalyzer"
}
}
}
}
Looked at all possible solutions online, didn't get any.
Also, if I do "output_unigrams": true, then it just works like match query and gives results.
The thing I'm trying to achieve:
Having these documents:
Chandni Chowk 2 Bangalore
Chandni Chowk
CCD Bangalore
Istah shawarma and biryani
Istah
So,
searching for "Chandni Chowk 2 Bangalore" should return 1, 2
searching for "Chandni Chowk" should return 1, 2
searching for "Istah shawarma and biryani" should return 4, 5
searching for "Istah" should return 4, 5
searching for "CCD Bangalore" should return 3
note: search keyword will always be exactly equal to value of the name field in the document ex: In this particular index, we can query "Chandni Chowk 2 Bangalore", "Chandni Chowk", "CCD Bangalore", "Istah shawarma and biryani", "Istah". "CCD" won't be queried on this index.

The analyzer parameter specifies the analyzer used for text analysis when indexing or searching a text field.
Modify your index mapping as
{
"settings": {
"analysis": {
"analyzer": {
"evolutionAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"custom_shingle"
]
}
},
"filter": {
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "10",
"output_unigrams": true // note this
}
}
}
},
"mappings": {
"legacy" : {
"properties": {
"name": {
"type": "text",
"fields": {
"shingles": {
"type": "text",
"analyzer": "evolutionAnalyzer", // note this
"search_analyzer": "evolutionAnalyzer"
},
"as_is": {
"type": "keyword"
}
},
"analyzer": "standard"
}
}
}
}
}
And, the modified search query will be
{
"query": {
"match": {
"name.shingles": {
"query": "Chandni Chowk"
}
}
}
}
Search Results:
"hits": [
{
"_index": "66127416",
"_type": "_doc",
"_id": "2",
"_score": 0.25759193,
"_source": {
"name": "Chandni Chowk"
}
},
{
"_index": "66127416",
"_type": "_doc",
"_id": "1",
"_score": 0.19363807,
"_source": {
"name": "Chandni Chowk 2 Banglore"
}
}
]

elasticsearch ignore accents on search

I have an elasticsearch index with customer informations
I have some issues looking for some results with accents
for example, I have {name: 'anais'} and {name: anaïs}
Running
GET /my-index/_search
{
"size": 25,
"query": {
"match": {"name": "anaïs"}
}
}
I would like to get both same for this query, in this case I only have anaïs
GET /my-index/_search
{
"size": 25,
"query": {
"match": {"name": "anais"}
}
}
I would like to get anais and anaïs, in this case I only have anais
I tried adding an analyser
PUT /my-new-celebrity/_settings
{
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
But in this case for both search I only get anais

Looks like you forgot to apply your custom default analyzer on your name field, below is working example:
Index def with mapping and setting
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings" : {
"properties" :{
"name" : {
"type" : "text",
"analyzer" : "default" // note this
}
}
}
}
Index sample docs
{
"name" : "anais"
}
{
"name" : "anaïs"
}
Search query same as yours
{
"size": 25,
"query": {
"match": {
"name": "anaïs"
}
}
}
And expected both search results
"hits": [
{
"_index": "myindexascii",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"name": "anaïs"
}
},
{
"_index": "myindexascii",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"name": "anais"
}
}
]

Synonym analyzer not working

Here are my settings:
{
"countries": {
"aliases": {},
"mappings": {
"country": {
"properties": {
"countryName": {
"type": "string"
}
}
}
},
"settings": {
"index": {
"creation_date": "1472140045116",
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"synonyms_path": "synonym.txt"
}
},
"analyzer": {
"synonym": {
"filter": [
"synonym"
],
"tokenizer": "whitespace"
}
}
},
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "7-fKyD9aR2eG3BwUNdadXA",
"version": {
"created": "2030599"
}
}
},
"warmers": {}
}
}
My synonym.txt file is in the config folder inside the main elasticsearch folder.
Here is my query:
query: {
query_string: {
fields: ["countryName"],
default_operator: "AND",
query: searchInput,
analyzer: "synonym"
}
}
The words in synonym.txt are: us, u.s., united states.
So this doesn't work. What's interesting is that search works as normal, except for when I enter any of the words in the synonym.txt file. So for example, when I usually type in us into the search, I would get results. With this analyzer, us doesn't give me anything.
I've done close and open to my ES server, and still it doesn't work.
EDIT
An example of a document:
{
"_index": "countries",
"_type": "country",
"_id": "57aabeb80057405968de152b",
"_score": 1,
"_source": {
"countryName": "United States"
}
Example of searchInput (this is coming from the front-end):
united states
EDIT #2:
Here is my updated index config file:
{
"countries": {
"aliases": {},
"mappings": {},
"settings": {
"index": {
"number_of_shards": "5",
"creation_date": "1472219634083",
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"synonyms_path": "synonym.txt"
}
},
"analyzer": {
"synonym": {
"filter": [
"synonym"
],
"tokenizer": "whitespace"
}
}
},
"country": {
"properties": {
"countryName": {
"type": "string",
"analyzer": "synonym"
},
"number_of_replicas": "1",
"uuid": "50ZwpIVFTqeD_rJxlmd59Q",
"version": {
"created": "2030599"
}
}
},
"warmers": {}
}
}
}
}
When I try adding documents, and doing a search on said documents, the synonym analyzer does not work for me.
EDIT #3
Here are 2 documents in the index:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [{
"_index": "stocks",
"_type": "stock",
"_id": "2",
"_score": 1,
"_source": {
"countryName": "United States"
}
}, {
"_index": "stocks",
"_type": "stock",
"_id": "1",
"_score": 1,
"_source": {
"countryName": "Canada"
}
}]
}
}

You are close, but I suggest reading thoroughly this section from the documentation to understand better this functionality.
As a solution:
PUT /countries
{
"mappings": {
"country": {
"properties": {
"countryName": {
"type": "string",
"analyzer": "synonym"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"synonyms_path": "synonym.txt"
}
},
"analyzer": {
"synonym": {
"filter": [
"lowercase",
"synonym"
],
"tokenizer": "whitespace"
}
}
}
}
}
You need to delete the index and create it again with the mapping above.
Then use this query:
"query": {
"query_string": {
"fields": [
"countryName"
],
"default_operator": "AND",
"query": "united states"
}
}

Have you deleted/created the index after pushing the txt ?
I think you should remove the "synonyms": "" if you are using "synonyms_path"

elasticsearch not returning text when entered partial word

I have my analyzers set like this:
"analyzer": {
"edgeNgram_autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "autocomplete"]
},
"full_name": {
"filter":["standard","lowercase","asciifolding"],
"type":"custom",
"tokenizer":"standard"
}
My filter:
"filter": {
"autocomplete": {
"type": "edgeNGram",
"side":"front",
"min_gram": 1,
"max_gram": 50
}
Name field analyzer:
"textbox": {
"_parent": {
"type": "document"
},
"properties": {
"text": {
"fields": {
"text": {
"type":"string",
"analyzer":"full_name"
},
"autocomplete": {
"type": "string",
"index_analyzer": "edgeNgram_autocomplete",
"search_analyzer": "full_name",
"analyzer": "full_name"
}
},
"type":"multi_field"
}
}
}
Put all together, makes up my mapping for docstore index:
PUT http://localhost:9200/docstore
{
"settings": {
"analysis": {
"analyzer": {
"edgeNgram_autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "autocomplete"]
},
"full_name": {
"filter":["standard","lowercase","asciifolding"],
"type":"custom",
"tokenizer":"standard"
}
},
"filter": {
"autocomplete": {
"type": "edgeNGram",
"side":"front",
"min_gram": 1,
"max_gram": 50
} }
}
},
"mappings": {
"space": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
},
"document": {
"_parent": {
"type": "space"
},
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
},
"textbox": {
"_parent": {
"type": "document"
},
"properties": {
"bbox": {
"type": "long"
},
"text": {
"fields": {
"text": {
"type":"string",
"analyzer":"full_name"
},
"autocomplete": {
"type": "string",
"index_analyzer": "edgeNgram_autocomplete",
"search_analyzer": "full_name",
"analyzer":"full_name"
}
},
"type":"multi_field"
}
}
},
"entity": {
"_parent": {
"type": "document"
},
"properties": {
"bbox": {
"type": "long"
},
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Add a space to hold all docs:
POST http://localhost:9200/docstore/space
{
"name": "Space 1"
}
When user enters word: proj
this should return, all text:
SampleProject
Sample Project
Project Name
myProjectname
firstProjectName
my ProjectName
But it returns nothing.
My query:
POST http://localhost:9200/docstore/textbox/_search
{
"query": {
"match": {
"text": "proj"
}
},
"filter": {
"has_parent": {
"type": "document",
"query": {
"term": {
"name": "1-a1-1001.pdf"
}
}
}
}
}
If I search by project, I get:
{ "took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 3.0133555,
"hits": [
{
"_index": "docstore",
"_type": "textbox",
"_id": "AVRuV2d_f4y6IKuxK35g",
"_score": 3.0133555,
"_routing": "AVRuVvtLf4y6IKuxK33f",
"_parent": "AVRuV2cMf4y6IKuxK33g",
"_source": {
"bbox": [
8750,
5362,
9291,
5445
],
"text": [
"Sample Project"
]
}
},
{
"_index": "docstore",
"_type": "textbox",
"_id": "AVRuV2d_f4y6IKuxK35Y",
"_score": 2.4106843,
"_routing": "AVRuVvtLf4y6IKuxK33f",
"_parent": "AVRuV2cMf4y6IKuxK33g",
"_source": {
"bbox": [
8645,
5170,
9070,
5220
],
"text": [
"Project Name and Address"
]
}
}
]
}
}
Maybe my edgengram is not suited for this?
I am saying:
side":"front"
Should I do it differently?
Does anyone know what I am doing wrong?

The problem is with the autocomplete indexing analyzer field name.
Change:
"index_analyzer": "edgeNgram_autocomplete"
To:
"analyzer": "edgeNgram_autocomplete"
And also search like (#Andrei Stefan) showed in his answer:
POST http://localhost:9200/docstore/textbox/_search
{
"query": {
"match": {
"text.autocomplete": "proj"
}
}
}
And it will work as expected!
I have tested your configuration on Elasticsearch 2.3
By the way, type multi_field is deprecated.
Hope I have managed to help :)

Your query should actually try to match on text.autocomplete and not text:
"query": {
"match": {
"text.autocomplete": "proj"
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch completion suggester skip_duplicates non duplicates - elasticsearch

Related

Elasticsearch Highlight the result of script fields

why is shingle token filter with analyser isn't yielding expected results?

elasticsearch ignore accents on search

Synonym analyzer not working

elasticsearch not returning text when entered partial word

Categories

Resources