Right way for user search by partial username or name using ngram tokenizer in elasticsearch - elasticsearch

I want to create search fature for social networking application in such a way that users can search other users by username or name even by inputting part of username or name using elasticsearch.
For example:
input: okma
result: {"username": "alokmahor", "name": "Alok Singh Mahor"} // partial match in username
input: m90
result: {"username": "ram9012", "name": "Ram Singh"} // partial match in username
input: shn
result: {"username": "r2020", "name": "Krishna Kumar"} // partial match with name
After reading and playing these links I come up with my partial solution which I am not sure if thats the correct way.
I followed
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
How to search for a part of a word with ElasticSearch
My solution is
DELETE my_index
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"username": { "type": "text", "analyzer": "my_analyzer" },
"name": { "type": "text", "analyzer": "my_analyzer" }
}
}
}
PUT /my_index/_doc/1
{
"username": "alokmahor",
"name": "Alok Singh Mahor"
}
PUT /my_index/_doc/2
{
"username": "ram9012",
"name": "Ram Singh"
}
PUT /my_index/_doc/3
{
"username": "r2020",
"name": "Krishna Kumar"
}
GET my_index/_search
{
"query": {
"multi_match": {
"query": "shn",
"analyzer": "my_analyzer",
"fields": ["username", "name"]
}
}
}
somehow this solution is partailly working and I am not sure if this is really a correct way as I got this after playing aorund elasticsearch features and copy pasting example code. So please suggest correct way or improvement on this.
Things which are not working
// "sin" is not matching with "Singh" but "Sin" is matching and working.
GET my_index/_search
{
"query": {
"multi_match": {
"query": "sin",
"analyzer": "my_analyzer",
"fields": ["username", "name"]
}
}
}

So please suggest correct way
The degree of correctness can only be defined by your requirement. You can keep on refining by checking all the possible use cases one by one.
improvement on this
For the problem you mention where Sin is matching while sin is not; this is because the analyzer defined doesn't make the search case-insensitive. To do so add lowercase filter in your analyzer definition as below:
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": [
"lowercase"
]
}
}
This answer can help you understand more in case-insensitive search.

Related

Elasticsearch sort by text field keyword

I have index with this settings
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter",
"asciifolding",
"elision",
"standard"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "32"
}
}
}
and have mapping for the name field
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
now I have several examples of names in documents. Name is one field with first name and last name inside.
--макс---
-макс -
{something} макс
макс {something}
I am using this query to find the documents with that name with alphabetical sorting
{
"query": {
"match": {
"name": {
"query": "макс",
"operator" : "and"
}
}
},
"sort": [
{"name.keyword" : "asc"}
]
}
it is bringing results as I wrote. but I expect that макс {something} will come for the first position than others because it is starting with a query which I wrote.
Can somebody help be there
So the query is by default scoring documents based on "how well they matched", this score is used to rank the "best matches first". But as soon as you define an sort you are saying ignore the query score and only using this field to rank the results. Now the results are still restricted to only documents matching the query but the idea of best match is lost unless you keep the special value _score in your sort statement somewhere.
Like this:
"sort": [
{
"productLine.keyword": {
"order": "desc"
}
},
{
"_score": {
"order": "desc"
}
}
]
Maybe you can just remove the sort and get the results you want based on default score sorting. Include a few example documents to make this fully reproducible if you want more support from the SO community

Configure highlighted part in the elasticsearch

Main question
The user is looking for a name and enters the part of the it, let's say au, and the document with the text paul is found.
I would like to have the doc highlighted like p<em>au</em>l.
How can I achieve it if I have a complex search query (combination of match, prefix, wildcard to rule relevance)?
Sub question
When do highlight settings from documentation for type, boundary_scanner and boundary_chars come into play? As per my tests described below, these settings don't change highlighted part.
Try 1: Wildcard query with default analyzer
PUT myindex
{
"mappings": {
"properties": {
"name": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
}
POST myindex/_doc/1
{
"name": "paul"
}
GET myindex/_search
{
"query": {
"wildcard": {"name": "*au*"}
},
"highlight": {
"fields": {
"name": {}
},
"type": "fvh",
"boundary_scanner": "chars",
"boundary_chars": "abcdefghijklmnopqrstuvwxyz.,!? \t\n"
}
}
This kind of search returns highlight <em>paul</em> but I need to get p<em>au</em>l.
Try 2: Match query with NGRAM analyzer
This one works as described in SO question: Highlighting part of word in elasticsearch
PUT myindexngram
{
"settings": {
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "3",
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"index_ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"search_term_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "index_ngram_analyzer",
"term_vector": "with_positions_offsets"
}
}
}
}
POST myindexngram/_doc/1
{
"name": "paul"
}
GET myindexngram/_search
{
"query": {
"match": {"name": "au"}
},
"highlight": {
"fields": {
"name": {}
}
}
}
This highlights p<em>au</em>l as desired but:
Highlighting depends on the query type, so combining match and wildcard will again result in <em>paul</em>.
Highlighting is not affected at all on type, boundary_scanner and boundary_chars settings.
Elastic version 7.13.4
Response from Elasticsearch team:
A highlighter works on terms, so only full terms can be highlighted - whatever are the terms in your index. In your second example, au could be highlighted, because it it a term in the index, which is not the case for your first example.
There is also an option to define your own highlight_query that could be different from the main query, but this could lead to unpredictable highlights.
https://discuss.elastic.co/t/configure-highlighted-part/295164

How to search by words written together among data where these words are written apart in Elasticsearch?

I have documents which have, let's say, 1 field - name of this document. Name may consist of several words written apart, for example:
{
"name": "first document"
},
{
"name": "second document"
}
My goal is to be able to search for these documents by strings:
firstdocument, seconddocumen
As you can see, search strings are written wrong, but they still match those documents if we delete whitespaces from documents' names. This issue could be handled by creating another field with the same string but without whitespaces, but it seems like extra data unless there's no other way to do that.
I need something similar to this:
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":2,
"output_unigrams":"true",
"token_separator": ""
}
],
"text": "first document"
}
But the other way around. I need kind of apply this not to a search text, but for search objects (name of documents), so I could find documents with a little misspell in a search text. How should it be done?
I suggest using multi-fields with an analyzer for removing whitespaces.
Analyzer
"no_spaces": {
"filter": [
"lowercase"
],
"char_filter": [
"remove_spaces"
],
"tokenizer": "standard"
}
Char Filter
"remove_spaces": {
"type": "pattern_replace",
"pattern": "[ ]",
"replacement": ""
}
Field Mapping
"name": {
"type": "text",
"fields": {
"without_spaces": {
"type": "text",
"analyzer": "no_spaces"
}
}
}
Query
GET /_search
{
"query": {
"match": {
"name.without_spaces": {
"query": "seconddocumen",
"fuzziness": "AUTO"
}
}
}
}
EDIT:
For completion: An alternative to the remove_spaces filter could be the shingle filter:
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"output_unigrams": "false",
"token_separator": ""
}
},
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"shingle_filter"
]
}
}
}

Elastic search partial substring search

I am trying to implement partial substring search in elastic serach 7.1 using following analyzer
PUT my_index-001
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete"
]
},
"autocomplete_search": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
},
"filter": {
"autocomplete": {
"type": "nGram",
"min_gram": 2,
"max_gram": 40
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
After that i tried adding some sample data to my_index-001 and type doc
PUT my_index-001/doc/1
{
"title": "ABBOT Series LTD 2014"
}
PUT my_index-001/doc/2
{
"title": "ABBOT PLO LTD 2014A"
}
PUT my_index-001/doc/3
{
"title": "ABBOT TXT"
}
PUT my_index-001/doc/4
{
"title": "ABBOT DMO LTD. 2016-II"
}
Query used to perform partial search :
GET my_index-001/_search
{
"query": {
"match": {
"title": {
"query": "ABB",
"operator": "or"
}
}
}
}
I was expecting the following output from the analyzer
If i type in ABB i should get docid 1,2,3,4
If i type in ABB 2014 i should get docid 1,2
IF i type in ABBO PLO i should get doc 2
If i type in TXT i should get doc 3
With the above analyzer setting i am not getting expected results .
Please let me know if i am missing anything in my analyzer setting of Elastic search
You were almost there but there are a couple of issues.
When creating index mappings through Kibana Dev Tools, there mustn't be any whitespace between the URI and the request body. You have whitespace in the first code snippet which caused ES to ignore the request body entirely! So remove that whitespace.
The maximum ngram difference is set to 1 by default. In order to use your high ngram intervals, you'll need to explicitly increase the index-level setting max_ngram_diff:
PUT my_index-001
{
"settings": {
"index": {
"max_ngram_diff": 40 <--
},
...
}
}
Type names are deprecated in v7. So is the nGram token filter in favor of ngram (lowercase g). And so is the string field type too! Here's the corrected PUT request body:
PUT my_index-001 <--- no whitespace after the URI!
{
"settings": {
"index": {
"max_ngram_diff": 40 <--- explicit setting
},
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete"
]
},
"autocomplete_search": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
},
"filter": {
"autocomplete": {
"type": "ngram", <--- ngram, not nGram
"min_gram": 2,
"max_gram": 40
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text", <--- text, not string
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
Since different mapping types had been deprecated in favor of the generic _doc type, you'll need to adjust the way you insert documents. The only difference, luckily, is changing doc to _doc in the URI:
PUT my_index-001/_doc/1
{ "title": "ABBOT Series LTD 2014" }
PUT my_index-001/_doc/2
{ "title": "ABBOT PLO LTD 2014A" }
PUT my_index-001/_doc/3
{ "title": "ABBOT TXT" }
PUT my_index-001/_doc/4
{ "title": "ABBOT DMO LTD. 2016-II" }
Finally, your query is perfectly fine and should behave the way you expect it to. The only thing to change is the operator to and when querying for two or more substrings, i.e.:
GET my_index-001/_search
{
"query": {
"match": {
"title": {
"query": "ABB 2014",
"operator": "and"
}
}
}
}
Other than that, all four of your test scenarios should return what you expect.

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

Resources