Multiple tokenizers inside one Custom Analyser in Elasticsearch - elasticsearch

I am using Custom NGRAM Analyzer which has a ngram tokenizer. I have also used lowercase filter. The query is working fine for searches without characters. But when I am searching for certain symbols, it fails. Since I have used lower case tokenizers, Elasticsearch doesn't analyse symbols. I know whitespace tokenizer can help me solve the issue. How can I use two tokenizers in a single analyzer?Below is the mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer":"my_tokenizer",
"filter":"lowercase"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
Is there a way I could solve this issue?

As per the documentation of elasticsearch,
An analyzer must have exactly one tokenizer.
However, you can have multiple analyzer defined in settings, and you can configure separate analyzer for each field.
If you want to have single field itself to be used using different analyzer, one of the option is to make that field multi-field as per this link
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "whitespace"
"fields": {
"ngram": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
So if you configure as above your query need to make use of title and title.ngram fields.
GET my_index/_search
{
"query": {
"multi_match": {
"query": "search ##$ whatever",
"fields": [
"title",
"title.ngram"
],
"type": "most_fields"
}
}
}
As another option, here is what you can do
Create two indexes.
The first index have field title with analyzer my_analyzer
Second index have field title with analyzer whitespace
Have same alias created for both of them as below
Execute the below:
POST _aliases
{
"actions":[
{
"add":{
"index":"index A",
"alias":"index"
}
},
{
"add":{
"index":"index B",
"alias":"index"
}
}
]
}
So when you eventually write a query, it must be pointing to this alias which in turn would be querying multiple indexes.
Hope this helps!

if you want to use 2 tokenizers, You should have 2 analyzers.
something like this:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer":"my_tokenizer",
"filter":"lowercase"
},
"my_analyzer_2": {
"tokenizer":"whitespace",
"filter":"lowercase"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
In general, you should also pay attention to the place of the analyzer in the Mapping.
Sometimes it is necessary to have the analyzer in both the serach_time and index_time.
"mappings":{
"_doc":{
"properties":{
"title":{
"type":"text",
"analyzer":"my_analyzer",
"search_analyzer":"my_analyzer"
}
}
}
}

1) You can try with updating your token_chars like below:
"token_chars":[
"letter",
"digit",
"symbol",
"punctuation"
]
2) If not work then try below analyzer:
{
"settings":{
"analysis":{
"filter":{
"my_filter":{
"type":"ngram",
"min_gram":3,
"max_gram":3,
"token_chars":[
"letter",
"digit",
"symbol",
"punctuation"
]
}
},
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase",
"like_filter"
]
}
}
}
},
"mappings":{
"_doc":{
"properties":{
"title":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
}
}
You need to use keyword tokenizer (keyword tokenizer) and then ngram token filter in your analyzer.

I find a solution for use synonym and edge_ngram tokenizer at the same time.
The main logic works like this => It uses one at indexing time and one at search time.
PUT test_synonym_and_autocomplete
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
},
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
},
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"kstem",
"synonym"
]
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "synonym_analyzer"
}
}
}
}
POST test_synonym_and_autocomplete/_doc
{
"text": "Quick Brown Fox"
}
POST test_synonym_and_autocomplete/_doc
{
"text": "B3Round"
}
GET test_synonym_and_autocomplete/_search
{
"query": {
"match": {
"text": {
"query": "test"
}
}
}
}
GET test_synonym_and_autocomplete/_search
{
"query": {
"match": {
"text": {
"query": "qui"
}
}
}
}
Results:

Related

How to match partial words in elastic search text search

I have a field name in my elastic search with a value of Single V
Now if i search it with a value of S or Sing , i don't get no result , but if i enter a full value Single , then i get the result Single V, the query i am using is as following :-
{
"query": {
"match": {
"name": "singl"
}
},
"sort": []
}
This gives me no results , do i need to change the mapping/setting for name or analyzer ?
EDIT:-
I am trying to create the following index with the following mapping/setting
PUT my_cars
{
"settings": {
"analysis": {
"normalizer": {
"sortable": {
"filter": ["lowercase"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36,
"token_chars": [
"letter"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "sortable"
}
}
}
}
}
}
But i get the following error
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
}
],
"type" : "illegal_argument_exception",
"reason" : "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
},
"status" : 400
}
Elasticsearch by default uses a standard analyzer for the text field if no analyzer is specified. This will tokenize "Single V" into "single" and "v". Due to this, you are getting the result for "Single" and not for the other terms.
If you want to do a partial search, you can use edge n-gram tokenizer or a Wildcard query
The mapping for the Edge n-gram tokenizer would be
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 6,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Update 1:
In the index mapping given above, there is one bracket } missing. Modify your index mapping as shown below
{
"settings": {
"analysis": {
"normalizer": {
"sortable": {
"filter": [
"lowercase"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
}, // note this
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36,
"token_chars": [
"letter"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "sortable"
}
}
}
}
}
}
This is because of the default analyzer. The field is broken into tokens because of the analyzer - [Single,V].
Match query will try to find an exact search of any of the query tokens. Since you are only passing Singl that will be the only token, which is not matching any of the two tokens which are saved in the DB.
{
"query": {
"wildcard": {
"user.id": {
"name": "*singl*"
}
}
}
}
You can use wildcard queries

elasticsearch autocomplete for title text

I am trying to implement a auto complete suggester for movies(title) somewhat similar to IMDB. Below is mapping that i used.This mapping gives decent results. I am using edge Ngam.. are there any better alternatives?
But it has some flaws like.
"war civil" "civil war" gives same results. ie it doesn't give priority to movies with words in same order as query.
It doesn't give any results when space is omitted between words eg "smoking barrels" gives good results. but "smokingbarrels" gives zero result.
What is wrong with query and mapping below?
curl -XPUT "http://localhost:9200/movieindex" -H 'Content-Type: application/json' -d'
{
"settings": {
"index": {
"analysis": {
"filter": {},
"analyzer": {
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "edge_ngram_tokenizer"
},
"edge_ngram_search_analyzer": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit",
"symbol"
]
}
}
}
}
},
"mappings": {
"movies": {
"properties": {
"title": {
"type": "text",
"fields": {
"edgengram": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "edge_ngram_search_analyzer"
}
},
"analyzer": "standard"
}
}
}
}
}
GET /movieindex/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"title.edgengram": {
"query": "smokingbarrels",
"fuzziness": 1
}
}
}
]
}
}
}

Why does my Elasticsearch multi-match query look only for prefixes?

I am trying to write an Elasticsearch multi-match query (with the Java API) to create a "search-as-you-type" program. The query is applied to two fields, title and description, which are analyzed as ngrams.
My problem is, it seems that Elasticsearch tries to find only words beginning like my query. For instance, if I search for "nut", then it matches with documents featuring "nut", "nuts", "Nutella", etc, but it does not match documents featuring "walnut", which should be matched.
Here are my settings :
{
"index": {
"analysis": {
"analyzer": {
"edgeNGramAnalyzer": {
"tokenizer": "edgeTokenizer",
"filter": [
"word_delimiter",
"lowercase",
"unique"
]
}
},
"tokenizer": {
"edgeTokenizer": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "8",
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
Here is the relevant part of my mapping :
{
"content": {
"properties": {
"title": {
"type": "text",
"analyzer": "edgeNGramAnalyzer",
"fields": {
"sort": {
"type": "keyword"
}
}
},
"description": {
"type": "text",
"analyzer": "edgeNGramAnalyzer",
"fields": {
"sort": {
"type": "keyword"
}
}
}
}
}
}
And here is my query :
new MultiMatchQueryBuilder(query).field("title", 3).field("description", 1).fuzziness(0).tieBreaker(1).minimumShouldMatch("100%")
Do you have any idea what I could be doing wrong ?
That's because you're using an edgeNGram tokenizer instead of nGram one. The former only indexes prefixes, while the latter will index prefixes, suffixes and also sub-parts of your data.
Change your analyzer definition to this instead and it should work as expected:
{
"index": {
"analysis": {
"analyzer": {
"edgeNGramAnalyzer": {
"tokenizer": "edgeTokenizer",
"filter": [
"word_delimiter",
"lowercase",
"unique"
]
}
},
"tokenizer": {
"edgeTokenizer": {
"type": "nGram", <---- change this
"min_gram": "3",
"max_gram": "8",
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}

Elasticsearch returning unexpected results

I used following mapping:
I have modified english analyzer to use ngram analyzer as follows,so that I should be able to search under following scenarios :
1] partial search and special character search
2] To get advantage of language analyzers
{
"settings": {
"analysis": {
"analyzer": {
"english_ngram": {
"type": "custom",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_stemmer",
"ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"english_stop": {
"type": "stop"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"en": {
"type": "string",
"analyzer": "english_ngram"
}
}
}
}
}
}
}
Indexed my data as follows:
PUT http://localhost:9200/movies/movie/1
{
"title" : "$peci#l movie"
}
Query as follows:
{
"query": {
"multi_match": {
"query": "$peci#44 m11ov",
"fields": ["title.en"],
"operator":"and",
"type": "most_fields",
"minimum_should_match": "75%"
}
}
}
In query I am looking for "$peci#44 m11ov" string ,ideally I should not get results for this.
Anything wrong in here ?
This is a result of ngram tokenization. When you tokenize a string $peci#l movie your analyzer produces tokens like $, $p, $pe, etc. Your query also produces most of these tokens. Though these matches will have a lower score than a complete match. If it's critical for you to exclude these false positive matches, you can try to set a threshold using min_score option https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-min-score.html

Elasticsearch is not sorting the results

I'm having problem with an elasticsearch query.
I want to be able to sort the results but elasticsearch is ignoring the sort tag. Here my query:
{
"sort": [{
"title": {"order": "desc"}
}],
"query":{
"term": { "title": "pagos" }
}
}
However, when I remove the query part and I send only the sort tag, it works.
Can anyone point me out the correct way?
I also tried with the following query, which is the complete query that I have:
{
"sort": [{
"title": {"order": "asc"}
}],
"query":{
"bool":{
"should":[
{
"match":{
"title":{
"query":"Pagos",
"boost":9
}
}
},
{
"match":{
"description":{
"query":"Pagos",
"boost":5
}
}
},
{
"match":{
"keywords":{
"query":"Pagos",
"boost":3
}
}
},
{
"match":{
"owner":{
"query":"Pagos",
"boost":2
}
}
}
]
}
}
}
Settings
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 15,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "asciifolding"]
},
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"autocomplete_filter"
]
}
}
}
}
}
Mappings
{
"objects": {
"properties": {
"id": { "type": "string", "index": "not_analyzed" },
"type": { "type": "string" },
"title": { "type": "string", "boost": 9, "analyzer": "autocomplete", "search_analyzer": "standard" },
"owner": { "type": "string", "boost": 2 },
"description": { "type": "string", "boost": 4 },
"keywords": { "type": "string", "boost": 1 }
}
}
}
Thanks in advance!
The field "title" in your document is an analyzed string field, which is also a multivalued field, which means elasticsearch will split the contents of the field into tokens and stores it separately in the index.
You probably want to sort the "title" field alphabetically on the first term, then on the second term, and so forth, but elasticsearch doesn’t have this information at its disposal at sort time.
Hence you can change your mapping of the "title" field from:
{
"title": {
"type": "string", "boost": 9, "analyzer": "autocomplete", "search_analyzer": "standard"
}
}
into a multifield mapping like this:
{
"title": {
"type": "string", "boost": 9, "analyzer": "autocomplete", "search_analyzer":"standard",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
Now execute your search based on analyzed "title" field and sort based on the not_analyzed "title.raw" field
{
"sort": [{
"title.raw": {"order": "desc"}
}],
"query":{
"term": { "title": "pagos" }
}
}
It is beautifully explained here: String Sorting and Multifields

Resources