match query no has result for custom analyzer - elasticsearch

I have two index:
First:
curl -XPUT 'http://localhost:9200/first/' -d '
{
"mappings": {
"product": {
"properties": {
"name": {
"type": "string",
"analyzer":"spanish"
}
}
}
}
}
'
Second:
curl -XPUT 'http://localhost:9200/second/' -d '
{
"mappings": {
"product": {
"properties": {
"name": {
"type": "string",
"analyzer":"spanish_custom"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwordsPath": "spanish_stop_custom.txt"
},
"spanish_stemmer": {
"type": "stemmer",
"language": "spanish"
}
},
"analyzer": {
"spanish_custom": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"spanish_stop",
"spanish_stemmer"
]
}
}
}
}
}
'
I insert some document for both index:
curl -XPOST 'http://localhost:9200/first/product' -d '
{
"name": "Hidratante"
}'
curl -XPOST 'http://localhost:9200/second/product' -d '
{
"name": "Hidratante"
}'
i checked tokens for the field name:
curl -XGET 'http://localhost:9200/first/_analyze?field=name' -d 'hidratante'
{"tokens":[{"token":"hidratant","start_offset":0,"end_offset":10,"type":"<ALPHANUM>","position":1}]}
curl -XGET 'http://localhost:9200/second/_analyze?field=name' -d 'hidratante'
{"tokens":[{"token":"hidrat","start_offset":0,"end_offset":10,"type":"<ALPHANUM>","position":1}]}
I want search for 'hidratant' and give results in both index, but i got results only first index
My Query:
curl -XGET 'http://127.0.0.1:9200/first/_search' -d '
{
"query" : {
"multi_match" : {
"query" : "hidratant",
"fields" : [ "name"],
"type" : "phrase_prefix",
"operator" : "AND",
"prefix_length" : 3,
"tie_breaker": 1
}
}
}
'
First index result:
{"took":6,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":0.5945348,"hits":[{"_index":"test","_type":"product","_id":"AVPxjvpRDl8qAEgsMFMu","_score":0.5945348,"_source":
{
"name": "Hidratante"
}},{"_index":"test","_type":"product","_id":"AVPxkYbKDl8qAEgsMFMv","_score":0.5945348,"_source":
{
"name": "Hidratante"
}}]}}
Second index result:
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
Why the second index no has return result?

As you mentioned in your question above, for second index the tokens being generated for the term Hidratanteare:
{"tokens":[{"token":"hidrat","start_offset":0,"end_offset":10,"type":"<ALPHANUM>","position":1}]}
There is a concept of search analyzer which comes in when you perform search operation. According to the documentation:
By default, queries will use the analyzer defined in the field mapping while searching.
So when you run phrase_prefix query , same custom analyzer created by you will act on name field in second index.
Since you are searching for keyword : hidratant
It gets analyzed as :
For first Index:
curl -XGET 'http://localhost:9200/first/_analyze?field=name' -d 'hidratant'
{
"tokens": [
{
"token": "hidratant",
"start_offset": 3,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
}
]
}
i.e why you get result in first index.
For second index:
curl -XGET 'http://localhost:9200/second/_analyze?field=name' -d 'hidratant'
{
"tokens": [
{
"token": "hidratant",
"start_offset": 3,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
}
]
}
The token generated while searching is hidratant but it was hidrat while indexing. That's why you don't get any result in second case.

Related

Elasticsearch: index a field with keyword tokenizer but without stopwords

I am looking for a way to search company names with keyword tokenizing but without stopwords.
For ex : The indexed company name is "Hansel und Gretel Gmbh."
Here "und" and "Gmbh" are stop words for the company name.
If the search term is "Hansel Gretel", that document should be found,
If the search term is "Hansel" then no document should be found. And if the search term is "hansel gmbh", the no document should be found as well.
I have tried to combine keywords tokenizer with stopwords in custom analyzer but it didnt work(as expected I guess).
I have also tried to use common terms query, but "Hansel" started to hit(again as expected)
Thanks in advance.
There are two ways bad and ugly. The first one uses regular expressions in order to remove stop words and trim spaces. There are a lot of drawbacks:
you have to support white-space tokenization(regexp(/s+)) and special symbol(.,;) removal by your own
no highlight is supported - keyword tokenizer does not support
case sensitivity is also a problem
normalizers(analyzers for keywords) are experimental feature - bad support, no features
Here is step-by-step example:
curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"normalizer": {
"custom_normalizer": {
"type": "custom",
"char_filter": ["stopword_char_filter", "trim_char_filter"],
"filter": ["lowercase"]
}
},
"char_filter": {
"stopword_char_filter": {
"type": "pattern_replace",
"pattern": "( ?und ?| ?gmbh ?)",
"replacement": " "
},
"trim_char_filter": {
"type": "pattern_replace",
"pattern": "(\\s+)$",
"replacement": ""
}
}
}
},
"mappings": {
"file": {
"properties": {
"name": {
"type": "keyword",
"normalizer": "custom_normalizer"
}
}
}
}
}'
Now we can check how our analyzer works(please note that requests to normalyzer are supported only in ES 6.x)
curl -XPOST "http://localhost:9200/test/_analyze" -H 'Content-Type: application/json' -d'
{
"normalizer": "custom_normalizer",
"text": "hansel und gretel gmbh"
}'
Now we are ready to index our document:
curl -XPUT "http://localhost:9200/test/file/1" -H 'Content-Type: application/json' -d'
{
"name": "hansel und gretel gmbh"
}'
And the last step is search:
curl -XGET "http://localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"name" : {
"query" : "hansel gretel"
}
}
}
}'
Another approach is:
create standard text analyzer with stop words filter
use analysis to filter out all stop words and special symbols
concatenate tokens manually
send term to ES as keyword
Here is step-by-step example:
curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "custom_stopwords"]
}
}, "filter": {
"custom_stopwords": {
"type": "stop",
"stopwords": ["und", "gmbh"]
}
}
}
},
"mappings": {
"file": {
"properties": {
"name": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
}'
Now we are ready to analyze our text:
POST test/_analyze
{
"analyzer": "custom_analyzer",
"text": "Hansel und Gretel Gmbh."
}
with the following result:
{
"tokens": [
{
"token": "hansel",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "gretel",
"start_offset": 11,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
The last step is token concatenation: hansel + gretel. The only drawback is manual analysis with custom code.

Elasticsearch keyword tokenizer doesn't work with phonetic analyzer

I want to add a custom phonetic analyzer, also I don't want to analyze my given string. Suppose, I have two string,
KAMRUL ISLAM
KAMRAL ISLAM
I don't want to get any result with a query string KAMRUL but want both two as a result with query string KAMRUL ISLAM.
For this, I have take a custom phonetic analyzer with a keyword tokenizer.
Index Settings :
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"dbl_metaphone": {
"tokenizer": "keyword",
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "keyword",
"filter": "dbl_metaphone"
}
}
}
}
}
Type Mappings:
PUT /my_index/_mapping/my_type
{
"properties": {
"name": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
I have inserted data with :
PUT /my_index/my_type/5
{
"name": "KAMRUL ISLAM"
}
And My query String:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "KAMRAL"
}
}
}
}
Unfortunately I am given both two string. I am using ES-1.7.1. Is there any way to solve this ?
Additionally, While I have run
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRUL ISLAM'
I got the result:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 1
}
]
}
And While running :
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRAL'
I have got:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 1
}
]
}

ElasticSearch analyzer that allows for query both with and without hypens

How do you construct an analyzer that allows you to query fields both with and without hyphens?
The following two queries must return the same person:
{
"query": {
"term": {
"name": {
"value": "Jay-Z"
}
}
}
}
{
"query": {
"term": {
"name": {
"value": "jay z"
}
}
}
}
What you could do is use a mapping character filter in order to replace the hyphen with a space. Basically, like this:
curl -XPUT localhost:9200/tests -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase"
],
"char_filter": [
"hyphens"
]
}
},
"char_filter": {
"hyphens": {
"type": "mapping",
"mappings": [
"-=>\\u0020"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
Then we can check what the analysis pipeline would yield using the _analyze endpoint:
For Jay-Z:
curl -XGET 'localhost:9200/tests/_analyze?pretty&analyzer=my_analyzer' -d 'Jay-Z'
{
"tokens" : [ {
"token" : "jay z",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
} ]
}
For jay z:
curl -XGET 'localhost:9200/tests/_analyze?pretty&analyzer=my_analyzer' -d 'jay z'
{
"tokens" : [ {
"token" : "jay z",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
} ]
}
As you can see the same token is going to be indexed for both forms so your term query will work with both forms as well.

Elasticsearch not using "default_search" analyzer unless explicitly stated in query

From reading the Elasticsearch documents, I would expect that naming an analyzer 'default_search' would cause that analyzer to get used for all searches unless another analyzer is specified. However, if I define my index like so:
curl -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer",
"filter": [
"lowercase"
],
"type" : "custom"
},
"default_search": {
"tokenizer" : "keyword",
"filter" : [
"lowercase"
]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "100",
"token_chars": []
}
}
}
},
"mappings": {
"TestDocument": {
"dynamic_templates": [
{
"metadata_template": {
"match_mapping_type": "string",
"path_match": "*",
"mapping": {
"type": "multi_field",
"fields": {
"ngram": {
"type": "{dynamic_type}",
"index": "analyzed",
"analyzer": "my_ngram_analyzer"
},
"{name}": {
"type": "{dynamic_type}",
"index": "analyzed",
"analyzer": "standard"
}
}
}
}
}
]
}
}
}'
And then add a 'TestDocument':
curl -XPUT 'http://localhost:9200/test/TestDocument/1' -d '{
"name" : "TestDocument.pdf" }'
My queries are still running through the default analyzer. I can tell because this query gives me a hit:
curl -XGET 'localhost:9200/test/TestDocument/_search?pretty=true' -d '{
"query": {
"match": {
"name.ngram": {
"query": "abc.pdf"
}
}
}
}'
But does not if I specify the correct analyzer (using the 'keyword' tokenizer)
curl -XGET 'localhost:9200/test/TestDocument/_search?pretty=true' -d '{
"query": {
"match": {
"name.ngram": {
"query": "abc.pdf",
"analyzer" : "default_search"
}
}
}
}'
What am I missing to use "default_search" for searches unless stated otherwise in my query? Am I just misinterpreting expected behavior here?
In your dynamic template, you are setting the search and index analyzer by using "analyzer." It will only use the default as a last resort.
"index_analyzer":"analyzer_name" //sets the index analyzer
"analyzer":"analyzer_name" // sets both search and index
"search_analyzer":"...." // sets the search analyzer.

Indexing website/url in Elastic Search

I have a website field of a document indexed in elastic search. Example value: http://example.com . The problem is that when I search for example, the document is not included. How to map correctly the website/url field?
I created the index below:
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_html":{
"type":"custom",
"tokenizer": "standard",
"filter":"standard",
"char_filter": "html_strip"
}
}
}
}
},
"mapping":{
"blogshops": {
"properties": {
"category": {
"properties": {
"name": {
"type": "string"
}
}
},
"reviews": {
"properties": {
"user": {
"properties": {
"_id": {
"type": "string"
}
}
}
}
}
}
}
}
}
I guess you are using standard analyzer, which splits http://example.dom into two tokens - http and example.com. You can take a look http://localhost:9200/_analyze?text=http://example.com&analyzer=standard.
If you want to split url, you need to use different analyzer or specify our own custom analyzer.
You can take a look how would be url indexed with simple analyzer - http://localhost:9200/_analyze?text=http://example.com&analyzer=simple. As you can see, now is url indexed as three tokens ['http', 'example', 'com']. If you don't want to index tokens like ['http', 'www'] etc, you can specify your analyzer with lowercase tokenizer (this is the one used in simple analyzer) and stop filter. For example something like this:
# Delete index
#
curl -s -XDELETE 'http://localhost:9200/url-test/' ; echo
# Create index with mapping and custom index
#
curl -s -XPUT 'http://localhost:9200/url-test/' -d '{
"mappings": {
"document": {
"properties": {
"content": {
"type": "string",
"analyzer" : "lowercase_with_stopwords"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"filter" : {
"stopwords_filter" : {
"type" : "stop",
"stopwords" : ["http", "https", "ftp", "www"]
}
},
"analyzer": {
"lowercase_with_stopwords": {
"type": "custom",
"tokenizer": "lowercase",
"filter": [ "stopwords_filter" ]
}
}
}
}
}' ; echo
curl -s -XGET 'http://localhost:9200/url-test/_analyze?text=http://example.com&analyzer=lowercase_with_stopwords&pretty'
# Index document
#
curl -s -XPUT 'http://localhost:9200/url-test/document/1?pretty=true' -d '{
"content" : "Small content with URL http://example.com."
}'
# Refresh index
#
curl -s -XPOST 'http://localhost:9200/url-test/_refresh'
# Try to search document
#
curl -s -XGET 'http://localhost:9200/url-test/_search?pretty' -d '{
"query" : {
"query_string" : {
"query" : "content:example"
}
}
}'
NOTE: If you don't like to use stopwords here is interesting article stop stopping stop words: a look at common terms query

Resources