I am looking for a way to search company names with keyword tokenizing but without stopwords.
For ex : The indexed company name is "Hansel und Gretel Gmbh."
Here "und" and "Gmbh" are stop words for the company name.
If the search term is "Hansel Gretel", that document should be found,
If the search term is "Hansel" then no document should be found. And if the search term is "hansel gmbh", the no document should be found as well.
I have tried to combine keywords tokenizer with stopwords in custom analyzer but it didnt work(as expected I guess).
I have also tried to use common terms query, but "Hansel" started to hit(again as expected)
Thanks in advance.
There are two ways bad and ugly. The first one uses regular expressions in order to remove stop words and trim spaces. There are a lot of drawbacks:
you have to support white-space tokenization(regexp(/s+)) and special symbol(.,;) removal by your own
no highlight is supported - keyword tokenizer does not support
case sensitivity is also a problem
normalizers(analyzers for keywords) are experimental feature - bad support, no features
Here is step-by-step example:
curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"normalizer": {
"custom_normalizer": {
"type": "custom",
"char_filter": ["stopword_char_filter", "trim_char_filter"],
"filter": ["lowercase"]
}
},
"char_filter": {
"stopword_char_filter": {
"type": "pattern_replace",
"pattern": "( ?und ?| ?gmbh ?)",
"replacement": " "
},
"trim_char_filter": {
"type": "pattern_replace",
"pattern": "(\\s+)$",
"replacement": ""
}
}
}
},
"mappings": {
"file": {
"properties": {
"name": {
"type": "keyword",
"normalizer": "custom_normalizer"
}
}
}
}
}'
Now we can check how our analyzer works(please note that requests to normalyzer are supported only in ES 6.x)
curl -XPOST "http://localhost:9200/test/_analyze" -H 'Content-Type: application/json' -d'
{
"normalizer": "custom_normalizer",
"text": "hansel und gretel gmbh"
}'
Now we are ready to index our document:
curl -XPUT "http://localhost:9200/test/file/1" -H 'Content-Type: application/json' -d'
{
"name": "hansel und gretel gmbh"
}'
And the last step is search:
curl -XGET "http://localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"name" : {
"query" : "hansel gretel"
}
}
}
}'
Another approach is:
create standard text analyzer with stop words filter
use analysis to filter out all stop words and special symbols
concatenate tokens manually
send term to ES as keyword
Here is step-by-step example:
curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "custom_stopwords"]
}
}, "filter": {
"custom_stopwords": {
"type": "stop",
"stopwords": ["und", "gmbh"]
}
}
}
},
"mappings": {
"file": {
"properties": {
"name": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
}'
Now we are ready to analyze our text:
POST test/_analyze
{
"analyzer": "custom_analyzer",
"text": "Hansel und Gretel Gmbh."
}
with the following result:
{
"tokens": [
{
"token": "hansel",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "gretel",
"start_offset": 11,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
The last step is token concatenation: hansel + gretel. The only drawback is manual analysis with custom code.
Related
I'm using ElasticSearch 5.1 and I want to enable the Stop Token Filter for the standard analyzer which is disabled by default
The document describes how to use it in a custom analyzer, but I would like to know how to enable it, since it's already included.
you have to configure the standard analyzer, see the example below how to do it with curl command(taken from docs here):
curl -XPUT 'localhost:9200/my_index?pretty' -d'
{
"settings": {
"analysis": {
"analyzer": {
"std_english": {
"type": "standard",
"stopwords": "_english_"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_text": {
"type": "text",
"analyzer": "standard",
"fields": {
"english": {
"type": "text",
"analyzer": "std_english"
}
}
}
}
}
}
}'
curl -XPOST 'localhost:9200/my_index/_analyze?pretty' -d'
{
"field": "my_text",
"text": "The old brown cow"
}'
curl -XPOST 'localhost:9200/my_index/_analyze?pretty' -d'
{
"field": "my_text.english",
"text": "The old brown cow"
}'
I have two index:
First:
curl -XPUT 'http://localhost:9200/first/' -d '
{
"mappings": {
"product": {
"properties": {
"name": {
"type": "string",
"analyzer":"spanish"
}
}
}
}
}
'
Second:
curl -XPUT 'http://localhost:9200/second/' -d '
{
"mappings": {
"product": {
"properties": {
"name": {
"type": "string",
"analyzer":"spanish_custom"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwordsPath": "spanish_stop_custom.txt"
},
"spanish_stemmer": {
"type": "stemmer",
"language": "spanish"
}
},
"analyzer": {
"spanish_custom": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"spanish_stop",
"spanish_stemmer"
]
}
}
}
}
}
'
I insert some document for both index:
curl -XPOST 'http://localhost:9200/first/product' -d '
{
"name": "Hidratante"
}'
curl -XPOST 'http://localhost:9200/second/product' -d '
{
"name": "Hidratante"
}'
i checked tokens for the field name:
curl -XGET 'http://localhost:9200/first/_analyze?field=name' -d 'hidratante'
{"tokens":[{"token":"hidratant","start_offset":0,"end_offset":10,"type":"<ALPHANUM>","position":1}]}
curl -XGET 'http://localhost:9200/second/_analyze?field=name' -d 'hidratante'
{"tokens":[{"token":"hidrat","start_offset":0,"end_offset":10,"type":"<ALPHANUM>","position":1}]}
I want search for 'hidratant' and give results in both index, but i got results only first index
My Query:
curl -XGET 'http://127.0.0.1:9200/first/_search' -d '
{
"query" : {
"multi_match" : {
"query" : "hidratant",
"fields" : [ "name"],
"type" : "phrase_prefix",
"operator" : "AND",
"prefix_length" : 3,
"tie_breaker": 1
}
}
}
'
First index result:
{"took":6,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":0.5945348,"hits":[{"_index":"test","_type":"product","_id":"AVPxjvpRDl8qAEgsMFMu","_score":0.5945348,"_source":
{
"name": "Hidratante"
}},{"_index":"test","_type":"product","_id":"AVPxkYbKDl8qAEgsMFMv","_score":0.5945348,"_source":
{
"name": "Hidratante"
}}]}}
Second index result:
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
Why the second index no has return result?
As you mentioned in your question above, for second index the tokens being generated for the term Hidratanteare:
{"tokens":[{"token":"hidrat","start_offset":0,"end_offset":10,"type":"<ALPHANUM>","position":1}]}
There is a concept of search analyzer which comes in when you perform search operation. According to the documentation:
By default, queries will use the analyzer defined in the field mapping while searching.
So when you run phrase_prefix query , same custom analyzer created by you will act on name field in second index.
Since you are searching for keyword : hidratant
It gets analyzed as :
For first Index:
curl -XGET 'http://localhost:9200/first/_analyze?field=name' -d 'hidratant'
{
"tokens": [
{
"token": "hidratant",
"start_offset": 3,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
}
]
}
i.e why you get result in first index.
For second index:
curl -XGET 'http://localhost:9200/second/_analyze?field=name' -d 'hidratant'
{
"tokens": [
{
"token": "hidratant",
"start_offset": 3,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
}
]
}
The token generated while searching is hidratant but it was hidrat while indexing. That's why you don't get any result in second case.
I have an Elasticsearch 1.5 running on my server,
specifically, I want/create three fields with is
1.name
2.description
3.nickname
I want setup stopword for description and nickname field when I insert the data on the Elasticsearch then stop word automatically remove unwanted stopword. I'm trying so many time but not working.
curl -X POST http://127.0.0.1:9200/tryoindex/ -d'
{
"settings": {
"analysis": {
"filter": {
"custom_english_stemmer": {
"type": "stemmer",
"name": "english"
},
"snowball": {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": {
"custom_lowercase_stemmed": {
"tokenizer": "standard",
"filter": [
"lowercase",
"custom_english_stemmer",
"snowball"
]
}
}
}
},
"mappings": {
"test": {
"_all" : {"enabled" : true},
"properties": {
"text": {
"type": "string",
"analyzer": "custom_lowercase_stemmed"
}
}
}
}
}'
curl -X POST "http://localhost:9200/tryoindex/nama/1" -d '{
"text" : "Tryolabs running monkeys KANGAROOS and jumping elephants jum is your"
}'
curl "http://localhost:9200/tryoindex/nama/_search?pretty=1" -d '{
"query": {
"query_string": {
"query": "Tryolabs running monkeys KANGAROOS and jumping elephants jum is your",
"fields": ["text"]
}
}
}'
Change your analyzer part to
"analyzer": {
"custom_lowercase_stemmed": {
"tokenizer": "standard",
"filter": [
"stop",
"lowercase",
"custom_english_stemmer",
"snowball"
]
}
}
To verify the changes use
curl -XGET 'localhost:9200/tryoindex/_analyze?analyzer=custom_lowercase_stemmed' -d 'testing this is stopword testing'
and observe the tokens
{"tokens":[{"token":"test","start_offset":0,"end_offset":7,"type":"<ALPHANUM>","position":1},{"token":"stopword","start_offset":16,"end_offset":24,"type":"<ALPHANUM>","position":4},{"token":"test","start_offset":25,"end_offset":32,"type":"<ALPHANUM>","position":5}]}%
PS: If you don't want to get the stemmed version of testing, then remove the stemming filters.
You need to use the stop token filter in your analyzer filter chain.
I am having a problem indexing and searching for words that may or may not contain whitespace...Below is an example
Here is how the mappings are set up:
curl -s -XPUT 'localhost:9200/test' -d '{
"mappings": {
"properties": {
"name": {
"street": {
"type": "string",
"index_analyzer": "index_ngram",
"search_analyzer": "search_ngram"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"desc_ngram": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 20
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "desc_ngram", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
}'
This is how I built the index:
curl -s -XPUT 'localhost:9200/test/name/1' -d '{ "street": "Lakeshore Dr" }'
curl -s -XPUT 'localhost:9200/test/name/2' -d '{ "street": "Sunnyshore Dr" }'
curl -s -XPUT 'localhost:9200/test/name/3' -d '{ "street": "Lake View Dr" }'
curl -s -XPUT 'localhost:9200/test/name/4' -d '{ "street": "Shore Dr" }'
Here is an example of the query that is not working correctly:
curl -s -XGET 'localhost:9200/test/_search?pretty=true' -d '{
"query":{
"bool":{
"must":[
{
"match":{
"street":{
"query":"lake shore dr",
"type":"boolean"
}
}
}
]
}
}
}';
If a user attempts to search for "Lake Shore Dr", I want to only match to document 1/"Lakeshore Dr"
If a user attempts to search for "Lakeview Dr", I want to only match to document 3/"Lake View Dr"
So is the issue with how I am setting up the mappings (tokenizer?, edgegram vs ngrams?, size of ngrams?) or the query (I have tried things like setting the minimum_should_match, and the analyzer to use), but I have not been able to get the desired results.
Thanks all.
From reading the Elasticsearch documents, I would expect that naming an analyzer 'default_search' would cause that analyzer to get used for all searches unless another analyzer is specified. However, if I define my index like so:
curl -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer",
"filter": [
"lowercase"
],
"type" : "custom"
},
"default_search": {
"tokenizer" : "keyword",
"filter" : [
"lowercase"
]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "100",
"token_chars": []
}
}
}
},
"mappings": {
"TestDocument": {
"dynamic_templates": [
{
"metadata_template": {
"match_mapping_type": "string",
"path_match": "*",
"mapping": {
"type": "multi_field",
"fields": {
"ngram": {
"type": "{dynamic_type}",
"index": "analyzed",
"analyzer": "my_ngram_analyzer"
},
"{name}": {
"type": "{dynamic_type}",
"index": "analyzed",
"analyzer": "standard"
}
}
}
}
}
]
}
}
}'
And then add a 'TestDocument':
curl -XPUT 'http://localhost:9200/test/TestDocument/1' -d '{
"name" : "TestDocument.pdf" }'
My queries are still running through the default analyzer. I can tell because this query gives me a hit:
curl -XGET 'localhost:9200/test/TestDocument/_search?pretty=true' -d '{
"query": {
"match": {
"name.ngram": {
"query": "abc.pdf"
}
}
}
}'
But does not if I specify the correct analyzer (using the 'keyword' tokenizer)
curl -XGET 'localhost:9200/test/TestDocument/_search?pretty=true' -d '{
"query": {
"match": {
"name.ngram": {
"query": "abc.pdf",
"analyzer" : "default_search"
}
}
}
}'
What am I missing to use "default_search" for searches unless stated otherwise in my query? Am I just misinterpreting expected behavior here?
In your dynamic template, you are setting the search and index analyzer by using "analyzer." It will only use the default as a last resort.
"index_analyzer":"analyzer_name" //sets the index analyzer
"analyzer":"analyzer_name" // sets both search and index
"search_analyzer":"...." // sets the search analyzer.