Best way to search/index the data - with and without whitespace - elasticsearch

I am having a problem indexing and searching for words that may or may not contain whitespace...Below is an example
Here is how the mappings are set up:
curl -s -XPUT 'localhost:9200/test' -d '{
"mappings": {
"properties": {
"name": {
"street": {
"type": "string",
"index_analyzer": "index_ngram",
"search_analyzer": "search_ngram"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"desc_ngram": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 20
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "desc_ngram", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
}'
This is how I built the index:
curl -s -XPUT 'localhost:9200/test/name/1' -d '{ "street": "Lakeshore Dr" }'
curl -s -XPUT 'localhost:9200/test/name/2' -d '{ "street": "Sunnyshore Dr" }'
curl -s -XPUT 'localhost:9200/test/name/3' -d '{ "street": "Lake View Dr" }'
curl -s -XPUT 'localhost:9200/test/name/4' -d '{ "street": "Shore Dr" }'
Here is an example of the query that is not working correctly:
curl -s -XGET 'localhost:9200/test/_search?pretty=true' -d '{
"query":{
"bool":{
"must":[
{
"match":{
"street":{
"query":"lake shore dr",
"type":"boolean"
}
}
}
]
}
}
}';
If a user attempts to search for "Lake Shore Dr", I want to only match to document 1/"Lakeshore Dr"
If a user attempts to search for "Lakeview Dr", I want to only match to document 3/"Lake View Dr"
So is the issue with how I am setting up the mappings (tokenizer?, edgegram vs ngrams?, size of ngrams?) or the query (I have tried things like setting the minimum_should_match, and the analyzer to use), but I have not been able to get the desired results.
Thanks all.

Related

Elasticsearch: index a field with keyword tokenizer but without stopwords

I am looking for a way to search company names with keyword tokenizing but without stopwords.
For ex : The indexed company name is "Hansel und Gretel Gmbh."
Here "und" and "Gmbh" are stop words for the company name.
If the search term is "Hansel Gretel", that document should be found,
If the search term is "Hansel" then no document should be found. And if the search term is "hansel gmbh", the no document should be found as well.
I have tried to combine keywords tokenizer with stopwords in custom analyzer but it didnt work(as expected I guess).
I have also tried to use common terms query, but "Hansel" started to hit(again as expected)
Thanks in advance.
There are two ways bad and ugly. The first one uses regular expressions in order to remove stop words and trim spaces. There are a lot of drawbacks:
you have to support white-space tokenization(regexp(/s+)) and special symbol(.,;) removal by your own
no highlight is supported - keyword tokenizer does not support
case sensitivity is also a problem
normalizers(analyzers for keywords) are experimental feature - bad support, no features
Here is step-by-step example:
curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"normalizer": {
"custom_normalizer": {
"type": "custom",
"char_filter": ["stopword_char_filter", "trim_char_filter"],
"filter": ["lowercase"]
}
},
"char_filter": {
"stopword_char_filter": {
"type": "pattern_replace",
"pattern": "( ?und ?| ?gmbh ?)",
"replacement": " "
},
"trim_char_filter": {
"type": "pattern_replace",
"pattern": "(\\s+)$",
"replacement": ""
}
}
}
},
"mappings": {
"file": {
"properties": {
"name": {
"type": "keyword",
"normalizer": "custom_normalizer"
}
}
}
}
}'
Now we can check how our analyzer works(please note that requests to normalyzer are supported only in ES 6.x)
curl -XPOST "http://localhost:9200/test/_analyze" -H 'Content-Type: application/json' -d'
{
"normalizer": "custom_normalizer",
"text": "hansel und gretel gmbh"
}'
Now we are ready to index our document:
curl -XPUT "http://localhost:9200/test/file/1" -H 'Content-Type: application/json' -d'
{
"name": "hansel und gretel gmbh"
}'
And the last step is search:
curl -XGET "http://localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"name" : {
"query" : "hansel gretel"
}
}
}
}'
Another approach is:
create standard text analyzer with stop words filter
use analysis to filter out all stop words and special symbols
concatenate tokens manually
send term to ES as keyword
Here is step-by-step example:
curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "custom_stopwords"]
}
}, "filter": {
"custom_stopwords": {
"type": "stop",
"stopwords": ["und", "gmbh"]
}
}
}
},
"mappings": {
"file": {
"properties": {
"name": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
}'
Now we are ready to analyze our text:
POST test/_analyze
{
"analyzer": "custom_analyzer",
"text": "Hansel und Gretel Gmbh."
}
with the following result:
{
"tokens": [
{
"token": "hansel",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "gretel",
"start_offset": 11,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
The last step is token concatenation: hansel + gretel. The only drawback is manual analysis with custom code.

Elasticseach not using synonyms from synonym file

I am new to elasticsearch so before downvoting or marking as duplicate, please read the question first.
I am testing synonyms in elasticsearch (v 2.4.6) which I have installed on Ubuntu 16.04. I am giving synonyms through a file named synonym.txt which I have placed in config directory. I have created an index synonym_test as follows-
curl -XPOST localhost:9200/synonym_test/ -d '{
"settings": {
"analysis": {
"analyzer": {
"my_synonyms": {
"tokenizer": "whitespace",
"filter": ["lowercase","my_synonym_filter"]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path" : "synonym.txt"
}
}
}
}
}'
The index contains two fields- id and some_text. I configure the field some_text with the custom analyzer as follows-
curl -XPUT localhost:9200/synonym_test/rulers/_mapping -d '{
"properties": {
"id": {
"type": "double"
},
"some_text": {
"type": "string",
"search_analyzer": "my_synonyms"
}
}
}'
Then I have inserted some data as -
curl -XPUT localhost:9200/synonym_test/external/5 -d '{
"id" : "5",
"some_text":"apple is a fruit"
}'
curl -XPUT localhost:9200/synonym_test/external/7 -d '{
"id" : "7",
"some_text":"english is spoken in england"
}'
curl -XPUT localhost:9200/synonym_test/external/8 -d '{
"id" : "8",
"some_text":"Scotland Yard is a popular game."
}'
curl -XPUT localhost:9200/synonym_test/external/9 -d '{
"id" : "9",
"some_text":"bananas contain potassium"
}'
The synonym.txt file contains following-
"britain,england,scotland"
"fruit,bananas"
After doing all this, when I run the query for term fruit (which should also return the text containing bananas as they are synonyms in file), I get the text containing fruit only.
{
"took":117,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":1,
"max_score":0.8465736,
"hits":[
{
"_index":"synonym_test",
"_type":"external",
"_id":"5",
"_score":0.8465736,
"_source":{
"id":"5",
"some_text":"apple is a fruit"
}
}
]
}
}
I have also tried the following links, but none seem to have helped me -
Synonym analyzer not working ,
Elasticsearch synonym analyzer not working , How to apply synonyms at query time instead of index time in Elasticsearch , how to configure the synonyms_path in elasticsearch and many other links.
So, can anyone please tell me if I am doing anything wrong? Is there anything wrong with the settings or synonym file? I want the synonyms to work (query time) so that when I search for a term, I get all documents related to that term.
Please refer to following url: Custom Analyzer on how you should configure custom analyzers.
If we follow the guides from above documentation our schema will be as follows:
curl -XPOST localhost:9200/synonym_test/ -d '{
"settings": {
"analysis": {
"analyzer": {
"type": "custom"
"my_synonyms": {
"tokenizer": "whitespace",
"filter": ["lowercase","my_synonym_filter"]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path" : "synonym.txt"
}
}
}
}
}
Which currently works on my elasticsearch instance.

Enable stop token filter for standard analyzer

I'm using ElasticSearch 5.1 and I want to enable the Stop Token Filter for the standard analyzer which is disabled by default
The document describes how to use it in a custom analyzer, but I would like to know how to enable it, since it's already included.
you have to configure the standard analyzer, see the example below how to do it with curl command(taken from docs here):
curl -XPUT 'localhost:9200/my_index?pretty' -d'
{
"settings": {
"analysis": {
"analyzer": {
"std_english": {
"type": "standard",
"stopwords": "_english_"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_text": {
"type": "text",
"analyzer": "standard",
"fields": {
"english": {
"type": "text",
"analyzer": "std_english"
}
}
}
}
}
}
}'
curl -XPOST 'localhost:9200/my_index/_analyze?pretty' -d'
{
"field": "my_text",
"text": "The old brown cow"
}'
curl -XPOST 'localhost:9200/my_index/_analyze?pretty' -d'
{
"field": "my_text.english",
"text": "The old brown cow"
}'

elasticsearch mapping analyzer - GET not getting result

I am trying to create an analyzer, which replaces special character with a whitespace and convert it into uppercase. then after, if I want to search with lowercase also it should work.
Mapping Analyzer:
soundarya#soundarya-VirtualBox:~/Downloads/elasticsearch-2.4.0/bin$ curl -XPUT 'http://localhost:9200/aida' -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"uppercase"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1 "
}
}
}
}
}
'
{"acknowledged":true}
soundarya#soundarya-VirtualBox:~/Downloads/elasticsearch-2.4.0/bin$ curl -XPOST 'http://localhost:9200/aida/_analyze?pretty' -d '{
"analyzer":"my_analyzer",
"text":"My name is Soun*arya?jwnne&yuuk"
}'
It is tokenizing the words properly by replacing the special character with the whitespace. Now if I search a word from the text, it is not retrieving me any result.
soundarya#soundarya-VirtualBox:~/Downloads/elasticsearch-2.4.0/bin$ curl -XGET 'http://localhost:9200/aida/_search' -d '{
"query":{
"match":{
"text":"My"
}
}
}'
I am not getting any result out of the above GET query. Getting result like :
soundarya#soundarya-VirtualBox:~/Downloads/elasticsearch-2.4.0/bin$ curl -XGET 'http://localhost:9200/aida/_search' -d '{
"query":{
"match":{
"text":"my"
}
}
}'
{"took":5,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
Can anyone help me with this! Thank you!
You don't seem to have indexed any data after creating your index. The call to _analyze will not index anything but simply show you how the content you send to ES would be analyzed.
First, you need to create your index by specifying a mapping in which you use the analyzer you've defined:
curl -XPUT 'http://localhost:9200/aida' -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"uppercase"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1 "
}
}
}
},
"mappings": { <--- add a mapping type...
"doc": {
"properties": {
"text": { <--- ...with a field...
"type": "string",
"analyzer": "my_analyzer" <--- ...using your analyzer
}
}
}
}
}'
Then you can index a new real document:
curl -XPOST 'http://localhost:9200/aida/doc' -d '{
"text": "My name is Soun*arya?jwnne&yuuk"
}'
Finally, you can search:
curl -XGET 'http://localhost:9200/aida/_search' -d '{
"query":{
"match":{
"text":"My"
}
}
}'

How to use stopword elasticsearch

I have an Elasticsearch 1.5 running on my server,
specifically, I want/create three fields with is
1.name
2.description
3.nickname
I want setup stopword for description and nickname field when I insert the data on the Elasticsearch then stop word automatically remove unwanted stopword. I'm trying so many time but not working.
curl -X POST http://127.0.0.1:9200/tryoindex/ -d'
{
"settings": {
"analysis": {
"filter": {
"custom_english_stemmer": {
"type": "stemmer",
"name": "english"
},
"snowball": {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": {
"custom_lowercase_stemmed": {
"tokenizer": "standard",
"filter": [
"lowercase",
"custom_english_stemmer",
"snowball"
]
}
}
}
},
"mappings": {
"test": {
"_all" : {"enabled" : true},
"properties": {
"text": {
"type": "string",
"analyzer": "custom_lowercase_stemmed"
}
}
}
}
}'
curl -X POST "http://localhost:9200/tryoindex/nama/1" -d '{
"text" : "Tryolabs running monkeys KANGAROOS and jumping elephants jum is your"
}'
curl "http://localhost:9200/tryoindex/nama/_search?pretty=1" -d '{
"query": {
"query_string": {
"query": "Tryolabs running monkeys KANGAROOS and jumping elephants jum is your",
"fields": ["text"]
}
}
}'
Change your analyzer part to
"analyzer": {
"custom_lowercase_stemmed": {
"tokenizer": "standard",
"filter": [
"stop",
"lowercase",
"custom_english_stemmer",
"snowball"
]
}
}
To verify the changes use
curl -XGET 'localhost:9200/tryoindex/_analyze?analyzer=custom_lowercase_stemmed' -d 'testing this is stopword testing'
and observe the tokens
{"tokens":[{"token":"test","start_offset":0,"end_offset":7,"type":"<ALPHANUM>","position":1},{"token":"stopword","start_offset":16,"end_offset":24,"type":"<ALPHANUM>","position":4},{"token":"test","start_offset":25,"end_offset":32,"type":"<ALPHANUM>","position":5}]}%
PS: If you don't want to get the stemmed version of testing, then remove the stemming filters.
You need to use the stop token filter in your analyzer filter chain.

Resources