How to use stopword elasticsearch - elasticsearch

I have an Elasticsearch 1.5 running on my server,
specifically, I want/create three fields with is
1.name
2.description
3.nickname
I want setup stopword for description and nickname field when I insert the data on the Elasticsearch then stop word automatically remove unwanted stopword. I'm trying so many time but not working.
curl -X POST http://127.0.0.1:9200/tryoindex/ -d'
{
"settings": {
"analysis": {
"filter": {
"custom_english_stemmer": {
"type": "stemmer",
"name": "english"
},
"snowball": {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": {
"custom_lowercase_stemmed": {
"tokenizer": "standard",
"filter": [
"lowercase",
"custom_english_stemmer",
"snowball"
]
}
}
}
},
"mappings": {
"test": {
"_all" : {"enabled" : true},
"properties": {
"text": {
"type": "string",
"analyzer": "custom_lowercase_stemmed"
}
}
}
}
}'
curl -X POST "http://localhost:9200/tryoindex/nama/1" -d '{
"text" : "Tryolabs running monkeys KANGAROOS and jumping elephants jum is your"
}'
curl "http://localhost:9200/tryoindex/nama/_search?pretty=1" -d '{
"query": {
"query_string": {
"query": "Tryolabs running monkeys KANGAROOS and jumping elephants jum is your",
"fields": ["text"]
}
}
}'

Change your analyzer part to
"analyzer": {
"custom_lowercase_stemmed": {
"tokenizer": "standard",
"filter": [
"stop",
"lowercase",
"custom_english_stemmer",
"snowball"
]
}
}
To verify the changes use
curl -XGET 'localhost:9200/tryoindex/_analyze?analyzer=custom_lowercase_stemmed' -d 'testing this is stopword testing'
and observe the tokens
{"tokens":[{"token":"test","start_offset":0,"end_offset":7,"type":"<ALPHANUM>","position":1},{"token":"stopword","start_offset":16,"end_offset":24,"type":"<ALPHANUM>","position":4},{"token":"test","start_offset":25,"end_offset":32,"type":"<ALPHANUM>","position":5}]}%
PS: If you don't want to get the stemmed version of testing, then remove the stemming filters.

You need to use the stop token filter in your analyzer filter chain.

Related

Search with asciifolding and UTF-8 characters in Elasticsearch

I am indexing all the names on a web page with characters with accents like "José". I want to be able to search the this name with "Jose" and "José".
How should I set up my index mapping and analyzer(s) for a simple index with one field "name"?
I set up an analyzer for the name field like this:
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding"]
}
}
But it folds all accents into ASCII equivalents and ignores the accent when indexing the "é". I want the "é" char to be in the index and I want to be able to search "José" with either "José" or "Jose".
You need to preserve the original token with the accent. To achieve that you need to redefine your own asciifolding token filter, like this:
PUT /my_index
{
"settings" : {
"analysis" : {
"analyzer" : {
"folding" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "text",
"analyzer": "folding"
}
}
}
}
}
After that, both tokens jose and josé will be indexed and searchable
This is what I can think of to resolve the folding problem with diacritical marks:
Analyzer used:
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}
}
Below is the mapping to be used:
mappings used:
{
"properties": {
"title": {
"type": "string",
"analyzer": "standard",
"fields": {
"folded": {
"type": "string",
"analyzer": "folding"
}
}
}
}
}
The title field uses the standard analyzer and will contain the original word with diacritics in place.
The title.folded field uses the folding analyzer, which strips the diacritical marks.
Below is the search query I will use:
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "esta loca",
"fields": [ "title", "title.folded" ]
}
}
}

Elastic Search - how to use language analyzer with UTF-8 filter?

I have a problem with ElasticSearch language analyzer. I am working on Lithuanian language, so I am using Lithuanian language analyzer. Analyzer works fine and I got all word cases I need. For example, I index Lithuania city "Klaipėda":
PUT /cities/city/1
{
"name": "Klaipėda"
}
Problem is that I also need to get a result, when I am searching "Klaipėda" only in Latin alphabet ("Klaipeda") and in all Lithuanian cases:
Nomanitive case: "Klaipeda"
Genitive case: "Klaipedos"
...
Locative case: "Klaipedoje"
"Klaipėda", "Klaipėdos", "Klaipėdoje" - works, but "Klaipeda", "Klaipedos", "Klaipedoje" - not.
My index:
PUT /cities
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"fields": {
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"md_folded_analyzer": {
"type": "lithuanian",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
and search query:
GET /cities/_search
{
"query": {
"multi_match" : {
"type": "most_fields",
"query": "klaipeda",
"fields": [ "name", "name.folded" ]
}
}
}
What I am doing wrong? Thanks for help.
The technique you are using here is so-called multi-fields. The limitation of the underlying name.folded field is that you can't perform search against it - you can perform only sorting by name.folded and aggregation.
To make a way round this I've come up with the following set-up:
Separate fields set-up (to eliminate duplicates - just specify copy_to):
curl -XPUT http://localhost:9200/cities -d '
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"copy_to": "folded",
},
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}'
Change the type of your analyzer to custom as it described here, because otherwise the asciifolding is not got into the config. And more important - asciifolding should go after all stemming / stop-words in Lithuanian language, because after folding the word can miss desired sense.
curl -XPUT http://localhost:9200/my_cities -d '
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"md_folded_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_stemmer",
"asciifolding"
]
}
}
}
}
}
Sorry I've eliminated lithuanian_keywords - it requires additional set-up, which I missed here. But I hope you've got the idea.

Enable stop token filter for standard analyzer

I'm using ElasticSearch 5.1 and I want to enable the Stop Token Filter for the standard analyzer which is disabled by default
The document describes how to use it in a custom analyzer, but I would like to know how to enable it, since it's already included.
you have to configure the standard analyzer, see the example below how to do it with curl command(taken from docs here):
curl -XPUT 'localhost:9200/my_index?pretty' -d'
{
"settings": {
"analysis": {
"analyzer": {
"std_english": {
"type": "standard",
"stopwords": "_english_"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_text": {
"type": "text",
"analyzer": "standard",
"fields": {
"english": {
"type": "text",
"analyzer": "std_english"
}
}
}
}
}
}
}'
curl -XPOST 'localhost:9200/my_index/_analyze?pretty' -d'
{
"field": "my_text",
"text": "The old brown cow"
}'
curl -XPOST 'localhost:9200/my_index/_analyze?pretty' -d'
{
"field": "my_text.english",
"text": "The old brown cow"
}'

Elasticsearch not using "default_search" analyzer unless explicitly stated in query

From reading the Elasticsearch documents, I would expect that naming an analyzer 'default_search' would cause that analyzer to get used for all searches unless another analyzer is specified. However, if I define my index like so:
curl -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer",
"filter": [
"lowercase"
],
"type" : "custom"
},
"default_search": {
"tokenizer" : "keyword",
"filter" : [
"lowercase"
]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "100",
"token_chars": []
}
}
}
},
"mappings": {
"TestDocument": {
"dynamic_templates": [
{
"metadata_template": {
"match_mapping_type": "string",
"path_match": "*",
"mapping": {
"type": "multi_field",
"fields": {
"ngram": {
"type": "{dynamic_type}",
"index": "analyzed",
"analyzer": "my_ngram_analyzer"
},
"{name}": {
"type": "{dynamic_type}",
"index": "analyzed",
"analyzer": "standard"
}
}
}
}
}
]
}
}
}'
And then add a 'TestDocument':
curl -XPUT 'http://localhost:9200/test/TestDocument/1' -d '{
"name" : "TestDocument.pdf" }'
My queries are still running through the default analyzer. I can tell because this query gives me a hit:
curl -XGET 'localhost:9200/test/TestDocument/_search?pretty=true' -d '{
"query": {
"match": {
"name.ngram": {
"query": "abc.pdf"
}
}
}
}'
But does not if I specify the correct analyzer (using the 'keyword' tokenizer)
curl -XGET 'localhost:9200/test/TestDocument/_search?pretty=true' -d '{
"query": {
"match": {
"name.ngram": {
"query": "abc.pdf",
"analyzer" : "default_search"
}
}
}
}'
What am I missing to use "default_search" for searches unless stated otherwise in my query? Am I just misinterpreting expected behavior here?
In your dynamic template, you are setting the search and index analyzer by using "analyzer." It will only use the default as a last resort.
"index_analyzer":"analyzer_name" //sets the index analyzer
"analyzer":"analyzer_name" // sets both search and index
"search_analyzer":"...." // sets the search analyzer.

Indexing website/url in Elastic Search

I have a website field of a document indexed in elastic search. Example value: http://example.com . The problem is that when I search for example, the document is not included. How to map correctly the website/url field?
I created the index below:
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_html":{
"type":"custom",
"tokenizer": "standard",
"filter":"standard",
"char_filter": "html_strip"
}
}
}
}
},
"mapping":{
"blogshops": {
"properties": {
"category": {
"properties": {
"name": {
"type": "string"
}
}
},
"reviews": {
"properties": {
"user": {
"properties": {
"_id": {
"type": "string"
}
}
}
}
}
}
}
}
}
I guess you are using standard analyzer, which splits http://example.dom into two tokens - http and example.com. You can take a look http://localhost:9200/_analyze?text=http://example.com&analyzer=standard.
If you want to split url, you need to use different analyzer or specify our own custom analyzer.
You can take a look how would be url indexed with simple analyzer - http://localhost:9200/_analyze?text=http://example.com&analyzer=simple. As you can see, now is url indexed as three tokens ['http', 'example', 'com']. If you don't want to index tokens like ['http', 'www'] etc, you can specify your analyzer with lowercase tokenizer (this is the one used in simple analyzer) and stop filter. For example something like this:
# Delete index
#
curl -s -XDELETE 'http://localhost:9200/url-test/' ; echo
# Create index with mapping and custom index
#
curl -s -XPUT 'http://localhost:9200/url-test/' -d '{
"mappings": {
"document": {
"properties": {
"content": {
"type": "string",
"analyzer" : "lowercase_with_stopwords"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"filter" : {
"stopwords_filter" : {
"type" : "stop",
"stopwords" : ["http", "https", "ftp", "www"]
}
},
"analyzer": {
"lowercase_with_stopwords": {
"type": "custom",
"tokenizer": "lowercase",
"filter": [ "stopwords_filter" ]
}
}
}
}
}' ; echo
curl -s -XGET 'http://localhost:9200/url-test/_analyze?text=http://example.com&analyzer=lowercase_with_stopwords&pretty'
# Index document
#
curl -s -XPUT 'http://localhost:9200/url-test/document/1?pretty=true' -d '{
"content" : "Small content with URL http://example.com."
}'
# Refresh index
#
curl -s -XPOST 'http://localhost:9200/url-test/_refresh'
# Try to search document
#
curl -s -XGET 'http://localhost:9200/url-test/_search?pretty' -d '{
"query" : {
"query_string" : {
"query" : "content:example"
}
}
}'
NOTE: If you don't like to use stopwords here is interesting article stop stopping stop words: a look at common terms query

Resources