how to switch on the elasticsearch stemming - elasticsearch

I don't know how to turn on the Elasticsearch English word stemming. I am sorry I didn't find out a clear example to do that.
Here is what I did
Creating the index
PUT /staff/list/ -d
{
"settings" : {
"analysis": {
"analyzer": {
"standard": {
"type": "standard"
}
}
}
}
}
Adding document
PUT /staff/list/jason
{
"Title" : "searches"
}
when I search for search
GET /staff/list/_search?q=search
The result doesnt appear.
What index setting I should do to make the stemming works.
Many thanks in advance

Please note that the default Elasticsearch analyzer do not support stemming.
In order to support stemming you may need to create a custom analyzer.
Here is how you do it:
Create the index and define an analyzer called my_analyzer
PUT /staff
{
"settings" : {
"analysis": {
"filter": {
"filter_snowball_en": {
"type": "snowball",
"language": "English"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"filter_snowball_en"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
}
}
Configure mapping that assigns my_analyzer to list type
PUT /staff/_mapping/list
{
"list": {
"properties": {
"title": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
Index documents
PUT /staff/list/jason
{
"title": "searches"
}
PUT /staff/list/debby
{
"title": "searched open"
}
Search and stemmed results
GET staff/list/_search
{
"query": {
"query_string": {
"query": "title:opened"
}
}
}
Result
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "staff",
"_type": "list",
"_id": "debby",
"_score": 1,
"_source": {
"title": "open"
}
}]
}
}
As you can see in the search results, debby document which contains the term
open was returned although we where searching for opened.
Hope that helps.

When you create the index, you are doing nothing (just re-declaring the standard analyzer).
The standard analyzer is the default that Elasticsearch uses, which doesn't stem any word.
You need to map the fields to their respective analyzers at your index creation (mapping documentation):
PUT /staff -d
{
"mappings": {
"list": {
"properties": {
"Title": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
I guess english analyzer fits to your case (which uses the standard tokenizer).

Related

How do I search documents with their synonyms in Elasticsearch?

I have an index with some documents. These documents have the field name. But now, my documents are able to have several names. And the number of names a document can have is uncertain. A document can have only one name, or there can be 10 names of one document.
The question is, how to organize my index, document and query and then search for 1 document by different names?
For example, there's a document with names: "automobile", "automobil", "自動車". And whenever I query one of these names, I should get this document. Can I create kind of an array of these names and build a query to search for each one? Or there's more appropriate way to do this.
Tldr;
I feels like you are looking for something like synonyms?
Solution
In the following example I am creating an index, with a specific text analyser.
This analyser, handle automobile, automobil and 自動車 as the same token.
PUT /74472994
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": ["synonym" ]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [ "automobile, automobil, 自動車" ]
}
}
}
}
},
"mappings": {
"properties": {
"name":{
"type": "text",
"analyzer": "synonym"
}
}
}
}
POST /74472994/_doc
{
"name": "automobile"
}
which allow me to perform the following request:
GET /74472994/_search
{
"query": {
"match": {
"name": "automobil"
}
}
}
GET /74472994/_search
{
"query": {
"match": {
"name": "自動車"
}
}
}
And always get:
{
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.7198386,
"hits": [
{
"_index": "74472994",
"_id": "ROfyhoQBcn6Q8d0DlI_z",
"_score": 1.7198386,
"_source": {
"name": "automobile"
}
}
]
}
}

How to exclude asterisks while searching with analyzer

I need to search by an array of values, and each value can be either simple text or text with askterisks(*).
For example:
["MYULTRATEXT"]
And I have the next index(i have a really big index, so I will simplify it):
................
{
"settings": {
"analysis": {
"char_filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(\\d+)*(?=\\d)",
"replacement": "1$"
}
},
"analyzer": {
"custom_search_analyzer": {
"char_filter": [
"asterisk_remove"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer":"keyword",
"search_analyzer": "custom_search_analyzer"
},
......................
And all data in the index is stored with asterisks * e.g.:
curl -X PUT "localhost:9200/locations/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
"name" : "MY*ULTRA*TEXT"
}
I need to return exact the same name value when I search by this string MYULTRATEXT
curl -XPOST 'localhost:9200/locations/_search?pretty' -d '
{
"query": { terms: { "name": ["MYULTRATEXT"] } }
}'
It Should return MY*ULTRA*TEXT, but it does not work, so can't find a workaround. Any thoughts?
I tried pattern_replace but seems like I am doing something wrong or I am missing something here.
So I need to replace all * to empty `` while searching
There appears to be a problem with the regex you provided and the replacement pattern.
I think what you want is:
"char_filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(\\w+)\\*(?=\\w)",
"replacement": "$1"
}
}
Note the following changes:
\d => \w (match word characters instead of only digits)
escape * since asterisks have a special meaning for regexes
1$ => $1 ($<GROUPNUM> is how you reference captured groups)
To see how Elasticsearch will analyze the text against an analyzer, or to check that you defined an analyzer correctly, Elasticsearch has the ANALYZE API endpoint that you can use: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
If you try this API with your current definition of custom_search_analyzer, you will find that "MY*ULTRA*TEXT" is analyzed to "MY*ULTRA*TEXT" and not "MYULTRATEXT" as you intend.
I have a personal app that I use to more easily interact with and visualize the results of the ANALYZE API. I tried your example and you can find it here: Elasticsearch Analysis Inspector.
This might help you - your regex pattern is the issue.
You want to replace all * occurrences with `` the pattern below will do the trick..
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer":"my_analyzer"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(?<=\\w)(\\*)(?=\\w)",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"asterisk_remove"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
}
Analyze query
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["MY*ULTRA*TEXT"]
}
Results of analyze query
{
"tokens": [
{
"token": "myultratext",
"start_offset": 0,
"end_offset": 13,
"type": "word",
"position": 0
}
]
}
Post a document
POST my_index/doc/1
{
"name" : "MY*ULTRA*TEXT"
}
Search query
GET my_index/_search
{
"query": {
"match": {
"name": "MYULTRATEXT"
}
}
}
Or
GET my_index/_search
{
"query": {
"match": {
"name": "myultratext"
}
}
}
Results search query
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "my_index",
"_type": "doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "MY*ULTRA*TEXT"
}
}
]
}
}
Hope it helps

Latenise token on query time

I need to latenise the query tokens that I use when querying (or filtering). I can do this on application level, but I was wondering if elasticsearch provides an out of the box solution.
I'm using ES 1.7.5 (as a service)
By default elasticsearch will use the same analyzer at index time and query time but it is possible to specify a search_analyzer which will only be used at query time.
Let's take a look at the following example:
# First we define an analyzer which will fold non ascii characters called `latinize`.
PUT books
{
"settings": {
"analysis": {
"analyzer": {
"latinize": {
"tokenizer": "standard",
"filter": ["asciifolding"]
}
}
}
},
"mappings": {
"book": {
"properties": {
"name": {
"type": "string",
"analyzer": "standard", # We use the standard analyzer at index time.
"search_analyzer": "latinize" # But we use the latinize analyzer at query time.
}
}
}
}
}
# Now let's create a document and search for it with a non latinized string.
POST books/book
{
"name": "aaoaao"
}
POST books/_search
{
"query": {
"match": {
"name": "ääöääö"
}
}
}
And bam! There is our document.
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "books",
"_type": "book",
"_id": "AVkIXdNyDpmDHTvI6Cp1",
"_score": 0.30685282,
"_source": {
"name": "aaoaao"
}
}
]
}
}

Find concatenate words in Elasticsearch

Say I have indexed this data
song:{
title:"laser game"
}
but the user is searching for
lasergame
How would you go about mapping/indexing/querying for this?
This is kind of tricky problem.
1) I guess the most effective way might be to use compound token filter, with word list made up of some words you think user might concatenate.
"settings": {
"analysis": {
"analyzer": {
"concatenate_split": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"myFilter"
]
}
},
"filter": {
"myFilter": {
"type": "dictionary_decompounder",
"word_list": [
"laser",
"game",
"lean",
"on",
"die",
"hard"
]
}
}
}
}
After applying analyzer, lasergame will split into laser and game along with lasergame, now this will give you results that has any of those words.
2) Another approach could be concatenating whole title with pattern replace char filter replacing all the spaces.
{
"index" : {
"analysis" : {
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\\s+",
"replacement":""
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_pattern"]
}
}
}
}
}
You need to use multi fields with this approach, with this pattern, laser game will be indexed as lasergame and your query will work.
Here the problem is laser game play will be indexed as lasegameplay and search for lasergame wont return anything so you might want to consider using prefix query or wildcard query for this.
3) This might not make sense but you could also use synonym filter, if you think users are often concatenating some words.
Hope this helps!
Easiest solution would be using nGrams. That would be the base to start working with and could be tweaked to meet your needs. But here you go:
Mappings
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "nGram",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"sample": {
"properties": {
"myField": {
"type": "string",
"analyzer": "myAnalyzer"
}
}
}
}
}
Test document
PUT /test/sample/1
{
"myField": "laser game"
}
Query
GET /test/_search
{
"query": {
"match": {
"myField": "lasergame"
}
}
}
Results
{
"took": 47,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2161999,
"hits": [
{
"_index": "test",
"_type": "sample",
"_id": "1",
"_score": 0.2161999,
"_source": {
"myField": "laser game"
}
}
]
}
}
This analyzer will create lots of ngrams in your index, such as la, las, lase...gam, game and etc. Both lasergame and laser game will produce a lot of similar tokens and will find your document as you'd expect.

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Resources