Keep non-stemmed tokens on Elasticsearch - elasticsearch

I'm using a stemmer (for the Brazilian Portuguese Language) when I index documents on Elasticsearch. This is what my default analyzer looks like(nvm minor mistakes here because I've copied this by hand from my code in the server):
{
"analysis":{
"filter":{
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true,
},
"stop_pt":{
"type": "stop",
"ignore_case": true,
"stopwords": "_brazilian_"
},
"stemmer_pt": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_asciifolding",
"stop_pt",
"stemmer_pt"
]
}
}
}
}
I haven't really touched my type mappings (apart from a few numeric fields I've declared "type":"long") so I expect most fields to be using this default analyzer I've specified above.
This works as expected, but the thing is that some users are frustrated because (since tokens are being stemmed), the query "vulnerabilities" and the query "vulnerable" return the same results, which is misleading because they expect the results having an exact match to be ranked first.
Whats is the default way (if any) to do this in elasticsearch? (maybe keep the unstemmed tokens in the index as well as the stemmed tokens?) I'm using version 1.5.1.

I ended up using "fields" field to index my attributes in different ways. Not sure whether this is optimal but this is the way I'm handling it right now:
Add another analyzer (I called it "no_stem_analyzer") with all filters that the "default" analyzer has, minus "stemmer".
For each attribute I want to keep both non stemmed and stemmed variants, I did this (example for field "DESCRIPTION"):
"mappings":{
"_default_":{
"properties":{
"DESCRIPTION":{
"type"=>"string",
"fields":{
"no_stem":{
"type":"string",
"index":"analyzed",
"analyzer":"no_stem_analyzer"
},
"stemmed":{
"type":"string",
"index":"analyzed",
"analyzer":"default"
}
}
}
},//.. other attributes here
}
}
At search time (using query_string_query) I must also indicate (using field "fields") that I want to search all sub-fields (e.g. "DESCRIPTION.*")
I also based my approach upon [this answer].(elasticsearch customize score for synonyms/stemming)

Related

Elasticsearch: search with wildcard and custom analyzer

Requirement: Search with special characters in a text field.
my Solution so far: Use wildcard query with custom analyzer. I want to use wildcards because it seems the easiest way to do partial searches in a long string with multiple search keys. See ES query below.
I have an index called "invoices" and it has document with one of the fields as
"searchString" : "I000010-1 000010 3901 North Saginaw Road add 2 Midland MI 48640 US MS Dhoni MSD-Company MSD (777) 777-7777 (333) 333-3333 sandeep#xyz.io msd-company msdhoni Dhoni, MS (3241480)"
Note: This field acts as the deprecated _all field in ES.
Index Mapping for this field:
"searchString": {"type": "text","analyzer": "multi_level_analyzer"},
Analyzer settings:
PUT invoices
{
"settings": {
"analysis": {
"analyzer": {
"multi_level_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
My query looks something like this:
GET invoices/_search
{
"query": {
"bool": {
"must": [{
"wildcard": {
"searchString": {
"value": "msd-company*",
"boost": 1.0
}
}
},
{
"wildcard": {
"searchString": {
"value": "Saginaw*",
"boost": 1.0
}
}
}
]
}
}
}
My question:
Earlier when I was not using a custom analyzer the above query worked BUT I was not able to search for words with special characters like "msd-company".
After attaching the custom analyzer(multi_level_analyzer) the above query fails to return any result. I changed the wildcard query and appended an asterisk before the search key and for some reason it works now. (referred this answer)
I want to know the impact of using "* msd-company*" instead of "msd-company*" in the wildcard query for the text field.
How can I still use the wildcard query "msd-company*" with custom analyzer?
Open to suggestions for any other approach to my problem statement.
I have solved my problem by changing the mapping of the said field to this:
"searchString": {"type": "text","analyzer": "multi_level_analyzer", "search_analyzer": "standard"},
But since wildcard queries are expensive, I would still like to know if there exists a better solution to satisfy my search use case.

Cannot retrieve data which includes specific symbols in Kibana

I try to use Kibana to retrive the comment data which includes some specific symbols like ?and 。 They are not general symbols.
I try to use escape character \ for them, the KQL is like comment:\?or comment:\\?, but it doesn't work, can anyone help?
When you create a sample doc and let ES auto-generate the mapping for you,
POST comments/_doc
{
"comment": "?"
}
running
GET comments/_mapping
will get you
"comment":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
Now, the text type's analyzer is usually standard by default.
When we attempt to see how our non-standard chars got analyzed
GET comments/_analyze
{
"text": "?",
"analyzer": "standard"
}
the result is
{
"tokens" : [ ]
}
meaning we cannot search for its contents using the standard-analyzed text field but need to
either define a different default analyzer
or define this analyzer in one of the comment's fields
Going with the 2nd approach (since it's good practice to keep differently-analyzed fields separate),
PUT comments2
{
"mappings": {
"properties": {
"comment": {
"type": "text",
"fields": {
"whitespace_analyzed": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
}
}
POST comments2/_doc
{
"comment": "?"
}
After verifying
GET comments2/_analyze
{
"text": "?",
"analyzer": "whitespace"
}
we can do the following in KQL
comment.whitespace_analyzed:"?"
Note that there are a bunch of built-in analyzers to choose from but you're more than welcome to create your own.

Full Text Search as well as Terms Search on same filed of Elasticsearch

I'm from MySql background. So I don't know much about elasticsearch and it's working.
Here is my requirements
There will be table of resulted records with sorting option on all the column. There will be filter option from where user will select multiple values for multiple columns (e.g, City should be from City1, City2, City3 and Category should be from Cat2, Cat22, Cat6). There will be also search bar where user will enter some text and full text search will be applied on some fields (i.e, City, Area etc).
This image will give better understanding.
Where I'm facing problem is Full Text Search. I have tried some mapping but every time I have to compromise either on Full Text Search or Terms Search. So I think there is no any way to apply both search on same field. But as I told, I don;t know much about elasticsearch. So if any one have solution, it will be appreciated.
Here is what I have applied currently which makes sorting and Terms Searching enable but Full Text Search is not working.
{
"mappings":{
"my_type":{
"properties":{
"city":{
"type":"string",
"index":"not_analyzed"
},
"category":{
"type":"string",
"index":"not_analyzed"
},
"area":{
"type":"string",
"index":"not_analyzed"
},
"zip":{
"type":"string",
"index":"not_analyzed"
},
"state":{
"type":"string",
"index":"not_analyzed"
}
}
}
}
}
You can update the mapping with multifields with two mappings one for full text and another for terms search. Here's a sample mapping for city.
{
"city": {
"type": "string",
"index": "not_analyzed",
"fields": {
"fulltext": {
"type": "string"
}
}
}
}
Default mapping is for terms search, so when terms search is required, you could simple query in "city" field. But, you need full-text search, query must be performed on "city.fulltext". Hope this helps.
Full-text search won't work on not_analyzed fields and sorting won't work on analyzed fields.
You need to use multi-fields.
It is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields. For instance, a string field could be mapped as a text field for full-text search, and as a keyword field for sorting or aggregations:
For example :
{
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
} ...
}
}
}
}
Use the dot notation to sort by city.raw :
{
"query": {
"match": {
"city": "york"
}
},
"sort": {
"city.raw": "asc"
}
}

Can I have multiple filters in an Elasticsearch index's settings?

I want an Elasticsearch index that simply stores "names" of features. I want to be able to issue phonetic queries and also type-ahead style queries separately. I would think I would be able to create one index with two analyzers and two filters; each analyzer could use one of the filters. But I do not seem to be able to do this.
Here is the index settings json I'm trying to use:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
}
},
"analyzer": {
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
},
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
}
When I attempt to create an index with these settings:
http://hostname:9200/index/type
I get an HTTP 400, saying
Custom Analyzer [phonetic_analyzer] failed to find filter under name [double_metaphone_filter]
Don't get me wrong, I fully realize what that sentence means. I looked and looked for an erroneous comma or quote but I don't see any. Otherwise, everything is there and formatted correctly.
If I delete the phonetic analyzer, the index is created but ONLY with the autocomplete analyzer and ngram filter.
If I delete the ngram filter, the index is created but ONLY with the phonetic analyzer and phonetic filter.
I have a feeling I'm missing a fundamental concept of ES, like only one analyzer per index, or one filter per index, or I must have some other logical dependencies set up correctly, etc. It sure would be nice to have a logical diagram or complete API spec of the Elasticsearch infrastructure, i.e. any index can have 1..n analyzers, only 1 filter, query must need any one of bool, match, etc. But that unicorn does not seem to exist.
I see tons of documentation, blog posts, etc on how to do each of these functionalities, but with only one analyzer and one filter on the index. I'd really like to do this dual functionality on one index (for reasons out of scope).
Can someone offer some help, advice here?
You are just missing the proper formatting for your settings object. You cannot have two analyzer or filter keys, as there can only be one value per key in this settings map object. Providing a list of your filters seems to work just fine. When you were creating your index object, the second key was overriding the first.
Look here:
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
},
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
},
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
}
}
}
I downloaded the plugin to confirm this works.
You can now test this out at the _analyze enpoint with a payload:
{
"analyzer":"autocomplete_analyzer",
"text":"Jonnie Smythe"
}

Overriding default keyword analysis for elasticsearch

I am trying to configure an elasticsearch index to have a default indexing policy of analysis with the keyword analyzer, and then overriding it on some fields, to allow them to be free text analyzed. So effectively opt-in free text analysis, where I am explicitly specifying in the mapping which fields are analysed for free text matching. My mapping defintion looks like this:
PUT test_index
{
"mappings":{
"test_type":{
"index_analyzer":"keyword",
"search_analyzer":"standard",
"properties":{
"standard":{
"type":"string",
"index_analyzer":"standard"
},
"keyword":{
"type":"string"
}
}
}
}
}
So standard should be an analyzed field, and keyword should be exact match only. However when I insert some sample data with the following command:
POST test_index/test_type
{
"standard":"a dog in a rug",
"keyword":"sheepdog"
}
I am not getting any matches against the following query:
GET test_index/test_type/_search?q=dog
However I do get matches against:
GET test_index/test_type/_search?q=*dog*
Which makes me think that the standard field is not being analyzed. Does anyone know what I am doing wrong?
Nothing's wrong with the index created. Change your query to GET test_index/test_type/_search?q=standard:dog and it should return the expected results.
If you do not want to specify field name in the query, update your mapping such that you provide the index_analyzer and search_analyzer values explicitly for each field with no default values. See below:
PUT test_index
{
"mappings": {
"test_type": {
"properties": {
"standard": {
"type": "string",
"index_analyzer": "standard",
"search_analyzer": "standard"
},
"keyword": {
"type": "string",
"index_analyzer": "keyword",
"search_analyzer": "standard"
}
}
}
}
}
Now if you try GET test_index/test_type/_search?q=dog, you'll get the desired results.

Resources