How to search with keyword analyzer? - elasticsearch

I have keyword analyzer as default analyzer, like so:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "keyword"
}}}}}}
```
But now I can't search anything. e.g:
{
"query": {
"query_string": {
"query": "cast"
}}}
Gives me 0 results all though "cast" is a common value i the indexed documents. (http://gist.github.com/baelter/b0720a52ee5a27e27d3a)
Search for "*" works fine btw.
I only have explicit defaults in my mapping:
{
"oceanography_point": {
"_all" : {
"enabled" : true
},
"properties" : {}
}
}
The index behaves as if no fields are included in _all, because field:value queries works fine.
Am I misusing the keyword analyzer?

Using keyword analyzer , you can only do an exact string match.
Lets assume that you have used keyword analyzer and no filters.
In that case for as string indexed as "Cast away in forest" , neither search for "cast" or "away" will work. You need to do an exact "Cast away in forest" string to match it. ( Assuming no lowercase filter used , you need to give the right case too)
A better approach would be to use multi fields to declare one copy as keyword analyzed and other one normal.
You can search on one of this field and aggregate on the other.

Okey, some 15h of trial and error I can conclude that this works for search:
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"default": {
"type": "keyword"
}}}}}}
How ever this breaks faceting so I ended up using a dynamic template instead:
"dynamic_templates" : [
{
"strings_not_analyzed" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
],

Related

How to preserve original term during transliteration in Elasticsearch with ICU plugin?

I'm using the folowing ICU transform filter to peform transliteration
"transliterate": {
"type": "icu_transform",
"id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC"
}
Current problem is that this filter replace the original term in index so search in native language is not possible with term query like this
{
"terms" : {
"field" : [
"term"
],
"boost" : 1.0
}
}
Is there any way to make icu_transform filter produce 2 terms original one and transliterated one?
If no i think the optimal solution will be maping with copy to another field and analyzer for this field without transliterate filter. Can you suggest smth more efficient?
I'm using Elasticsearch 5.6.4
Multi-fields allow you to index the same source value to different fields in different ways. You can index to a field with the standard analyzer and to another field with an analyzer that applies the ICU transform filter. For example,
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"latin": {
"type": "text",
"analyzer": "latin"
}
}
}
}
}
}
}
Then you can query the my_field or my_field.latin field.

How to handle wildcards in elastic search structured queries

My use case requires to query for our elastic search domain with trailing wildcards. I wanted to get your opinion on the best practices of handling such wildcards in the queries.
Do you think adding the following clauses is a good practice for the queries:
"query" : {
"query_string" : {
"query" : "attribute:postfix*",
"analyze_wildcard" : true,
"allow_leading_wildcard" : false,
"use_dis_max" : false
}
}
I've disallowed leading wildcards since it is a heavy operation. However I wanted to how good is analyzing wildcard for every query request in the long run. My understanding is, analyze wildcard would have no impact if the query doesn't actually have any wildcards. Is that correct?
If you have the possibility of changing your mapping type and index settings, the right way to go is to create a custom analyzer with an edge-n-gram token filter that would index all prefixes of the attribute field.
curl -XPUT http://localhost:9200/your_index -d '{
"settings": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
}
},
"analyzer": {
"attr_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "edge_filter"]
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"attribute": {
"type": "string",
"analyzer": "attr_analyzer",
"search_analyzer": "standard"
}
}
}
}
}'
Then, when you index a document, the attribute field value (e.g.) postfixing will be indexed as the following tokens: p, po, pos, post, postf, postfi, postfix, postfixi, postfixin, postfixing.
Finally, you can then easily query the attribute field for the postfix value using a simple match query like this. No need to use an under-performing wildcard in a query string query.
{
"query": {
"match" : {
"attribute" : "postfix"
}
}
}

Wildcard query over _all field on Elasticsearch

I'm trying to perform wildcard queries over the _all field. An example query could be:
GET index/type/_search
{
"from" : 0,
"size" : 1000,
"query" : {
"bool" : {
"must" : {
"wildcard" : {
"_all" : "*tito*"
}
}
}
}
}
The thing is that to use a wildcard query the _all field needs to be not_analyzed, otherwise the query won't work. See ES documentation for more info.
I tried to set the mappings over the _all field using this request:
PUT index
{
"mappings": {
"type": {
"_all" : {
"enabled" : true,
"index_analyzer": "not_analyzed",
"search_analyzer": "not_analyzed"
},
"_timestamp": {
"enabled": "true"
},
"properties": {
"someProp": {
"type": "date"
}
}
}
}
}
But I'm getting the error analyzer [not_analyzed] not found for field [_all].
I want to know what I'm doing wrong and if there is another (better) way to perform this kind of queries.
Thanks.-
Have you tried removing:
"search_analyzer": "not_analyzed"
Also, I wonder how well a wildcard across all properties will scale. Have you looked into NGrams? See the docs here.
Most probably you wanted to give option
"index": "not_analyzed"
Index attribute for a string field, _all is a string field, determines if that field should be analyzed or not.
"search_analyzer" is to set to determine which analyzer should be used for user entered query, which is valid if index attribute is set to analyzed.
"index_analyzer" is to set to determine which analyzer should be used for documents, again which is valid if index attribute is set to analyzed.

How to index both a string and its reverse?

I'm looking for a way to analyze the string "abc123" as ["abc123", "321cba"]. I've looked at the reverse token filter, but that only gets me ["321cba"]. Documentation on this filter is pretty sparse, only stating that
"A token filter of type reverse ... simply reverses each token."
(see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html).
I've also tinkered with using the keyword_repeat filter, which gets me two instances. I don't know if that's useful, but for now all it does it reverse both instances.
How can I use the reverse token filter but keep the original token as well?
My analyzer:
{ "settings" : { "analysis" : {
"analyzer" : {
"phone" : {
"type" : "custom"
,"char_filter" : ["strip_non_numeric"]
,"tokenizer" : "keyword"
,"filter" : ["standard", "keyword_repeat", "reverse"]
}
}
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
}
}}}
Make and put a analyzer to reverse a string (say reverse_analyzer).
PUT index_name
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"type": "custom",
"char_filter": [
"strip_non_numeric"
],
"tokenizer": "keyword",
"filter": [
"standard",
"keyword_repeat",
"reverse"
]
}
},
"char_filter": {
"strip_non_numeric": {
"type": "pattern_replace",
"pattern": "[^0-9]",
"replacement": ""
}
}
}
}
}
then, for a field, (say phoneno), use mapping as, (create a type and append mapping for phone as)
PUT index_name/type_name/_mapping
{
"type_name": {
"properties": {
"phone_no": {
"type": "string",
"fields": {
"reverse": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
So, phone_no is like multifield, which will store a string and its reverse as,
if you index
phone_no: 911220
then in elasticsearch, there will be fields as,
phone_no: 911220 and phone_no.reverse : 022119, so you can search, filter reverse or not-reversed field.
Hope this helps.
I don't believe you can do this directly, as I am unaware of any way to get the reverse token filter to also output the original.
However, you could use the fields parameter to index both the original and the reversed at the same time with no additional coding. You would then search both fields.
So let's say your field was called phone_number:
"phone_number": {
"type": "string",
"fields": {
"reverse": { "type": "string", "index": "phone" }
}
}
In this case we're indexing using the default analyzer (assume standard) plus also indexing into reverse with your customer analyzer phone which reverses. You then issue your queries against both fields.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
I'm not sure it's possible to do this using built-in set of token filters. I would recommend you to create your own plugin. There is ICU Analysis plugin supported by elastic search team, that you can use as example.
I wound up using the following two char_filter's in my analyzer. It's an ugly abuse of regex, but it seems to work. It is limited to the first 20 numeric characters, but in my use-case that is acceptable.
First it groups all numeric characters, then explicitly rebuilds the string with its own (numeric-only!) reverse. The space in the center of the replacement pattern then causes the tokenizer to split it into two tokens - the original and the reverse.
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
,"dupe_and_reverse" : {
"type" : "pattern_replace"
,"pattern" : "([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)"
,"replacement" : "$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15$16$17$18$19$20 $20$19$18$17$16$15$14$13$12$11$10$9$8$7$6$5$4$3$2$1"
}
}

Elasticsearch: filter for a substring in the value of a document field?

I am new to Elasticsearch. I have the following mapping for a string field:
"ipAddress": {
"type": "string",
"store": "no",
"index": "not_analyzed",
"omit_norms": "true",
"include_in_all": false
}
A document with value in the ipAddress field looks like:
"ipAddress": "123.3.4.12 134.4.5.6"
Notice that in the above there are two IP addresses, separated by a blank.
Now I need to filter documents based on this field. This is an example filter value
123.3.4.12
And the filter value is always a single IP address as shown above.
I look at the filters at
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filters.html
and I cannot seem to be able to find right filter for this. I tried the term filter,
{
"query": {
"filtered" : {
"query" : {
"match_all" : {}
},
"filter": {
"term" : { "ipAddress" : "123.3.4.12" }
}
}
}
}
but it seems that it returns a document only when the filter value 100% matches the value of a document's field.
Can anyone help me out on this?
Update:
Based on John Petrone's suggestion, I got it working by defining a whitespace tokenizer based analyzer as follows:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"blank_sep_analyzer": {
"tokenizer": "whitespace"
}
}
}
}
},
"mappings": {
"ipAddress": {
"type": "string",
"store": "no",
"index": "analyzed",
"analyzer": "blank_sep_analyzer",
"omit_norms": "true",
"include_in_all": false
}
}
}
The problem is that the field is not analyzed, so if you have 2 IP addresses in it the term is actually the full field, e.g. "123.3.4.12 134.4.5.6".
I'd suggest a different approach - if you are always going to have lists of IP addresses separated by spaces consider using the whitespace tokenizer to create tokens as whitespaces - should create several tokens that the IP address will then match:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-whitespace-tokenizer.html
Another approach could be storing the IP addresses as an array. And then the current mappings would work. You would just have to separate the IP addresses when indexing the document.

Resources