Elasticsearch strange filter behaviour - elasticsearch

I'm trying to replace a particular string inside a field. So I used custom analyser and character filter just as it's described in the docs, but it didn't work.
Here are my index settings:
{
"settings": {
"analysis": {
"char_filter": {
"doule_colon_to_space": {
"type": "mapping",
"mappings": [ "::=> "]
}},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [ "doule_colon_to_space" ],
"tokenizer": "standard"
}}
}}}
which should replace all double colons (::) in a field with spaces. I then update my mapping to use the analyzer:
{
"posts": {
"properties": {
"id": {
"type": "long"
},
"title": {
"type": "string",
"analyzer": "my_analyzer",
"fields": {
"simple": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Then I put a document in the index:
{
"id": 1,
"title": "Person::Bruce Wayne"
}
I then test if analyzer works, but it appears it's not - when I send this https://localhost:/first_test/_analyze?analyzer=my_analyzer&text=Person::Someone+Close, I got two tokens back - 'PersonSomeone' (together) and 'Close'. Am I doing this right? May be I should escape the space somehow? I use Elasticsearch 1.3.4

I think the whitespace in your char_filter pattern is being ignored. Try using the unicode escape sequence for a single space instead:
"mappings": [ "::=>\\u0020"]
Update:
In response to your comment, the short answer is yes, the example is wrong. The docs do suggest that you can use a mapping character filter to replace a token with another one which is padded by whitespace, but the code disagrees.
The source code for the MappingCharFilterFactory uses this regex to parse the settings:
// source => target
private static Pattern rulePattern = Pattern.compile("(.*)\\s*=>\\s*(.*)\\s*$");
This regex matches (and effectively discards) any whitespace (\\s*) surrounding the second replacement token ((.*)), so it seems that you cannot use leading or trailing whitespace as part of your replacement mapping (though it could include interstitial whitespace). Even if the regex were different, the matched token is trim()ed, which would have removed any leading and trailing whitespace.

Related

Defining cusstom tokenizer in elastic search

This is how i am trying to define a custom tokenizer in es
"pattern" :"[\-s+]",
but when i run this i get the response as shown below
"pattern" : """[-s+]""",
notice in the output i get additional quotes : "pattern" : """[-s+]""",in the begenninng and the end, if we dont have to use any escape characters this works fine, but when using escape character this results in double quotes being appended, any help?
\ Is a reserved Lucene operator, you have to escape it.
https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html
Please try this way
PUT test_varun
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern":"[\\-s+]"
}
}
}
}
}
If doesnt make it please attach an example input/output to reproduce in my end.

Is field named "language" somehow special?

In my query I have following filter:
"term": {
"language": "en-us"
}
And it's not returning any results despite there are a lot of docs with "language" = "en-us" and this field is defined in the mapping correctly. When I change filter for example for:
"term": {
"isPublic": true
}
Then it correctly filter by "isPublic" field.
My suspicion here is that field named "language" is treated somehow special? Maybe it's reserved keyword in ES query? Can't find it in docs.
ES v2.4.0
Mapping of document:
"mappings": {
"contributor": {
"_timestamp": {},
"properties": {
"createdAt": {
"type": "date",
"format": "epoch_millis||dateOptionalTime"
},
"displayName": {
"type": "string"
},
"followersCount_en_us": {
"type": "long"
},
"followersCount_zh_cn": {
"type": "long"
},
"id": {
"type": "long"
},
"isPublic": {
"type": "boolean"
},
"language": {
"type": "string"
},
"photoUrl": {
"type": "string",
"index": "not_analyzed"
},
"role": {
"type": "string",
"store": true
},
"slug": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
The field language is nothing special. It should be all in the mapping. Several possible causes come to mind:
query analyzer != index analyzer
the analyzer first splits into two tokens, en and de and then throws away short tokens, which would leave both, query and index empty:-)
the field is not indexed, just stored.
The - is not a normal ascii dash in the index or the query. I have seen crazy things happening when people paste queries from a word processor, like quotes are no longer straight quotes, dashes are ndash or mdash, ü ist not one character but a combined character.
EDIT after mapping was added to the question:
The type string is analyzed with the Standard Analyzer which splits text into tokens in particular at dashes too, so the field contains two tokens, "en" and "us". Your search is a term query, which should probably be called token-query, because it queries exactly this, the token as you write it: "en-us". But this token does not exist in the field.
Two ways to remedy this:
set the field to not-analyzed and keep the query as is
change the query to a match query.
I would rather use (1), since the language field content is something like an ID and should not be analyzed.
More about the topic: "Why doesn’t the term query match my document?" on https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-term-query.html

Analyze all uppercase tokens in a field

I would like to analyze value of a text field in 2 ways. Using standard analysis and a custom analysis that only indexes all uppercase tokens in the text.
For example, if the value is "This WHITE cat is very CUTE.", the only tokens that should be indexed for custom analysis is "WHITE" and "CUTE". For this, I am using Pattern Capture Token Filter with pattern "(\b[A-Z]+\b)+?". But this is indexing all tokens and not just uppercase tokens.
Is Pattern Capture Token Filter the right one to use for this task? If yes, what am I doing wrong? If not, how do I get this done? Please help.
You should use instead pattern_replace and char_filter:
PUT test
{
"settings": {
"analysis": {
"char_filter": {
"filter_lowercase": {
"type": "pattern_replace",
"pattern": "[A-Z][a-z]+|[a-z]+",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"filter_lowercase"
]
}
}
}
}
}
GET test/_analyze
{"analyzer": "my_analyzer",
"text" : "This WHITE cat is very CUTE"
}

Semi-exact (complete) match in ElasticSearch

Is there a way to require a complete (though not necessarily exact) match in ElasticSearch?
For instance, if a field has the term "I am a little teapot short and stout", I would like to match on " i am a LITTLE TeaPot short and stout! " but not just "teapot short and stout". I've tried the term filter, but that requires an actual exact match.
If your "not necessarily exact" definition refers to uppercase/lowercase letters combination and the punctuation marks (like ! you have in your example), this would be a solution, not too simple and obvious tough:
The mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"my_pattern_replace"
]
}
},
"filter": {
"my_pattern_replace": {
"type": "pattern_replace",
"pattern": "!",
"replacement":""
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_keyword_lowercase"
}
}
}
}
}
The idea here is the following:
use a keyword tokenizer to keep the text as is and not to be split into tokens
use the lowercase filter to get rid of the mixing uppercase/lowercase characters
trim filter used to get rid of the trailing and leading whitespaces
use a pattern_replace filter to get rid of the punctuation. This is like this because a keyword tokenizer won't do anything to the characters inside the text. A standard analyzer will do this, but the standard will, also, split the text whereas you need it as is
And this is the query you would use for the mapping above:
{
"query": {
"match": {
"text": " i am a LITTLE TeaPot short and stout! "
}
}
}

How to not-analyze in ElasticSearch?

I've got a field in an ElasticSearch field which I do not want to have analyzed, i. e. it should be stored and compared verbatim. The values will contain letters, numbers, whitespace, dashes, slashes and maybe other characters.
If I do not give an analyzer in my mapping for this field, the default still uses a tokenizer which hacks my verbatim string into chunks of words. I don't want that.
Is there a super simple analyzer which, basically, does not analyze? Or is there a different way of denoting that this field shall not be analyzed?
I only create the index, I don't do anything else. I can use analyzers like "english" for other fields which seems to be built-in names for pre-configured analyzers. Is there a list of other names? Maybe there's one fitting my needs (namely doing nothing with the input).
This is my mapping currently:
{
"my_type": {
"properties": {
"my_field1": { "type": "string", "analyzer": "english" },
"my_field2": { "type": "string" }
}
}
}
my_field1 is language-dependent; this seems to work. my_field2 shall be verbatim. I'd like to give an analyzer there which simply does not do anything.
A sample value for my_field2 would be "B45c 14/04".
"my_field2": {
"properties": {
"title": {
"type": "string",
"index": "not_analyzed"
}
}
}
Check you here, https://www.elastic.co/guide/en/elasticsearch/reference/1.4/mapping-core-types.html, for further info.
This is no longer true due to the removal of the string (replaced by keyword and text) type as described here. Instead you should use keyword type with "index": true | false.
For Example OLD:
{
"foo": {
"type" "string",
"index": "not_analyzed"
}
}
becomes NEW:
{
"foo": {
"type" "keyword",
"index": true
}
}
This means the field is indexed but as it is typed as keyword not analyzed implicitly. If you would like to have the field analyzed, you need to use text type.
keyword analyser can be also used.
// don't actually use this, use "index": "not_analyzed" instead
{
"my_type": {
"properties": {
"my_field1": { "type": "string", "analyzer": "english" },
"my_field2": { "type": "string", "analyzer": "keyword" }
}
}
}
As noted here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-analyzer.html, it makes more sense to mark those fields as not_analyzed.
But keyword analyzer can be useful when it is set by default for whole index.
UPDATE: As it said in comments, string is no longer supported in 5.X

Resources