Mapping analyser for splitting string in Elastic search - elasticsearch

is it possible to create a mapping analyser for splitting string into smaller parts based on count of characters?
For example, let's say I have a string: "ABCD1E2F34". This is some token constructed from multiple smaller codes and I want to break it down to those codes again.
If I know for sure that:
- First code is always 4 letters ("ABCD")
- Second is 3 letters ("1E2")
- Third is 1 letter ("F")
- Fourth is 2 letters ("34")
Can I create a mapping analyser for a field that will map the string like this? If I set the field "bigCode" to have value "ABCD1E2F34" I will be able to access it like this:
bigCode.full ("ABCD1E2F34")
bigCode.first ("ABCD")
bigCode.second ("1E2")
...
Thanks a lot!

What do you think about Pattern tokenizer? I create a regex to split string to tokens which is (?<=(^\\w{4}))|(?<=^\\w{4}(\\w{3}))|(?<=^\\w{4}\\w{3}(\\w{1}))|(?<=^\\w{4}\\w{3}\\w{1}(\\w{2})). After that I created an analyzer like that:
PUT /myindex
{
"settings": {
"analysis": {
"analyzer": {
"codeanalyzer": {
"type": "pattern",
"pattern":"(?<=(^\\w{4}))|(?<=^\\w{4}(\\w{3}))|(?<=^\\w{4}\\w{3}(\\w{1}))|(?<=^\\w{4}\\w{3}\\w{1}(\\w{2}))"
}
}
}
}
}
POST /myindex/_analyze?analyzer=codeanalyzer&text=ABCD1E2F34
And the result is tokenized data:
{
"tokens": [
{
"token": "abcd",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "1e2",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "f",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 2
},
{
"token": "34",
"start_offset": 8,
"end_offset": 10,
"type": "word",
"position": 3
}
]
}
You can check the documentation also : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

Related

Elastic length filter

I am trying to take advantage of the length filter in ElasticSearch
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-length-tokenfilter.html
In the example provided, it simply removes the matching tokens.
But when I use it, it replaces the tokens with _
Anyone ran in to this problem?
I suppose I can add a character filter. But maybe there is some undocumented feature?
Example:
String i eat icecream
If I apply length filter, with min = 3, max=10, the tokens I get, is:
_ eat icecream instead of eat icecream
Using your string "i eat icecream" and analyzing it using a length token filter, I am getting the below result (tested on Elasticsearch version 7.x)
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "length",
"min": 3,
"max": 10
}
],
"text": "i eat icecream"
}
The tokens generated are
{
"tokens": [
{
"token": "eat",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "icecream",
"start_offset": 6,
"end_offset": 14,
"type": "word",
"position": 2
}
]
}

In consistent results returned with white space in query

Using NEST.
I have the following code.
QueryContainerDescriptor<ProductIndex> q
var queryContainer = new QueryContainer();
queryContainer &= q.Match(m => m.Field(f => f.Code).Query(parameters.Code));
I would like to have both these criteria
code=FRUIT 12 //with space
code=FRUIT12 //no space
Return products 1 and 2
Currently
I get products 1 and 2 if I set code=FRUIT 12 //with space
and I only get product 2 if I set code=FRUIT12 //no space
Sample data
Products
[
{
"id": 1,
"name": "APPLE",
"code": "FRUIT 12"
},
{
"id": 2,
"name": "ORANGE",
"code": "FRUIT12"
}
]
by default, a string field will have a standard tokenizer, that will emit a single token "FRUIT12" for the "FRUIT12" input.
You need to use a word_delimiter token filter in your field analyzer to allow the behavior your are expecting :
GET _analyze
{
"text": "FRUIT12",
"tokenizer": "standard"
}
gives
{
"tokens": [
{
"token": "FRUIT12",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
}
]
}
----------- and
GET _analyze
{
"text": "FRUIT12",
"tokenizer": "standard",
"filters": ["word_delimiter"]
}
gives
{
"tokens": [
{
"token": "FRUIT",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "12",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
If your add the word_delimiter token filter on your field, any search query on this field will also have the word_delimiter token filter enabled ( unless you override it with search_analyzer option in the mapping )
so "FRUIT12" mono-term query will be "translated" to ["FRUIT", "12"] multi term query.

NEST Fluent DSL querying some URL field

Got some question related to NEST. Below are some documents in my ES.
As you can see I already insert some entry in my ES. I tried do some query something like this:
var response = elastic.Search<ESIntegrationLog>(s => s
.Index("20160806")
.Type("esintegrationlog")
.Query(q =>
q.Term(p => p.CalledBy, "lazada")
)
.Sort(ss => ss.Descending(p => p.CalledOn))
.Take(300)
);
The outcome is just as I expected, I do found the entry. But when I tried to query by 'callPoint', I somehow cannot find any result. Below is the code:
var response = elastic.Search<ESIntegrationLog>(s => s
.Index("20160806")
.Type("esintegrationlog")
.Query(q =>
q.Term(p => p.CallPoint, "/cloudconnect/api/xxxxxxx/v1")
)
.Sort(ss => ss.Descending(p => p.CalledOn))
.Take(300)
);
I do already tried to encode the URL, but still does not find anything. Any ideas?
Update: I solve the case using 'match'.
.Query(q =>
//q.Term(p => p.CallPoint, "abcdefg")
q.MatchPhrasePrefix(c=> c.Field(d=> d.CallPoint).Query("/cloudconnect/api/xxxxxxx/v1"))
)
I suspect that callPoint is an analyzed string field, which has been analyzed by the standard analyzer. You'll be able to see how callPoint is mapped by looking at the mappings in 20160806 index. Using Sense
GET 20160806
If the mapping for callPoint is { "type" : "string" } then the input will be analyzed at index time. You can see how the standard analyzer will analyze the input using the _analyze API
POST _analyze
{
"text" : "/cloudconnect/api/xxxxxxx/v1",
"analyzer": "standard"
}
produces the following tokens
{
"tokens": [
{
"token": "cloudconnect",
"start_offset": 1,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "api",
"start_offset": 14,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "xxxxxxx",
"start_offset": 18,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "v1",
"start_offset": 26,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 3
}
]
}
A term query does not analyze the query input so will be attempting to match the query input as is against what is in the inverted index, which for the callPoint field, has been analyzed at index time. A match query does analyze the query input so you would get a match for the document as expected. Alternatively, you could map callPoint as a not_analyzed string field so that the input is not analyzed at index time and is indexed verbatim.

Exclude from CamelCase tokenizer in Elasticsearch

Struggling to make iPhone match when searching for iphone in Elasticsearch.
Since I have some source code at stake, I surely need CamelCase tokenizer, but it appears to break iPhone into two terms, so iphone can't be found.
Anyone knows of a way to add exceptions to breaking camelCase words into tokens (camel + case)?
UPDATE: to make it clear, I want NullPointerException to be tokenized as [null, pointer, exception], but I don't want iPhone to become [i, phone].
Any other solution?
UPDATE 2: #ChintanShah's answer suggests a different approach that gives us even more - NullPointerException will be tokenized as [null, pointer, exception, nullpointer, pointerexception, nullpointerexception], which is definitely much more useful from the point of view of the one that searches. And indexing is also faster! Price to pay is index size, but it is a superior solution.
You can achieve your requirements with word_delimiter token filter.
This is my setup
{
"settings": {
"analysis": {
"analyzer": {
"camel_analyzer": {
"tokenizer": "whitespace",
"filter": [
"camel_filter",
"lowercase",
"asciifolding"
]
}
},
"filter": {
"camel_filter": {
"type": "word_delimiter",
"generate_number_parts": false,
"stem_english_possessive": false,
"split_on_numerics": false,
"protected_words": [
"iPhone",
"WiFi"
]
}
}
}
},
"mappings": {
}
}
This will split the words on case changes so NullPointerException will be tokenized as null, pointer and exception but iPhone and WiFi will remain as it is as they are protected. word_delimiter has lot of options for flexibility. You can also preserve_original which will help you a lot.
GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer
Result
{
"tokens": [
{
"token": "iphone",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 1
}
]
}
Now with
GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer
Result
{
"tokens": [
{
"token": "null",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "pointer",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 2
},
{
"token": "exception",
"start_offset": 11,
"end_offset": 20,
"type": "word",
"position": 3
}
]
}
Another approach is to analyze your field twice with different analyzers but I feel word_delimiter will do the trick.
Does this help?

elastcsearch : is it possible to emit overlapping tokens with the pattern tokenizer?

Working with elasticsearch, I want to set up an analyzer to emit overlapping tokens given an input string, a little bit like the edge Ngrams tokenizer.
Given the input
a/b/c
I would like the analyzer to produce tokens
a a/b a/b/c
I tried the pattern tokenizer with the following setup:
settings: {
analysis: {
tokenizer: {
"my_tokenizer": {
"type": "pattern",
"pattern": "^(.*)(/|$)",
"group": 1
}
...
However it doesn't output all the matching sequences and because it is greedy will only output
a/b/c
Is there a way I could do this with another combination of builtin tokenizers/filters/analyzers?
Depending on your values format, you could use a path hierarchy tokenizer.
Tried with the analyze API :
GET _analyze?tokenizer=path_hierarchy&text=a/b/c
Output was quite close to what you want :
{
"tokens": [
{
"token": "a",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 1
},
{
"token": "a/b",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "a/b/c",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 1
}
]
}
Give it a try, and let us know :)

Resources