Exclude from CamelCase tokenizer in Elasticsearch

Exclude from CamelCase tokenizer in Elasticsearch - elasticsearch

Struggling to make iPhone match when searching for iphone in Elasticsearch.
Since I have some source code at stake, I surely need CamelCase tokenizer, but it appears to break iPhone into two terms, so iphone can't be found.
Anyone knows of a way to add exceptions to breaking camelCase words into tokens (camel + case)?
UPDATE: to make it clear, I want NullPointerException to be tokenized as [null, pointer, exception], but I don't want iPhone to become [i, phone].
Any other solution?
UPDATE 2: #ChintanShah's answer suggests a different approach that gives us even more - NullPointerException will be tokenized as [null, pointer, exception, nullpointer, pointerexception, nullpointerexception], which is definitely much more useful from the point of view of the one that searches. And indexing is also faster! Price to pay is index size, but it is a superior solution.

You can achieve your requirements with word_delimiter token filter.
This is my setup
{
"settings": {
"analysis": {
"analyzer": {
"camel_analyzer": {
"tokenizer": "whitespace",
"filter": [
"camel_filter",
"lowercase",
"asciifolding"
]
}
},
"filter": {
"camel_filter": {
"type": "word_delimiter",
"generate_number_parts": false,
"stem_english_possessive": false,
"split_on_numerics": false,
"protected_words": [
"iPhone",
"WiFi"
]
}
}
}
},
"mappings": {
}
}
This will split the words on case changes so NullPointerException will be tokenized as null, pointer and exception but iPhone and WiFi will remain as it is as they are protected. word_delimiter has lot of options for flexibility. You can also preserve_original which will help you a lot.
GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer
Result
{
"tokens": [
{
"token": "iphone",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 1
}
]
}
Now with
GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer
Result
{
"tokens": [
{
"token": "null",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "pointer",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 2
},
{
"token": "exception",
"start_offset": 11,
"end_offset": 20,
"type": "word",
"position": 3
}
]
}
Another approach is to analyze your field twice with different analyzers but I feel word_delimiter will do the trick.
Does this help?

Related

Elastic length filter

I am trying to take advantage of the length filter in ElasticSearch
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-length-tokenfilter.html
In the example provided, it simply removes the matching tokens.
But when I use it, it replaces the tokens with _
Anyone ran in to this problem?
I suppose I can add a character filter. But maybe there is some undocumented feature?
Example:
String i eat icecream
If I apply length filter, with min = 3, max=10, the tokens I get, is:
_ eat icecream instead of eat icecream

Using your string "i eat icecream" and analyzing it using a length token filter, I am getting the below result (tested on Elasticsearch version 7.x)
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "length",
"min": 3,
"max": 10
}
],
"text": "i eat icecream"
}
The tokens generated are
{
"tokens": [
{
"token": "eat",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "icecream",
"start_offset": 6,
"end_offset": 14,
"type": "word",
"position": 2
}
]
}

In consistent results returned with white space in query

Using NEST.
I have the following code.
QueryContainerDescriptor<ProductIndex> q
var queryContainer = new QueryContainer();
queryContainer &= q.Match(m => m.Field(f => f.Code).Query(parameters.Code));
I would like to have both these criteria
code=FRUIT 12 //with space
code=FRUIT12 //no space
Return products 1 and 2
Currently
I get products 1 and 2 if I set code=FRUIT 12 //with space
and I only get product 2 if I set code=FRUIT12 //no space
Sample data
Products
[
{
"id": 1,
"name": "APPLE",
"code": "FRUIT 12"
},
{
"id": 2,
"name": "ORANGE",
"code": "FRUIT12"
}
]

by default, a string field will have a standard tokenizer, that will emit a single token "FRUIT12" for the "FRUIT12" input.
You need to use a word_delimiter token filter in your field analyzer to allow the behavior your are expecting :
GET _analyze
{
"text": "FRUIT12",
"tokenizer": "standard"
}
gives
{
"tokens": [
{
"token": "FRUIT12",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
}
]
}
----------- and
GET _analyze
{
"text": "FRUIT12",
"tokenizer": "standard",
"filters": ["word_delimiter"]
}
gives
{
"tokens": [
{
"token": "FRUIT",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "12",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
If your add the word_delimiter token filter on your field, any search query on this field will also have the word_delimiter token filter enabled ( unless you override it with search_analyzer option in the mapping )
so "FRUIT12" mono-term query will be "translated" to ["FRUIT", "12"] multi term query.

Mapping analyser for splitting string in Elastic search

is it possible to create a mapping analyser for splitting string into smaller parts based on count of characters?
For example, let's say I have a string: "ABCD1E2F34". This is some token constructed from multiple smaller codes and I want to break it down to those codes again.
If I know for sure that:
- First code is always 4 letters ("ABCD")
- Second is 3 letters ("1E2")
- Third is 1 letter ("F")
- Fourth is 2 letters ("34")
Can I create a mapping analyser for a field that will map the string like this? If I set the field "bigCode" to have value "ABCD1E2F34" I will be able to access it like this:
bigCode.full ("ABCD1E2F34")
bigCode.first ("ABCD")
bigCode.second ("1E2")
...
Thanks a lot!

What do you think about Pattern tokenizer? I create a regex to split string to tokens which is (?<=(^\\w{4}))|(?<=^\\w{4}(\\w{3}))|(?<=^\\w{4}\\w{3}(\\w{1}))|(?<=^\\w{4}\\w{3}\\w{1}(\\w{2})). After that I created an analyzer like that:
PUT /myindex
{
"settings": {
"analysis": {
"analyzer": {
"codeanalyzer": {
"type": "pattern",
"pattern":"(?<=(^\\w{4}))|(?<=^\\w{4}(\\w{3}))|(?<=^\\w{4}\\w{3}(\\w{1}))|(?<=^\\w{4}\\w{3}\\w{1}(\\w{2}))"
}
}
}
}
}
POST /myindex/_analyze?analyzer=codeanalyzer&text=ABCD1E2F34
And the result is tokenized data:
{
"tokens": [
{
"token": "abcd",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "1e2",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "f",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 2
},
{
"token": "34",
"start_offset": 8,
"end_offset": 10,
"type": "word",
"position": 3
}
]
}
You can check the documentation also : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

Shingles in Elasticsearch which respect punctuation

I am building an address matching engine for UK addresses in Elasticsearch and have found shingles to be very useful however I am seeing some issues when it comes to punctuation. A query for "4 Walmley Close" is returning the following matches:
Units 3 And 4, Walmley Chambers, 3 Walmley Close
Flat 4, Walmley Court, 10 Walmley Close
Co-Operative Retail Services Ltd, 4 Walmley Close
The true match is number 3, however both 1 and 2 match (falsely) as they both become '4 walmley' when turned into shingles. I would like to tell the shingle analyzer not generate shingles that straddle commas. So, for example 1) currently I get:
units 3
3 and
and 4
4 walmley
walmley chambers
chambers 3
3 walmley
walmley close
...when in actual fact all I want is....
units 3
3 and
and 4
walmley chambers
3 walmley
walmley close
My current settings are below. I have experimented with swapping the tokenizer from standard to whitespace, this helps in that it retains the commas and would potentially avoid the situation above (i.e. I end up with '4, walmley' as my shingle in address 1 and 2) however I end up with lots of unusable shingles in my index and with 70 million documents I need to keep the index size down.
As you can see in the index settings I have also have a street_sym filter which I would love to be able to use in my shingles e.g. for this example, in addition to generating 'walmley close' I would like to have 'walmley cl' however when I attempted to include this I got shingles of 'close cl' which were not terribly helpful!
Any advice from more experienced Elasticsearch users would be hugely appreciated. I have read through Gormley and Tong's excellent book but cannot get my head around this particular issue.
Thanks in advance for any help offered.
"analysis": {
"filter": {
"shingle": {
"type": "shingle",
"output_unigrams": false
},
"street_sym": {
"type": "synonym",
"synonyms": [
"st => street",
"rd => road",
"ave => avenue",
"ct => court",
"ln => lane",
"terr => terrace",
"cir => circle",
"hwy => highway",
"pkwy => parkway",
"cl => close",
"blvd => boulevard",
"dr => drive",
"ste => suite",
"wy => way",
"tr => trail"
]
}
},
"analyzer": {
"shingle": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
]
}
}
}

See my comment on your question for why the solution still won't stop "4 Walmley Close" from matching all three of the matches you provided. However, it is possible to at least get the tokens you want. I'm not sure it it's the most elegant/performant solution, but using the Pattern Replace, Pattern Capture, and Length filters on your shingles seems to do the trick:
"analysis": {
"filter": {
"shingle": {
"type": "shingle",
"output_unigrams": false
},
"street_sym": {
"type": "synonym",
"synonyms": [
"st => street",
"rd => road",
"ave => avenue",
"ct => court",
"ln => lane",
"terr => terrace",
"cir => circle",
"hwy => highway",
"pkwy => parkway",
"cl => close",
"blvd => boulevard",
"dr => drive",
"ste => suite",
"wy => way",
"tr => trail"
]
},
"no_middle_comma": {
"type": "pattern_replace",
"pattern": ".+,.+",
"replacement": ""
},
"no_trailing_comma": {
"type": "pattern_capture",
"preserve_original": false,
"patterns": [
"(.*),"
]
},
"not_empty": {
"type": "length",
"min": 1
}
},
"analyzer": {
"test": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"street_sym",
"shingle",
"no_middle_comma",
"no_trailing_comma",
"not_empty"
]
}
}
}
no_middle_comma: replace any tokens with a comma in the middle with an empty token
no_trailing_comma: replace any tokens ending with a comma with the part before the comma
not_empty: remove any empty tokens resulting from the above
For example, "Units 3 And 4, Walmley Chambers, 3 Walmley Cl" becomes:
{
"tokens": [
{
"token": "units 3",
"start_offset": 0,
"end_offset": 7,
"type": "shingle",
"position": 0
},
{
"token": "3 and",
"start_offset": 6,
"end_offset": 11,
"type": "shingle",
"position": 1
},
{
"token": "and 4",
"start_offset": 8,
"end_offset": 14,
"type": "shingle",
"position": 2
},
{
"token": "walmley chambers",
"start_offset": 15,
"end_offset": 32,
"type": "shingle",
"position": 4
},
{
"token": "3 walmley",
"start_offset": 33,
"end_offset": 42,
"type": "shingle",
"position": 6
},
{
"token": "walmley close",
"start_offset": 35,
"end_offset": 45,
"type": "shingle",
"position": 7
}
]
}
Note that your synonym filter works: "Walmley Cl" became "walmley close".

elastcsearch : is it possible to emit overlapping tokens with the pattern tokenizer?

Working with elasticsearch, I want to set up an analyzer to emit overlapping tokens given an input string, a little bit like the edge Ngrams tokenizer.
Given the input
a/b/c
I would like the analyzer to produce tokens
a a/b a/b/c
I tried the pattern tokenizer with the following setup:
settings: {
analysis: {
tokenizer: {
"my_tokenizer": {
"type": "pattern",
"pattern": "^(.*)(/|$)",
"group": 1
}
...
However it doesn't output all the matching sequences and because it is greedy will only output
a/b/c
Is there a way I could do this with another combination of builtin tokenizers/filters/analyzers?

Depending on your values format, you could use a path hierarchy tokenizer.
Tried with the analyze API :
GET _analyze?tokenizer=path_hierarchy&text=a/b/c
Output was quite close to what you want :
{
"tokens": [
{
"token": "a",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 1
},
{
"token": "a/b",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "a/b/c",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 1
}
]
}
Give it a try, and let us know :)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Exclude from CamelCase tokenizer in Elasticsearch - elasticsearch

Related

Elastic length filter

In consistent results returned with white space in query

Mapping analyser for splitting string in Elastic search

Shingles in Elasticsearch which respect punctuation

elastcsearch : is it possible to emit overlapping tokens with the pattern tokenizer?

Categories

Resources