elasticsearch custom tokenizer don't split time by ":" - elasticsearch

for example, I have log like this:
11:22:33 user:abc&game:cde
if I use the standard tokenizer, this log will be split to :
11 22 33 user abc game cde
but 11:22:33 means time, I don't want to split it, I want to use custom tokenizer to split it to:
11:22:33 user abc game cde
so, how should I set the tokenizer?

You can use pattern tokenizer in order to achieve that.
A tokenizer of type pattern that can flexibly separate text into terms via a regular expression
Read more here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

Related

elasticsearch tokenizer to generate tokens for special characters

I have a use case where special characters also should be searchable. I have tried some tokenizers like char_group, standard, n-gram.
If I use an n-gram tokenizer I am able to make special characters searchable(since it generates a token for each character).
But n-gram generates too many tokens, so I am not interested in using an n-gram tokenizer.
For example, if the text is hey john.s #100 is a test name, then the tokenizer should create tokens for [hey, john, s, #, 100, is, a, test, name]
Please refer to this question for a detailed explanation.
Thank you.....
Based on your use-case the best option would be to use a Whitespace tokenizer with a combination of Word Delimiter Graph filter.
For more information check the official documentation of Elasticsearch about whitespace tokenizer and word delimiter graph filter here:
https://www.elastic.co/guide/en/elasticsearch/reference/8.4/analysis-whitespace-tokenizer.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html

How to tokenize a sentence based on maximum number of words in Elasticsearch?

I have a string like "This is a beautiful day"
What tokenizer or what combination between tokenizer and token filter should I use to produce output that contains terms that have a maximum of 2 words? Ideally, the output should be:
"This, This is, is, is a, a, a beautiful, beautiful, beautiful day, day"
So far, I have tried all built-in tokenizer, the 'pattern' tokenizer seems the one I can use, but I don't know how to write a regex pattern for my case. Any help?
Seems that you're looking for shingle token filter it does exactly what you want.
As what #Oleksii said.
in your case max_shingle_size = 2 (which is the default), and min_shingle_size = 1.

Elasticsearch - tokenize terms by capitalized character, for example "TheStarTech" => [The, Star, Tech]

Does Elasticsearch support tokenizer to tokenize terms by capitalized character, for example: Tokenize TheStarTech to terms [The, Star, Tech]. Pattern tokenizer seems helpful, any suggestions?
See this: World Delimited Token Filter
It does what you want and more. You can pass in the parameters as may fit your need. Check split_on_case_change parameter which is true by default.

In elasticsearc How can I Tokenize words separeted by space and be able to match by typing without space

Here is what I want to achieve :
My field value : "one two three"
I want to be able to match this field by typing: one or onetwo or onetwothree or onethree or twothree or two or three
For that, the tokenizer need to produce those tokens:
one
onetwo
onetwothree
onethree
two
twothree
three
Do you know how can I implement this analyzer ?
there is the same problem in German language when we connect different words into one. For this purpose Elasticsearch uses technique called "coumpound words". There is also a specific token filter called "compound word token filter". It is trying to find sub-words from given dictionary in string. You only have to define dictionary for your language. There is whole specification at link bellow.
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-compound-word-tokenfilter.html

Tokenizer that splits up words with underscores in them but also retains a full version

I'm implementing an Elasticsearch search for content that has filesnames in it, such as
"golf_master_2009.xls". I'd like a tokenizer that splits this up into at least the following tokens: "golf", "master", "golf_master_2009.xml". Now I have to use wildcards (for example "master") if I want to search for it without specifying the full filename.
You can apply differents analyzers using a multifield field.
See http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html
HTH
You can use your own analyzer with keyword tokenizer and word delimiter token filter (with options generate_word_parts and preserve_original set to true)

Resources