What are tokens in elasticsearch for exactly?

What are tokens in elasticsearch for exactly? - elasticsearch

I googled for my question, but couldn't find an answer. I'm fairly new to elasticsearch and I think I didn't get the idea about tokens yet.
I've built a mapping with a custom name_analyzer that uses the filters lowercase, unique and asciifolding with preserve_original=true.
I have the field search_combo_name and the content for example is this:
André, André Mustermann, andre.mustermann#gmail.com, Mustermann
When I use kibana to analyze the string above against my name_analyzer, I get the following result:
{
"tokens" : [
{
"token" : "andre",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "andré",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "mustermann",
"start_offset" : 13,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "andre.mustermann",
"start_offset" : 25,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "gmail.com",
"start_offset" : 42,
"end_offset" : 51,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
That's the result I expect, but what are these tokens used for?
When I search with bool must/should or match, elasticsearch searches for the content of the fields and not the tokens, right?

These tokens are the ones that are going to be indexed and that you can then search on.
All queries will run on those tokens (i.e. not on the raw content directly), which is why it is important to set proper field types and analyzers (in case of text fields) when indexing data into Elasticsearch.
Failing to do so can result in bad relevance (and also bad performance), i.e. queries with bad and/or imprecise results, or queries that take too long to execute. It's a very wide topic, but maybe if you present your use case in more details, we can help better.

Related

How to keep the longest token of text analysis result in elasticsearch?

I use medcl /
elasticsearch-analysis-pinyin to ayalyze a text field, and I want only keep the longest token in the analysis result.
For example in the below result, only keep the longest token english123djcjdj.
Is there a token filter for this?
I've checked the token filter doc of elasticsearch, there is an limit token count filter which only keep the first token which it not match my case.
# curl -H 'Content-type:application/json' -XPOST localhost:8200/pinyin/_analyze?pretty -d '{"analyzer":"pinyin_analyzer","text":"english 123DJ曾经DJ"}'
{
"tokens" : [
{
"token" : "english123dj",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "english123djcjdj",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "dj",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
}
]
}

Arabic diacritics ignoring in elasticsearch

I have a case where I want I use elasticsearch as a text search engine for pretty long HTML Arabic text.
The search works pretty fine except for words with diacritics, it doesn't seem to be able to recognize them.
For example:
This sentence: ' وَهَكَذَا في كُلّ عَقْدٍ' (this is the one stored in the db)
is the exact same as this: 'وهكذا في كل عقد' (this is what the user enters for search)
it's exactly the same with the exception of the added diacritics, which are handled as separate characters in computers (but are just rendered on top of other characters).
I want to know if there's a way to make the search ignore all diacritics.
The first method I am thinking about is if there's a way to tell elasticsearch to completely ignore diacritics when indexing (kindda like stopwords ?).
If not, is it suitable to have another field in the document (text_normalized) where I manually remove the diacritics before adding it to elasticsearch, would that be efficient ?

To solve your problem you can use arabic_normalization token filter, it will remove diacritics from text before indexing. You need to define a custom analyzer and your Analyzer should look something like this:
"analyzer": {
"rebuilt_arabic": {
"tokenizer": "standard",
"filter": [
"lowercase",
"decimal_digit",
"arabic_stop",
"arabic_normalization",
"arabic_keywords",
"arabic_stemmer"
]
}
}
Analyzer API check:
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["arabic_normalization"],
"text" : "وَهَكَذَا في كُلّ عَقْدٍ"
}
Result from Analyzer:
{
"tokens" : [
{
"token" : "وهكذا",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "في",
"start_offset" : 10,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "كل",
"start_offset" : 13,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "عقد",
"start_offset" : 18,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
As you can see diacritics are removed. For more information you can check here.

ElasticSearch / Nest search with synonyms, plural and mispellings

I want to do a search that implements the following items.
Right now I have implemented all of this through regex and it is far from covering everything and I would like to know how much I could use ElasticSearch for this instead:
Synonyms
My understanding is that this is implemented when the index is created.
indexSettings.Analysis.TokenFilters.Add("synonym", new SynonymTokenFilter { Synonyms = new[] { "tire => tyre", "aluminum => aluminium" }, IgnoreCase = true, Tokenizer = "whitespace" });
but do I need to include the plurals as well? or,
Singular words (shoes and shoe should be an identical match)
does that mean that I need to put 'shoes' in the synonym list? or is there another way?
Small misspellings, substitutions and omissions should be allowed
so that 'automobile', 'automoble' or 'automoblie' would match. I don't know if this is even possible.
Ignore all stop words
right now I'm removing all the 'the', 'this', 'my', etc through regex
All my search terms are plain English words and numbers; nothing else is allowed.

All of this is possible through configuring/writing a custom analyzer within Elasticsearch. To answer each question in turn:
Synonyms
Synonyms can be applied at either index time, search time or both. There are tradeoffs to consider in whichever approach you choose
Applying synonyms at index time will result in faster search compared to applying at search time, at the cost of more disk space, indexing throughput and ease and flexibility of adding/removing existing synonyms
Applying synonyms at search time allows for greater flexibility at the expense of search speed.
Also need to consider the size of the synonyms list and how frequently, if ever, it changes. I would consider trying both and deciding which works best for your scenario and requirements.
Singular words (shoes and shoe should be an identical match)
You may consider using stemming to reduce plural and singular words to their root form, using an algorithmic or dictionary based stemmer. Perhaps start with the English Snowball stemmer and see how it works for you.
You should also consider whether you need to also index the original word form e.g. should exact word matches rank higher than stemmed words on their root form?
Small misspellings, substitutions and omissions should be allowed
Consider using queries that can utilize fuzziness to handle typos and misspellings. If there are spelling errors in the index data, consider some form of sanitization before indexing. As per all data stores, Garbage In, Garbage Out :)
Ignore all stop words
Use an English Stop token filter to remove stop words.
Putting all of this together, an example analyzer might look like
void Main()
{
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var defaultIndex = "default-index";
var connectionSettings = new ConnectionSettings(pool)
.DefaultIndex(defaultIndex);
var client = new ElasticClient(connectionSettings);
if (client.IndexExists(defaultIndex).Exists)
client.DeleteIndex(defaultIndex);
client.CreateIndex(defaultIndex, c => c
.Settings(s => s
.Analysis(a => a
.TokenFilters(t => t
.Stop("my_stop", st => st
.StopWords("_english_", "i've")
.RemoveTrailing()
)
.Synonym("my_synonym", st => st
.Synonyms(
"dap, sneaker, pump, trainer",
"soccer => football"
)
)
.Snowball("my_snowball", st => st
.Language(SnowballLanguage.English)
)
)
.Analyzers(an => an
.Custom("my_analyzer", ca => ca
.Tokenizer("standard")
.Filters(
"lowercase",
"my_stop",
"my_snowball",
"my_synonym"
)
)
)
)
)
.Mappings(m => m
.Map<Message>(mm => mm
.Properties(p => p
.Text(t => t
.Name(n => n.Content)
.Analyzer("my_analyzer")
)
)
)
)
);
client.Analyze(a => a
.Index(defaultIndex)
.Field<Message>(f => f.Content)
.Text("Loving those Billy! Them is the maddest soccer trainers I've ever seen!")
);
}
public class Message
{
public string Content { get; set; }
}
my_analyzer produces the following tokens for above
{
"tokens" : [
{
"token" : "love",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "those",
"start_offset" : 7,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "billi",
"start_offset" : 13,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "them",
"start_offset" : 20,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "maddest",
"start_offset" : 32,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "football",
"start_offset" : 40,
"end_offset" : 46,
"type" : "SYNONYM",
"position" : 7
},
{
"token" : "trainer",
"start_offset" : 47,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "dap",
"start_offset" : 47,
"end_offset" : 55,
"type" : "SYNONYM",
"position" : 8
},
{
"token" : "sneaker",
"start_offset" : 47,
"end_offset" : 55,
"type" : "SYNONYM",
"position" : 8
},
{
"token" : "pump",
"start_offset" : 47,
"end_offset" : 55,
"type" : "SYNONYM",
"position" : 8
},
{
"token" : "ever",
"start_offset" : 61,
"end_offset" : 65,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "seen",
"start_offset" : 66,
"end_offset" : 70,
"type" : "<ALPHANUM>",
"position" : 11
}
]
}

How to check the tokens generated for different tokenizers in Elasticsearch

I have been using different type of tokenizers for test and demonstration purposes. I need to check how a particular text field is tokenized using different tokenizers and also see the tokens generated.
How can I achieve that?

You can use the _analyze endpoint for this purpose.
For instance, using the standard analyzer, you can analyze this is a test like this
curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'this is a test'
And this produces the following tokens:
{
"tokens" : [ {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
} ]
}
Of course, you can use any of the existing analyzers and you can also specify tokenizers using the tokenizer parameter, token filters using the token_filtersparameter and character filters using the char_filters parameter. For instance, analyzing the HTML curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>' using the standard analyzer, the keyword tokenizer, the lowercase token filter and the html_strip character filter yields this, i.e. a lowercase single token without the HTML markup:
curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>'
{
"tokens" : [ {
"token" : "this is a test",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 1
} ]
}

Apart from what #Val have mentioned you can try out the term vector,if you are intending to study the working of tokenisers.You can try out something like this just for examining the tokenisation happening in a field
GET /index-name/type-name/doc-id/_termvector?fields=field-to-be-examined
To know more about tokenisers and their operations you can refer this blog

path_hierarchy in elasticsearch

Is it possible to use the path_hierarchy tokenizer with paths that have whitespace in them and have it create tokens based only on the delimiter not the whitespace? For example,
"/airport/hangar 1"
would be tokenized as
"airport", "hangar 1",
not
"airport", "hangar", "1"?

The path_hierarchy tokenizer works perfectly fine with paths that have whitespaces:
curl "localhost:9200/_analyze?tokenizer=path_hierarchy&pretty=true" -d "/airport/hangar 1"
{
"tokens" : [ {
"token" : "/airport",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "/airport/hangar 1",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 1
} ]
}
However, based on your example, you might need to use the pattern tokenizer instead.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

What are tokens in elasticsearch for exactly? - elasticsearch

Related

How to keep the longest token of text analysis result in elasticsearch?

Arabic diacritics ignoring in elasticsearch

ElasticSearch / Nest search with synonyms, plural and mispellings

How to check the tokens generated for different tokenizers in Elasticsearch

path_hierarchy in elasticsearch

Categories

Resources