Is it possible to use the path_hierarchy tokenizer with paths that have whitespace in them and have it create tokens based only on the delimiter not the whitespace? For example,
"/airport/hangar 1"
would be tokenized as
"airport", "hangar 1",
not
"airport", "hangar", "1"?
The path_hierarchy tokenizer works perfectly fine with paths that have whitespaces:
curl "localhost:9200/_analyze?tokenizer=path_hierarchy&pretty=true" -d "/airport/hangar 1"
{
"tokens" : [ {
"token" : "/airport",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "/airport/hangar 1",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 1
} ]
}
However, based on your example, you might need to use the pattern tokenizer instead.
Related
I use medcl /
elasticsearch-analysis-pinyin to ayalyze a text field, and I want only keep the longest token in the analysis result.
For example in the below result, only keep the longest token english123djcjdj.
Is there a token filter for this?
I've checked the token filter doc of elasticsearch, there is an limit token count filter which only keep the first token which it not match my case.
# curl -H 'Content-type:application/json' -XPOST localhost:8200/pinyin/_analyze?pretty -d '{"analyzer":"pinyin_analyzer","text":"english 123DJ曾经DJ"}'
{
"tokens" : [
{
"token" : "english123dj",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "english123djcjdj",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "dj",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
}
]
}
Our documents have a field of keyword type called "version_numbers", which essentially stores an array of version numbers. For example: ["1.00.100.01", "2.00.470.00"].
This versioning follows a specific pattern, where each group should be associated with search keywords. Here's a breakdown of the versioning pattern:
1.00.240.15
\ / | |
\/ \ \_ maintenance version
major \_ minor version
version
I want to to build an analyzer such that:
major version is associated with keyword APP_VERSION and can be searched with queries like APP_VERSION1.0, APP_VERSION1.00, APP_VERSION1 etc.
minor version is associated with keyword XP and can be searched with queries like XP24, XP 24, XP240 etc.
maintenance version is associated with keyword Rev and can be searched with queries like Rev15, Rev 15 etc.
documents also can be queried in a combination of all three, like APP_VERSION1 XP240 Rev15 etc.
How do I associate each group of version pattern with keywords specified above?
Here's what I've tried so far to tokenize versions:
{
"analysis": {
"analyzer": {
"version_analyzer": {
"tokenizer": "version_tokenizer"
}
},
"tokenizer": {
"version_tokenizer": {
"type": "simple_pattern_split",
"pattern": "\\d"
}
}
}
}
But this seems to be splitting by dots only.
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "00",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "470",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "00",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 3
}
]
}
I'm new to Elasticsearch, so I highly appreciate any hints/guidance on this.
I googled for my question, but couldn't find an answer. I'm fairly new to elasticsearch and I think I didn't get the idea about tokens yet.
I've built a mapping with a custom name_analyzer that uses the filters lowercase, unique and asciifolding with preserve_original=true.
I have the field search_combo_name and the content for example is this:
André, André Mustermann, andre.mustermann#gmail.com, Mustermann
When I use kibana to analyze the string above against my name_analyzer, I get the following result:
{
"tokens" : [
{
"token" : "andre",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "andré",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "mustermann",
"start_offset" : 13,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "andre.mustermann",
"start_offset" : 25,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "gmail.com",
"start_offset" : 42,
"end_offset" : 51,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
That's the result I expect, but what are these tokens used for?
When I search with bool must/should or match, elasticsearch searches for the content of the fields and not the tokens, right?
These tokens are the ones that are going to be indexed and that you can then search on.
All queries will run on those tokens (i.e. not on the raw content directly), which is why it is important to set proper field types and analyzers (in case of text fields) when indexing data into Elasticsearch.
Failing to do so can result in bad relevance (and also bad performance), i.e. queries with bad and/or imprecise results, or queries that take too long to execute. It's a very wide topic, but maybe if you present your use case in more details, we can help better.
I have a case where I want I use elasticsearch as a text search engine for pretty long HTML Arabic text.
The search works pretty fine except for words with diacritics, it doesn't seem to be able to recognize them.
For example:
This sentence: ' وَهَكَذَا في كُلّ عَقْدٍ' (this is the one stored in the db)
is the exact same as this: 'وهكذا في كل عقد' (this is what the user enters for search)
it's exactly the same with the exception of the added diacritics, which are handled as separate characters in computers (but are just rendered on top of other characters).
I want to know if there's a way to make the search ignore all diacritics.
The first method I am thinking about is if there's a way to tell elasticsearch to completely ignore diacritics when indexing (kindda like stopwords ?).
If not, is it suitable to have another field in the document (text_normalized) where I manually remove the diacritics before adding it to elasticsearch, would that be efficient ?
To solve your problem you can use arabic_normalization token filter, it will remove diacritics from text before indexing. You need to define a custom analyzer and your Analyzer should look something like this:
"analyzer": {
"rebuilt_arabic": {
"tokenizer": "standard",
"filter": [
"lowercase",
"decimal_digit",
"arabic_stop",
"arabic_normalization",
"arabic_keywords",
"arabic_stemmer"
]
}
}
Analyzer API check:
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["arabic_normalization"],
"text" : "وَهَكَذَا في كُلّ عَقْدٍ"
}
Result from Analyzer:
{
"tokens" : [
{
"token" : "وهكذا",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "في",
"start_offset" : 10,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "كل",
"start_offset" : 13,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "عقد",
"start_offset" : 18,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
As you can see diacritics are removed. For more information you can check here.
I have been using different type of tokenizers for test and demonstration purposes. I need to check how a particular text field is tokenized using different tokenizers and also see the tokens generated.
How can I achieve that?
You can use the _analyze endpoint for this purpose.
For instance, using the standard analyzer, you can analyze this is a test like this
curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'this is a test'
And this produces the following tokens:
{
"tokens" : [ {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
} ]
}
Of course, you can use any of the existing analyzers and you can also specify tokenizers using the tokenizer parameter, token filters using the token_filtersparameter and character filters using the char_filters parameter. For instance, analyzing the HTML curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>' using the standard analyzer, the keyword tokenizer, the lowercase token filter and the html_strip character filter yields this, i.e. a lowercase single token without the HTML markup:
curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>'
{
"tokens" : [ {
"token" : "this is a test",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 1
} ]
}
Apart from what #Val have mentioned you can try out the term vector,if you are intending to study the working of tokenisers.You can try out something like this just for examining the tokenisation happening in a field
GET /index-name/type-name/doc-id/_termvector?fields=field-to-be-examined
To know more about tokenisers and their operations you can refer this blog