How to keep the longest token of text analysis result in elasticsearch? - elasticsearch

I use medcl /
elasticsearch-analysis-pinyin to ayalyze a text field, and I want only keep the longest token in the analysis result.
For example in the below result, only keep the longest token english123djcjdj.
Is there a token filter for this?
I've checked the token filter doc of elasticsearch, there is an limit token count filter which only keep the first token which it not match my case.
# curl -H 'Content-type:application/json' -XPOST localhost:8200/pinyin/_analyze?pretty -d '{"analyzer":"pinyin_analyzer","text":"english 123DJ曾经DJ"}'
{
"tokens" : [
{
"token" : "english123dj",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "english123djcjdj",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "dj",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
}
]
}

Related

How do I index non-standard version numbers in Elasticsearch?

Our documents have a field of keyword type called "version_numbers", which essentially stores an array of version numbers. For example: ["1.00.100.01", "2.00.470.00"].
This versioning follows a specific pattern, where each group should be associated with search keywords. Here's a breakdown of the versioning pattern:
1.00.240.15
\ / | |
\/ \ \_ maintenance version
major \_ minor version
version
I want to to build an analyzer such that:
major version is associated with keyword APP_VERSION and can be searched with queries like APP_VERSION1.0, APP_VERSION1.00, APP_VERSION1 etc.
minor version is associated with keyword XP and can be searched with queries like XP24, XP 24, XP240 etc.
maintenance version is associated with keyword Rev and can be searched with queries like Rev15, Rev 15 etc.
documents also can be queried in a combination of all three, like APP_VERSION1 XP240 Rev15 etc.
How do I associate each group of version pattern with keywords specified above?
Here's what I've tried so far to tokenize versions:
{
"analysis": {
"analyzer": {
"version_analyzer": {
"tokenizer": "version_tokenizer"
}
},
"tokenizer": {
"version_tokenizer": {
"type": "simple_pattern_split",
"pattern": "\\d"
}
}
}
}
But this seems to be splitting by dots only.
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "00",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "470",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "00",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 3
}
]
}
I'm new to Elasticsearch, so I highly appreciate any hints/guidance on this.

What are tokens in elasticsearch for exactly?

I googled for my question, but couldn't find an answer. I'm fairly new to elasticsearch and I think I didn't get the idea about tokens yet.
I've built a mapping with a custom name_analyzer that uses the filters lowercase, unique and asciifolding with preserve_original=true.
I have the field search_combo_name and the content for example is this:
André, André Mustermann, andre.mustermann#gmail.com, Mustermann
When I use kibana to analyze the string above against my name_analyzer, I get the following result:
{
"tokens" : [
{
"token" : "andre",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "andré",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "mustermann",
"start_offset" : 13,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "andre.mustermann",
"start_offset" : 25,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "gmail.com",
"start_offset" : 42,
"end_offset" : 51,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
That's the result I expect, but what are these tokens used for?
When I search with bool must/should or match, elasticsearch searches for the content of the fields and not the tokens, right?
These tokens are the ones that are going to be indexed and that you can then search on.
All queries will run on those tokens (i.e. not on the raw content directly), which is why it is important to set proper field types and analyzers (in case of text fields) when indexing data into Elasticsearch.
Failing to do so can result in bad relevance (and also bad performance), i.e. queries with bad and/or imprecise results, or queries that take too long to execute. It's a very wide topic, but maybe if you present your use case in more details, we can help better.

How to check the tokens generated for different tokenizers in Elasticsearch

I have been using different type of tokenizers for test and demonstration purposes. I need to check how a particular text field is tokenized using different tokenizers and also see the tokens generated.
How can I achieve that?
You can use the _analyze endpoint for this purpose.
For instance, using the standard analyzer, you can analyze this is a test like this
curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'this is a test'
And this produces the following tokens:
{
"tokens" : [ {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
} ]
}
Of course, you can use any of the existing analyzers and you can also specify tokenizers using the tokenizer parameter, token filters using the token_filtersparameter and character filters using the char_filters parameter. For instance, analyzing the HTML curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>' using the standard analyzer, the keyword tokenizer, the lowercase token filter and the html_strip character filter yields this, i.e. a lowercase single token without the HTML markup:
curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>'
{
"tokens" : [ {
"token" : "this is a test",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 1
} ]
}
Apart from what #Val have mentioned you can try out the term vector,if you are intending to study the working of tokenisers.You can try out something like this just for examining the tokenisation happening in a field
GET /index-name/type-name/doc-id/_termvector?fields=field-to-be-examined
To know more about tokenisers and their operations you can refer this blog

how can i search for values that have "-" dash in them with elastic search

my entry in elastic search is like this:
input:
curl -XPUT "http://localhost:9200/movies/movie/4" -d'
{
"uid": "a-b"
}'
query :
curl -XGET "http://localhost:9200/movies/movie/_search" -d '
{
"query" : {
"term" : { "uid": "a-b" }
}
}'
output:
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
Thanks
it entirely depends on the analyser you are using.
With the standard analyser you can see that two tokens have been generated and the "-" is ignored.
curl -XGET 'localhost:9200/myindex/_analyze?analyzer=standard&pretty' -d 'a-b'
{
"tokens" : [ {
"token" : "a",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "b",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
With the whitespace analyser the "-" is treated as data and you only get one token out for your data:
curl -XGET 'localhost:9200/myindex/_analyze?analyzer=whitespace&pretty' -d 'a-b'
{
"tokens" : [ {
"token" : "a-b",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
} ]
}
When you use term query no analysis is done on the search query.
So you are probably trying to match "a-b" against "a" and "b" (assuming you've used the standard analyser in your mapping) - i.e. it won't match and won't return results.
If you has used match or query_string in your query your search would probably have worked as the search string would have been analysed.
i.e. ES would try to match "a" and "b" against a field containing "a" and "b" - this would be a successful match.

path_hierarchy in elasticsearch

Is it possible to use the path_hierarchy tokenizer with paths that have whitespace in them and have it create tokens based only on the delimiter not the whitespace? For example,
"/airport/hangar 1"
would be tokenized as
"airport", "hangar 1",
not
"airport", "hangar", "1"?
The path_hierarchy tokenizer works perfectly fine with paths that have whitespaces:
curl "localhost:9200/_analyze?tokenizer=path_hierarchy&pretty=true" -d "/airport/hangar 1"
{
"tokens" : [ {
"token" : "/airport",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "/airport/hangar 1",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 1
} ]
}
However, based on your example, you might need to use the pattern tokenizer instead.

Resources