Elasticsearch Analysis token filter doesn't capture pattern - elasticsearch

I made a custom analyzer in my test index:
PUT test
{
"settings": {
"analysis": {
"filter": {
"myFilter": {
"type": "pattern_capture",
"patterns": ["\\d+(,\\d+)*(\\.\\d+)?[%$€£¥]?"],
"preserve_original": 1
}
},
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "myTokenizer",
"filters":["myFilter"]
}
},
"tokenizer": {
"myTokenizer":{
"type":"pattern",
"pattern":"([^\\p{N}\\p{L}%$€£¥##'\\-&]+)|((?<=[^\\p{L}])['\\-&]|^['\\-&]|['\\-&](?=[^\\p{L}])|['\\-&]$)|((?<=[^\\p{N}])[$€£¥%]|^[$€£¥%]|(?<=[$€£¥%])(?=\\d))"
}
}
}
}
}
It is supposed to spit numbers like 123,234.56$ as a single token
But when such a number is provided it spits out 3 tokens 123 234 56$
The sample of failing test query:
GET test/Stam/_termvector?pretty=true
{
doc:{
"Stam" : {
"fld" : "John Doe",
"txt": "100,234.54%"
}
},
"per_field_analyzer" : {
"Stam.txt": "myAnalyzer"
},
"fields" : ["Stam.txt"],
"offsets":true,
"positions":false,
"payloads":false,
"term_statistics":false,
"field_statistics":false
}
}
Can anyone figure out what is the reason?
Definitely for every other case ',' and '.' are delimiters, that is why I added a filter for that purpose, but unfortunately it doesn't work.
Thanks in advance.

The answer is quite simple, token filter can not combine tokens by design. It should be done through char filters, that are applied to the char stream even before tokenizer starts to split to tokens.
I only had to make sure that the custom tokenizer will not split my tokens.

Related

Elasticsearch Text with Path Hierarchy vs KeyWord using Prefix query performance

I'm trying to achieve the best way to filter results based on folder hierarchies. We will use this to simulate a situation where we want to get all assets/documents in provided folder and all subfolders (recursive search).
So for example for such a structure
/someFolder/someSubfolder/1
/someFolder/someSubfolder/1/subFolder
/someFolder/someSubfolder/2
/someFolder/someSubfolder/2/subFolder
If we search for /someFolder/someSubfolder/1
We want to get as results
/someFolder/someSubfolder/1
/someFolder/someSubfolder/1/subFolder
Now I've found two ways to do this. Not sure which one would be better from performance perspective.
Use Text property with path_hierarchy Tokenizer
Use Keyword property and use Query prefix to get results
Both of the above seem to work as I want them to (unless I missed something). Not sure which one would be better. On one hand I've read that filtering should be done on Keywords. On the other hand path_hierarchy Tokenizer seems to be created exactly for these scenarios but we can only use it with Text field.
Below I prepared a sample code.
Create index and push some test data into it.
PUT test-index-2
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy"
}
}
}
},
"mappings": {
"properties": {
"folderPath": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
POST test-index-2/_doc/
{
"folderPath": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces"
}
POST test-index-2/_doc/
{
"folderPath": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces/SomeTestValue/11"
}
Now both of below queries will return two results for matching partial path hierarchy.
1.
GET test-index-2/_search
{
"query": {
"bool": {
"filter": [
{ "term": { "folderPath": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces" }}
]
}
}
}
GET test-index-2/_search
{
"query": {
"prefix" : { "folderPath.keyword": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces" }
}
}
Now the question would be: Which solution is better if we want to get a subset of results ?

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

Analyze all uppercase tokens in a field

I would like to analyze value of a text field in 2 ways. Using standard analysis and a custom analysis that only indexes all uppercase tokens in the text.
For example, if the value is "This WHITE cat is very CUTE.", the only tokens that should be indexed for custom analysis is "WHITE" and "CUTE". For this, I am using Pattern Capture Token Filter with pattern "(\b[A-Z]+\b)+?". But this is indexing all tokens and not just uppercase tokens.
Is Pattern Capture Token Filter the right one to use for this task? If yes, what am I doing wrong? If not, how do I get this done? Please help.
You should use instead pattern_replace and char_filter:
PUT test
{
"settings": {
"analysis": {
"char_filter": {
"filter_lowercase": {
"type": "pattern_replace",
"pattern": "[A-Z][a-z]+|[a-z]+",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"filter_lowercase"
]
}
}
}
}
}
GET test/_analyze
{"analyzer": "my_analyzer",
"text" : "This WHITE cat is very CUTE"
}

Elasticsearch completion - generating input list with analyzers

I've had a look at this article: https://www.elastic.co/blog/you-complete-me
However, it requires writing some logic in the client to create multiple "input". Is there a way to define an analyzer (maybe using shingle or ngram/edge-ngram) that will generate the multiple terms for input?
Here's what I tried (and it obviously doesn't work):
DELETE /products/
PUT /products/
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"product": {
"properties": {
"name": {"type": "string"
,"copy_to": ["name_suggest"]
}
,"name_suggest": {
"type": "completion",
"payloads": false,
"analyzer": "autocomplete"
}
}
}
}
}
PUT /products/product/1
{
"name": "Apple iPhone 5"
}
PUT /products/product/2
{
"name": "iPhone 4 16GB"
}
PUT /products/product/3
{
"name": "iPhone 3 GS 16GB black"
}
PUT /products/product/4
{
"name": "Apple iPhone 4 S 16 GB white"
}
PUT /products/product/5
{
"name": "Apple iPhone case"
}
POST /products/_suggest
{
"suggestions": {
"text":"i"
,"completion":{
"field": "name_suggest"
}
}
}
Don't think there's a direct way to achieve this.
I'm not sure why it would be needed to store ngrammed tokens considering elasticsearch already stores the 'input' text as an FST structure. New releases also allow for fuzziness in the suggest query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html#fuzzy
I can understand the need for something like a shingle analyser to generate the inputs for you, but there doesn't seem to be a way yet. Having said that, the _analyze endpoint can be used to generate tokens from the analyzer of your choice and those tokens can be passed to the 'input' field (with or without any other added logic). This way you won't have to replicate your analyzer logic in your application code. That's the only way i can think of to achieve the desired input field.
Hope it helps.

How to index both a string and its reverse?

I'm looking for a way to analyze the string "abc123" as ["abc123", "321cba"]. I've looked at the reverse token filter, but that only gets me ["321cba"]. Documentation on this filter is pretty sparse, only stating that
"A token filter of type reverse ... simply reverses each token."
(see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html).
I've also tinkered with using the keyword_repeat filter, which gets me two instances. I don't know if that's useful, but for now all it does it reverse both instances.
How can I use the reverse token filter but keep the original token as well?
My analyzer:
{ "settings" : { "analysis" : {
"analyzer" : {
"phone" : {
"type" : "custom"
,"char_filter" : ["strip_non_numeric"]
,"tokenizer" : "keyword"
,"filter" : ["standard", "keyword_repeat", "reverse"]
}
}
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
}
}}}
Make and put a analyzer to reverse a string (say reverse_analyzer).
PUT index_name
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"type": "custom",
"char_filter": [
"strip_non_numeric"
],
"tokenizer": "keyword",
"filter": [
"standard",
"keyword_repeat",
"reverse"
]
}
},
"char_filter": {
"strip_non_numeric": {
"type": "pattern_replace",
"pattern": "[^0-9]",
"replacement": ""
}
}
}
}
}
then, for a field, (say phoneno), use mapping as, (create a type and append mapping for phone as)
PUT index_name/type_name/_mapping
{
"type_name": {
"properties": {
"phone_no": {
"type": "string",
"fields": {
"reverse": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
So, phone_no is like multifield, which will store a string and its reverse as,
if you index
phone_no: 911220
then in elasticsearch, there will be fields as,
phone_no: 911220 and phone_no.reverse : 022119, so you can search, filter reverse or not-reversed field.
Hope this helps.
I don't believe you can do this directly, as I am unaware of any way to get the reverse token filter to also output the original.
However, you could use the fields parameter to index both the original and the reversed at the same time with no additional coding. You would then search both fields.
So let's say your field was called phone_number:
"phone_number": {
"type": "string",
"fields": {
"reverse": { "type": "string", "index": "phone" }
}
}
In this case we're indexing using the default analyzer (assume standard) plus also indexing into reverse with your customer analyzer phone which reverses. You then issue your queries against both fields.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
I'm not sure it's possible to do this using built-in set of token filters. I would recommend you to create your own plugin. There is ICU Analysis plugin supported by elastic search team, that you can use as example.
I wound up using the following two char_filter's in my analyzer. It's an ugly abuse of regex, but it seems to work. It is limited to the first 20 numeric characters, but in my use-case that is acceptable.
First it groups all numeric characters, then explicitly rebuilds the string with its own (numeric-only!) reverse. The space in the center of the replacement pattern then causes the tokenizer to split it into two tokens - the original and the reverse.
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
,"dupe_and_reverse" : {
"type" : "pattern_replace"
,"pattern" : "([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)"
,"replacement" : "$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15$16$17$18$19$20 $20$19$18$17$16$15$14$13$12$11$10$9$8$7$6$5$4$3$2$1"
}
}

Resources