ElasticSearch catenate_words -- only keep concatenated value - elasticsearch

Following examples here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html
Specifically the catenate_words option.
I would like to use this to concatenate words that I can then use in a phrase query before and after the concatenated word, but the word parts prevent this.
For example, their example is this:
super-duper-xl → [ superduperxl, super, duper, xl ]
Now if my actual phrase was "what a great super-duper-xl" that would turn into a sequence:
[what,a,great,superduperxl,super,duper,xl]
That matches the phrase "great superduperxl" which is fine.
However, if the phrase was "the super-duper-xl emerged" the sequence would be:
[the,superduperxl,super,duper,xl,emerged]
This does not phrase match "superduperxl emerged", however it would if the part tokens (super,duper,xl) were not emitted.
Is there any way I can concatenate words keeping only the concatenated word and filtering out the word parts?

Pattern replace character filter can be used here.
"-" is replaced with "" to generate tokens
Query
PUT my-index1
{
"settings": {
"analysis": {
"analyzer": {
"remove_hyphen_analyzer": {
"tokenizer": "standard",
"char_filter": [
"remove_hyphen_filter"
]
}
},
"char_filter": {
"remove_hyphen_filter": {
"type": "pattern_replace",
"pattern": "-",
"replacement": ""
}
}
}
},
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "remove_hyphen_analyzer"
}
}
}
}
POST my-index1/_analyze
{
"analyzer": "remove_hyphen_analyzer",
"text": "the super-duper-xl emerged"
}
Result
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "superduperxl",
"start_offset" : 4,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "emerged",
"start_offset" : 19,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}

Related

Tokens in Index Time vs Query Time are not the same when using common_gram filter ElasticSearch

I want to use common_gram token filter based on this link.
My elasticsearch version is: 7.17.8
Here is the setting of my index in ElasticSearch.
I have defined a filter named "common_grams" that uses "common_grams" as type.
I have defined a custom analyzer named "index_grams" that use "whitespace" as tokenizer and the above filter as a token filter.
I have just one field named as "title_fa" and I have used my custom analyzer for this field.
PUT /my-index-000007
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": [ "common_grams" ]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": [ "the","is" ]
}
}
}
}
,
"mappings": {
"properties": {
"title_fa": {
"type": "text",
"analyzer": "index_grams",
"boost": 40
}
}
}
}
It works fine in Index Time and the tokens are what I expect to be. Here I get the tokens via kibana dev tool.
GET /my-index-000007/_analyze
{
"analyzer": "index_grams",
"text" : "brown is the"
}
Here is the result of the tokens for the text.
{
"tokens" : [
{
"token" : "brown",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "brown_is",
"start_offset" : 0,
"end_offset" : 8,
"type" : "gram",
"position" : 0,
"positionLength" : 2
},
{
"token" : "is",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "is_the",
"start_offset" : 6,
"end_offset" : 12,
"type" : "gram",
"position" : 1,
"positionLength" : 2
},
{
"token" : "the",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 2
}
]
}
When I search the query "brown is the", I expect these tokens to be searched:
["brown", "brown_is", "is", "is_the", "the" ]
But these are the tokens that will actually be searched:
["brown is the", "brown is_the", "brown_is the"]
Here you can see the details
Query Time Tokens
UPDATE:
I have added a sample document like this:
POST /my-index-000007/_doc/1
{ "title_fa" : "brown" }
When I search "brown coat"
GET /my-index-000007/_search
{
"query": {
"query_string": {
"query": "brown is coat",
"default_field": "title_fa"
}
}
}
it returns the document because it searches:
["brown", "coat"]
When I search "brown is coat", it can't find the document because it is searching for
["brown is coat", "brown_is coat", "brown is_coat"]
Clearly when it gets a query that contains a common word, it acts differently and I guess it's because of the index time tokens and query time tokens.
Do you know where I am getting this wrong? Why is it acting differently?

ElasticSearch - Search without apostrophe

I'm trying to allow users to search without entering an apostrophe.
E.G type Johns and still bring up results for John's
I've tried multiple things including adding the stemmer filter but with no luck.
I thought I could potentially do something manual such as
GET /_analyze
{
"char_filter": [{
"type": "pattern_replace",
"pattern": "\\s*([a-zA-Z0-9]+)\\'s",
"replacement": "$1 $1s $1's "
}],
"tokenizer": "standard",
"text": "john's dog jumped"
}
And i get the following response
{
"tokens" : [
{
"token" : "john",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "johns",
"start_offset" : 5,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "john's",
"start_offset" : 5,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "dog",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "jumped",
"start_offset" : 11,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
However I still don't get a match when I search for "johns" with out the '
My settings look like:
"analyzer" : {
"my_custom_search" : {
"char_filter" : [ "flexible_plurals" ],
"tokenizer" : "standard"
}
},
"char_filter" : {
"flexible_plurals" : {
"pattern" : """\s*([a-zA-Z0-9]+)\'s""",
"type" : "pattern_replace",
"replacement" : " $1 $1s $1's "
}
}
My mappings like
"search-terms" : {
"type" : "text",
"analyzer" : "my_custom_search"
}
I am using the match query to query the data
You are almost correct, Hope you are using the match query and you have defined your field as text with the custom analyzer, if you use the text field without your custom analyzer which uses your char_filter it will simply use the standard analyzer and won't generate the johns token hence no match.
Complete Working example
Index setting and mapping
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"apostrophe_filter": {
"type": "pattern_replace",
"pattern": "\\s*([a-zA-Z0-9]+)\\'s",
"replacement": "$1 $1s $1's "
}
},
"analyzer": {
"custom_analyzer": {
"filter": [
"lowercase"
],
"char_filter": [
"apostrophe_filter"
],
"tokenizer": "standard"
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
Index sample document
{
"title" : "john's"
}
And search for johns
{
"query": {
"match": {
"title": "johns"
}
}
}
Search results
"hits": [
{
"_index": "72937076",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "john's" --> note `john's`
}
}
]

Unexpected result from Synonym Filter if Char Filter is also used

I have a field which uses both a char filter (it converts Chinese words to some separator text) and a synonym filter.
PUT test
{
"mappings": {
"properties": {
"description": {
"analyzer": "standard",
"search_analyzer": "foobar",
"type": "text"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"foobar": {
"char_filter": [
"chinese_to_sep"
],
"filter": [
"custom_synonyms"
],
"tokenizer": "standard"
}
},
"char_filter": {
"chinese_to_sep": {
"pattern": "([\\p{IsHan}]+)",
"replacement": " sepsep ",
"type": "pattern_replace"
}
},
"filter": {
"custom_synonyms" : {
"type" : "synonym",
"updateable" : "true",
"synonyms" : [
"苹果 => apple",
"香蕉 => banana",
"柠檬 => lemon",
"xyz => universe"
]
}
}
}
}
}
(You may ask why my synonym filter still has Chinese words if my char filter will remove all Chinese characters? It's because that synonym filter is used by other fields as well in the real setup, and not all fields will remove Chinese words.)
Anyway, when I run this query, I expect I will get just two tokens ("Sun", and "sepsep"). But to my surprise, I got ("Sun", "apple", "banana", "lemon")!
GET test/_analyze
{
"analyzer": "foobar",
"text": "Sun 太陽",
"explain": false
}
I got this instead:
{
"tokens" : [
{
"token" : "Sun",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "apple",
"start_offset" : 5,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "banana",
"start_offset" : 5,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "lemon",
"start_offset" : 5,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 1
}
]
}
It seems like the synonym filter is somehow "pre-processed" by the char-filter and thus the first 3 entries will map "sepsep" to a synonym. Is this expected? Or is this a bug?

CamelCase Search with Elasticsearch

I want to configure Elasticsearch, so that searching for "JaFNam" will create a good score for "JavaFileName".
I'm tried to build an analyzer, that combines a CamelCase pattern analyzer with an edge_ngram tokenizer. I thought this would create terms like these:
J F N Ja Fi Na Jav Fil Nam Java File Name
But the tokenizer seems not to have any effect: I keep getting these terms:
Java File Name
What would the correct Elasticsearch configuration look like?
Example code:
curl -XPUT 'http://127.0.0.1:9010/hello?pretty=1' -d'
{
"settings":{
"analysis":{
"analyzer":{
"camel":{
"type":"pattern",
"pattern":"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])",
"filters": ["edge_ngram"]
}
}
}
}
}
'
curl -XGET 'http://127.0.0.1:9010/hello/_analyze?pretty=1' -d'
{
"analyzer":"camel",
"text":"JavaFileName"
}'
results in:
{
"tokens" : [ {
"token" : "java",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}, {
"token" : "file",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "name",
"start_offset" : 8,
"end_offset" : 12,
"type" : "word",
"position" : 2
} ]
}
You analyzer definition is not correct. you need a tokenizer and an array of filter, as it is your analyzer doesn't work. Try like this instead:
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"tokenizer": "my_pattern",
"filter": [
"my_gram"
]
}
},
"filter": {
"my_gram": {
"type": "edge_ngram",
"max_gram": 10
}
},
"tokenizer": {
"my_pattern": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}

Searching for hyphened text in Elasticsearch

I am storing a 'Payment Reference Number' in elasticsearch.
The layout of it is e.g.: 2-4-3-635844569819109531 or 2-4-2-635844533758635433 etc
I want to be able to search for documents by their payment ref number either by
Searching using the 'whole' reference number, e.g. putting in 2-4-2-635844533758635433
Any 'part' of the reference number from the 'start'. E.g. 2-4-2-63 (.. so only return the second one in the example)
Note: i do not want to search 'in the middle' or 'at the end' etc. From the beginning only.
Anyways, the hyphens are confusing me.
Questions
1) I am not sure if I should remove them in the mapping like
"char_filter" : {
"removeHyphen" : {
"type" : "mapping",
"mappings" : ["-=>"]
}
},
or not. I have never use the mappings in that way so not sure if this is necessary.
2) I think I need a 'ngrams' filter because I want to be able to search a part of the reference number from the being. I think something like
"partial_word":{
"filter":[
"standard",
"lowercase",
"name_ngrams"
],
"type":"custom",
"tokenizer":"whitespace"
},
and the filter
"name_ngrams":{
"side":"front",
"max_gram":50,
"min_gram":2,
"type":"edgeNGram"
},
I am not sure how to put it all together but
"paymentReference":{
"type":"string",
"analyzer": "??",
"fields":{
"partial":{
"search_analyzer":"???",
"index_analyzer":"partial_word",
"type":"string"
}
}
}
Everything that I have tried seems to always 'break' in the second search case.
If I do 'localhost:9200/orders/_analyze?field=paymentReference&pretty=1' -d "2-4-2-635844533758635433" it always breaks the hyphen as it's own token and returns e.g. all documents with 2- which is 'alot'! and not what I want when searching for 2-4-2-6
Can someone tell me how to map this field for the two types of searches I am trying to achieve?
Update - Answer
Effectively what Val said below. I just changed the mapping slightly to be more specific re the analyzers and also I don't need the main string indexed because I just query the partial.
Mapping
"paymentReference":{
"type": "string",
"index":"not_analyzed",
"fields": {
"partial": {
"search_analyzer":"payment_ref",
"index_analyzer":"payment_ref",
"type":"string"
}
}
}
Analyzer
"payment_ref": {
"type": "custom",
"filter": [
"lowercase",
"name_ngrams"
],
"tokenizer": "keyword"
}
Filter
"name_ngrams":{
"side":"front",
"max_gram":50,
"min_gram":2,
"type":"edgeNGram"
},
You don't need to use the mapping char filter for this.
You're on the right track using the Edge NGram token filter since you need to be able to search for prefixes only. I would use a keyword tokenizer instead to make sure the term is taken as a whole. So the way to set this up is like this:
curl -XPUT localhost:9200/orders -d '{
"settings": {
"analysis": {
"analyzer": {
"partial_word": {
"type": "custom",
"filter": [
"lowercase",
"ngram_filter"
],
"tokenizer": "keyword"
}
},
"filter": {
"ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 50
}
}
}
},
"mappings": {
"order": {
"properties": {
"paymentReference": {
"type": "string",
"fields": {
"partial": {
"analyzer": "partial_word",
"type": "string"
}
}
}
}
}
}
}'
Then you can analyze what is going to be indexed into your paymentReference.partial field:
curl -XGET 'localhost:9205/payments/_analyze?field=paymentReference.partial&pretty=1' -d "2-4-2-635844533758635433"
And you get exactly what you want, i.e. all the prefixes:
{
"tokens" : [ {
"token" : "2-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-6",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-63",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-635",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-6358",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-63584",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
...
Finally you can search for any prefix:
curl -XGET localhost:9200/orders/order/_search?q=paymentReference.partial:2-4-3
Not sure whether wildcard search match your needs. I define custom filter and set preserve_original and generate number parts false. Here is the sample code:
PUT test1
{
"settings" : {
"analysis" : {
"analyzer" : {
"myAnalyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : [ "dont_split_on_numerics" ]
}
},
"filter" : {
"dont_split_on_numerics" : {
"type" : "word_delimiter",
"preserve_original": true,
"generate_number_parts" : false
}
}
}
},
"mappings": {
"type_one": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
},
"type_two": {
"properties": {
"raw": {
"type": "text",
"analyzer": "myAnalyzer"
}
}
}
}
}
POST test1/type_two/1
{
"raw": "2-345-6789"
}
GET test1/type_two/_search
{
"query": {
"wildcard": {
"raw": "2-345-67*"
}
}
}

Resources