Searching for hyphened text in Elasticsearch - elasticsearch

I am storing a 'Payment Reference Number' in elasticsearch.
The layout of it is e.g.: 2-4-3-635844569819109531 or 2-4-2-635844533758635433 etc
I want to be able to search for documents by their payment ref number either by
Searching using the 'whole' reference number, e.g. putting in 2-4-2-635844533758635433
Any 'part' of the reference number from the 'start'. E.g. 2-4-2-63 (.. so only return the second one in the example)
Note: i do not want to search 'in the middle' or 'at the end' etc. From the beginning only.
Anyways, the hyphens are confusing me.
Questions
1) I am not sure if I should remove them in the mapping like
"char_filter" : {
"removeHyphen" : {
"type" : "mapping",
"mappings" : ["-=>"]
}
},
or not. I have never use the mappings in that way so not sure if this is necessary.
2) I think I need a 'ngrams' filter because I want to be able to search a part of the reference number from the being. I think something like
"partial_word":{
"filter":[
"standard",
"lowercase",
"name_ngrams"
],
"type":"custom",
"tokenizer":"whitespace"
},
and the filter
"name_ngrams":{
"side":"front",
"max_gram":50,
"min_gram":2,
"type":"edgeNGram"
},
I am not sure how to put it all together but
"paymentReference":{
"type":"string",
"analyzer": "??",
"fields":{
"partial":{
"search_analyzer":"???",
"index_analyzer":"partial_word",
"type":"string"
}
}
}
Everything that I have tried seems to always 'break' in the second search case.
If I do 'localhost:9200/orders/_analyze?field=paymentReference&pretty=1' -d "2-4-2-635844533758635433" it always breaks the hyphen as it's own token and returns e.g. all documents with 2- which is 'alot'! and not what I want when searching for 2-4-2-6
Can someone tell me how to map this field for the two types of searches I am trying to achieve?
Update - Answer
Effectively what Val said below. I just changed the mapping slightly to be more specific re the analyzers and also I don't need the main string indexed because I just query the partial.
Mapping
"paymentReference":{
"type": "string",
"index":"not_analyzed",
"fields": {
"partial": {
"search_analyzer":"payment_ref",
"index_analyzer":"payment_ref",
"type":"string"
}
}
}
Analyzer
"payment_ref": {
"type": "custom",
"filter": [
"lowercase",
"name_ngrams"
],
"tokenizer": "keyword"
}
Filter
"name_ngrams":{
"side":"front",
"max_gram":50,
"min_gram":2,
"type":"edgeNGram"
},

You don't need to use the mapping char filter for this.
You're on the right track using the Edge NGram token filter since you need to be able to search for prefixes only. I would use a keyword tokenizer instead to make sure the term is taken as a whole. So the way to set this up is like this:
curl -XPUT localhost:9200/orders -d '{
"settings": {
"analysis": {
"analyzer": {
"partial_word": {
"type": "custom",
"filter": [
"lowercase",
"ngram_filter"
],
"tokenizer": "keyword"
}
},
"filter": {
"ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 50
}
}
}
},
"mappings": {
"order": {
"properties": {
"paymentReference": {
"type": "string",
"fields": {
"partial": {
"analyzer": "partial_word",
"type": "string"
}
}
}
}
}
}
}'
Then you can analyze what is going to be indexed into your paymentReference.partial field:
curl -XGET 'localhost:9205/payments/_analyze?field=paymentReference.partial&pretty=1' -d "2-4-2-635844533758635433"
And you get exactly what you want, i.e. all the prefixes:
{
"tokens" : [ {
"token" : "2-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-6",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-63",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-635",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-6358",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-63584",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
...
Finally you can search for any prefix:
curl -XGET localhost:9200/orders/order/_search?q=paymentReference.partial:2-4-3

Not sure whether wildcard search match your needs. I define custom filter and set preserve_original and generate number parts false. Here is the sample code:
PUT test1
{
"settings" : {
"analysis" : {
"analyzer" : {
"myAnalyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : [ "dont_split_on_numerics" ]
}
},
"filter" : {
"dont_split_on_numerics" : {
"type" : "word_delimiter",
"preserve_original": true,
"generate_number_parts" : false
}
}
}
},
"mappings": {
"type_one": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
},
"type_two": {
"properties": {
"raw": {
"type": "text",
"analyzer": "myAnalyzer"
}
}
}
}
}
POST test1/type_two/1
{
"raw": "2-345-6789"
}
GET test1/type_two/_search
{
"query": {
"wildcard": {
"raw": "2-345-67*"
}
}
}

Related

ElasticSearch - Search without apostrophe

I'm trying to allow users to search without entering an apostrophe.
E.G type Johns and still bring up results for John's
I've tried multiple things including adding the stemmer filter but with no luck.
I thought I could potentially do something manual such as
GET /_analyze
{
"char_filter": [{
"type": "pattern_replace",
"pattern": "\\s*([a-zA-Z0-9]+)\\'s",
"replacement": "$1 $1s $1's "
}],
"tokenizer": "standard",
"text": "john's dog jumped"
}
And i get the following response
{
"tokens" : [
{
"token" : "john",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "johns",
"start_offset" : 5,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "john's",
"start_offset" : 5,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "dog",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "jumped",
"start_offset" : 11,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
However I still don't get a match when I search for "johns" with out the '
My settings look like:
"analyzer" : {
"my_custom_search" : {
"char_filter" : [ "flexible_plurals" ],
"tokenizer" : "standard"
}
},
"char_filter" : {
"flexible_plurals" : {
"pattern" : """\s*([a-zA-Z0-9]+)\'s""",
"type" : "pattern_replace",
"replacement" : " $1 $1s $1's "
}
}
My mappings like
"search-terms" : {
"type" : "text",
"analyzer" : "my_custom_search"
}
I am using the match query to query the data
You are almost correct, Hope you are using the match query and you have defined your field as text with the custom analyzer, if you use the text field without your custom analyzer which uses your char_filter it will simply use the standard analyzer and won't generate the johns token hence no match.
Complete Working example
Index setting and mapping
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"apostrophe_filter": {
"type": "pattern_replace",
"pattern": "\\s*([a-zA-Z0-9]+)\\'s",
"replacement": "$1 $1s $1's "
}
},
"analyzer": {
"custom_analyzer": {
"filter": [
"lowercase"
],
"char_filter": [
"apostrophe_filter"
],
"tokenizer": "standard"
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
Index sample document
{
"title" : "john's"
}
And search for johns
{
"query": {
"match": {
"title": "johns"
}
}
}
Search results
"hits": [
{
"_index": "72937076",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "john's" --> note `john's`
}
}
]

How to get index item that has : "name" - "McLaren" by searching with "mclaren" in Elasticsearch 1.7?

Here is the tokenizer -
"tokenizer": {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
Mapping -
"name": {
"type": "string",
"analyzer": "filename_index",
"include_in_all": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "naturalsort"
}
}
},
Analyzer -
"filename_index" : {
"tokenizer" : "filename",
"filter" : [
"word_delimiter",
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"czech_stop",
"czech_keywords",
"czech_stemmer"
]
},
I would like to get index item by searching - mclaren, but the name indexed is McLaren.
I would like to stick to query_string cause a lot of other functionality is based on that. Here is the query with what I cant get the expected result -
{
"query": {
"filtered": {
"query": {
"query_string" : {
"query" : "mclaren",
"default_operator" : "AND",
"analyze_wildcard" : true,
}
}
}
},
"size" :50,
"from" : 0,
"sort": {}
}
How I could accomplish this? Thank you!
I got it ! The problem is certainly around the word_delimiter token filter.
By default it :
Split tokens at letter case transitions. For example: PowerShot →
Power, Shot
Cf documentation
So macLaren generate two tokens -> [mac, Laren] when maclaren only generate one token ['maclaren'].
analyze example :
POST _analyze
{
"tokenizer": {
"pattern": """[^\p{L}\d]+""",
"type": "pattern"
},
"filter": [
"word_delimiter"
],
"text": ["macLaren", "maclaren"]
}
Response:
{
"tokens" : [
{
"token" : "mac",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "Laren",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "maclaren",
"start_offset" : 9,
"end_offset" : 17,
"type" : "word",
"position" : 102
}
]
}
So I think one option is to configure your word_delimiter with the option split_on_case_change to false (see parameters doc)
Ps: remeber to remove the settings you previously added (cf comments), since with this setting, your query string query will only target the name field that does not exists.

Unexpected result from Synonym Filter if Char Filter is also used

I have a field which uses both a char filter (it converts Chinese words to some separator text) and a synonym filter.
PUT test
{
"mappings": {
"properties": {
"description": {
"analyzer": "standard",
"search_analyzer": "foobar",
"type": "text"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"foobar": {
"char_filter": [
"chinese_to_sep"
],
"filter": [
"custom_synonyms"
],
"tokenizer": "standard"
}
},
"char_filter": {
"chinese_to_sep": {
"pattern": "([\\p{IsHan}]+)",
"replacement": " sepsep ",
"type": "pattern_replace"
}
},
"filter": {
"custom_synonyms" : {
"type" : "synonym",
"updateable" : "true",
"synonyms" : [
"苹果 => apple",
"香蕉 => banana",
"柠檬 => lemon",
"xyz => universe"
]
}
}
}
}
}
(You may ask why my synonym filter still has Chinese words if my char filter will remove all Chinese characters? It's because that synonym filter is used by other fields as well in the real setup, and not all fields will remove Chinese words.)
Anyway, when I run this query, I expect I will get just two tokens ("Sun", and "sepsep"). But to my surprise, I got ("Sun", "apple", "banana", "lemon")!
GET test/_analyze
{
"analyzer": "foobar",
"text": "Sun 太陽",
"explain": false
}
I got this instead:
{
"tokens" : [
{
"token" : "Sun",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "apple",
"start_offset" : 5,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "banana",
"start_offset" : 5,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "lemon",
"start_offset" : 5,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 1
}
]
}
It seems like the synonym filter is somehow "pre-processed" by the char-filter and thus the first 3 entries will map "sepsep" to a synonym. Is this expected? Or is this a bug?

ElasticSearch catenate_words -- only keep concatenated value

Following examples here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html
Specifically the catenate_words option.
I would like to use this to concatenate words that I can then use in a phrase query before and after the concatenated word, but the word parts prevent this.
For example, their example is this:
super-duper-xl → [ superduperxl, super, duper, xl ]
Now if my actual phrase was "what a great super-duper-xl" that would turn into a sequence:
[what,a,great,superduperxl,super,duper,xl]
That matches the phrase "great superduperxl" which is fine.
However, if the phrase was "the super-duper-xl emerged" the sequence would be:
[the,superduperxl,super,duper,xl,emerged]
This does not phrase match "superduperxl emerged", however it would if the part tokens (super,duper,xl) were not emitted.
Is there any way I can concatenate words keeping only the concatenated word and filtering out the word parts?
Pattern replace character filter can be used here.
"-" is replaced with "" to generate tokens
Query
PUT my-index1
{
"settings": {
"analysis": {
"analyzer": {
"remove_hyphen_analyzer": {
"tokenizer": "standard",
"char_filter": [
"remove_hyphen_filter"
]
}
},
"char_filter": {
"remove_hyphen_filter": {
"type": "pattern_replace",
"pattern": "-",
"replacement": ""
}
}
}
},
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "remove_hyphen_analyzer"
}
}
}
}
POST my-index1/_analyze
{
"analyzer": "remove_hyphen_analyzer",
"text": "the super-duper-xl emerged"
}
Result
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "superduperxl",
"start_offset" : 4,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "emerged",
"start_offset" : 19,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}

CamelCase Search with Elasticsearch

I want to configure Elasticsearch, so that searching for "JaFNam" will create a good score for "JavaFileName".
I'm tried to build an analyzer, that combines a CamelCase pattern analyzer with an edge_ngram tokenizer. I thought this would create terms like these:
J F N Ja Fi Na Jav Fil Nam Java File Name
But the tokenizer seems not to have any effect: I keep getting these terms:
Java File Name
What would the correct Elasticsearch configuration look like?
Example code:
curl -XPUT 'http://127.0.0.1:9010/hello?pretty=1' -d'
{
"settings":{
"analysis":{
"analyzer":{
"camel":{
"type":"pattern",
"pattern":"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])",
"filters": ["edge_ngram"]
}
}
}
}
}
'
curl -XGET 'http://127.0.0.1:9010/hello/_analyze?pretty=1' -d'
{
"analyzer":"camel",
"text":"JavaFileName"
}'
results in:
{
"tokens" : [ {
"token" : "java",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}, {
"token" : "file",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "name",
"start_offset" : 8,
"end_offset" : 12,
"type" : "word",
"position" : 2
} ]
}
You analyzer definition is not correct. you need a tokenizer and an array of filter, as it is your analyzer doesn't work. Try like this instead:
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"tokenizer": "my_pattern",
"filter": [
"my_gram"
]
}
},
"filter": {
"my_gram": {
"type": "edge_ngram",
"max_gram": 10
}
},
"tokenizer": {
"my_pattern": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}

Resources