CamelCase Search with Elasticsearch - elasticsearch

I want to configure Elasticsearch, so that searching for "JaFNam" will create a good score for "JavaFileName".
I'm tried to build an analyzer, that combines a CamelCase pattern analyzer with an edge_ngram tokenizer. I thought this would create terms like these:
J F N Ja Fi Na Jav Fil Nam Java File Name
But the tokenizer seems not to have any effect: I keep getting these terms:
Java File Name
What would the correct Elasticsearch configuration look like?
Example code:
curl -XPUT 'http://127.0.0.1:9010/hello?pretty=1' -d'
{
"settings":{
"analysis":{
"analyzer":{
"camel":{
"type":"pattern",
"pattern":"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])",
"filters": ["edge_ngram"]
}
}
}
}
}
'
curl -XGET 'http://127.0.0.1:9010/hello/_analyze?pretty=1' -d'
{
"analyzer":"camel",
"text":"JavaFileName"
}'
results in:
{
"tokens" : [ {
"token" : "java",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}, {
"token" : "file",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "name",
"start_offset" : 8,
"end_offset" : 12,
"type" : "word",
"position" : 2
} ]
}

You analyzer definition is not correct. you need a tokenizer and an array of filter, as it is your analyzer doesn't work. Try like this instead:
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"tokenizer": "my_pattern",
"filter": [
"my_gram"
]
}
},
"filter": {
"my_gram": {
"type": "edge_ngram",
"max_gram": 10
}
},
"tokenizer": {
"my_pattern": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}

Related

ElasticSearch - Search without apostrophe

I'm trying to allow users to search without entering an apostrophe.
E.G type Johns and still bring up results for John's
I've tried multiple things including adding the stemmer filter but with no luck.
I thought I could potentially do something manual such as
GET /_analyze
{
"char_filter": [{
"type": "pattern_replace",
"pattern": "\\s*([a-zA-Z0-9]+)\\'s",
"replacement": "$1 $1s $1's "
}],
"tokenizer": "standard",
"text": "john's dog jumped"
}
And i get the following response
{
"tokens" : [
{
"token" : "john",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "johns",
"start_offset" : 5,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "john's",
"start_offset" : 5,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "dog",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "jumped",
"start_offset" : 11,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
However I still don't get a match when I search for "johns" with out the '
My settings look like:
"analyzer" : {
"my_custom_search" : {
"char_filter" : [ "flexible_plurals" ],
"tokenizer" : "standard"
}
},
"char_filter" : {
"flexible_plurals" : {
"pattern" : """\s*([a-zA-Z0-9]+)\'s""",
"type" : "pattern_replace",
"replacement" : " $1 $1s $1's "
}
}
My mappings like
"search-terms" : {
"type" : "text",
"analyzer" : "my_custom_search"
}
I am using the match query to query the data
You are almost correct, Hope you are using the match query and you have defined your field as text with the custom analyzer, if you use the text field without your custom analyzer which uses your char_filter it will simply use the standard analyzer and won't generate the johns token hence no match.
Complete Working example
Index setting and mapping
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"apostrophe_filter": {
"type": "pattern_replace",
"pattern": "\\s*([a-zA-Z0-9]+)\\'s",
"replacement": "$1 $1s $1's "
}
},
"analyzer": {
"custom_analyzer": {
"filter": [
"lowercase"
],
"char_filter": [
"apostrophe_filter"
],
"tokenizer": "standard"
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
Index sample document
{
"title" : "john's"
}
And search for johns
{
"query": {
"match": {
"title": "johns"
}
}
}
Search results
"hits": [
{
"_index": "72937076",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "john's" --> note `john's`
}
}
]

ElasticSearch catenate_words -- only keep concatenated value

Following examples here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html
Specifically the catenate_words option.
I would like to use this to concatenate words that I can then use in a phrase query before and after the concatenated word, but the word parts prevent this.
For example, their example is this:
super-duper-xl → [ superduperxl, super, duper, xl ]
Now if my actual phrase was "what a great super-duper-xl" that would turn into a sequence:
[what,a,great,superduperxl,super,duper,xl]
That matches the phrase "great superduperxl" which is fine.
However, if the phrase was "the super-duper-xl emerged" the sequence would be:
[the,superduperxl,super,duper,xl,emerged]
This does not phrase match "superduperxl emerged", however it would if the part tokens (super,duper,xl) were not emitted.
Is there any way I can concatenate words keeping only the concatenated word and filtering out the word parts?
Pattern replace character filter can be used here.
"-" is replaced with "" to generate tokens
Query
PUT my-index1
{
"settings": {
"analysis": {
"analyzer": {
"remove_hyphen_analyzer": {
"tokenizer": "standard",
"char_filter": [
"remove_hyphen_filter"
]
}
},
"char_filter": {
"remove_hyphen_filter": {
"type": "pattern_replace",
"pattern": "-",
"replacement": ""
}
}
}
},
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "remove_hyphen_analyzer"
}
}
}
}
POST my-index1/_analyze
{
"analyzer": "remove_hyphen_analyzer",
"text": "the super-duper-xl emerged"
}
Result
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "superduperxl",
"start_offset" : 4,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "emerged",
"start_offset" : 19,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}

Emails not being searched properly in elasticsearch

I have indexed a few documents in elasticsearch which have email ids as a field. But when I query for a specific email id, the search results are showing all the documents without filtering.
This is the query I have used
{
"query": {
"match": {
"mail-id": "abc#gmail.com"
}
}
}
By default, your mail-id field is analyzed by the standard analyzer which will tokenize the email abc#gmail.com into the following two tokens:
{
"tokens" : [ {
"token" : "abc",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "gmail.com",
"start_offset" : 4,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
What you need instead is to create a custom analyzer using the UAX email URL tokenizer, which will tokenize email addresses as a one token.
So you need to define your index as follows:
curl -XPUT localhost:9200/people -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
},
"mappings": {
"person": {
"properties": {
"mail-id": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
After creating that index, you can see that the email abc#gmail.com will be tokenized as a single token and your search will work as expected.
curl -XGET 'localhost:9200/people/_analyze?analyzer=my_analyzer&pretty' -d 'abc#gmail.com'
{
"tokens" : [ {
"token" : "abc#gmail.com",
"start_offset" : 0,
"end_offset" : 13,
"type" : "<EMAIL>",
"position" : 1
} ]
}
This happens when you use the default mappings. Elasticsearch has uax_url_email tokenizers which would identify the urls and emails as a single entity/token.
You can read more about this here and here

Searching for hyphened text in Elasticsearch

I am storing a 'Payment Reference Number' in elasticsearch.
The layout of it is e.g.: 2-4-3-635844569819109531 or 2-4-2-635844533758635433 etc
I want to be able to search for documents by their payment ref number either by
Searching using the 'whole' reference number, e.g. putting in 2-4-2-635844533758635433
Any 'part' of the reference number from the 'start'. E.g. 2-4-2-63 (.. so only return the second one in the example)
Note: i do not want to search 'in the middle' or 'at the end' etc. From the beginning only.
Anyways, the hyphens are confusing me.
Questions
1) I am not sure if I should remove them in the mapping like
"char_filter" : {
"removeHyphen" : {
"type" : "mapping",
"mappings" : ["-=>"]
}
},
or not. I have never use the mappings in that way so not sure if this is necessary.
2) I think I need a 'ngrams' filter because I want to be able to search a part of the reference number from the being. I think something like
"partial_word":{
"filter":[
"standard",
"lowercase",
"name_ngrams"
],
"type":"custom",
"tokenizer":"whitespace"
},
and the filter
"name_ngrams":{
"side":"front",
"max_gram":50,
"min_gram":2,
"type":"edgeNGram"
},
I am not sure how to put it all together but
"paymentReference":{
"type":"string",
"analyzer": "??",
"fields":{
"partial":{
"search_analyzer":"???",
"index_analyzer":"partial_word",
"type":"string"
}
}
}
Everything that I have tried seems to always 'break' in the second search case.
If I do 'localhost:9200/orders/_analyze?field=paymentReference&pretty=1' -d "2-4-2-635844533758635433" it always breaks the hyphen as it's own token and returns e.g. all documents with 2- which is 'alot'! and not what I want when searching for 2-4-2-6
Can someone tell me how to map this field for the two types of searches I am trying to achieve?
Update - Answer
Effectively what Val said below. I just changed the mapping slightly to be more specific re the analyzers and also I don't need the main string indexed because I just query the partial.
Mapping
"paymentReference":{
"type": "string",
"index":"not_analyzed",
"fields": {
"partial": {
"search_analyzer":"payment_ref",
"index_analyzer":"payment_ref",
"type":"string"
}
}
}
Analyzer
"payment_ref": {
"type": "custom",
"filter": [
"lowercase",
"name_ngrams"
],
"tokenizer": "keyword"
}
Filter
"name_ngrams":{
"side":"front",
"max_gram":50,
"min_gram":2,
"type":"edgeNGram"
},
You don't need to use the mapping char filter for this.
You're on the right track using the Edge NGram token filter since you need to be able to search for prefixes only. I would use a keyword tokenizer instead to make sure the term is taken as a whole. So the way to set this up is like this:
curl -XPUT localhost:9200/orders -d '{
"settings": {
"analysis": {
"analyzer": {
"partial_word": {
"type": "custom",
"filter": [
"lowercase",
"ngram_filter"
],
"tokenizer": "keyword"
}
},
"filter": {
"ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 50
}
}
}
},
"mappings": {
"order": {
"properties": {
"paymentReference": {
"type": "string",
"fields": {
"partial": {
"analyzer": "partial_word",
"type": "string"
}
}
}
}
}
}
}'
Then you can analyze what is going to be indexed into your paymentReference.partial field:
curl -XGET 'localhost:9205/payments/_analyze?field=paymentReference.partial&pretty=1' -d "2-4-2-635844533758635433"
And you get exactly what you want, i.e. all the prefixes:
{
"tokens" : [ {
"token" : "2-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-6",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-63",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-635",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-6358",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-63584",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
...
Finally you can search for any prefix:
curl -XGET localhost:9200/orders/order/_search?q=paymentReference.partial:2-4-3
Not sure whether wildcard search match your needs. I define custom filter and set preserve_original and generate number parts false. Here is the sample code:
PUT test1
{
"settings" : {
"analysis" : {
"analyzer" : {
"myAnalyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : [ "dont_split_on_numerics" ]
}
},
"filter" : {
"dont_split_on_numerics" : {
"type" : "word_delimiter",
"preserve_original": true,
"generate_number_parts" : false
}
}
}
},
"mappings": {
"type_one": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
},
"type_two": {
"properties": {
"raw": {
"type": "text",
"analyzer": "myAnalyzer"
}
}
}
}
}
POST test1/type_two/1
{
"raw": "2-345-6789"
}
GET test1/type_two/_search
{
"query": {
"wildcard": {
"raw": "2-345-67*"
}
}
}

Elasticsearch custom analyzer not working

I am using elasticsearch as my search engine, I am now trying to create an custom analyzer to make the field value just lowercase. The following is my code:
Create index and mapping
create index with a custom analyzer named test_lowercase:
curl -XPUT 'localhost:9200/test/' -d '{
"settings": {
"analysis": {
"analyzer": {
"test_lowercase": {
"type": "pattern",
"pattern": "^.*$"
}
}
}
}
}'
create a mapping using the test_lowercase analyzer for the address field:
curl -XPUT 'localhost:9200/test/_mapping/Users' -d '{
"Users": {
"properties": {
"name": {
"type": "string"
},
"address": {
"type": "string",
"analyzer": "test_lowercase"
}
}
}
}'
To verify if the test_lowercase analyzer work:
curl -XGET 'localhost:9200/test/_analyze?analyzer=test_lowercase&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "\nbeijing china\n",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
} ]
}
As we can see, the string 'Beijing China' is indexed as a single lowercase-ed whole term 'beijing china', so the test_lowercase analyzer works fine.
To verify if the field 'address' is using the lowercase analyzer:
curl -XGET 'http://localhost:9200/test/_analyze?field=address&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "\nbeijing china\n",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
} ]
}
curl -XGET 'http://localhost:9200/test/_analyze?field=name&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "beijing",
"start_offset" : 1,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "china",
"start_offset" : 9,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
As we can see, for the same string 'Beijing China', if we use field=address to analyze, it creates a single item 'beijing china', when using field=name, we got two items 'beijing' and 'china', so it seems the field address is using my custom analyzer 'test_lowercase'.
Insert a document to the test index to see if the analyzer works for documents
curl -XPUT 'localhost:9200/test/Users/12345?pretty' -d '{"name": "Jinshui Tang", "address": "Beijing China"}'
Unfortunately, the document has been successfully inserted but the address field has not been correctly analyzed. I can't search out it by using the wildcard query as follows:
curl -XGET 'http://localhost:9200/test/Users/_search?pretty' -d '
{
"query": {
"wildcard": {
"address": "*beijing ch*"
}
}
}'
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
List all terms analyzed for the document:
So I run the following commands to see all terms of the document, and I found that the 'Beijing China' is not in the term vector at all.
curl -XGET 'http://localhost:9200/test/Users/12345/_termvector?fields=*&pretty'
{
"_index" : "test",
"_type" : "Users",
"_id" : "12345",
"_version" : 3,
"found" : true,
"took" : 2,
"term_vectors" : {
"name" : {
"field_statistics" : {
"sum_doc_freq" : 2,
"doc_count" : 1,
"sum_ttf" : 2
},
"terms" : {
"jinshui" : {
"term_freq" : 1,
"tokens" : [ {
"position" : 0,
"start_offset" : 0,
"end_offset" : 7
} ]
},
"tang" : {
"term_freq" : 1,
"tokens" : [ {
"position" : 1,
"start_offset" : 8,
"end_offset" : 12
} ]
}
}
}
}
}
We can see that the name is correctly analyzed and it became two terms 'jinshui' and 'tang', but the address is lost.
Can anyone please help? Is there anything missing?
Thanks a lot!
To lowercase the text you don't need a pattern. Use something like this:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"test_lowercase": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
}
PUT /test/_mapping/Users
{
"Users": {
"properties": {
"name": {
"type": "string"
},
"address": {
"type": "string",
"analyzer": "test_lowercase"
}
}
}
}
PUT /test/Users/12345
{"name": "Jinshui Tang", "address": "Beijing China"}
And to verify you did the right thing, use this:
GET /test/Users/_search
{
"fielddata_fields": ["name", "address"]
}
And you will see exactly how Elasticsearch is indexing your data:
"fields": {
"name": [
"jinshui",
"tang"
],
"address": [
"beijing",
"china"
]
}

Resources