Elasticsearch custom analyzer not working - elasticsearch

I am using elasticsearch as my search engine, I am now trying to create an custom analyzer to make the field value just lowercase. The following is my code:
Create index and mapping
create index with a custom analyzer named test_lowercase:
curl -XPUT 'localhost:9200/test/' -d '{
"settings": {
"analysis": {
"analyzer": {
"test_lowercase": {
"type": "pattern",
"pattern": "^.*$"
}
}
}
}
}'
create a mapping using the test_lowercase analyzer for the address field:
curl -XPUT 'localhost:9200/test/_mapping/Users' -d '{
"Users": {
"properties": {
"name": {
"type": "string"
},
"address": {
"type": "string",
"analyzer": "test_lowercase"
}
}
}
}'
To verify if the test_lowercase analyzer work:
curl -XGET 'localhost:9200/test/_analyze?analyzer=test_lowercase&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "\nbeijing china\n",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
} ]
}
As we can see, the string 'Beijing China' is indexed as a single lowercase-ed whole term 'beijing china', so the test_lowercase analyzer works fine.
To verify if the field 'address' is using the lowercase analyzer:
curl -XGET 'http://localhost:9200/test/_analyze?field=address&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "\nbeijing china\n",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
} ]
}
curl -XGET 'http://localhost:9200/test/_analyze?field=name&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "beijing",
"start_offset" : 1,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "china",
"start_offset" : 9,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
As we can see, for the same string 'Beijing China', if we use field=address to analyze, it creates a single item 'beijing china', when using field=name, we got two items 'beijing' and 'china', so it seems the field address is using my custom analyzer 'test_lowercase'.
Insert a document to the test index to see if the analyzer works for documents
curl -XPUT 'localhost:9200/test/Users/12345?pretty' -d '{"name": "Jinshui Tang", "address": "Beijing China"}'
Unfortunately, the document has been successfully inserted but the address field has not been correctly analyzed. I can't search out it by using the wildcard query as follows:
curl -XGET 'http://localhost:9200/test/Users/_search?pretty' -d '
{
"query": {
"wildcard": {
"address": "*beijing ch*"
}
}
}'
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
List all terms analyzed for the document:
So I run the following commands to see all terms of the document, and I found that the 'Beijing China' is not in the term vector at all.
curl -XGET 'http://localhost:9200/test/Users/12345/_termvector?fields=*&pretty'
{
"_index" : "test",
"_type" : "Users",
"_id" : "12345",
"_version" : 3,
"found" : true,
"took" : 2,
"term_vectors" : {
"name" : {
"field_statistics" : {
"sum_doc_freq" : 2,
"doc_count" : 1,
"sum_ttf" : 2
},
"terms" : {
"jinshui" : {
"term_freq" : 1,
"tokens" : [ {
"position" : 0,
"start_offset" : 0,
"end_offset" : 7
} ]
},
"tang" : {
"term_freq" : 1,
"tokens" : [ {
"position" : 1,
"start_offset" : 8,
"end_offset" : 12
} ]
}
}
}
}
}
We can see that the name is correctly analyzed and it became two terms 'jinshui' and 'tang', but the address is lost.
Can anyone please help? Is there anything missing?
Thanks a lot!

To lowercase the text you don't need a pattern. Use something like this:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"test_lowercase": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
}
PUT /test/_mapping/Users
{
"Users": {
"properties": {
"name": {
"type": "string"
},
"address": {
"type": "string",
"analyzer": "test_lowercase"
}
}
}
}
PUT /test/Users/12345
{"name": "Jinshui Tang", "address": "Beijing China"}
And to verify you did the right thing, use this:
GET /test/Users/_search
{
"fielddata_fields": ["name", "address"]
}
And you will see exactly how Elasticsearch is indexing your data:
"fields": {
"name": [
"jinshui",
"tang"
],
"address": [
"beijing",
"china"
]
}

Related

elasticsearch match_phrase query for exact sub-string search

I used match_phrase query for search full-text matching.
But it did not work as I thought.
Query:
POST /_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"browsing_url": "/critical-illness"
}
}
],
"minimum_should_match": 1
}
}
}
Results:
"hits" : [
{
"_source" : {
"browsing_url" : "https://www.google.com/url?q=https://industrytoday.co.uk/market-research-industry-today/global-critical-illness-commercial-insurance-market-to-witness-a-pronounce-growth-during-2020-2025&usg=afqjcneelu0qvjfusnfjjte1wx0gorqv5q"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness&tbm=nws"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness+-insurance%3f"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness%3f"
}
}
]
expectation:
To only get results where the given string is an exact sub-string in the field. For example:
https://www.example.com/critical-illness OR
https://www.example.com/critical-illness-insurance
Mapping:
"browsing_url": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
The results are not what I expected. I expected to get the results exactly as the search /critical-illness as a substring of the stored text.
The reason you're seeing unexpected results is because both your search query, and the field itself, are being run through an analyzer. Analyzers will break down text into a list of individual terms that can be searched on. Here's an example using the _analyze endpoint:
GET _analyze
{
"analyzer": "standard",
"text": "example.com/critical-illness"
}
{
"tokens" : [
{
"token" : "example.com",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "critical",
"start_offset" : 12,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "illness",
"start_offset" : 21,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
So while your documents true value is example.com/critical-illness, behind the scenes Elasticsearch will only use this list of tokens for matches. The same thing goes for your search query since you're using match_phrase, which tokenizes the phrase passed in. The end result is Elasticsearch trying to match the token list ["critical", "illness"] against your documents token lists.
Most of the time the standard analyzer does a good job of removing unnecessary tokens, however in your case you care about characters like / since you want to match against them. One way to solve this is to use a different analyzer like a reversed path hierarchy analyzer. Below is an example of how to configure this analyzer and use it for your browsing_url field:
PUT /browse_history
{
"settings": {
"analysis": {
"analyzer": {
"url_analyzer": {
"tokenizer": "url_tokenizer"
}
},
"tokenizer": {
"url_tokenizer": {
"type": "path_hierarchy",
"delimiter": "/",
"reverse": true
}
}
}
},
"mappings": {
"properties": {
"browsing_url": {
"type": "text",
"norms": false,
"analyzer": "url_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Now if you analyze a URL you'll now see URL paths kept whole:
GET browse_history/_analyze
{
"analyzer": "url_analyzer",
"text": "example.com/critical-illness?src=blah"
}
{
"tokens" : [
{
"token" : "example.com/critical-illness?src=blah",
"start_offset" : 0,
"end_offset" : 37,
"type" : "word",
"position" : 0
},
{
"token" : "critical-illness?src=blah",
"start_offset" : 12,
"end_offset" : 37,
"type" : "word",
"position" : 0
}
]
}
This lets you do a match_phrase_prefix to find all documents with URLs that contain a critical-illness path:
POST /browse_history/_search
{
"query": {
"match_phrase_prefix": {
"browsing_url": "critical-illness"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.7896894,
"hits" : [
{
"_index" : "browse_history",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.7896894,
"_source" : {
"browsing_url" : "https://www.example.com/critical-illness"
}
}
]
}
}
EDIT:
Previous answer before revision was to use the keyword field and a regexp, however this is a pretty costly query to make.
POST /browse_history/_search
{
"query": {
"regexp": {
"browsing_url.keyword": ".*/critical-illness"
}
}
}

A strange result of english analyzer which is not the same as elasticsearch definitive guide said

The example in this link says using GET /my_index/_analyze to analyze the word Foxes will return the term fox.
But in my case, I found the result is still foxes.
curl -XGET http://localhost:9200/my_index/_mapping/my_type?pretty
Mapping:
{
"my_index" : {
"mappings" : {
"my_type" : {
"properties" : {
"english_title" : {
"type" : "text",
"analyzer" : "english"
},
"title" : {
"type" : "text"
}
}
}
}
}
}
Test:
curl -XGET http://localhost:9200/my_index/_analyze?pretty -d '
{
"field": "my_type.english_title",
"text": "Foxes"
}
'
Response:
{
"tokens" : [
{
"token" : "foxes",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
When I used GET /_analyze, the result was the term fox.
curl -XGET http://localhost:9200/_analyze?pretty -d '
{
"analyzer": "english",
"text": "Foxes"
}
Response:
{
"tokens" : [
{
"token" : "fox",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
Is it a mistake in the tutorial? GET /my_index/_analyze this method can't get the correct result.
All make perfect sense.
curl -XGET http://localhost:9200/my_index/_analyze?pretty -d '
{
"field": "my_type.english_title",
"text": "Foxes"
}
You are sending to analyze field my_type.english_title which is missing because. So my type. should be removed
try
curl -XGET http://localhost:9200/my_index/_analyze?pretty -d '
{
"field": "english_title",
"text": "Foxes"
}

CamelCase Search with Elasticsearch

I want to configure Elasticsearch, so that searching for "JaFNam" will create a good score for "JavaFileName".
I'm tried to build an analyzer, that combines a CamelCase pattern analyzer with an edge_ngram tokenizer. I thought this would create terms like these:
J F N Ja Fi Na Jav Fil Nam Java File Name
But the tokenizer seems not to have any effect: I keep getting these terms:
Java File Name
What would the correct Elasticsearch configuration look like?
Example code:
curl -XPUT 'http://127.0.0.1:9010/hello?pretty=1' -d'
{
"settings":{
"analysis":{
"analyzer":{
"camel":{
"type":"pattern",
"pattern":"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])",
"filters": ["edge_ngram"]
}
}
}
}
}
'
curl -XGET 'http://127.0.0.1:9010/hello/_analyze?pretty=1' -d'
{
"analyzer":"camel",
"text":"JavaFileName"
}'
results in:
{
"tokens" : [ {
"token" : "java",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}, {
"token" : "file",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "name",
"start_offset" : 8,
"end_offset" : 12,
"type" : "word",
"position" : 2
} ]
}
You analyzer definition is not correct. you need a tokenizer and an array of filter, as it is your analyzer doesn't work. Try like this instead:
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"tokenizer": "my_pattern",
"filter": [
"my_gram"
]
}
},
"filter": {
"my_gram": {
"type": "edge_ngram",
"max_gram": 10
}
},
"tokenizer": {
"my_pattern": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}

Elasticsearch snowball in French not stemming correctly

I've seen a problem with the same stem word in French.
Here is an example: snowball in French
or
curl -XDELETE http://localhost:9200/stacko36088193
curl -XPOST http://localhost:9200/stacko36088193 -d '
{
"index": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "snowball",
"language" : "French"
}
}
}
}
}'
curl 'localhost:9200/stacko36088193/_analyze?pretty=1&analyzer=my_analyzer' -d 'développeur développeuse'
And see token keys
{
"tokens" : [ {
"token" : "développeur",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "développ",
"start_offset" : 12,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
How can you do to have the same stem for all of these words?

Emails not being searched properly in elasticsearch

I have indexed a few documents in elasticsearch which have email ids as a field. But when I query for a specific email id, the search results are showing all the documents without filtering.
This is the query I have used
{
"query": {
"match": {
"mail-id": "abc#gmail.com"
}
}
}
By default, your mail-id field is analyzed by the standard analyzer which will tokenize the email abc#gmail.com into the following two tokens:
{
"tokens" : [ {
"token" : "abc",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "gmail.com",
"start_offset" : 4,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
What you need instead is to create a custom analyzer using the UAX email URL tokenizer, which will tokenize email addresses as a one token.
So you need to define your index as follows:
curl -XPUT localhost:9200/people -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
},
"mappings": {
"person": {
"properties": {
"mail-id": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
After creating that index, you can see that the email abc#gmail.com will be tokenized as a single token and your search will work as expected.
curl -XGET 'localhost:9200/people/_analyze?analyzer=my_analyzer&pretty' -d 'abc#gmail.com'
{
"tokens" : [ {
"token" : "abc#gmail.com",
"start_offset" : 0,
"end_offset" : 13,
"type" : "<EMAIL>",
"position" : 1
} ]
}
This happens when you use the default mappings. Elasticsearch has uax_url_email tokenizers which would identify the urls and emails as a single entity/token.
You can read more about this here and here

Resources