How to create and add custom analyser in Elastic search? - elasticsearch

i have a batch of "smartphones" products in my ES and I need to query them by using "smart phone" text. So I m looking into the compound word token filter. Specifically , I m planning to use a custom filter like this:
curl -XPUT 'localhost:9200/_all/_settings -d '
{
"analysis" : {
"analyzer":{
"second":{
"type":"custom",
"tokenizer":"standard",
"filter":["myFilter"]
}
"filter": {
"myFilter" :{
"type" : "dictionary_decompounder"
"word_list": ["smart", "phone"]
}
}
}
}
}
'
Is this the correct approach ? Also I d like to ask you how can i create and add the custom analyser to ES? I looked into several links but couldn't figure out how to do it. I guess I m looking for the correct syntax.
Thank you
EDIT
I m running 1.4.5 version.
and I verified that the custom analyser was added successfully:
{
"test_index" : {
"settings" : {
"index" : {
"creation_date" : "1453761455612",
"analysis" : {
"filter" : {
"myFilter" : {
"type" : "dictionary_decompounder",
"word_list" : [ "smart", "phone" ]
}
},
"analyzer" : {
"second" : {
"type" : "custom",
"filter" : [ "lowercase", "myFilter" ],
"tokenizer" : "standard"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"version" : {
"created" : "1040599"
},
"uuid" : "xooKEdMBR260dnWYGN_ZQA"
}
}
}
}

Your approach looks good, I would also consider adding lowercase token filter, so that even Smartphone (notice Uppercase 'S') will be split into smart and phone.
Then You could create index with analyzer like this,
curl -XPUT 'localhost:9200/your_index -d '
{
"settings": {
"analysis": {
"analyzer": {
"second": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"myFilter"
]
}
},
"filter": {
"myFilter": {
"type": "dictionary_decompounder",
"word_list": [
"smart",
"phone"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "second"
}
}
}
}
}
'
Here you are creating index named your_index, custom analyzer named second and applied that to name field.
You can check if the analyzer is working as expected with analyze api like this
curl -XGET 'localhost:9200/your_index/_analyze' -d '
{
"analyzer" : "second",
"text" : "LG Android smartphone"
}'
Hope this helps!!

Related

ELK bool query with match and prefix

I'm new in ELK. I have a problem with the followed search query:
curl --insecure -H "Authorization: ApiKey $ESAPIKEY" -X GET "https://localhost:9200/commsrch/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"should" : [
{"match" : {"cn" : "franc"}},
{"prefix" : {"srt" : "99889300200"}}
]
}
}
}
'
I need to find all documents that satisfies the condition: OR field "cn" contains "franc" OR field "srt" starts with "99889300200".
Index mapping:
{
"commsrch" : {
"mappings" : {
"properties" : {
"addr" : {
"type" : "text",
"index" : false
},
"cn" : {
"type" : "text",
"analyzer" : "compname"
},
"srn" : {
"type" : "text",
"analyzer" : "srnsrt"
},
"srt" : {
"type" : "text",
"analyzer" : "srnsrt"
}
}
}
}
}
Index settings:
{
"commsrch" : {
"settings" : {
"index" : {
"routing" : {
"allocation" : {
"include" : {
"_tier_preference" : "data_content"
}
}
},
"number_of_shards" : "1",
"provided_name" : "commsrch",
"creation_date" : "1675079141160",
"analysis" : {
"filter" : {
"ngram_filter" : {
"type" : "ngram",
"min_gram" : "3",
"max_gram" : "4"
}
},
"analyzer" : {
"compname" : {
"filter" : [
"lowercase",
"stop",
"ngram_filter"
],
"type" : "custom",
"tokenizer" : "whitespace"
},
"srnsrt" : {
"type" : "custom",
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "1",
"uuid" : "C15EXHnaTIq88JSYNt7GvA",
"version" : {
"created" : "8060099"
}
}
}
}
}
Query works properly with just only one condition. If query has only "match" condition, results has properly documents count. If query has only "prefix" condition, results has properly documents count.
In case of two conditions "match" and "prefix", i see in result documents that corresponds only "prefix" condition.
In ELK docs can't find any limitation about mixing "prefix" and "match", but as i see some problem exists. Please help to find where is the problem.
In continue of experince I have one more problem.
Example:
Source data:
1st document cn field: "put stone is done"
2nd document cn field:: "job one or two"
Mapping and index settings the same as described in my first post
Request:
{
"query": {
"bool": {
"should" : [
{"match" : {"cn" : "one"}},
{"prefix" : {"cn" : "one"}}
]
}
}
}
'
As I understand, the high scores got first document, because it has more repeats of "one". But I need high scores for documents, that has at least one word in field "cn" started from string "one". I have experiments with query:
{
"query": {
"bool": {
"should": [
{"match": {"cn": "one"}},
{
"constant_score": {
"filter": {
"prefix": {
"cn": "one"
}
},
"boost": 100
}
}
]
}
}
}
But it doesn't work properly. What's wrong with my query?

exact match in elasticSearch after incorporating hunspell filter

We have added the hunspell filter to our elastic search instance. Nothing fancy...
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"filter": {
"en_GB": {
"type": "hunspell",
"language": "en_GB"
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
},
"en_GB": {
"filter": [
"lowercase",
"en_GB"
],
"tokenizer": "standard"
}
}
}
}
}
Now though we seem to have lost the built in facility to do exact match queries using quotation marks. So searching for "lace" will also do an equal score search for "lacy" for example. I understand this is kind of the point of including hunspell but I would like to be able to force exact matches by using quotes
I am doing boolean queries for this by the way. Along the lines of (in java)
"bool" : {
"must" : {
"query_string" : {
"query" : "\"lace\"",
"fields" :
...
or (postman direct to 9200 ...
{
"query" : {
"query_string" : {
"query" : "\"lace\"",
"fields" :
....
Is this possible ? I'm guessing this might be something we would do in the tokaniser but I'm not quite sure where to start...?
You will not be able to handle this tokenizer level, but you can tweak configurations at mapping level to use multi-fields, you can keep a copy of the same field which will not be analyzed and later use this in query to support your usecase.
You can update your mappings like following
"mappings": {
"desc": {
"properties": {
"labels": {
"type": "string",
"analyzer": "en_GB",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
Furthur modify your query to search on raw field instead of analyzed field.
{
"query": {
"bool": {
"must": [{
"query_string": {
"default_field": "labels.raw",
"query": "lace"
}
}]
}
}
}
Hope this helps
Thanks

Using Email tokenizer in elasticsearch

Did try some examples from elasticsearch documentation and from google but nothing helped in figuring out..
just a sample data I have is just few blog posts. I am trying to see all posts with email address. When I use "email":"someone" I see all the posts matching someone but when I change to use someone#gmail.com nothing shows up!
"hits": [
{
"_index": "blog",
"_type": "post",
"_id": "2",
"_score": 1,
"_source": {
"user": "sreenath",
"email": "someone#gmail.com",
"postDate": "2011-12-12",
"body": "Trying to figure out this",
"title": "Elastic search testing"
}
}
]
when I use Get query is as shown below, I see all posts matching someone#anything.com. But I want to change this
{ "term" : { "email" : "someone" }} to { "term" : { "email" : "someone#gmail.com" }}
GET blog/post/_search
{
"query" : {
"filtered" : {
"filter" : {
"and" : [
{ "term" :
{ "email" : "someone" }
}
]
}
}
}
}
I did the curl -XPUT for the following, but did not help
curl -XPUT localhost:9200/test/ -d '
{
"settings" : {
"analysis" : {
"filter" : {
"email" : {
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : [
"([^#]+)",
"(\\p{L}+)",
"(\\d+)",
"#(.+)"
]
}
},
"analyzer" : {
"email" : {
"tokenizer" : "uax_url_email",
"filter" : [ "email", "lowercase", "unique" ]
}
}
}
}
}
'
You have created a custom analyzer for email addresses but you are not using it. You need to declare the email field in your mapping type to actually use that analyzer, like below. Also make sure to create the right index with that analyzer, i.e. blog and not test
change this
|
v
curl -XPUT localhost:9200/blog/ -d '{
"settings" : {
"analysis" : {
"filter" : {
"email" : {
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : [
"([^#]+)",
"(\\p{L}+)",
"(\\d+)",
"#(.+)"
]
}
},
"analyzer" : {
"email" : {
"tokenizer" : "uax_url_email",
"filter" : [ "email", "lowercase", "unique" ]
}
}
}
},
"mappings": { <--- add this
"post": {
"properties": {
"email": {
"type": "string",
"analyzer": "email"
}
}
}
}
}
'

Elasticsearch "failed to find analyzer"

I have created a synonym analyser on an index:
curl http://localhost:9200/test_index/_settings?pretty
{
"test_index" : {
"settings" : {
"index" : {
"creation_date" : "1429175067557",
"analyzer" : {
"search_synonyms" : {
"filter" : [ "lowercase", "search_synonym_filter" ],
"tokenizer" : "standard"
}
},
"uuid" : "Zq6Id8xsRWGofJrNCb7M8w",
"number_of_replicas" : "1",
"analysis" : {
"filter" : {
"search_synonym_filter" : {
"type" : "synonym",
"synonyms" : [ "sneakers,pumps" ]
}
}
},
"number_of_shards" : "5",
"version" : {
"created" : "1050099"
}
}
}
}
}
But when I try to use it with the mapping:
curl -XPUT 'http://localhost:9200/test_index/_mapping/product_catalog?pretty' -H "Content-Type: application/json" \
-d '{"product_catalog": {"properties" : {"name": {"type": "string", "include_in_all": true, "analyzer":"search_synonyms"} }}}'
I get the error:
{
"error" : "MapperParsingException[Analyzer [search_synonyms] not found for field [name]]",
"status" : 400
}
I have also tried to just check the analyser with:
curl 'http://localhost:9200/test_index/_analyze?analyzer=search_synonyms&pretty=1&text=pumps'
but still get an error:
ElasticsearchIllegalArgumentException[failed to find analyzer [search_synonyms]]
Any ideas, I may be missing something but I can't think what.
The analyzer element has to be inside your analysis component. Change your index creator as follows:
{
"settings": {
"index": {
"creation_date": "1429175067557",
"uuid": "Zq6Id8xsRWGofJrNCb7M8w",
"number_of_replicas": "0",
"analysis": {
"filter": {
"search_synonym_filter": {
"type": "synonym",
"synonyms": [
"sneakers,pumps"
]
}
},
"analyzer": {
"search_synonyms": {
"filter": [
"lowercase",
"search_synonym_filter"
],
"tokenizer": "standard"
}
}
},
"number_of_shards": "5",
"version": {
"created": "1050099"
}
}
}
}

Why ElasticSearch is not finding my term

I just installed and testing elastic search it looks great and i need to know some thing i have an configuration file
elasticsearch.json in config directory
{
"network" : {
"host" : "127.0.0.1"
},
"index" : {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval" : "2s",
"analysis" : {
"analyzer" : {
"index_analyzer" : {
"tokenizer" : "nGram",
"filter" : ["lowercase"]
},
"search_analyzer" : {
"tokenizer" : "nGram",
"filter" : ["lowercase"]
}
},
"// you'll need lucene dep for this: filter" : {
"snowball": {
"type" : "snowball",
"language" : "English"
}
}
}
}
}
and i have inserted an doc that contains a word searching if i search for keyword
search it says nothing found...
wont it stem before indexing or i missed some thing in config ....
How looks your query?
your config does not look good. try:
...
"index_analyzer" : {
"tokenizer" : "nGram",
"filter" : ["lowercase", "snowball"]
},
"search_analyzer" : {
"tokenizer" : "nGram",
"filter" : ["lowercase", "snowball"]
}
},
"filter" : {
"snowball": {
"type" : "snowball",
"language" : "English"
}
}
I've had trouble overriding the "default_search" and "default_index" analyzer as well.
This works though.
You can add "index_analyzer" to default all string fields with unspecified analyzers within a type, if need be.
curl -XDELETE localhost:9200/twitter
curl -XPOST localhost:9200/twitter -d '
{"index":
{ "number_of_shards": 1,
"analysis": {
"filter": {
"snowball": {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": { "a2" : {
"type":"custom",
"tokenizer": "standard",
"filter": ["lowercase", "snowball"]
}
}
}
}
}
}'
curl -XPUT localhost:9200/twitter/tweet/_mapping -d '{
"tweet" : {
"date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy"],
"properties" : {
"user": {"type":"string"},
"message" : {"type" : "string", "analyzer":"a2"}
}
}}'
curl -XPUT http://localhost:9200/twitter/tweet/1 -d '{ "user": "kimchy", "post_date": "2009-11-15T13:12:00", "message": "Trying out searching teaching, so far so good?" }'
curl -XGET localhost:9200/twitter/tweet/_search?q=message:search
curl -XGET localhost:9200/twitter/tweet/_search?q=message:try

Resources