ElasticSearch analyzer that allows for query both with and without hypens - elasticsearch

How do you construct an analyzer that allows you to query fields both with and without hyphens?
The following two queries must return the same person:
{
"query": {
"term": {
"name": {
"value": "Jay-Z"
}
}
}
}
{
"query": {
"term": {
"name": {
"value": "jay z"
}
}
}
}

What you could do is use a mapping character filter in order to replace the hyphen with a space. Basically, like this:
curl -XPUT localhost:9200/tests -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase"
],
"char_filter": [
"hyphens"
]
}
},
"char_filter": {
"hyphens": {
"type": "mapping",
"mappings": [
"-=>\\u0020"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
Then we can check what the analysis pipeline would yield using the _analyze endpoint:
For Jay-Z:
curl -XGET 'localhost:9200/tests/_analyze?pretty&analyzer=my_analyzer' -d 'Jay-Z'
{
"tokens" : [ {
"token" : "jay z",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
} ]
}
For jay z:
curl -XGET 'localhost:9200/tests/_analyze?pretty&analyzer=my_analyzer' -d 'jay z'
{
"tokens" : [ {
"token" : "jay z",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
} ]
}
As you can see the same token is going to be indexed for both forms so your term query will work with both forms as well.

Related

Does Elasticsearch support nested or object fields in MultiMatch?

I have some object field named "FullTitleFts". It has field "text" inside. This query works fine (and returns some entries):
GET index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"fullTitleFts.text": "Ivan"
}
}
]
}
}
}
But this query returns nothing:
GET index/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Ivan",
"fields": [
"fullTitleFts.text"
]
}
}
]
}
}
}
Mapping of the field:
"fullTitleFts": {
"copy_to": [
"text"
],
"type": "keyword",
"fields": {
"text": {
"analyzer": "analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "text"
}
}
}
"analyzer": {
"filter": [
"lowercase",
"hypocorisms",
"protect_kw"
],
"char_filter": [
"replace_char_filter",
"e_char_filter"
],
"expand": "true",
"type": "custom",
"tokenizer": "standard"
}
e_char_filter is for replacing Cyrillic char "ё" to "е", replace_char_filter is for removing "�" from text. protect_kw is keyword_marker for some Russian unions. hypocorisms is synonym_graph for making another forms of names.
Example of analyzer output:
GET index/_analyze
{
"analyzer": "analyzer",
"text": "Алёна�"
}
{
"tokens" : [
{
"token" : "аленка",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "аленушка",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "алена",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
I've also found this question. And it seems that the answer didn't really work - author had to add "include_in_root" option in mapping. So i wondering if multi match supports nested or object fields at all. I am also can't find anything about it in docs.
As you have provided index mapping, your field is defined as multi-field and not as nested or object field. So both match and multi_match should work without providing path. you can just use field name as fullTitleFts.text when need to search on text type and fullTitleFts when need to search on keyword type.

Search by slug in Elasticsearch

I have an index named homes. Here is the simplified mapping of it:
{
"template": "homes",
"index_patterns": "homes",
"settings": {
"index.refresh_interval": "60s"
},
"mappings": {
"properties": {
"status": {
"type": "keyword"
},
"address": {
"type": "keyword",
"fields": {
"suggest": {
"type": "search_as_you_type"
},
"search": {
"type": "text"
}
}
}
}
}
}
As you can see, there is an address field which I query this way:
{
"query": {
"bool": {
"filter": [
{
"term": {
"status": "sale"
}
},
{
"term": {
"address": "406 - 533 Richmond St W"
}
}
]
}
}
}
Now my problem is that I need to be able to query with slugyfied version of the address field as well. For example, I need to query like this:
{
"query": {
"bool": {
"filter": [
{
"term": {
"status": "sale"
}
},
{
"term": {
"address": "406-533-richmond-st-w"
}
}
]
}
}
}
So, instead of 406 - 533 Richmond St W I need to query 406-533-richmond-st-w. How can I do that? I was thinking of adding a new field address_slug which is the slugyfied version of address but I need it to be auto populated so I don't need to manually fill this field every time that I insert or update a document in the index.
If you create a custom analyzer with the token filters below and another field for search that uses the custom analyzer, you can achieve this. Here is an example analyze result and output:
GET {index}/_analyze
{
"tokenizer": "keyword",
"filter": [
{
"type": "lowercase"
},
{
"type": "pattern_replace",
"pattern": """[^A-Za-z0-9]+""",
"replacement": "-"
}
],
"text": "406 - 533 Richmond St W"
}
Output:
{
"tokens" : [
{
"token" : "406-533-richmond-st-w",
"start_offset" : 0,
"end_offset" : 23,
"type" : "word",
"position" : 0
}
]
}

distinct a field without case sensetive in Elasticsearsh

We have a requirement about distinct a field without case sensitive (for example “CIty” and “ciTy” must been in a group).
can i implement a query by this feature in Elasticsearch ?
This problem needs to be resolved while indexing. Tokens can be converted to lowercase while indexing using normalizer
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"city": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
}
will give
"aggregations" : {
"cities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "a",
"doc_count" : 2
}
]
}
}
both "A" and "a" are treated as single value

Elasticsearch: Find substring match, Exact match and only if present match

I want to perform such kind of query such that the query shows output if and only if all the words in the query are present in the given string as a string or query
For example -
let text = "garbage can"
so if I query
"garb"
it should return "garbage can"
if I query
"garbage ca"
it should return "garbage can"
but if I query
"garbage b"
it should not return anything
I tried using substring and also match but they both did not quite did the job for me.
You may use an Edge N-gram tokenizer to index your data.
You can also have custom token_chars in the latest 7.8 version!
Have a look at the documentation for more details: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
I guess you want to do a prefix query. please try the following prefix query:
GET /test_index/_search
{
"query": {
"prefix": {
"my_keyword": {
"value": "garbage b"
}
}
}
}
However, this kind of prefix query's performance is not good.
You could try the following query by using customized prefix analyser.
First, create a new index:
PUT /test_index
{
"settings": {
"index": {
"number_of_shards": "1",
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "1"
}
},
"mappings": {
"properties": {
"my_text": {
"analyzer": "autocomplete",
"type": "text"
},
"my_keyword": {
"type": "keyword"
}
}
}
}
Second, insert data into this index:
PUT /test_index/_doc/1
{
"my_text": "garbage can",
"my_keyword": "garbage can"
}
Query with "garbage c"
GET /test_index/_search
{
"query": {
"term": {
"my_text": "garbage c"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.45802015,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.45802015,
"_source" : {
"my_text" : "garbage can",
"my_keyword" : "garbage can"
}
}
]
}
}
Query with "garbage b"
GET /test_index/_search
{
"query": {
"term": {
"my_text": "garbage b"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
If you don't want to do a prefix query, you could try the following wildcard query. Please remember the performance is bad and you could also try to use customeized analyser to optimize it.
GET /test_index/_search
{
"query": {
"wildcard": {
"my_keyword": {
"value": "*garbage c*"
}
}
}
}
New Edit Part
I'm not sure if I got want you really want this time....
Anyway, please try to use the following _mapping and queries:
1. Create Index
PUT /test_index
{
"settings": {
"index": {
"max_ngram_diff": 50,
"number_of_shards": "1",
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 51,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "1"
}
},
"mappings": {
"properties": {
"my_text": {
"analyzer": "autocomplete",
"type": "text"
},
"my_keyword": {
"type": "keyword"
}
}
}
}
2. Insert some smaple data
PUT /test_index/_doc/1
{
"my_text": "test garbage can",
"my_keyword": "test garbage can"
}
PUT /test_index/_doc/2
{
"my_text": "garbage",
"my_keyword": "garbage"
}
3. Query
GET /test_index/_search
{
"query": {
"term": {
"my_text": "bage c"
}
}
}
Please Note:
This index only support string which max length is 50. Otherwise you need to modified max_ngram_diff, min_gram and max_gram
It need a lot of mem to build the reversing index

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

Resources