Elasticsearch template to support case insensitive searches

Elasticsearch template to support case insensitive searches - elasticsearch

I've setup a normalizer on an index field to support case insensitive searches, cant seem to get it to work.
GET users/
Returns the following mapping:
{
"users": {
"aliases": {},
"mappings": {
"user": {
"properties": {
"active": {
"type": "boolean"
},
"first_name": {
"type": "keyword",
"fields": {
"normalize": {
"type": "keyword",
"normalizer": "search_normalizer"
}
}
},
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "users",
"creation_date": "1567936315432",
"analysis": {
"normalizer": {
"search_normalizer": {
"filter": [
"lowercase"
],
"type": "custom"
}
}
},
"number_of_replicas": "1",
"uuid": "5SknFdwJTpmF",
"version": {
"created": "6040299"
}
}
}
}
}
Although first_name is normalized to lowercase, queries on the first_name field are case sensitive.
Using the following query for a user with first name Dave
GET users/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name": {
"value": ".*dave.*"
}
}
}
]
}
}
}
GET users/_analyze
{
"analyzer" : "standard",
"text": "Dave"
}
returns
{
"tokens": [
{
"token": "dave",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Although "Dave" is tokenized to "dave" the following query
GET users/_search
{
"query": {
"match": {
"first_name": "dave"
}
}
}
Returns no hits.
Is there an issue with my current mapping? or the query?

I think you have missed first_name.normalize in query
Indexing Records
{"first_name": "Daveraj"}
{"index": {}}
{"first_name": "RajdaveN"}
{"index": {}}
{"first_name": "Dave"}
Query
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name.normalize": {
"value": ".*dave.*"
}
}
}
]
}
}
}
Result
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.0,
"hits": [
{
"_index": "test3",
"_type": "test3_type",
"_id": "M8-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Dave"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Mc-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Daveraj"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Ms-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "RajdaveN"
}
}
]
}
}```

You have created a normalized multi-field: first_name.normalize , but you are searching on the original field first_name which doesn't have any analyzer specified (will default to index-default analyzer or standard).
The examples given here might help:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
You need to explicitly specify the multi-field you want to search on, note even though a multi-field cant have its own content, it indexes different terms as opposed to its parent (although not always) as a result of possibly being analyzed using diff analyzers/char/token filters.

Related

Keyword normalizer not applied on document

I'm using Elasticsearch 6.8
here is my mapping
{
"index_patterns": [
"my_index_*"
],
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"analysis": {
"analyzer": {
"lower_ascii_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"audit_conformity": {
"dynamic": "false",
"properties": {
"country": {
"type": "keyword",
"normalizer": "my_normalizer"
},
[…]
Then I post a document with this body
{
"_source": {
"company_id": "a813bec1-f9f3-44c7-96ac-11157f64b79b",
"country": "MX",
"user_entity_id": "1"
}
}
When I search for the document, the country is still capitalized
GET /my_index_country/_search
I get
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index_country",
"_type": "my_index",
"_id": "LOT0fYIBCNP9gFG_7cet",
"_score": 1,
"_source": {
"_source": {
"company_id": "a813bec1-f9f3-44c7-96ac-11157f64b79b",
"country": "MX",
"user_entity_id": "1",
}
}
}
]
}
}
What do I do wrong ?

You do nothing wrong, but normalizers (and analyzer alike) will never modify your source document, only whatever is indexed from it.
This means that the source document keeps holding MX but underneath mx will be indexed for the country field.
If you want to lowercase the country field, you should use an ingest pipeline with a lowercase processor instead which will modify your source document before indexing it:
PUT _ingest/pipeline/lowercase-pipiline
{
"processors": [
{
"lowercase": {
"field": "country"
}
}
]
}
Then use it when indexing your documents:
PUT my_index_country/my_index/LOT0fYIBCNP9gFG_7cet?pipeline=lowercase-pipeline
{
"company_id": "a813bec1-f9f3-44c7-96ac-11157f64b79b",
"country": "MX",
"user_entity_id": "1",
}
GET my_index_country/my_index/LOT0fYIBCNP9gFG_7cet
Result =>
{
"company_id": "a813bec1-f9f3-44c7-96ac-11157f64b79b",
"country": "mx",
"user_entity_id": "1",
}

Elasticsearch returns NullPointerException during inner_hits query

I have an index, which stores a nested document. I wanna see this nested documents, for this purpose I used 'inner_hits' in request, but elastic returns nullPointerException. Do anyone meet with this problem?)
Request to elasticsearch using Postman:
GET http://localhost/my-index/_search
{
"query": {
"nested": {
"path": "address_object",
"query": {
"bool": {
"must": {
"term": {"address_object.city": "Paris"}
}
}
},
"inner_hits" : {}
}
}
}
Response with status code 200:
{
"took": 161,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 1,
"skipped": 0,
"failed": 1,
"failures": [
{
"shard": 0,
"index": "my-index",
"node": "DWdD83KaTmUiodENQkGDww",
"reason": {
"type": "null_pointer_exception",
"reason": null
}
}
]
},
"hits": {
"total": 6500039,
"max_score": 2.1761138,
"hits": []
}
}
Elasticsearch version: 6.2.4
Lucene version: 7.2.1
Update:
Mapping:
{
"my-index": {
"mappings": {
"mytype": {
"dynamic": "false",
"_source": {
"enabled": false
},
"properties": {
"adverts_count": {
"type": "integer",
"store": true
},
...
"address_object": {
"type": "nested",
"properties": {
"adverts_count": {
"type": "integer",
"store": true
},
"city": {
"type": "keyword",
"store": true
}
}
},
...
Sample document:
{
"_index": "my-index",
"_type": "mytype",
"_id": "XDWrGncBdwNBWGEagAM2",
"_score": 2.1587489,
"fields": {
"is_target_page_shown": [
0
],
"updated_at": [
1612264276
],
"is_shown": [
0
],
"nb_queries": [
1
],
"search_query": [
"phone"
],
"target_category": [
15
],
"adverts_count": [
1
]
}
}
Extra information:
If I remove the "inner_hits": {} from search request, elastic returns nested documents(_index, _type, _id, _score), but ain't other fields(e.g city)
Also, as suggested in the comments, I tried setting to true ignore_unmapped, but it doesn't helped. The same nullPointerException.

I tried reproducing your issue, but as you have not provided the proper sample documents(one which you provided doesn't have the address_object properties), I used your mapping and below sample documents.
PUT index-name/_doc/1
{
"address_object" :{
"adverts_count" : 1,
"city": "paris"
}
}
PUT index-name/_doc/2
{
"address_object" :{
"adverts_count" : 1,
"city": "blr"
}
}
And when I use the same search provided by you.
POST 71907588/_search
{
"query": {
"nested": {
"path": "address_object",
"query": {
"bool": {
"must": {
"term": {
"address_object.city": "paris"
}
}
}
},
"inner_hits": {}
}
}
}
I get a proper response, matching paris as city as shown in the search response.
"hits": [
{
"_index": "71907588",
"_id": "1",
"_score": 0.6931471,
"_source": {
"address_object": {
"adverts_count": 1,
"city": "paris"
}
},
"inner_hits": {
"address_object": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.6931471,
"hits": [
{
"_index": "71907588",
"_id": "1",
"_nested": {
"field": "address_object",
"offset": 0
},
"_score": 0.6931471,
"_source": {
"city": "paris",
"adverts_count": 1
}
}
]
}
}
}
}
]

Search all unique terms from a given query in elastic search

I am trying to search for all the unique names in the index test_nested.
GET test_nested/_mappings
{
"test_nested": {
"mappings": {
"my_type": {
"properties": {
"group": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"user": {
"type": "nested",
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
}
GET test_nested/_search
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "test_nested",
"_type": "my_type",
"_id": "AWG5iVBz4bQsVnslc9gL",
"_score": 1,
"_source": {
"group": "fans",
"user": [
{
"name": "Linux"
},
{
"name": "Android (operating system)"
},
{
"name": "Widows 10"
}
]
}
},
{
"_index": "test_nested",
"_type": "my_type",
"_id": "AWG5ieKW4bQsVnslc9gM",
"_score": 1,
"_source": {
"group": "fans",
"user": [
{
"name": "Bitcoin"
},
{
"name": "PHP"
},
{
"name": "Microsoft Windows"
}
]
}
},
{
"_index": "test_nested",
"_type": "my_type",
"_id": "AWG5irrV4bQsVnslc9gN",
"_score": 1,
"_source": {
"group": "fans",
"user": [
{
"name": "Windows XP"
}
]
}
},
{
"_index": "test_nested",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"group": "fans",
"user": [
{
"name": "iOS"
},
{
"name": "Android (operating system)"
},
{
"name": "Widows 10"
},
{
"name": "Widows XP"
}
]
}
}
]
}
}
I want all the unique names for a term. i.e. if I search for "wi"* then I should get [Microsoft Windows, Widows 10, Windows XP]

I don't know exactly what you mean but I use that query to list all statuses:
GET order/default/_search
{
"size": 0,
"aggs": {
"status_terms": {
"terms": {
"field": "status.keyword",
"missing": "N/A",
"min_doc_count": 0,
"order": {
"_key": "asc"
}
}
}
}
}
My model has status field and that query lists all statuses.
This is bucket aggregations
One of fields in result is:
sum_other_doc_count - Elastic returns the top unique terms. So if you have many different terms then some of them will not appear in the results. This field is a sum of documents which will not be a part of the response.
For nested objects try to read and use Nested Query docs

I found the solution. Hope it helps someone.
GET record_new/_search
{
"size": 0,
"query": {
"term": {
"software_tags": {
"value": "windows"
}
}
},
"aggs": {
"software_tags": {
"terms": {
"field": "software_tags.keyword",
"include" : ".*Windows.*",
"size": 10000,
"order": {
"_count": "desc"
}
}
}
}
}

facing problems with terms filter

My mapping looks like the below.
"BID": {
"type": "string"
},
"REGION": {
"type": "string"
},
Now I am trying to search for the records whose BID values are B100, B302. I've written below query. Though I've records with those ID values, I am not getting any results. Any clue where I am doing wrong?
{"query": {"filtered": {"filter": {"terms": {"BID": ["B100","B302"]}}}}}

Try using lower-case values, like:
{"query": {"filtered": {"filter": {"terms": {"BID": ["b100","b302"]}}}}}
You need to do this because, since you did not specify an analyzer in the definition of "BID" in your mapping, the default standard analyzer is used, which will convert letters to lower-case.
Alternatively, if you want to maintain the case in your index terms, you can add "index": "not_analyzed" to your mapping definition for "BID".
To test I set up an index like this:
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"BID": {
"type": "string",
"index": "not_analyzed"
},
"REGION": {
"type": "string"
}
}
}
}
}
added a few docs:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"REGION":"NA","BID":"B100"}
{"index":{"_id":2}}
{"REGION":"NA","BID":"B200"}
{"index":{"_id":3}}
{"REGION":"NA","BID":"B302"}
and now your query works as written:
POST /test_index/_search
{
"query": {
"filtered": {
"filter": {
"terms": {
"BID": [
"B100",
"B302"
]
}
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"REGION": "NA",
"BID": "B100"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"REGION": "NA",
"BID": "B302"
}
}
]
}
}
Here is some code I used for testing:
http://sense.qbox.io/gist/b4b4767501df7ad8b6459c4d96809d737a8811ec

Querying elasticsearch with OR and wildcards

I'm trying to do a simple query to my elasticsearch _type and match multiple fields with wildcards, my first attempt was like this:
POST my_index/my_type/_search
{
"sort" : { "date_field" : {"order" : "desc"}},
"query" : {
"filtered" : {
"filter" : {
"or" : [
{
"term" : { "field1" : "4848" }
},
{
"term" : { "field2" : "6867" }
}
]
}
}
}
}
This example will successfully match every record when field1 OR field2 are exactly equal to 4848 and 6867 respectively.
What I'm trying to do is to match on field1 any text that contains 4848 and field2 that contains 6867 but I'm not really sure how to do it.
I appreciate any help I can get :)

It sounds like your problem has mostly to do with analysis. The appropriate solution depends on the structure of your data and what you want to match. I'll provide a couple of examples.
First, let's assume that your data is such that we can get what we want just using the standard analyzer. This analyzer will tokenize text fields on whitespace, punctuation and symbols. So the text "1234-5678-90" will be broken into the terms "1234", "5678", and "90", so a "term" query or filter for any of those terms will match that document. More concretely:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"field1":{
"type": "string",
"analyzer": "standard"
},
"field2":{
"type": "string",
"analyzer": "standard"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"field1": "1212-2323-4848","field2": "1234-5678-90"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"field1": "0000-0000-0000","field2": "0987-6543-21"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"field1": "1111-2222-3333","field2": "6867-4545-90"}
POST test_index/_search
{
"query": {
"filtered": {
"filter": {
"or": [
{
"term": { "field1": "4848" }
},
{
"term": { "field2": "6867" }
}
]
}
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"field1": "1212-2323-4848",
"field2": "1234-5678-90"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"field1": "1111-2222-3333",
"field2": "6867-4545-90"
}
}
]
}
}
(Explicitly writing "analyzer": "standard" is redundant since that is the default analyzer used if you do not specify one; I just wanted to make it obvious.)
On the other hand, if the text is embedded in such a way that the standard analysis doesn't provide what you want, say something like "121223234848" and you want to match on "4848", you will have to do something little more sophisticated, using ngrams. Here is an example of that (notice the difference in the data):
DELETE /test_index
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"field1":{
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"field2":{
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"field1": "121223234848","field2": "1234567890"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"field1": "000000000000","field2": "0987654321"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"field1": "111122223333","field2": "6867454590"}
POST test_index/_search
{
"query": {
"filtered": {
"filter": {
"or": [
{
"term": { "field1": "4848" }
},
{
"term": { "field2": "6867" }
}
]
}
}
}
}
...
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"field1": "121223234848",
"field2": "1234567890"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"field1": "111122223333",
"field2": "6867454590"
}
}
]
}
}
There is a lot going on here, so I won't attempt to explain it in this post. If you want more explanation I would encourage you to read this blog post: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. Hope you'll forgive the shameless plug. ;)
Hope that helps.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch template to support case insensitive searches - elasticsearch

Related

Keyword normalizer not applied on document

Elasticsearch returns NullPointerException during inner_hits query

Search all unique terms from a given query in elastic search

facing problems with terms filter

Querying elasticsearch with OR and wildcards

Categories

Resources