distinct a field without case sensetive in Elasticsearsh - elasticsearch

We have a requirement about distinct a field without case sensitive (for example “CIty” and “ciTy” must been in a group).
can i implement a query by this feature in Elasticsearch ?

This problem needs to be resolved while indexing. Tokens can be converted to lowercase while indexing using normalizer
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"city": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
}
will give
"aggregations" : {
"cities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "a",
"doc_count" : 2
}
]
}
}
both "A" and "a" are treated as single value

Related

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

ElasticSearch "more like this" returning empty result

I made a very simple test to figure out my mistake, but did not find it. I created two indexes and I'm trying to search documents in the ppa index that are similar to a given document in the ods index (like the second example here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html).
These are my settings, mappings and documents for the ppa index:
PUT /ppa
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ppa/_mapping/ppa
{"properties": {"descricao": {"type": "text", "analyzer": "brazilian"}}}
POST /_bulk
{"index":{"_index":"ppa","_type":"ppa"}}
{"descricao": "erradicar a pobreza"}
{"index":{"_index":"ppa","_type":"ppa"}}
{"descricao": "erradicar a pobreza"}
These are my settings, mappings and documents for the ods index:
PUT /ods
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ods/_mapping/ods
{"properties": {"metaodsdescricao": {"type": "text", "analyzer": "brazilian"},"metaodsid": {"type": "integer"}}}
POST /_bulk
{"index":{"_index":"ods","_type":"ods", "_id" : "1" }}
{ "metaodsdescricao": "erradicar a pobreza","metaodsid": 1}
{"index":{"_index":"ods","_type":"ods", "_id" : "2" }}
{"metaodsdescricao": "crianças que vivem na pobreza", "metaodsid": 2}
Now, this search doesn't work:
GET /ppa/ppa/_search
{
"query": {
"more_like_this" : {
"fields" : ["descricao"],
"like" : [
{
"_index" : "ods",
"_type" : "ods",
"_id" : "1"
}
],
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 20
}
}
}
But this one does work:
GET /ppa/ppa/_search
{
"query": {
"more_like_this" : {
"fields" : ["descricao"],
"like" : ["erradicar a pobreza"],
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 20
}
}
}
What is happening?
Please, help me make this return something other than empty.
The "more like this" query work well when you have indexed a lot of data. The empty result can be symptom of very few documents present in the elastic index.

Elasticsearch: index first char of string

I'm using version 5.3.
I have a text field a. I'd like to aggregate on the first char of a. I also need the entire original value.
I'm assuming the most efficient way is to have a keyword field a.firstLetter with a custom normalizer. I've tried to achieve this with a pattern replace char filter but am struggling with the regexp.
Am I going at this entirely wrong? Can you help me?
EDIT
This is what I've tried.
settings.json
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"first_char": {
"type": "pattern_replace",
"pattern": "(?<=^.)(.*)",
"replacement": ""
}
}
"normalizer": {
"first_letter": {
"type": "custom",
"char_filter": ["first_char"]
"filter": ["lowercase"]
}
}
}
}
}
}
mappings.json
{
"properties": {
"a": {
"type": "text",
"index_options": "positions",
"fields": {
"firstLetter": {
"type": "keyword",
"normalizer": "first_letter"
}
}
}
}
}
I get no buckets when I try to aggregate like so:
"aggregations": {
"grouping": {
"terms": {
"field": "a.firstLetter"
}
}
}
So basically my approach was "replace all but the first char with an empty string." The regexp is something I was able to gather by googling.
EDIT 2
I had misconfigured the normalizer (I've fixed the examples). The correct configuration reveals that normalizers do not support pattern replace char filters due to issue 23142. Apparently support for it will be implemented earliest in version 5.4.
So are there any other options? I'd hate to do this in code, by adding a field in the doc for the first letter, since I'm using Elasticsearch features for every other aggregation.
You can use the truncate filter with a length of one
PUT foo
{
"mappings": {
"bar" : {
"properties": {
"name" : {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : [ "my_filter", "lowercase" ]
}
},
"filter": {
"my_filter": {
"type": "truncate",
"length": 1
}
}
}
}
}
}
GET foo/_analyze
{
"field" : "name",
"text" : "New York"
}
# response
{
"tokens": [
{
"token": "n",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
}
]
}

Elasticsearch fielddata - should I use it?

Given an index with documents that have a brand property, we need to create a term aggregation that is case insensitive.
Index definition
Please note that the use of fielddata
PUT demo_products
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"product": {
"properties": {
"brand": {
"type": "text",
"analyzer": "my_custom_analyzer",
"fielddata": true,
}
}
}
}
}
Data
POST demo_products/product
{
"brand": "New York Jets"
}
POST demo_products/product
{
"brand": "new york jets"
}
POST demo_products/product
{
"brand": "Washington Redskins"
}
Query
GET demo_products/product/_search
{
"size": 0,
"aggs": {
"brand_facet": {
"terms": {
"field": "brand"
}
}
}
}
Result
"aggregations": {
"brand_facet": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "new york jets",
"doc_count": 2
},
{
"key": "washington redskins",
"doc_count": 1
}
]
}
}
If we use keyword instead of text we end up the 2 buckets for New York Jets because of the differences in casing.
We're concerned about the performance implications by using fielddata. However if fielddata is disabled we get the dreaded "Fielddata is disabled on text fields by default."
Any other tips to resolve this - or should we not be so concerned about fielddate?
Starting with ES 5.2 (out today), you can use normalizers with keyword fields in order to (e.g.) lowercase the value.
The role of normalizers is a bit like analyzers for text fields, though what you can do with them is more restrained, but that would probably help with the issue you're facing.
You'd create the index like this:
PUT demo_products
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"product": {
"properties": {
"brand": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
And your query would return this:
"aggregations" : {
"brand_facet" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "new york jets",
"doc_count" : 2
},
{
"key" : "washington redskins",
"doc_count" : 1
}
]
}
}
Best of both worlds!
You can lowercase the aggregation at query time if you use a script. It won't perform as well as a normalized keyword field, but is still quite fast in my experience. For example, your query would be:
GET demo_products/product/_search
{
"size": 0,
"aggs": {
"brand_facet": {
"terms": {
"script": "doc['brand'].value.toLowerCase()"
}
}
}
}

How do I get Elasticsearch to ignore terms emptied by a char_filter?

I have a set of US street addresses that I've indexed. The source data is imperfect and sometimes fields contain junk. Specifically, I have zip5 and zip4 fields and a pattern_replace char_filter that strips any non-numeric characters. When that char_filter ends up replacing everything (yielding an empty string), matching still seems to look at that field. The same happens if the original field is just an empty string (as opposed to null). How could I set this up such that it'll just disregard fields that are empty strings (either by source or by the result of a char_filter)?
Example
First, let's create an index with a digits_only pattern replacer and an analyzer that uses it:
curl -XPUT "http://localhost:9200/address_bug" -d'
{
"settings": {
"index": {
"number_of_shards": "4",
"number_of_replicas": "1"
},
"analysis": {
"char_filter" : {
"digits_only" : {
"type" : "pattern_replace",
"pattern" : "([^0-9])",
"replacement" : ""
}
},
"analyzer" : {
"zip" : {
"type" : "custom",
"tokenizer" : "keyword",
"char_filter" : [
"digits_only"
]
}
}
}
}
}'
Now, let's create a mapping that uses the analyzer (NB: I'm using with_positions_offsets for highlighting):
curl -XPUT "http://localhost:9200/address_bug/_mapping/address" -d'
{
"address": {
"properties": {
"zip5": {
"type" : "string",
"analyzer" : "zip",
"term_vector" : "with_positions_offsets"
},
"zip4": {
"type" : "string",
"analyzer" : "zip",
"term_vector" : "with_positions_offsets"
}
}
}
}'
Now that our index and type is set up, let's index some imperfect data:
curl -XPUT "http://localhost:9200/address_bug/address/1234" -d'
{
"zip5" : "02144",
"zip4" : "ABCD"
}'
Alright, let's search for it and ask it to explain itself. In this case the search term is Street because in my actual application I have a single field for full address searching.
curl -XGET "http://localhost:9200/address_bug/address/_search?explain" -d'
{
"query": {
"match": {
"zip4": "Street"
}
}
}'
And, here is the interesting part of the results:
"_explanation": {
"value": 0.30685282,
"description": "weight(zip4: in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.30685282,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.30685282,
"description": "idf(docFreq=1, maxDocs=1)"
},
{
"value": 1,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
(Full response is in this gist.)
Expected Result
I wouldn't have expected any hits. If I instead index a document with "zip4" : null, it yields the expect results: no hits.
Help? Am I even taking the right approach here? In my full application, I'm using the same technique for a phone field and suspect I'd have the same issues with the results.
As #plmaheu mentioned, you can use the stop token filter to completely remove
empty strings, so for instance, this is a configuration that I tested that
works:
POST /myindex
{
"settings": {
"analysis": {
"char_filter" : {
"digits_only" : {
"type" : "pattern_replace",
"pattern" : "[^0-9]+",
"replacement" : ""
}
},
"filter": {
"remove_empty": {
"type": "stop",
"stopwords": [""]
}
},
"analyzer" : {
"zip" : {
"type" : "custom",
"tokenizer" : "keyword",
"char_filter" : [
"digits_only"
],
"filter": ["remove_empty"]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"zip": {
"type": "string",
"analyzer": "zip"
}
}
}
}
}
Here the remove_empty filter removes the stopword "", if you use the analyze
API on the string "abcd", you get back the response {"tokens":[]}, so no
tokens will be indexed if the zip code is entirely invalid.
I also tested this works when searching for "foo", no results are found.
You can use a length token filter like this:
"filter": {
"remove_empty": {
"type": "length",
"min": 1
}
}

Resources