Synonyms aggregation in elasticsearch 7 - term based

Synonyms aggregation in elasticsearch 7 - term based - elasticsearch

I am trying to aggregate fields, but fields are similar like Med and Medium. I don't want both to come in my aggregation results, only either of it should come. I tried with synonyms but it doesn't seem to work.
Question is: How can I concatenate or unify similar aggregation results when it is term based?
Below is my work.
Mapping and Setting
{
"settings": {
"index" : {
"analysis" : {
"filter" : {
"synonym_filter" : {
"type" : "synonym",
"synonyms" : [
"medium, m, med",
"large, l",
"extra small, xs, x small"
]
}
},
"analyzer" : {
"synonym_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "synonym_filter"]
}
}
}
}
},
"mappings": {
"properties": {
"skus": {
"type": "nested",
"properties": {
"labels": {
"dynamic": "true",
"properties": {
"Color": {
"type": "text",
"fields": {
"synonym": {
"analyzer": "synonym_analyzer",
"type": "text",
"fielddata":true
}
}
},
"Size": {
"type": "text",
"fields": {
"synonym": {
"analyzer": "synonym_analyzer",
"type": "text",
"fielddata":true
}
}
}
}
}
}
}
}
}}
Aggregation
{
"aggs":{
"sizesFilter": {
"aggs": {
"sizes": {
"terms": {
"field": "skus.labels.Size.synonym"
}
}
},
"nested": {
"path": "skus"
}
}
}}
With only one doc my aggregation result is
"aggregations": {
"sizesFilter": {
"doc_count": 1,
"sizes": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "m",
"doc_count": 1
},
{
"key": "med",
"doc_count": 1
},
{
"key": "medium",
"doc_count": 1
}
]
}
}
}

I got it by setting tokenizer in analyzer to "keyword"
{
"analyzer" : {
"synonym_analyzer" : {
"tokenizer" : "keyword",
"filter" : ["lowercase", "synonym_filter"]
}
}
}

Related

Filter aggregation keys with non nested mapping in elasticsearch

I have following mapping:
{
"Country": {
"properties": {
"State": {
"properties": {
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"Code": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"Lang": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
}
This is sample document:
{
"Country": {
"State": [
{
"Name": "California",
"Code": "CA",
"Lang": "EN"
},
{
"Name": "Alaska",
"Code": "AK",
"Lang": "EN"
},
{
"Name": "Texas",
"Code": "TX",
"Lang": "EN"
}
]
}
}
I am querying on this index to get aggregates of count of states by name. I am using following query:
{
"from": 0,
"size": 0,
"query": {
"query_string": {
"query": "Country.State.Name: *Ala*"
}
},
"aggs": {
"counts": {
"terms": {
"field": "Country.State.Name.raw",
"include": ".*Ala.*"
}
}
}
}
I am able to get only keys matching with query_string using include regex in terms aggregation but seems there is no way to make it case insensitive regex in include.
The result I want is:
{
"aggregations": {
"counts": {
"buckets": [
{
"key": "Alaska",
"doc_count": 1
}
]
}
}
}
Is there other solution available to get me only keys matching query_string without using nested mapping?

Use Normalizer for keyword datatype. Below is the sample mapping:
Mapping:
PUT country
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": { <---- Note this
"type": "custom",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"Country": {
"properties": {
"State": {
"properties": {
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"normalizer": "my_normalizer" <---- Note this
}
}
},
"Code": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
},
"Lang": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
}
}
}
}
}
Document:
POST country/_doc/1
{
"Country": {
"State": [
{
"Name": "California",
"Code": "CA",
"Lang": "EN"
},
{
"Name": "Alaska",
"Code": "AK",
"Lang": "EN"
},
{
"Name": "Texas",
"Code": "TX",
"Lang": "EN"
}
]
}
}
Aggregation Query:
POST country/_search
{
"from": 0,
"size": 0,
"query": {
"query_string": {
"query": "Country.State.Name: *Ala*"
}
},
"aggs": {
"counts": {
"terms": {
"field": "Country.State.Name.raw",
"include": "ala.*"
}
}
}
}
Notice the query pattern in include. Basically all the values of the *.raw fields that you have, would be stored in lowercase letters due to the normalizer that I've applied.
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"counts" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "alaska",
"doc_count" : 1
}
]
}
}
}
Hope this helps!

I was able to fix the problem by using inline script to filter the keys. (Still a dirty fix but it solves my use case for now and I can avoid mapping changes)
Here is how I am executing query.
{
"from": 0,
"size": 0,
"query": {
"query_string": {
"query": "Country.State.Name: *Ala*"
}
},
"aggs": {
"counts": {
"terms": {
"script": {
"source": "doc['Country.State.Name.raw'].value.toLowerCase().contains('ala') ? doc['Country.State.Name.raw'].value : null",
"lang": "painless"
}
}
}
}
}

Counting search results in ElasticSearch by a nested property

Here is a schema with a nested property.
{
"dynamic": "strict",
"properties" : {
"Id" : {
"type": "integer"
},
"Name_en" : {
"type": "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"normalizer": "cloudbuy_normalizer_alphanumeric"
},
"text" : {
"type" : "text",
"analyzer": "english"
}
}
},
"Menus" : {
"type" : "nested",
"properties" : {
"Id" : {
"type" : "integer"
},
"Name" : {
"type" : "keyword",
"normalizer": "normalizer_alphanumeric"
},
"AncestorsIds" : {
"type" : "integer"
}
}
}
}
}
And here is a document.
{
"Id": 12781279
"Name": "Thing of purpose made to fit",
"Menus": [
{
"Id": -571057,
"Name": "Top level menu",
"AncestorsIds": [
-571057
]
}
,
{
"Id": 1022313,
"Name": "Other",
"AncestorsIds": [
-571057
,
1022313
]
}
]
}
For any given query I need a list with two columns: the Menu.Id and the number of documents in the result set that have that Menu.Id in their Menus array.
How?
(Is there any documentation for aggs that isn't impenetrable?)

#Richard, does this query suits your need ?
POST yourindex/_search
{
"_source": "false",
"aggs":{
"menus": {
"nested": {
"path": "Menus"
},
"aggs":{
"menu_aggregation": {
"terms": {
"field": "Menus.Id",
"size": 10
}
}
}
}
}
Output :
"aggregations": {
"menus": {
"doc_count": 2,
"menu_aggregation": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": -571057,
"doc_count": 1
},
{
"key": 1022313,
"doc_count": 1
}
]
}
}
Here we specify a nested path and then aggregate on the menu Ids.
You can take a look at this documentation page : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-nested-aggregation.html

Terms query(ElasticSearch) not working for some field values

This is my Mapping where i have already declared my required field as keywordbut yet my terms query is not working for category_name and storeName but it is working fine for price.
"mappings": {
"properties" : {
"firebaseId":{
"type":"text"
},
"name" : {
"type" : "text",
"analyzer" : "synonym"
},
"name_auto" : {
"type": "text",
"fields": {
"edgengram": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "edge_ngram_search_analyzer"
},
"completion": {
"type": "completion"
}
}
},
"category_name" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"storeName" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"sku" : {
"type" : "text"
},
"price" : {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"magento_id" : {
"type" : "text"
},
"seller_id" : {
"type" : "text"
},
"square_item_id" : {
"type" : "text"
},
"square_variation_id" : {
"type" : "text"
},
"typeId" : {
"type" : "text"
}
}
}
}
}
This is my query below :
{
"size": 0,
"aggs": {
"Category Filter": {
"terms": {
"field": "category_name",
"size": 10
}
},
"Store Filter": {
"terms": {
"field": "storeName",
"size": 10
}
},
"Price Filter": {
"range": {
"field": "price",
"ranges": [
{
"from": 0,
"to": 50
},
{
"from": 50,
"to": 100
},
{
"from": 100,
"to": 200
}
]
}
}
}
}
which returns as follows :
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [category_name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."

You need to use the .keyword sub-field, like this:
{
"size": 0,
"aggs": {
"Category Filter": {
"terms": {
"field": "category_name.keyword", <-- change this
"size": 10
}
},
"Store Filter": {
"terms": {
"field": "storeName.keyword", <-- change this
"size": 10
}
},
"Price Filter": {
"range": {
"field": "price",
"ranges": [
{
"from": 0,
"to": 50
},
{
"from": 50,
"to": 100
},
{
"from": 100,
"to": 200
}
]
}
}
}
}

Nested sort causing CircuitBreakingException using icu_collation for Japanese text

I am using Elasticsearch 2.4, with the icu_analysis plugin added to provide sorting on Japanese text. It works well enough on my local environment, which has a limited number of documents, but when I attempt it on a more realistic dataset, queries fail with the following CircuitBreakingException:
"CircuitBreakingException[[fielddata] Data too large, data for [translations.name.jp_sort] would be larger than limit of [10239895142/9.5gb]]"
I understand that this occurs when attempting to sort on fielddata with large document counts and that docvalues should be used instead - but I am not sure if that can be done in this situation or why it isn't happening already.
There are about 470 million documents in the index, which are storing translations as a nested document - only about 35 million out of the full set contain a Japanese translation. Here is the mapping for the documents:
{
"settings" : {
"number_of_shards" : 6,
"number_of_replicas": 0,
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
},
"japanese_ordering": {
"type": "icu_collation",
"language": "ja",
"country": "JP"
}
},
"analyzer": {
"trigrams": {
"tokenizer": "my_ngram_tokenizer",
"filter": "lowercase"
},
"japanese_ordering": {
"tokenizer": "keyword",
"filter": [ "japanese_ordering" ]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
}
},
"mappings" : {
"product" : {
"_all" : {
"enabled" : false
},
"properties" : {
"name" : {
"type" : "string",
"analyzer": "trigrams",
"fields": {
"value" : {
"type": "string",
"index": "not_analyzed"
}
}
},
"record_status" : {
"type" : "integer"
},
"categories" : {
"type" : "integer"
},
"variant_status" : {
"type" : "integer"
},
"visit_count" : {
"type" : "integer"
},
"translations": {
"type": "nested",
"properties": {
"name": {
"type": "string",
"fields": {
"jp_sort": {
"type": "string",
"analyzer": "japanese_ordering"
}
}
},
"language_id": {
"type": "short"
}
}
}
}
}
}
}
and this is the query that is CircuitBreaking:
{
"from": 0,
"size": 20,
"query": {
"bool": {
"should": [],
"must_not": [],
"must": [{
"nested": {
"path": "translations",
"score_mode": "max",
"query": {
"bool": {
"must": [{
"match": {
"translations.name": {
"query": "\u30C6\u30B9\u30C8",
"boost": 5
}
}
}]
}
}
}
}]
}
},
"filter": {
"bool": {
"must": [{
"terms": {
"variant_status": ["1"],
"_cache": true
}
}, {
"nested": {
"path": "translations",
"query": {
"bool": {
"must": [{
"term": {
"translations.language_id": 9,
"_cache": true
}
}]
}
}
}
}, {
"term": {
"record_status": 1,
"_cache": true
}
}],
"must_not": [{
"term": {
"product_collections": 0
}
}]
}
},
"sort": [{
"translations.name.jp_sort": {
"order": "asc",
"nested_path": "translations"
}
}]
}

The ES 5.5 release has introduced a new field type called icu_collation_keyword which solves the issue you're facing.
You can read more here: https://www.elastic.co/blog/elasticsearch-5-5-0-released

Elasticsearch query multiple types with different bool

I have an index with 3 different types of content: ['media','group',user'] and I need to do a search at the three at the same type, but requesting some extra parameters that one of them must accomplish before adding to the results list.
Here is my current index data:
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"media": {
"_all": {
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"UID": {
"type": "integer",
"include_in_all": false
},
"addtime": {
"type": "integer",
"include_in_all": false
},
"title": {
"type": "string",
"index": "not_analyzed"
}
}
},
"group": {
"_all": {
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"UID": {
"type": "integer",
"include_in_all": false
},
"name": {
"type": "string",
"index": "not_analyzed"
},
"desc": {
"type": "string",
"include_in_all": false
}
}
},
"user": {
"_all": {
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"addtime": {
"type": "integer",
"include_in_all": false
},
"username": {
"type": "string"
}
}
}
}
}
So currently I can make a search on all the index with
{
query: {
match: {
_all: {
"query": "foo",
"operator": "and"
}
}
}
}
and get the results for media, groups or users with the word "foo" on it, which is great, but I need to make it remove all the media on which the user is not the owner of the results. So I guess I need to do a bool query where I set the "must" clause and add the 'UID' variable to whatever the current user ID is.
My problem is how to do this and how to specify that the filter will work just on one type while leaving the others untouched.
I haven't been able to find an answer on the Elastic Search documentation

At the end I was able to accomplish this by following Andrei's comments. I know it is not perfect since I had to add a should with the types "group" and "user", but it fit perfectly with my design since I need to put more filters on those too. Be advice that the search will end up being slower.
curl -X GET 'http://localhost:9200/foo/_search' -d '
{
"query": {
"bool" :
{
"must" :
{
"query" : {
"match" :
{
"_all":
{
"query" : "test"
}
}
}
},
"filter":
{
"bool":
{
"should":
[{
"bool" : {
'must':
[{
"type":
{
"value": "media"
}
},
{
'bool':
{
"should" : [
{ "term" : {"UID" : 2}},
{ "term" : {"type" : "public"}}
]
}
}]
}
},
{
"bool" : {
"should" : [
{ "type" : {"value" : "group"}},
{ "type" : {"value" : "user"}}
]
}
}]
}
}
}
}
}'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Synonyms aggregation in elasticsearch 7 - term based - elasticsearch

I got it by setting tokenizer in analyzer to "keyword" { "analyzer" : { "synonym_analyzer" : { "tokenizer" : "keyword", "filter" : ["lowercase", "synonym_filter"] } } }

Related

Filter aggregation keys with non nested mapping in elasticsearch

Counting search results in ElasticSearch by a nested property

Terms query(ElasticSearch) not working for some field values

Nested sort causing CircuitBreakingException using icu_collation for Japanese text

Elasticsearch query multiple types with different bool

Categories

Resources