Elasticsearch terms aggregate duplicates - elasticsearch

I have a field using a ngram analyzer and trying to use a terms aggregate on the field to return unique documents by the field. The returned keys in the aggregates don't match the documents fields being returned and I'm getting duplicate fields.
"analysis" : {
"filter" : {
"autocomplete_filter" : {
"type" : "edge_ngram",
"min_gram" : "1",
"max_gram" : "20"
}
},
"analyzer" : {
"autocomplete" : {
"type" : "custom",
"filter" : [ "lowercase", "autocomplete_filter" ],
"tokenizer" : "standard"
}
}
}
}
"name" : {
"type" : "string",
"analyzer" : "autocomplete",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
{
"query": {
"query_string": {
"query":"bra",
"fields":["name"],
"use_dis_max":true
}
},
"aggs": {
"group_by_name": {
"terms": { "field":"name.raw" }
}
}
}
I'm getting back the following names and keys.
Braingeyser, Brainstorm, Braingeyser, Brainstorm, Brainstorm, Brainstorm, Bramblecrush, Brainwash, Brainwash, Braingeyser
{"key":"Bog Wraith","doc_count":18}
{"key":"Birds of Paradise","doc_count":15}
{"key":"Circle of Protection: Black","doc_count":15}
{"key":"Lightning Bolt","doc_count":15}
{"key":"Grizzly Bears","doc_count":14}
{"key":"Black Knight","doc_count":13}
{"key":"Bad Moon","doc_count":12}
{"key":"Boomerang","doc_count":12}
{"key":"Wall of Bone","doc_count":12}
{"key":"Balance","doc_count":11}
How can I get elasticsearch to only return unique fields from the aggregate?

To remove duplicates being returned in your aggregate you can try:
"aggs": {
"group_by_name": {
"terms": { "field":"name.raw" },
"aggs": {
"remove_dups": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}

Related

How can I use query_string to match both nested and non-nested fields at the same time?

I have an index with a mapping something like this:
"email" : {
"type" : "nested",
"properties" : {
"from" : {
"type" : "text",
"analyzer" : "lowercase_keyword",
"fielddata" : true
},
"subject" : {
"type" : "text",
"analyzer" : "lowercase_keyword",
"fielddata" : true
},
"to" : {
"type" : "text",
"analyzer" : "lowercase_keyword",
"fielddata" : true
}
}
},
"textExact" : {
"type" : "text",
"analyzer" : "lowercase_standard",
"fielddata" : true
}
I want to use query_string to search for matches in both the nested and the non-nested field at the same time, e.g.
email.to:foo#example.com AND textExact:bar
But I can't figure out how to write a query that will search both fields at once. The following doesn't work, because query_string searches do not return nested documents:
"query": {
"query_string": {
"fields": [
"textExact",
"email.to"
],
"query": "email.to:foo#example.com AND textExact:bar"
}
}
I can write a separate nested query, but that will only search against nested fields. Is there any way I can use query_string to match both nested and non-nested fields at the same time?
I am using Elasticsearch 6.8. Cross-posted on the Elasticsearch forums.
Nested documents can only be queried with the nested query.
You can follow below two approaches.
1. You can combine nested and normal query in must clause, which works like "and" for different queries.
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "email",
"query": {
"term": {
"email.to": "foo#example.com"
}
}
}
},
{
"match": {
"textExact": "bar"
}
}
]
}
}
}
2. copy-to
The copy_to parameter allows you to copy the values of multiple fields into a group field, which can then be queried as a single field.
{
"mappings": {
"properties": {
"textExact":{
"type": "text"
},
"to_email":{
"type": "keyword"
},
"email":{
"type": "nested",
"properties": {
"to":{
"type":"keyword",
"copy_to": "to_email" --> copies to non-nested field
},
"from":{
"type":"keyword"
}
}
}
}
}
}
Query
{
"query": {
"query_string": {
"fields": [
"textExact",
"to_email"
],
"query": "to_email:foo#example.com AND textExact:bar"
}
}
}
Result
"_source" : {
"textExact" : "bar",
"email" : [
{
"to" : "sdfsd#example.com",
"from" : "a#example.com"
},
{
"to" : "foo#example.com",
"from" : "sdfds#example.com"
}
]
}

Elasticsearch - How to Generate Facets for Doubly Nested Objects

Using elasticsearch 7, I am trying to build facets for doubly nested objects.
So in the example below I would like to pull out the artist id codes from the artistMakerPerson field. I can pull out the association which is nested at a single depth but I can't get the syntax for the nested nested objects.
You could use the following code in Kibana to recreate an example.
My mapping looks like this:
PUT test_artist
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties": {
"object" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
}
},
"copy_to" : [
"global_search"
]
},
"uniqueID" : {
"type" : "keyword",
"copy_to" : [
"global_search"
]
},
"artistMakerPerson" : {
"type" : "nested",
"properties" : {
"association" : {
"type" : "keyword"
},
"name" : {
"type" : "nested",
"properties" : {
"id" : {
"type" : "keyword"
},
"text" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
}
},
"copy_to" : [
"gs_authority"
]
}
}
},
"note" : {
"type" : "text"
}
}
}
}
}
}
Index a document with:
PUT /test_artist/_doc/123
{
"object": "cup",
"uniquedID": "123",
"artistMakerPerson" : [
{
"name" : {
"text" : "Johann Kandler",
"id" : "A6734"
},
"association" : "modeller",
"note" : "probably"
},
{
"name" : {
"text" : "Peter Reinicke",
"id" : "A27702"
},
"association" : "designer",
"note" : "probably"
}
]
}
I am using this query to pull out facets or aggregations for artistMakerPerson.association
GET test_artist/_search
{
"size": 0,
"aggs": {
"artists": {
"nested": {
"path": "artistMakerPerson"
},
"aggs": {
"kinds": {
"terms": {
"field": "artistMakerPerson.association",
"size": 10
}
}
}
}
}
}
and I am rewarded with buckets for designer and modeller but I get nothing when I try to pull out the deeper artist id:
GET test_artist/_search
{
"size": 0,
"aggs": {
"artists": {
"nested": {
"path": "artistMakerPerson"
},
"aggs": {
"kinds": {
"terms": {
"field": "artistMakerPerson.name.id",
"size": 10
}
}
}
}
}
}
What am I doing wrong?
Change the path from artistMakerPerson to artistMakerPerson.name.
GET test_artist/_search
{
"size": 0,
"aggs": {
"artists": {
"nested": {
"path": "artistMakerPerson.name"
},
"aggs": {
"kinds": {
"terms": {
"field": "artistMakerPerson.name.id",
"size": 10
}
}
}
}
}
}

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

Display field value of data type token_count

I have the following mapping:
"fullName" : {
"type" : "text",
"norms" : false,
"similarity" : "boolean",
"fields" : {
"raw" : {
"type" : "keyword"
},
"terms" : {
"type" : "token_count",
"analyzer" : "standard"
}
}
}
I want to display the value of terms field. When I do the following, I get the fullName but not the terms value
GET /_search
{"_source": ["fullName","fullName.terms"],
"query": {
"bool" : {
"must" : {
"script" : {
"script" : {
"source": "doc['fullName.terms'].value != 3,
"lang": "painless"
}
}
}
}
}
}
How can I get it?
You need to configure that your token count is stored - Here documentation
You should modify your mapping :
"terms" : {
"type" : "token_count",
"analyzer" : "standard",
"store": true
}
Then to retrive the value you need to explicitly ask for stored value in your query : ( here documentation )
GET /_search
{
"_source": [
"fullName"
],
"stored_fields": [
"fullName.terms"
],
"query": {
"bool": {
"must": {
"script": {
"script": {
"source": "doc['fullName.terms'].value != 3",
"lang": "painless"
}
}
}
}
}
}

Aggregating a Key/Value list in ElasticSearch

General problem is that I've created a Name/Value mapping in elastic search to deal with a potentially huge user input of tags - as opposed to allowing an open schema where people can just create documents with new properties.
I've got an elastic search mapping that looks like this:
"Tags" : {
"properties" : {
"Value" : {
"analyzer" : "keyword",
"type" : "string"
},
"Name" : {
"analyzer" : "keyword",
"type" : "string"
}
}
},
With records that look like this
"Tags" : [
{
"Name" : "group",
"Value" : "foobar"
},
{
"Name" : "season",
"Value" : "winter"
}
],
What I'm trying to do with an elastic search query is to write a script that will aggregate only the season entries.
...
"script" : "for (int i = 0; i < doc['Tags.Value'].values.length; i++) {
if (doc['Tags.Value'].values[i] == 'season') {
return doc['Tags.Names'].values[i]
} }"
...
I've gone through about 200 permutations of the above script and it's not quite returning the results that I would like to see.
Your Tags field should be nested so that you can write a nested query to only select the season tags and then you can aggregate on those values only. That would allow you to ditch that script which is going to perform very badly if you have a huge amount of tags.
So your mapping needs to look like this:
"Tags" : {
"type": "nested", <---- add this
"properties" : {
"Value" : {
"analyzer" : "keyword",
"type" : "string"
},
"Name" : {
"analyzer" : "keyword",
"type" : "string"
}
}
},
Then your query should include a nested clause on the season tag names, so that your terms aggregation can simply work on those values.
{
"query": {
"filtered": {
"filter": {
"nested": {
"path": "Tags",
"filter": {
"term": {
"Tags.Name": "season"
}
}
}
}
}
},
"aggs": {
"season_tags": {
"nested": {
"path": "Tags"
},
"aggs": {
"season_values": {
"terms": {
"field": "Tags.Value"
}
}
}
}
}
}

Resources