ElasticSearch: How do I aggregate whole string sentence by term of an analyzed field? - elasticsearch

I have a analyzed field, for instance, let's name it "motto". I want to full-text saerch "life" and aggregate them by count.
...
"query":{
"term":{
"motto":"life"
}
},
"aggs": {
"match_count": {
"terms": "motto"
}
}
...
The result I want it to be:
...
{
...
"buckets": [
{
"key":"life is good",
"doc_count":3
}
]
...
}
...
The result actually it is:
{
...
"buckets": [
{
"key": "life",
"doc_count": 3
},
{
"key": "good",
"doc_count": 3
},
{
"key": "is",
"doc_count": 3
}
]
...
}
How do I aggregate them as the way I want it?

What you can do is to create a not_analyzed sub-field to the motto field, like this:
curl -XPUT localhost:9200/your_index/your_type/_mapping -d '{
"your_type": {
"properties": {
"motto": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}'
When done, you need to re-index your data in order to populate the motto.raw sub-field.
And finally, you'll be able to run a query like this, i.e. search on motto but aggregate on motto.raw:
...
"query":{
"term":{
"motto":"life"
}
},
"aggs": {
"match_count": {
"terms": { "field": "motto.raw" }
}
}
...

Related

Elasticsearch: Retrieving filtered and unfiltered count in one request

I am using the following mapping in one of my ElasticSearch indices:
"mappings": {
"my-mapping": {
"properties": {
"id": {
"type": "keyword"
},
"groupId": {
"type" : "keyword"
}
"title": {
"type": "text"
}
}
}
}
I now want to count elements matching to a search string which may be present inside of "title", grouped by my groupId. I can achieve that using aggregations and buckets:
/indexname/_search
{
"query" : {
"term" : {
"title" : "sky"
}
},
"aggs": {
"filtered_buckets": {
"terms": {
"field": "groupId"
}
}
}
}
Additionally, I want to know the count of all elements not respecting the filter. I could simply achieve that using a non-queried search:
/indexname/_search
{
"aggs": {
"filtered_buckets": {
"terms": {
"field": "groupId"
}
}
}
}
Current problem is: Is there any possibility to generate aggregation data containing the filtered count and the unfiltered count of only those groups which had a hit before - in one request?
For example:
"buckets": [
{
"key": "257786",
"doc_count": 3024,
"filtered_doc_count" : 202
},
{
"key": "254640",
"doc_count": 3010
"filtered_doc_count" : 1
},
{
"key": "252256",
"doc_count": 2367
"filtered_doc_count" : 5
},
...
]
One way I see is splitting the requests in two while first requesting all filtered buckets (their IDs) and then requesting the counts of these specific buckets using "terms" : { "id" : ["4", "65", "404"] }. This is not very nice and I don't want to request twice (_msearch does not help here).
Second bad solution would be to persist the all-counts somewhere in all of my entities.
Is there any way to achieve what I described in a single request?
PS: Please correct me, if the question is unclear.
Based on these:
How to filter terms aggregation
http://nocf-www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
I made this:
PUT test
{
"mappings": {
"my-mapping": {
"properties": {
"id": {
"type": "keyword"
},
"groupId": {
"type" : "keyword"
},
"title": {
"type": "text"
}
}
}
}
}
PUT test/type1/1
{
"id":1,
"groupId": 1,
"title": "asd"
}
PUT test/type1/2
{
"id":2,
"groupId": 1,
"title": "sky"
}
PUT test/type1/3
{
"id":3,
"groupId": 2,
"title": "sky"
}
PUT test/type1/4
{
"id":4,
"groupId": 2,
"title": "sky"
}
PUT test/type1/5
{
"id":5,
"groupId": 2,
"title": "sky"
}
POST test/type1/_search
{
"aggs": {
"categories-filtered": {
"filter": {"term": {"title": "sky"}},
"aggs": {
"names": {
"terms": {"field": "groupId"}
}
}
},
"categories": {
"terms": {"field": "groupId"}
}
}
}

Elastic synonym usage in aggregations

Situation :
Elastic version used: 2.3.1
I have an elastic index configured like so
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"british,english",
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Which is great, when I query the document and use a query term "english" or "queen" I get all documents matching british and monarch. When I use a synonym term in filter aggregation it doesnt work. For example
In my index I have 5 documents, 3 of them have monarch, 2 of them have queen
POST /my_index/_search
{
"size": 0,
"query" : {
"match" : {
"status.synonym":{
"query": "queen",
"operator": "and"
}
}
},
"aggs" : {
"status_terms" : {
"terms" : { "field" : "status.synonym" }
},
"monarch_filter" : {
"filter" : { "term": { "status.synonym": "monarch" } }
}
},
"explain" : 0
}
The result produces:
Total hits:
5 doc count (as expected, great!)
Status terms: 5 doc count for queen (as expected, great!)
Monarch filter: 0 doc count
I have tried different synonym filter configuration:
queen,monarch
queen,monarch => queen
queen,monarch => queen,monarch
But the above hasn't changed the results. I was wanting to conclude that maybe you can use filters at query time only but then if terms aggregation is working why shouldn't filter, hence I think its my synonym filter configuration that is wrong. A more extensive synonym filter example can be found here.
QUESTION:
How to use/configure synonyms in filter aggregation?
Example to replicate the case above:
1. Create and configure index:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"wlh,wellhead=>wellwell"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
PUT my_index/_mapping/job
{
"properties": {
"title":{
"type": "string",
"analyzer": "my_synonyms"
}
}
}
2.Put two documents:
PUT my_index/job/1
{
"title":"wellhead smth else"
}
PUT my_index/job/2
{
"title":"wlh other stuff"
}
3.Execute a search on wlh which should return 2 documents; have a terms aggregation which should have 2 documents for wellwell and a filter which shouldn't have 0 count:
POST my_index/_search
{
"size": 0,
"query" : {
"match" : {
"title":{
"query": "wlh",
"operator": "and"
}
}
},
"aggs" : {
"wlhAggs" : {
"terms" : { "field" : "title" }
},
"wlhFilter" : {
"filter" : { "term": { "title": "wlh" } }
}
},
"explain" : 0
}
The results of this query is:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"wlhAggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "wellwell",
"doc_count": 2
},
{
"key": "else",
"doc_count": 1
},
{
"key": "other",
"doc_count": 1
},
{
"key": "smth",
"doc_count": 1
},
{
"key": "stuff",
"doc_count": 1
}
]
},
"wlhFilter": {
"doc_count": 0
}
}
}
And thats my problem, the wlhFilter should have at least 1 doc count in it.
I'm short in time, so if needed I can elaborate a bit more at a later time today/tomorrow. But the following should work:
DELETE /my_index
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"british,english",
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_synonyms",
"fielddata": true
}
}
}
}
}
POST my_index/test/1
{
"title" : "the british monarch"
}
GET my_index/_search
{
"query": {
"match": {
"title": "queen"
}
}
}
GET my_index/_search
{
"query": {
"match": {
"title": "queen"
}
},
"aggs": {
"queen_filter": {
"filter": {
"term": {
"title": "queen"
}
}
},
"monarch_filter": {
"filter": {
"term": {
"title": "monarch"
}
}
}
}
}
Could you share the mapping you have defined for your status.synonym field?
EDIT: V2
The reason why your filter's output is 0, is because a filter in Elasticsearch never goes through an analysis phase. It's meant for exact matches.
The token 'wlh' in your aggregation will not be translated to 'wellwell', meaning that it doesn't occur in the inverted index. This is because, during index time, your 'wlh' is translated into 'wellwell'.
In order to achieve what you want, you will have to index the data into a separate field and adjust your filter accordingly.
You could try something like:
DELETE my_index
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"wlh,wellhead=>wellwell"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"job": {
"properties": {
"title": {
"type": "string",
"fields": {
"synonym": {
"type": "string",
"analyzer": "my_synonyms"
}
}
}
}
}
}
}
PUT my_index/job/1
{
"title":"wellhead smth else"
}
PUT my_index/job/2
{
"title":"wlh other stuff"
}
POST my_index/_search
{
"size": 0,
"query": {
"match": {
"title.synonym": {
"query": "wlh",
"operator": "and"
}
}
},
"aggs": {
"wlhAggs": {
"terms": {
"field": "title.synonym"
}
},
"wlhFilter": {
"filter": {
"term": {
"title": "wlh"
}
}
}
}
}
Output:
{
"aggregations": {
"wlhAggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "wellwell",
"doc_count": 2
},
{
"key": "else",
"doc_count": 1
},
{
"key": "other",
"doc_count": 1
},
{
"key": "smth",
"doc_count": 1
},
{
"key": "stuff",
"doc_count": 1
}
]
},
"wlhFilter": {
"doc_count": 1
}
}
}
Hope this helps!!
So with the help of #Byron Voorbach below and his comments this is my solution:
I have created a separate field which I use synonym analyser on, as
opposed to having a property field (mainfield.property).
And most importantly the problem was my synonyms were contracted! I
had, for example, british,english => uk. Changing that to
british,english,uk solved my issue and the filter aggregation is
returning the right number of documents.
Hope this helps someone, or at least point to the right direction.
Edit:
Oh lord praise the documentation! I completely fixed my issue with Filters (S!) aggregation (link here). In filters configuration I specified Match type of query and it worked! Ended up with something like this:
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"status" : { "match" : { "cats.saurus" : "monarch" }},
"country" : { "match" : { "cats.saurus" : "british" }}
}
}
}
}

Elasticsearch : Is it possible to not analysed aggregation query on analysed field?

I have certain document which stores the brand names in analysed form for ex: {"name":"Sam-sung"} {"name":"Motion:Systems"}. There are cases where i would want to aggregation these brands under timestamp.
my query as follow ,
{
"size": 0,
"aggs": {
"filtered_aggs": {
"filter": {
"range": {
"#timestamp":{
"gte":"2016-07-18T14:23:41.459Z",
"lte":"2016-07-18T14:53:10.017Z"
}
}
},
"aggs": {
"execute_time": {
"terms": {
"field": "brands",
"size": 0
}
}
}
}
}
}
but the return results will be
{
...
"aggregations": {
"states": {
"buckets": [
{
"key": "Sam",
"doc_count": 5
},
{
"key": "sung",
"doc_count": 5
},
{
"key": "Motion",
"doc_count": 1
},
{
"key": "Systems",
"doc_count": 1
}
]
}
}
}
but i want to the results is
{
...
"aggregations": {
"states": {
"buckets": [
{
"key": "Sam-sung",
"doc_count": 5
},
{
"key": "Motion:Systems",
"doc_count": 1
}
]
}
}
}
Is there any way in which i can make not analysed query on analysed field in elastic search?
You need to add a not_analyzed sub-field to your brands fields and then aggregate on that field.
PUT /index/_mapping/type
{
"properties": {
"brands": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then you need to fully reindex your data in order to populate the new sub-fields brands.raw.
Finally, you can change your query to this:
POST index/_search
{
"size": 0,
"aggs": {
"filtered_aggs": {
"filter": {
"range": {
"#timestamp":{
"gte":"2016-07-18T14:23:41.459Z",
"lte":"2016-07-18T14:53:10.017Z"
}
}
},
"aggs": {
"execute_time": {
"terms": {
"field": "brands.raw",
"size": 0
}
}
}
}
}
}

Broken aggregation in elasticsearch

I'm getting erroneous results on performing terms aggregation in the field names in the index.
The following is the mappings I have used to the names field:
{
"dbnames": {
"properties": {
"names": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
Here is the results I'm getting for a simple terms aggregation on the field:
"aggregations": {
"names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Martin",
"doc_count": 1
},
{
"key": "John martin",
"doc_count": 1
},
{
"key": " Victor Moses",
"doc_count": 1
}
]
}
}
As you can see, I have the same names with different casings being shown as different buckets in the aggregation. What I want here is irrespective of the case, the names should be clubbed together.
The easiest way would be to make sure you properly case the value of your names field at indexing time.
If that is not an option, the other way to go about it is to define an analyzer that will do it for you and set that analyzer as index_analyzer for the names field. Such a custom analyzer would need to use the keyword tokenizer (i.e. take the whole value of the field as a single token) and the lowercase token filter (i.e. lowercase the value)
curl -XPUT localhost:9200/your_index -d '{
"settings": {
"index": {
"analysis": {
"analyzer": {
"casing": { <--- custom casing analyzer
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"names": {
"type": "string",
"index_analyzer": "casing" <--- use your custom analyzer
}
}
}
}
}'
Then we can index some data:
curl -XPOST localhost:9200/your_index/your_type/_bulk -d '
{"index":{}}
{"names": "John Martin"}
{"index":{}}
{"names": "John martin"}
{"index":{}}
{"names": "Victor Moses"}
'
And finally the terms aggregation on the names field would return your the expected results:
curl -XPOST localhost:9200/your_index/your_type/_search-d '{
"size": 0,
"aggs": {
"dbnames": {
"terms": {
"field": "names"
}
}
}
}'
Results:
{
"dbnames": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "john martin",
"doc_count": 2
},
{
"key": "victor moses",
"doc_count": 1
}
]
}
}
There are 2 options here
Use not_analyzed option - This one has a disadvantage that same
string with different cases wont be seen as on
keyword tokenizer + lowercase filter - This one does not have the
above issue
I have neatly outlined these two approaches and how to use them here - https://qbox.io/blog/elasticsearch-aggregation-custom-analyzer

Elasticsearch Query aggregated by unique substrings (email domain)

I have an elasticsearch query that queries over an index and then aggregates based on a specific field sender_not_analyzed. I then use a term aggregation on that same field sender_not_analyzed which returns buckets for the top "senders". My query is currently:
{
"size": 0,
"query": {
"regexp": {
"sender_not_analyzed": ".*[#].*"
}
},
"aggs": {
"sender-stats": {
"terms": {
"field": "sender_not_analyzed"
}
}
}
}
which returns buckets that look like:
"aggregations": {
"sender-stats": {
"buckets": [
{
"key": "<Mike <mike#fizzbuzz.com>#MISSING_DOMAIN>",
"doc_count": 5017
},
{
"key": "jon.doe#foo.com",
"doc_count": 3963
},
{
"key": "jane.doe#foo.com",
"doc_count": 2857
},
{
"key": "jon.doe#bar.com",
"doc_count":1544
}
How can I write an aggregation such that I get single bucket for each unique email domain, eg foo.com would have a doc_count of (3963 + 2857) 6820? Can I accomplish this with a regex aggregation or do I need to write some kind of custom analyzer to split the string at the # to the end of string?
This is pretty late, but I think this can be done by using pattern_replace char filter, you capture the domain name with regex, This is my setup
POST email_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"char_filter": [
"domain"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"domain": {
"type": "pattern_replace",
"pattern": ".*#(.*)",
"replacement": "$1"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"domain": {
"type": "string",
"analyzer": "my_custom_analyzer"
},
"sender_not_analyzed": {
"type": "string",
"index": "not_analyzed",
"copy_to": "domain"
}
}
}
}
}
Here domain char filter will capture the domain name, we need to use keyword tokenizer to get the domain as it is, I am using lowercase filter but it is up to you if you want to use it or not. Using copy_to parameter to copy the value of the sender_not_analyzed to domain field, although _source field won't be modified to include this value but we can query it.
GET email_index/_search
{
"size": 0,
"query": {
"regexp": {
"sender_not_analyzed": ".*[#].*"
}
},
"aggs": {
"sender-stats": {
"terms": {
"field": "domain"
}
}
}
}
This will give you desired result.

Resources