How to get the count of most frequent pattern in elasticsearch? - elasticsearch

I want to get the ten most frequent patterns in search with elasticsearch .
Example :
"cgn:4189, dfsdkfldslfs"
"cgn:4210, aezfvdsvgds"
"cgn:4189, fdsmpfjdjs"
"cgn:4195, cvsf"
"cgn:4189, mkpjd"
"cgn:4210, mfsfgkpjd"
I want to get :
4189 : 3
4210 : 2
4195 : 1
I know how to do that in mysql or via awk/sort/head ... but with elasticsearch I'm lost.

Exactly how it will work depends on your analyzer, but if you are just using the default, standard analyzer, you can probably get what you want pretty easily with a terms aggregation.
As a simple example, I set up a trivial index:
PUT /test_index
{
"settings": {
"number_of_shards": 1
}
}
Then indexed the data you posted, using the bulk api:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"msg":"cgn:4189, dfsdkfldslfs"}
{"index":{"_id":2}}
{"msg":"cgn:4210, aezfvdsvgds"}
{"index":{"_id":3}}
{"msg":"cgn:4189, fdsmpfjdjs"}
{"index":{"_id":4}}
{"msg":"cgn:4195, cvsf"}
{"index":{"_id":5}}
{"msg":"cgn:4189, mkpjd"}
{"index":{"_id":6}}
{"msg":"cgn:4210, mfsfgkpjd"}
Then I can run a simple terms aggregation to get back all the terms and how often they occur (ordered descending by term frequency by default):
POST /test_index/_search?search_type=count
{
"aggs": {
"msg_terms": {
"terms": {
"field": "msg"
}
}
}
}
which returns:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"msg_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cgn",
"doc_count": 6
},
{
"key": "4189",
"doc_count": 3
},
{
"key": "4210",
"doc_count": 2
},
{
"key": "4195",
"doc_count": 1
},
{
"key": "aezfvdsvgds",
"doc_count": 1
},
{
"key": "cvsf",
"doc_count": 1
},
{
"key": "dfsdkfldslfs",
"doc_count": 1
},
{
"key": "fdsmpfjdjs",
"doc_count": 1
},
{
"key": "mfsfgkpjd",
"doc_count": 1
},
{
"key": "mkpjd",
"doc_count": 1
}
]
}
}
}
Here is the code I used:
http://sense.qbox.io/gist/a827095b675596c4e3d545ce963cde3fae932156

Related

Get count of distinct values for a field across all documents in elastic search

I have a field
*slices.code *
in my Elasticsearch mapping. Slices is an array element and slices.code has various values like "ATF", "ORW", "HKL". Slices is not a nested type field. I want to avoid adding nested type to this field. In each document there could be multiple occurnces for slice.code = ATF/ORW. So I want to get all possible values of slice.code along with total occurence of each field value in all the documents. Something like this where HKL appeared in 2 documents but 3 number of times total
{
"key": "HKL",
"doc_count": 2,
"total": {
"value": 3
}
},
{
"key": "ATF",
"doc_count": 3,
"total": {
"value": 7
}
},
{
"key": "ORW",
"doc_count": 2,
"total": {
"value": 5
}
}
I tried using terms query, but with that i only get doc_count, i don't get total occurence of the field value with that query. Below is the terms query that i tried
{
"size": 0,
"aggs": {
"distinct_colors": {
"terms": {
"field": "slices.code.keyword",
"size": 65535
}
}
}
}
Output that i received:
"buckets": [
{
"key": "HKG",
"doc_count": 1
},
{
"key": "MNL",
"doc_count": 1
},
{
"key": "PVG",
"doc_count": 1
},
{
"key": "TPE",
"doc_count": 1
}
]

Group results returned by elasticsearch query based on query terms

I am very new with elasticsearch. I am facing an issue building a query. My document structure is like:
{
latlng: {
lat: '<some-latitude>',
lon: '<some-longitude>'
},
gmap_result: {<Some object>}
}
I am doing a search on a list of lat-long. For each coordinate, I am fetching a result that is within 100m. I have been able to do this part. But, the tricky part is that I do not know which results in the output correspond to the which query term. I think this requires using aggregations at some level, but I am currently clueless on how to proceed on this.
An aggregate query is the correct approach. You can learn about them here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
An example is below. In this example, I am using a match query to find all instances of the word test in the field title and then aggregating the field status to count the number of results with the word test that are in each status.
GET /my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "*test*"
}
}
]
}
},
"aggs": {
"count_by_status": {
"terms": {
"field": "status"
}
}
},
"size": 0
}
The results look like this:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 346,
"max_score": 0,
"hits": []
},
"aggregations": {
"count_by_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Open",
"doc_count": 283
},
{
"key": "Completed",
"doc_count": 36
},
{
"key": "On Hold",
"doc_count": 12
},
{
"key": "Withdrawn",
"doc_count": 10
},
{
"key": "Declined",
"doc_count": 5
}
]
}
}
}
If you provide your query, it would help us give a more specific aggregate query for you to use.

Boosting elastic aggregation result

I have an elastic index for products, each product has Brand attribution and I "have to" create an aggregation that returns Brands of the products.
My Sample Query:
GET /products/product/_search
{
"size": 0,
"aggs": {
"myFancyFilter": {
"filter": {
"match_all": {}
},
"aggs": {
"inner": {
"terms": {
"field": "Brand",
"size": 3
}
}
}
}
},
"query": {
"match_all": {}
}
}
And the result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 236952,
"max_score": 0,
"hits": []
},
"aggregations": {
"myFancyFilter": {
"doc_count": 236952,
"inner": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 139267,
"buckets": [
{
"key": "Brand1",
"doc_count": 3144
},
{
"key": "Brand2",
"doc_count": 1759
},
{
"key": "Brand3",
"doc_count": 1737
}
]
}
}
}
}
It works perfect for me. Elastic sorts buckets according to doc_count, however I would like to manipulate the bucket order in result. For example, assume that I have Brand5 and I want to increment its order to #2. I want result coming in order Brand1, Brand5 and Brand3.
If it was not in an aggregation, but in a query, I could use function_score, but now, I don't have an idea. Any clues?
What you are looking for is to define your own sorting definition and that to be applied in aggregation in elasticsearch. I've been able to come up with a solution by renaming the aggregation terms in below manner:
Brand1 to a_Brand1
Brand5 to b_Brand5
Brand3 to c_Brand3
And then apply sorting on the terms so that sorting happens lexicographically.
Of course this may not be the exact or the best solution but I felt this can help.
Below is the query that I've used. Please note that my field name is brand and it is a multifield and I'm using the field brand.keyword.
POST testdataindex/_search
{
"size":0,
"query":{
"match_all":{
}
},
"aggs":{
"myFancyFilter":{
"filter":{
"match_all":{
}
},
"aggs":{
"inner":{
"terms":{
"script":{
"lang":"painless",
"inline":"if(params.newNames.containsKey(doc['brand.keyword'].value)) { return params.newNames[doc['brand.keyword'].value];} return null;",
"params":{
"newNames":{
"Brand1":"a_Brand1",
"Brand5":"b_Brand5",
"Brand3":"c_Brand3"
}
}
},
"order":{
"_term":"asc"
}
}
}
}
}
}
}
I've created a sample data with brand names Brand1, Brand3 and Brand5 and below how the results would appear. Note the change in the term names.
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": []
},
"aggregations": {
"myFancyFilter": {
"doc_count": 8,
"inner": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "a_Brand1",
"doc_count": 2
},
{
"key": "b_Brand5",
"doc_count": 4
},
{
"key": "c_Brand3",
"doc_count": 2
}
]
}
}
}
}
Hope it helps!

Get Percentage of Values in Elasticsearch

I have some test documents that look like
"hits": {
...
"_source": {
"student": "DTWjkg",
"name": "My Name",
"grade": "A"
...
"student": "ggddee",
"name": "My Name2",
"grade": "B"
...
"student": "ggddee",
"name": "My Name3",
"grade": "A"
And I wanted to get the percentage of students that have a grade of B, the result would be "33%", assuming there were only 3 students.
How would I do this in Elasticsearch?
So far I have this aggregation, which I feel like is close:
"aggs": {
"gradeBPercent": {
"terms": {
"field" : "grade",
"script" : "_value == 'B'"
}
}
}
This returns:
"aggregations": {
"gradeBPercent": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "false",
"doc_count": 2
},
{
"key": "true",
"doc_count": 1
}
]
}
}
I'm not looking necessarily looking for an exact answer, perhaps what I could terms and keywords I could google. I've read over the elasticsearch docs and not found anything that could help.
First off, you shouldn't need a script for this aggregation. If you want to limit your results to everyone where `value == 'B' then you should do that using a filter, not a script.
ElasticSearch won't return you a percentage exactly, but you can easily calculate that using the result from a TERMS AGGREGATION.
Example:
GET devdev/audittrail/_search
{
"size": 0,
"aggs": {
"a1": {
"terms": {
"field": "uIDRequestID"
}
}
}
}
That returns:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 25083,
"max_score": 0,
"hits": []
},
"aggregations": {
"a1": {
"doc_count_error_upper_bound": 9,
"sum_other_doc_count": 1300,
"buckets": [
{
"key": 556,
"doc_count": 34
},
{
"key": 393,
"doc_count": 28
},
{
"key": 528,
"doc_count": 15
}
]
}
}
}
So what does that return mean?
the hits.total field is the total number of records matching your query.
the doc_count is telling you how many items are in each bucket.
So for my example here: I could say that the key "556" shows up in 34 of 25083 documents, so it has a percentage of (34 / 25083) * 100

Elasticsearch Ignoring Filter in Aggregations

The request looks something like:
{
"aggs": {
"contentType": {
"terms": {
"field": "contentType",
"size": 0
}
}
},
"query": {...},
"filter": {...}
}
The response looks something like:
{
"took": 300,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 68,
"max_score": 0,
"hits": []
},
"aggregations": {
"contentType": {
"doct_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 9
"doc_count": 7054
},
{
"key": 9
"doc_count": 7054
},
{
"key": 5
"doc_count": 6236
},
{
"key": 4
"doc_count": 1124
}
]
}
}
}
The "doc_count" in the aggregation is what the results would be without the "filter" and just the "query". The "filter" seems to be ignored.
This was working at some point, but all of a sudden doesn't seem to be working. Anyone have any clue?
Elasticsearch 1.5.2, NEST 1.4.3.
Thanks.
filter used at the top level of your DSL query has been renamed to post_filter (see https://github.com/elastic/elasticsearch/issues/4119). Documentation for post_filter is here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-post-filter.html
I'm not sure whether it applies or not to your particular query, but you might want to use the filtered query type: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html

Resources