Elasticsearch - aggregation of unique counts - elasticsearch

I have an Elasticsearch database of books:
{
"id": 1,
"name": "Animal Farm"
},
{
"id": 2,
"name": "Brave New World"
},
{
"id": 3,
"name": "Nineteen Eighty-Four"
},
{
"id": 4,
"name": "Animal Farm"
},
{
"id": 5,
"name": "We"
}
As you can see, the books with id of 1 and 4 have the conflict book name "Animal Farm". However, they are different books. One is by George Orwell, and another one is literally about farm animals.
I want to know how often do the book names conflict. For the example above, the expected results are:
{
"conflicts": [
{
"num_of_books": 2,
"count": "1"
},
{
"num_of_books": 1,
"count": "3"
}
]
}
The entry with num_of_books of 2 is the conflict of "Animal Farm", and it happened once (therefore the count is 1). The other 3 books has all different names, so they appears in the entry with num_of_books of 1 and count of 3. I don't need the names of the books. Only the counts matter.
I know SQL has "subquery" to do this:
SELECT num_of_books, COUNT(*) AS _count
FROM (
SELECT COUNT(*) AS num_of_books
FROM books
GROUP BY name
)
GROUP BY num_of_books;
I read the articles of Nested Aggregation and Sub-Aggregations, but failed to see the possibility to achieve my goal.
Any comment will help, thanks!

Running aggregations on aggregations is not yet possible in ES, as far as I know. I know of a few outstanding issues on allowing to apply additional logic on the results of bucket aggregations, but they are still being discussed and debated.
In your case, you can get away with the inner SQL query by using a terms aggregation in order to get the name of all conflicting books by using min_doc_count: 2.
{
"size": 0,
"aggs": {
"books": {
"terms": {
"field": "name",
"min_doc_count": 2
}
}
}
}
Then you can parse the buckets on the client side and re-bucket them into new num_of_books buckets depending on their count. For instance, using the head plugin you can add the following code in the Transform section
var num_of_books = {};
root.aggregations.books.buckets.forEach(function(b) {
num_of_books[b.doc_count] = (num_of_books[b.doc_count] || 0) + 1;
});
return num_of_books;
num_of_books would then contain something like this:
{
"2": 1,
"1": 3
}

Related

Sorting a set of results with pre-ordered items

I have a list of pre-ordered items (order by score ASC) like:
[{
"id": "id2",
"score": 1
}, {
"id": "id12",
"score": 1
}, {
"id": "id8",
"score": 1.4
}, {
"id": "id9",
"score": 1.4
}, {
"id": "id14",
"score": 1.75
}, {
...
}]
Let's say I have an elasticsearch index with a massive of items. Note that there's no "score" field in indexed documents.
Now I want elasticsearch to return only those items with ids in the said list. Ok, this one is easy. I'm now stuck at sorting the result. That means I need the result to be sorted exactly as my pre-ordered list above.
Any suggestion for me to achieve that?
I'm not an English native speaker, so sorry for my grammar and words.
As version of 7.4, Elastic introduced pinned query that promotes selected documents to rank higher than those matching a given query. In your case this search query should return what you want:
GET /_search
{
"query": {
"pinned" : {
"ids" : ["id2", "id12", "id8"],
"organic" : {
other queries
}
}
}
}
For more information you can check Elasticsearch official documentation here.

How to apply exact match on single field and distinct on multiple fields together in ElasticSearch?

I recently started working on ElasticSearch, and I am trying search for following criteria
I want to apply exact match on ENAME & distinct on both EID & ENAME on above data.
Let say for matching, I have string ABC.
So result should be like as below
[
{"EID" :111, "ENAME" : "ABC"},
{"EID" : 444, "ENAME" : "ABC"}
]
You can achieve this via a combination of term query and terms aggregation.
Assuming that you have the following mapping:
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"EID": {
"type": "keyword"
},
"ENAME": {
"type": "keyword"
}
}
}
}
}
And inserted the documents like this:
POST my_index/doc/3
{
"EID": "111",
"ENAME": "ABC"
}
POST my_index/doc/4
{
"EID": "222",
"ENAME": "XYZ"
}
POST my_index/doc/12
{
"EID": "444",
"ENAME": "ABC"
}
The query that will do the job might look like this:
POST my_index/doc/_search
{
"query": {
"term": { 1️⃣
"ENAME": "ABC"
}
},
"size": 0, 3️⃣
"aggregations": {
"by EID": {
"terms": { 2️⃣
"field": "EID"
}
}
}
}
Let me explain how it works:
1️⃣ - term query asks Elasticsearch to filter on exact value of a keyword field "ENAME";
2️⃣ - terms aggregation collects the list of all possible values of another keyword field "EID" and gives back the first N most frequent ones;
3️⃣ - "size": 0 tells Elasticsearch not to return any search hits (we are only interested in the aggregations).
The output of the query will look like this:
{
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"by EID": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "111", <== Here is the first "distinct" value that we wanted
"doc_count": 3
},
{
"key": "444", <== Here is another "distinct" value
"doc_count": 2
}
]
}
}
}
The output does not look exactly like what you posted in the question, but I believe it is the closest what you can achieve with Elasticsearch.
However, this output is equivalent:
"ENAME" is implicitly present (since its value was used for filtering)
"EID" is present under the "buckets" of the aggregations section.
Note that under "doc_count" you will find the number of documents having such "EID".
What if I want to do a DISTINCT on several fields?
For a more complex scenario (e.g. when you need to do a distinct on many fields) see this answer.
More information about aggregations is available here.
Hope that helps!

Significant Terms Aggregation of "flat" structures

I currently try to prototype a product recommendation system using the Elasticsearch Significant Terms aggregation. So far, I didn't find a good example yet which deals with "flat" JSON structures of sales (here: The itemId) coming from a relational database, such as mine:
Document 1
{
"lineItemId": 1,
"lineNo": 1,
"itemId": 1,
"productId": 1234,
"userId": 4711,
"salesQuantity": 2,
"productPrice": 0.99,
"salesGross": 1.98,
"salesTimestamp": 1234567890
}
Document 2
{
"lineItemId": 1,
"lineNo": 2,
"itemId": 1,
"productId": 1235,
"userId": 4711,
"salesQuantity": 1,
"productPrice": 5.99,
"salesGross": 5.99,
"salesTimestamp": 1234567890
}
I have around 1.5 million of these documents in my Elasticsearch index. A lineItem is a part of a sale (identified by itemId), which can consist of 1 or more lineItems What I would like to receive is the, say, 5 most uncommonly common products which were bought in conjunction with the sale of one specific productId.
The MovieLens example (https://www.elastic.co/guide/en/elasticsearch/guide/current/_significant_terms_demo.html) deals with data in the structure of
{
"movie": [122,185,231,292,
316,329,355,356,362,364,370,377,420,
466,480,520,539,586,588,589,594,616
],
"user": 1
}
so it's unfortunately not really useful to me. I'd be very glad for an example or a suggestion using my "flat" structures. Thanks a lot in advance.
It sounds like you're trying to build an item-based recommender. Apache Mahout has tools to help with collaborative filtering (formerly the Taste project).
There is also a Taste plugin for Elasticsearch 1.5.x which I believe can work with data like yours to produce item-based recommendations.
(Note: This plugin uses Rivers which were deprecated in Elasticsearch 1.5, so I'd check with the authors about plans to support more recent versions of Elasticsearch before adopting this suggestion.)
Since I don't have the amount of data that you do, try this:
get the list of itemIds for bundles that contain a certain productId that you want to find "stuff" for:
{
"query": {
"filtered": {
"filter": {
"term": {
"productId": 1234
}
}
}
},
"fields": [
"itemId"
]
}
Then
using this list create this query:
GET /sales/sales/_search?search_type=count
{
"query": {
"filtered": {
"filter": {
"terms": {
"itemId": [1,2,3,4,5,6,7,11]
}
}
}
},
"aggs": {
"most_sig": {
"significant_terms": {
"field": "productId",
"size": 0
}
}
}
}
If I understand correctly you have a doc per order line item. What you want is a single doc per order. The Order doc should have an array of productIds (or an array of line item objects that each include a productId field).
That way when you query for orders containing product X the sig_terms aggregation should find product Y is found to be uncommonly common in these orders.

Elasticsearch - how to do field collapsing and get Distinct results? (actual records, not just counters)

In relational db our data looks like this:
Company -> Department -> Office
Elasticsearch version of the same data (flattened):
{
"officeID": 123,
"officeName": "office 1",
"state": "CA",
"department": {
"departmentID": 456,
"departmentName": "Department 1",
"company": {
"companyID": 789,
"companyName": "Company 1",
}
}
},{
"officeID": 124,
"officeName": "office 2",
"state": "CA",
"department": {
"departmentID": 456,
"departmentName": "Department 1",
"company": {
"companyID": 789,
"companyName": "Company 1",
}
}}
We need to find department (or company) by providing office information (such as state).
For example, since all I need is a department info, I can specify it like this (we are using Nest)
searchDescriptor = searchDescriptor.Source(x => x.Include("department"));
and get all departments with qualifying offices.
The problem is - I am getting multiple "department" records with the same id (one for each office).
We are using paging and sorting.
Would it be possible to get paged and sorted Distinct results?
I have spent a few days trying to find an answer (exploring options like facets, aggregations, top_hits etc) but so far the only working option I see would be a manual one - get results from Elasticsearch, group data manually and pass to the client. The problem with this approach is obvious - every time I grab next portion, I'll have to get X extra records just in case some of the records will be duplicate; since I don't know X in advance (and number of such records could be huge) will be forced either to get lots of data unnecessarily (every time I do the search) or to hit our search engine several times until I get required number of records.
So far I was unable to achieve my goal using aggregations (all I am getting is document count, but I want actual data; when I try to use top_hits, I am getting data, but those are really top hits (sorted by number of offices per department, ignoring sorting I have specified in the query); here is an example of the code I tried:
searchDescriptor = searchDescriptor.Aggregations(a => a
.Terms("myunique",
t =>
t.Field("department.departmentID")
.Size(10)
.Aggregations(
x=>x.TopHits("mytophits",
y=>y.Source(true)
.Size(1)
.Sort(k => k.OnField("department.departmentName").Ascending())
)
)
)
);
Does anyone know if Elasticsearch can perform operations like Distinct and get unique records?
Update:
I can get results using top_hits (see below), but in this case I won't be able to use paging (looks like Elasticsearch aggregations feature doesn't support paging), so I am back to square one...
{
"from": 0,
"size": 33,
"explain": false,
"sort": [
{
"departmentID": {
"order": "asc"
}
}
],
"_source": {
"include": [
"department"
]
},
"aggs": {
"myunique": {
"terms": {
"field": "department.departmentID",
"order": {
"mytopscore": "desc"
}
},
"aggs": {
"mytophits": {
"top_hits": {
"size": 5,
"_source": {
"include": [
"department.departmentID"
]
}
}
},
"mytopscore": {
"max": {
"script": "_score"
}
}
}
}
},
"query": {
"wildcard" : { "officeName" : "some office*" }
}
}

Term aggregation consider only the prefix to aggregate

In my elastic search documents I have users and some sort of representation of his place in the organization, for instance:
The CEO is position 1
The ones directly under the CEO will be 1/1, 1/2, 1/3, and so on
The ones under 1/1 will be 1/1/1, 1/1/2, 1/2/3, etc
I have an aggregration in which I want to aggregate by VP, so I want everybody under 1/1, 1/2, 1/3.
To do that I created a query like this one:
"aggs": {
"information": {
"terms":{
"field": "position",
"script": "_value.replaceAll('(1/1/[0/]*[1-9]).+', '$1')"
}
This would get the prefix and replace by the group in the regex, so everyone would have the same position, then I could make the aggregation. This has a poor performance.
I was thinking about using something like this
"aggs": {
"information": {
"terms":{
"field": "position",
"prefix": "1/1/.*'
}
So I would group by everyone that starts with 1/1 (1/1/1/1, 1/1/1/2, 1/1/1/3 would be one group, 1/1/2/1, 1/1/2/2, 1/1/2/3 would be a second group and so on).
Is it possible?
If you know beforehand that on how deep level you want to run this aggregation, you could simply store these levels at different fields:
{
"name": "Jack",
"own_level": 4,
"level_1": "1",
"level_2": "3",
"level_3": "2",
"level_4": null
}
But this would require many nested terms aggregations to reproduce the hierarchy. This version would make one such aggregation sufficient:
{
"name": "Jack",
"own_level": 4,
"level_1": "1",
"level_2": "1/3",
"level_3": "1/3/2",
"level_4": null
}
It also has simpler query filter if you want to focus on people under for example 1/1 by having a filter on field level_2 and terms aggregation on field level_3.
If you don't know the maximum level of the hierarchy you can use nested documents like this, but then queries and aggregations get a bit more complex:
{
"name": "Jack",
"own_level": 4,
"bosses": [
{
"level": 1,
"id": "1"
},
{
"level": 2,
"id": "1/3"
},
{
"level": 3,
"id": "1/3/2"
}
]
}

Resources