select distinct from elasticsearch

select distinct from elasticsearch - elasticsearch

I have a collection of documents which belongs to few authors:
[
{ id: 1, author_id: 'mark', content: [...] },
{ id: 2, author_id: 'pierre', content: [...] },
{ id: 3, author_id: 'pierre', content: [...] },
{ id: 4, author_id: 'mark', content: [...] },
{ id: 5, author_id: 'william', content: [...] },
...
]
I'd like to retrieve and paginate a distinct selection of best matching document based upon the author's id:
[
{ id: 1, author_id: 'mark', content: [...], _score: 100 },
{ id: 3, author_id: 'pierre', content: [...], _score: 90 },
{ id: 5, author_id: 'william', content: [...], _score: 80 },
...
]
Here's what I'm currently doing (pseudo-code):
unique_docs = res.results.to_a.uniq{ |doc| doc.author_id }
Problem is right on pagination: How to select 20 "distinct" documents?
Some people are pointing term facets, but I'm not actually doing a tag cloud:
Distinct selection with CouchDB and elasticsearch
http://elasticsearch-users.115913.n3.nabble.com/Getting-Distinct-Values-td3830953.html
Thanks,
Adit

As at present ElasticSearch does not provide a group_by equivalent, here's my attempt to do it manually.
While the ES community is working for a direct solution to this problem (probably a plugin) here's a basic attempt which works for my needs.
Assumptions.
I'm looking for relevant content
I've assumed that first 300 docs are relevant, so I consider
restricting my research to this selection, regardless many or some
of these are from the same few authors.
for my needs I didn't "really" needed full pagination, it was enough
a "show more" button updated through ajax.
Drawbacks
results are not precise
as we take 300 docs per time we don't know how many unique docs will come out (possibly could be 300 docs from the same author!). You should understand if it fits your average number of docs per author and probably consider a limit.
you need to do 2 queries (waiting remote call cost):
first query asks for 300 relevant docs with just these fields: id & author_id
retrieve full docs of paginated ids in a second query
Here's some ruby pseudo-code: https://gist.github.com/saxxi/6495116

Now the 'group_by' issue have been updated, you can use this feature from elastic 1.3.0 #6124.
If you search for following query,
{
"aggs": {
"user_count": {
"terms": {
"field": "author_id",
"size": 0
}
}
}
}
you will get result
{
"took" : 123,
"timed_out" : false,
"_shards" : { ... },
"hits" : { ... },
"aggregations" : {
"user_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "mark",
"doc_count" : 87350
}, {
"key" : "pierre",
"doc_count" : 41809
}, {
"key" : "william",
"doc_count" : 24476
} ]
}
}
}

Related

Elasticsearch Sorting Tiebreakers

Say I am creating a search engine for a photo sharing social network and the documents of the site have the following schema
{
"id": 123456
"name": "Foo",
"num_followers": 123456,
"num_photos": 123456
}
I would like my search results to satisfy the following requirements:
Only have results where the search query strings matches the "name" field in the document
Rank the search results by number of followers descending
In the case where multiple customers have the same number of followers, rank by number of photos descending
For example, say I have the following documents in my index:
{
"id": 1,
"name": "Customer",
"num_followers": 3,
"num_photos": 27
}
{
"id": 2,
"name": "Customer",
"num_followers": 25,
"num_photos": 1
}
{
"id": 3,
"name": "Customer",
"num_followers": 8,
"num_photos": 2
}
{
"id": 4,
"name": "Customer",
"num_followers": 8,
"num_photos": 5
}
{
"id": 5,
"name": "FooBar",
"num_followers": 10000,
"num_photos": 20000
}
If I search "Customer" in the search bar of the site, the ES hits should be in the following order:
{
"id": 2,
"name": "Customer",
"num_followers": 25,
"num_photos": 1
}
{
"id": 4,
"name": "Customer",
"num_followers": 8,
"num_photos": 5
}
{
"id": 3,
"name": "Customer",
"num_followers": 8,
"num_photos": 2
}
{
"id": 1,
"name": "Customer",
"num_followers": 3,
"num_photos": 27
}
I'm assuming I will need to perform some sort of compact query to create this "tiebreaker" logic. What clauses should I be using? If anyone had an example of something similar that would be amazing. Thanks in advance.

This sounds like a pretty standard sorting use case. Elasticsearch can sort on multiple fields in a predefined priority order. See documentation here.
GET /my_index/_search
{
"sort" : [
{ "num_followers" : {"order" : "desc"}},
{ "num_photos" : "desc" }
],
"query" : {
"term" : { "name" : "Customer" }
}
}
Obviously this is just a simple term query -- you may want that to be a keyword search instead based on the wording of your question.

How do I sort buckets by Term Aggregation's nested doc_count?

I have an index, invoices, that I need to aggregate into yearly buckets then sort.
I have succeeded in using Bucket Sort to sort my buckets by simple sum values (revenue and tax). However, I am struggling to sort by more deeply nested doc_count values (status).
I want to order my buckets not only by revenue, but also by the number of docs with a status field equal to 1, 2, 3 etc...
The documents in my index looks like this:
"_source": {
"created_at": "2018-07-07T03:11:34.327Z",
"status": 3,
"revenue": 68.474,
"tax": 6.85,
}
I request my aggregations like this:
const params = {
index: 'invoices',
size: 0,
body: {
aggs: {
sales: {
date_histogram: {
field: 'created_at',
interval: 'year',
},
aggs: {
total_revenue: { sum: { field: 'revenue' } },
total_tax: { sum: { field: 'tax' } },
statuses: {
terms: {
field: 'status',
},
},
sales_bucket_sort: {
bucket_sort: {
sort: [{ total_revenue: { order: 'desc' } }],
},
},
},
},
},
},
}
The response (truncated) looks like this:
"aggregations": {
"sales": {
"buckets": [
{
"key_as_string": "2016-01-01T00:00:00.000Z",
"key": 1451606400000,
"doc_count": 254,
"total_tax": {
"value": 735.53
},
"statuses": {
"sum_other_doc_count": 0,
"buckets": [
{
"key": 2,
"doc_count": 59
},
{
"key": 1,
"doc_count": 58
},
{
"key": 5,
"doc_count": 57
},
{
"key": 3,
"doc_count": 40
},
{
"key": 4,
"doc_count": 40
}
]
},
"total_revenue": {
"value": 7355.376005351543
}
},
]
}
}
I want to sort by key: 1, for example. Order the buckets according to which one has the greatest number of docs with a status value of 1. I tried to order my terms aggregation, then specify the desired key like this:
statuses: {
terms: {
field: 'status',
order: { _key: 'asc' },
},
},
sales_bucket_sort: {
bucket_sort: {
sort: [{ 'statuses.buckets[0]._doc_count': { order: 'desc' } }],
},
},
However this did not work. It didn't error, it just doesn't seem to have any effect.
I noticed someone else on SO had a similar question many years ago, but I was hoping a better answer had emerged since then: Elasticsearch aggregation. Order by nested bucket doc_count
Thanks!

Nevermind I figured it out. I added a separate filter aggregation like this:
aggs: {
total_revamnt: { sum: { field: 'revamnt' } },
total_purchamnt: { sum: { field: 'purchamnt' } },
approved_invoices: {
filter: {
term: {
status: 1,
},
},
},
Then I was able to bucket sort that value like this:
sales_bucket_sort: {
bucket_sort: {
sort: [{ 'approved_invoices>_count': { order: 'asc' } }],
},
},

In case if anyone comes to this issue again. Latest update tried with Elasticsearch version 7.10 could work in this way:
sales_bucket_sort: {
bucket_sort: {
sort: [{ '_count': { order: 'asc' } }],
},
}
With only _count specified, it will automatically take the doc_count and sort accordingly.

I believe this answer will just sort by the doc_count of the date_histogram aggregation, not the nested sort.
JP's answer works: create a filter with the target field: value then sort by it.

Elastic query group by

I've started the process of learning ElasticSearch and I was wondering if somebody could help me shortcut the process by providing some examples of how I would a build couple of queries.
Here's my example schema...
PUT /sales/_mapping
{
"sale": {
"properties": {
"productCode: {"type":"string"},
"productTitle": {"type": "string"},
"quantity" : {"type": "integer"},
"unitPrice" : {"type": double}
}
}
}
POST /sales/1
{"productCode": "A", "productTitle": "Widget", "quantity" : 5, "unitPrice":
5.50}
POST /sales/2
{"productCode": "B", "productTitle": "Gizmo", "quantity" : 10, "unitPrice": 1.10}
POST /sales/3
{"productCode": "C", "productTitle": "Spanner", "quantity" : 5, "unitPrice":
9.00}
POST /sales/4
{"productCode": "A", "productTitle": "Widget", "quantity" : 15, "unitPrice":
5.40}
POST /sales/5
{"productCode": "B", "productTitle": "Gizmo", "quantity" : 20, "unitPrice":
1.00}
POST /sales/6
{"productCode": "B", "productTitle": "Gizmo", "quantity" : 30, "unitPrice":
0.90}
POST /sales/7
{"productCode": "B", "productTitle": "Gizmo", "quantity" : 40, "unitPrice":
0.80}
POST /sales/8
{"productCode": "C", "productTitle": "Spanner", "quantity" : 100,
"unitPrice": 7.50}
POST /sales/9
{"productCode": "C", "productTitle": "Spanner", "quantity" : 200,
"unitPrice": 5.50}
What query would I need to generate the following results?
a). Show the show the number of documents grouped by product code
Product code Title Count
A Widget 2
B Gizmo 4
C Spanner 3
b). Show the total units sold by product code, i.e.
Product code Title Total units sold
A Widget 20
B Gizmo 100
C Spanner 305
TIA

You can accomplish that using aggregations, in particular Terms Aggregations. And it can be done in just one run, by including them within your query structure; in order to instruct ES to generate analytic data based in aggregations, you need to include the aggregations object (or aggs), and specify within it the type of aggregations you would like ES to run upon your data.
{
"query": {
"match_all": {}
},
"aggs": {
"group_by_product": {
"terms": {
"field": "productCode"
},
"aggs": {
"units_sold": {
"sum": {
"field": "quantity"
}
}
}
}
}
}
By running that query, besides the resulting hits from your search (in this case we are doing a match all), and additional object will be included, within the response object, holding the corresponding resulting aggregations. For example
{
...
"hits": {
"total": 6,
"max_score": 1,
"hits": [ ... ]
},
"aggregations": {
"group_by_product": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "b",
"doc_count": 3,
"units_sold": {
"value": 60
}
},
{
"key": "a",
"doc_count": 2,
"units_sold": {
"value": 20
}
},
{
"key": "c",
"doc_count": 1,
"units_sold": {
"value": 5
}
}
]
}
}
}
I omitted some details from the response object for brevity, and to highlight the important part, which is within the aggregations object. You can see how the aggregated data consists of different buckets, each representing the distinct product types (identified by the key key) that were found within your documents, doc_count has the number of occurrences per product type, and the unit_sold object, holds the total sum of units sold per each of the product types.
One important thing to keep into consideration is that in order to perform aggregations on string or text fields, you need to enable the fielddata setting within your field mapping, as that setting is disabled by default on all text based fields. In order to update the mapping, for ex. of the product code field, you just need to to a PUT request to the corresponding mapping type within the index, for example
PUT http://localhost:9200/sales/sale/_mapping
{
"properties": {
"productCode": {
"type": "string",
"fielddata": true
}
}
}
(more info about the fielddata setting)

MongoDB scans entire index when using $all and $elemMatch

I have a collection of user documents, where each user can have an arbitrary set of properties. Each user is associated to an app document. Here is an example user:
{
"appId": "XXXXXXX",
"properties": [
{ "name": "age", "value": 30 },
{ "name": "gender", "value": "female" },
{ "name": "alive", "value": true }
]
}
I would like to be able to find/count users based on the values of their properties. For example, find me all users for app X that have property Y > 10 and Z equals true.
I have a compound, multikey index on this collection db.users.ensureIndex({ "appId": 1, "properties.name": 1, "properties.value": 1}). This index is working well for single condition queries, ex:
db.users.find({
appId: 'XXXXXX',
properties: {
$elemMatch: {
name: 'age',
value: {
$gt: 10
}
}
}
})
The above query completes in < 300ms with a collection of 1M users. However, when I try and add a second condition, the performance degrades considerably (7-8s), and the explain() output indicates that the whole index is being scanned to fulfill the query ("nscanned" : 2752228).
Query
db.users.find({
appId: 'XXXXXX',
properties: {
$all: [
{
$elemMatch: {
name: 'age',
value: {
$gt: 10
}
}
},
{
$elemMatch: {
name: 'alive',
value: true
}
}
]
}
})
Explain
{
"cursor" : "BtreeCursor appId_1_properties.name_1_properties.value_1",
"isMultiKey" : true,
"n" : 256,
"nscannedObjects" : 1000000,
"nscanned" : 2752228,
"nscannedObjectsAllPlans" : 1018802,
"nscannedAllPlans" : 2771030,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 21648,
"nChunkSkips" : 0,
"millis" : 7425,
"indexBounds" : {
"appId" : [
[
"XXXXX",
"XXXXX"
]
],
"properties.name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"properties.value" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"filterSet" : false
}
I assume this is because Mongo is unable to create suitable bounds since I am looking for both boolean and integer values.
My question is this: Is there a better way to structure my data, or modify my query to improve performance and take better advantage of my index? Is it possible to instruct mongo to treat each condition separately, generate appropriate bounds, and then perform the intersection of the results, instead of scanning all documents? Or is mongo just not suited for this type of use case?

I know this is an old question, but I think it would be much better to structure your data without the "name" and "value" tags:
{
"appId": "XXXXXXX",
"properties": [
{ "age": 30 },
{ "gender: "female" },
{ "alive": true }
]
}

Limit aggregation in grouped aggregation

I had a collection like this, but with much more data.
{
_id: ObjectId("db759d014f70743495ef1000"),
tracked_item_origin: "winword",
tracked_item_type: "Software",
machine_user: "mmm.mmm",
organization_id: ObjectId("a91864df4f7074b33b020000"),
group_id: ObjectId("20ea74df4f7074b33b520000"),
tracked_item_id: ObjectId("1a050df94f70748419140000"),
tracked_item_name: "Word",
duration: 9540,
}
{
_id: ObjectId("2b769d014f70743495fa1000"),
tracked_item_origin: "http://www.facebook.com",
tracked_item_type: "Site",
machine_user: "gabriel.mello",
organization_id: ObjectId("a91864df4f7074b33b020000"),
group_id: ObjectId("3f6a64df4f7074b33b040000"),
tracked_item_id: ObjectId("6f3466df4f7074b33b080000"),
tracked_item_name: "Facebook",
duration: 7920,
}
I do an aggregation, ho return grouped data like this:
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Twitter"}, "duration"=>288540},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"ANoticia"}, "duration"=>237300},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Facebook"}, "duration"=>203460},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Word"}, "duration"=>269760},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Excel"}, "duration"=>204240}
Simple aggregation code:
AgentCollector.collection.aggregate(
{'$match' => {group_id: '20ea74df4f7074b33b520000'}},
{'$group' => {
_id: {tracked_item_type: '$tracked_item_type', tracked_item_name: '$tracked_item_name'},
duration: {'$sum' => '$duration'}
}},
{'$sort' => {
'_id.tracked_item_type' => 1,
duration: -1
}}
)
There is a way to limit only 2 items by tracked_item_type key? Ex. 2 Sites and 2 Softwares.

As your question currently stands unclear, I really hope you mean that you want to specify two Site keys and 2 Software keys because that's a nice and simple answer that you can just add to your $match phase as in:
{$match: {
group_id: "20ea74df4f7074b33b520000",
tracked_item_name: {$in: ['Twitter', 'Facebook', 'Word', 'Excel' ] }
}},
And we can all cheer and be happy ;)
If however your question is something more diabolical such as, getting the top 2 Sites and Software entries from the result by duration, then we thank you very much for spawning this abomination.
Warning:
Your mileage may vary on what you actually want to do or whether this is going to blow up by the sheer size of your results. But this follows as an example of what you are in for:
db.collection.aggregate([
// Match items first to reduce the set
{$match: {group_id: "20ea74df4f7074b33b520000" }},
// Group on the types and "sum" of duration
{$group: {
_id: {
tracked_item_type: "$tracked_item_type",
tracked_item_name: "$tracked_item_name"
},
duration: {$sum: "$duration"}
}},
// Sort by type and duration descending
{$sort: { "_id.tracked_item_type": 1, duration: -1 }},
/* The fun part */
// Re-shape results to "sites" and "software" arrays
{$group: {
_id: null,
sites: {$push:
{$cond: [
{$eq: ["$_id.tracked_item_type", "Site" ]},
{ _id: "$_id", duration: "$duration" },
null
]}
},
software: {$push:
{$cond: [
{$eq: ["$_id.tracked_item_type", "Software" ]},
{ _id: "$_id", duration: "$duration" },
null
]}
}
}},
// Remove the null values for "software"
{$unwind: "$software"},
{$match: { software: {$ne: null} }},
{$group: {
_id: "$_id",
software: {$push: "$software"},
sites: {$first: "$sites"}
}},
// Remove the null values for "sites"
{$unwind: "$sites"},
{$match: { sites: {$ne: null} }},
{$group: {
_id: "$_id",
software: {$first: "$software"},
sites: {$push: "$sites"}
}},
// Project out software and limit to the *top* 2 results
{$unwind: "$software"},
{$project: {
_id: 0,
_id: { _id: "$software._id", duration: "$software.duration" },
sites: "$sites"
}},
{$limit : 2},
// Project sites, grouping multiple software per key, requires a sort
// then limit the *top* 2 results
{$unwind: "$sites"},
{$group: {
_id: { _id: "$sites._id", duration: "$sites.duration" },
software: {$push: "$_id" }
}},
{$sort: { "_id.duration": -1 }},
{$limit: 2}
])
Now what that results in is *not exactly the clean set of results that would be ideal but it is something that can be programatically worked with, and better than filtering the previous results in a loop. (My data from testing)
{
"result" : [
{
"_id" : {
"_id" : {
"tracked_item_type" : "Site",
"tracked_item_name" : "Digital Blasphemy"
},
"duration" : 8000
},
"software" : [
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Word"
},
"duration" : 9540
},
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Notepad"
},
"duration" : 4000
}
]
},
{
"_id" : {
"_id" : {
"tracked_item_type" : "Site",
"tracked_item_name" : "Facebook"
},
"duration" : 7920
},
"software" : [
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Word"
},
"duration" : 9540
},
{
"_id" : {
"tracked_item_type" : "Software",
"tracked_item_name" : "Notepad"
},
"duration" : 4000
}
]
}
],
"ok" : 1
}
So you see you get the top 2 Sites in the array, with the top 2 Software items embedded in each. Aggregation itself, cannot clear this up any further, because we would need to re-merge the items we split apart in order to do this, and as yet there is no operator that we could use to perform this action.
But that was fun. It's not all the way done, but most of the way, and making that into a 4 document response would be relatively trivial code. But my head hurts already.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

select distinct from elasticsearch - elasticsearch

Related

Elasticsearch Sorting Tiebreakers

How do I sort buckets by Term Aggregation's nested doc_count?

Elastic query group by

MongoDB scans entire index when using $all and $elemMatch

Limit aggregation in grouped aggregation

Categories

Resources