Group by field, sort and get the first (or last, whatever) items of the group in MongoDB (with Spring Data) - spring

I have the following entity (getters, setters and constructor omitted)
public class Event {
#Id
private String id;
private String internalUuid;
private EventType eventType;
}
EventType is an enum containing arbitrary event types:
public enum EventType {
ACCEPTED,
PROCESSED,
DELIVERED;
}
My problem is that I have a table with a lot of events, some having the same internalUuid but different statuses. I need to get a list of Events with each Event representing the 'newest' status (ordering by EventType would suffice). Currently, I'm just fetching everything, grouping to separates lists in code, sorting the lists by EventType and then just creating a new list with the first element of each list.
Example would be as follows.
Data in table:
{ "id": "1", "internalUuid": "1", "eventType": "ACCEPTED" },
{ "id": "2", "internalUuid": "1", "eventType": "PROCESSED" },
{ "id": "3", "internalUuid": "1", "eventType": "DELIVERED" },
{ "id": "4", "internalUuid": "2", "eventType": "ACCEPTED" },
{ "id": "5", "internalUuid": "2", "eventType": "PROCESSED" },
{ "id": "6", "internalUuid": "3", "eventType": "ACCEPTED" }
Output of the query (any order would be ok):
[
{ "id": "3", "internalUuid": "1", "eventType": "DELIVERED" },
{ "id": "5", "internalUuid": "2", "eventType": "PROCESSED" },
{ "id": "6", "internalUuid": "3", "eventType": "ACCEPTED" }
]
It is not guaranteed that a "higher" status also has a "higher" ID.
How do I do that without doing the whole process by hand? I literally have no idea how to start as I'm very new to MongoDB but haven't found anything that helped me on Google. I'm using Spring Boot and Spring Data.
Thanks!

Okay I think I have figured it out (thanks to Joe's comment). I'm not a 100% sure that the code is correct but it seems to do what I want. I'm open to improvements.
(I had to add a priority field to Event and EventType because sorting by eventType obviously does String-based (alphabetic) sorting on the enum's name):
private List<Event> findCandidates() {
// First, 'match' so that all documents are found
final MatchOperation getAll = Aggregation.match(new Criteria("_id").ne(null));
// Then sort by priority
final SortOperation sort = Aggregation.sort(Sort.by(Sort.Direction.DESC, "priority"));
// After that, group by internalUuid and make sure to also push the full event to not lose it for the next step
final GroupOperation groupByUuid = Aggregation.group("internalUuid").push("$$ROOT").as("events");
// Get the first element of each sorted and grouped list (I'm not fully sure what the 'internalUuid' parameter does here and if I could change that)
final ProjectionOperation getFirst = Aggregation.project("internalUuid").and("events").arrayElementAt(0).as("event");
// We're nearly done! Only thing left to do is to map to our Event to have a usable List of Event in .getMappedResults()
final ProjectionOperation map = Aggregation.project("internalUuid")
.and("event._id").as("_id")
.and("event.internalUuid").as("internalUuid")
.and("event.eventType").as("eventType")
.and("event.priority").as("priority");
final Aggregation aggregation = Aggregation.newAggregation(getAll, sort, groupByUuid, getFirst, map);
final AggregationResults<InvoiceEvent> aggregationResults =
mongoTemplate.aggregateAndReturn(InvoiceEvent.class).by(aggregation).all();
return aggregationResults.getMappedResults();
}

Related

Is it possible to round a number with Spring data MongoDB aggregations?

I have the following aggregation pipeline to calculate the top rated brands from a collection of phones with their reviews embedded.
public Document findTopRatedBrands(int minReviews, int results) {
UnwindOperation unwindOperation = unwind("reviews");
GroupOperation groupOperation = group("$brand").avg("$reviews.rating")
.as("avgRating").count().as("numReviews");
MatchOperation matchOperation = match(new Criteria("numReviews").gte(minReviews));
SortOperation sortOperation = sort(Sort.by(Sort.Direction.DESC, "avgRating",
"numReviews"));
LimitOperation limitOperation = limit(results);
ProjectionOperation projectionOperation = project().andExpression("_id").as
("brand").andExpression("avgRating").as("rating").andExclude("_id")
.andExpression("numReviews").as("reviews");
Aggregation aggregation = newAggregation(unwindOperation, groupOperation, matchOperation,
sortOperation, limitOperation, projectionOperation);
AggregationResults<Phone> result = mongoOperations
.aggregate(aggregation, "phones", Phone.class);
return result.getRawResults();
}
An example of document in the phones collection is this:
{
"_id": {
"$oid": "61e1cc8f452d0aef89d9125f"
},
"brand": "Samsung",
"name": "Samsung Galaxy S7",
"releaseYear": 2016,
"reviews": [{
"_id": {
"$oid": "61d4403b86913bee0245c171"
},
"rating": 2,
"dateOfReview": {
"$date": "2019-12-24T00:00:00.000Z"
},
"title": "Won't do that again.",
"body": "I could not use with my carrier. Sent it back.",
"username": "bigrabbit324"
}]
}
I would like to sort by avgRating (rounded to the first decimal place), and secondly by the number of reviews. Now the average rating it's not rounded so it gives always different values, so I can't sort by number of reviews also. I have seen the ArithmeticOperators.Round class but I don't understand how to include it here if possible.
An example of result is the following:
[Document{{brand=Nokia, rating=3.25, reviews=4}}]
I would like to have 3.2 as rating.
This works in Mongo Compass:
$project: {
_id: 0,
brand: '$_id',
rating: { $round: ['$avgRating', 1] }
}
try
ProjectionOperation roundAverageRating = Aggregation.project("avgRating", "numReviews")
.and(ArithmeticOperators.Round.roundValueOf("avgRating").place(1))
.as("avgRatingRounded");

How to create a HashMap with custom object as a key?

In Elasticsearch, I have an object that contains an array of objects. Each object in the array have type, id, updateTime, value fields.
My input parameter is an array that contains objects of the same type but different values and update times. Id like to update the objects with new value when they exist and create new ones when they aren't.
I'd like to use Painless script to update those but keep them distinct, as some of them may overlap. Issue is that I need to use both type and id to keep them unique. So far I've done it with bruteforce approach, nested for loop and comparing elements of both arrays, but I'm not too happy about that.
One of the ideas is to take array from source, build temporary HashMap for fast lookup, process input and later store all objects back into source.
Can I create HashMap with custom object (a class with type and id) as a key? If so, how to do it? I can't add class definition to the script.
Here's the mapping. All fields are 'disabled' as I use them only as intermidiate state and query using other fields.
{
"properties": {
"arrayOfObjects": {
"properties": {
"typ": {
"enabled": false
},
"id": {
"enabled": false
},
"value": {
"enabled": false
},
"updated": {
"enabled": false
}
}
}
}
}
Example doc.
{
"arrayOfObjects": [
{
"typ": "a",
"id": "1",
"updated": "2020-01-02T10:10:10Z",
"value": "yes"
},
{
"typ": "a",
"id": "2",
"updated": "2020-01-02T11:11:11Z",
"value": "no"
},
{
"typ": "b",
"id": "1",
"updated": "2020-01-02T11:11:11Z"
}
]
}
And finally part of the script in it's current form. The script does some other things, too, so I've stripped them out for brevity.
if (ctx._source.arrayOfObjects == null) {
ctx._source.arrayOfObjects = new ArrayList();
}
for (obj in params.inputObjects) {
def found = false;
for (existingObj in ctx._source.arrayOfObjects) {
if (obj.typ == existingObj.typ && obj.id == existingObj.id && isAfter(obj.updated, existingObj.updated)) {
existingObj.updated = obj.updated;
existingObj.value = obj.value;
found = true;
break;
}
}
if (!found) {
ctx._source.arrayOfObjects.add([
"typ": obj.typ,
"id": obj.id,
"value": params.inputValue,
"updated": obj.updated
]);
}
}
There's technically nothing suboptimal about your approach.
A HashMap could potentially save some time but since you're scripting, you're already bound to its innate inefficiencies... Btw here's how you initialize & work with HashMaps.
Another approach would be to rethink your data structure -- instead of arrays of objects use keyed objects or similar. Arrays of objects aren't great for frequent updates.
Finally a tip: you said that these fields are only used to store some intermediate state. If that weren't the case (or won't be in the future), I'd recommend using nested arrays to enable querying independently of other objects in the array.

is there any way where i can apply group and pagination using createQuery?

Query like this,
http://localhost:3030/dflowzdata?$skip=0&$group=uuid&$limit=2
and dflowzdata service contains data like,
[
{
"uuid": 123456,
"id": 1
},
{
"uuid": 123456,
"id": 2
},
{
"uuid": 7890,
"id": 3
},
{
"uuid": 123456,
"id": 4
},
{
"uuid": 4567,
"id": 5
}
]
Before Find Hook like,
if (query.$group !== undefined) {
let value = hook.params.query.$group
delete hook.params.query.$group
const query = hook.service.createQuery(hook.params.query);
hook.params.rethinkdb = query.group(value)
}
Its gives correct result but without pagination, like I need only two records but its give me all records
result is,
{"total":[{"group":"123456","reduction":3},{"group":"7890","reduction":1},{"group":"4567","reduction":3}],"data":[{"group":"123456","reduction":[{"uuid":"123456","id":1},{"uuid":"123456","id":2},{"uuid":"123456","id":4}]},{"group":"7890","reduction":[{"uuid":"7890","id":3}]},{"group":"4567","reduction":[{"uuid":"4567","id":5}]}],"limit":2,"skip":0}
can anyone help me how should get correct records using $limit?
According to the documentation on data types, ReQL commands called on GROUPED_DATA operate on each group individually. For more details, read the group documentation. So limit won't apply to the result of group.
The page for group tells: to operate on all the groups rather than operating on each group [...], you can use ungroup to turn a grouped stream or grouped data into an array of objects representing the groups.
Hence ungroup to apply functions to group's result:
r.db('db').table('table')
.group('uuid')
.ungroup()
.limit(2)

How to get all maxes from couchbase using map/reduce?

I've got a lot of records like:
{
"id": "1000",
"lastSeen": "2018-02-26T18:49:21.863Z"
}
{
"id": "1000",
"lastSeen": "2017-02-26T18:49:21.863Z"
}
{
"id": "2000",
"lastSeen": "2018-02-26T18:49:21.863Z"
}
{
"id": "2000",
"lastSeen": "2017-02-26T18:49:21.863Z"
}
I'd like to get the most recent records for all ids. So in this case the output would be the following(most recent record for ids 1000 and 2000):
{
"id": "1000",
"lastSeen": "2018-02-26T18:49:21.863Z"
}
{
"id": "2000",
"lastSeen": "2018-02-26T18:49:21.863Z"
}
With N1QL, this would be
SELECT id, MAX(lastSeen) FROM mybucket GROUP BY id
How would I do this using a couchbase view and map/reduce?
Thanks!
I am far from a regular user of map/reduce, and there may be more efficient JavaScript, but try this:
Map
function (doc, meta) {
emit(doc.id, doc.lastSeen);
}
Reduce
function reduce(key, values, rereduce) {
var max = values.sort().reverse()[0];
return max;
}
Filter: ?limit=6&stale=false&connection_timeout=60000&inclusive_end=true&skip=0&full_set=true&group_level=1
The idea is to sort all the values being emitted (lastSeen). Since they are ISO 8601 and can be lexigraphically sorted, sort() works just fine. You want the latest, so that's what the reverse() is for (otherwise you'd get the oldest).
The filter has a group_level of 1, so it will group by the doc.id field.
You can query by descending and reduce to first one on list as below:
Map:
function (doc, meta) {
emit(doc.id, doc.lastSeen);
}
Reduce:
function reduce(key, values, rereduce) {
return values[0];
}
Filter:
?inclusive_end=true&skip=0&full_set=&group_level=1&descending=true
This will eliminate the overhead of sorting the grouped values inside reduce function.

Which is the better design for this API response

I'm trying to decide upon the best format of response for my API. I need to return a reports response which provides information on the report itself and the fields contained on it. Fields can be of differing types, so there can be: SelectList; TextArea; Location etc..
They each use different properties, so "SelectList" might use "Value" to store its string value and "Location" might use "ChildItems" to hold "Longitude" "Latitude" etc.
Here's what I mean:
"ReportList": [
{
"Fields": [
{
"Id": {},
"Label": "",
"Value": "",
"FieldType": "",
"FieldBankFieldId": {},
"ChildItems": [
{
"Item": "",
"Value": ""
}
]
}
]
}
The problem with this is I'm expecting the users to know when a value is supposed to be null. So I'm expecting a person looking to extract the value from "Location" to extract it from "ChildItems" and not "Value". The benefit to this however, is it's much easier to query for things than the alternative which is the following:
"ReportList": [
{
"Fields": [
{
"SelectList": [
{
"Id": {},
"Label": "",
"Value": "",
}
]
"Location": [
{
"Id": {},
"Label": "",
"Latitude": "",
"Longitude": "",
"etc": "",
}
]
}
]
}
So this one is a reports list that contains a list of fields which on it contains a list of fieldtype for every fieldtype I have (15 or something like that). This is opposed to just having a list of reports which has a list of fields with a "fieldtype" enum which I think is fairly easy to manipulate.
So the Question: Which format is best for a response? Any alternatives and comments appreciated.
EDIT:
To query all fields by fieldtype in a report and get values with the first way it would go something like this:
foreach(field in fields)
{
switch(field.fieldType){
case FieldType.Location :
var locationValue = field.childitems;
break;
case FieldType.SelectList:
var valueselectlist = field.Value;
break;
}
The second one would be like:
foreach(field in fields)
{
foreach(location in field.Locations)
{
var latitude = location.Latitude;
}
foreach(selectList in field.SelectLists)
{
var value= selectList.Value;
}
}
I think the right answer is the first one. With the switch statement. It makes it easier to query on for things like: Get me the value of the field with the id of this guid. It just means putting it through a big switch statement.
I went with the first one because It's easier to query for the most common use case. I'll expect the client code to put it into their own schema if they want to change it.

Resources