I have collection which is contain id, domain . In the collection the same domain saved multiple times. I want to aggregate and get result like
google.com 4
times.com 5
My code
public List<DomainDTO> domainAggregation() {
Aggregation pipeline = newAggregation(
group(fields("id","domain")),
group("domain").count().as("count"),
sort(Sort.Direction.DESC, previousOperation(), "domain")
);
AggregationResults groupResults = mongoTemplate.aggregate(
pipeline, Domains.class, DomainDTO.class);
List<DomainDTO> domainReport = groupResults.getMappedResults();
return domainReport;
}
DomainDTO consist
private String domain;
private Integer count;
Domain entity consist
private String id;
private String searchId;
private String domain;
private Date searchDate;
private String searchName;
private Integer count;
Result json is
{
"domain": null,
"count": 2
},
{
"domain": null,
"count": 1
},
{
"domain": null,
"count": 2
},
{
"domain": null,
"count": 48
},
Domain name is not passed, and not sorted. Could not find bug. Any suggestions?
Your current query outputs something like
{ "$group" : {
"_id" : { "id" : "$id" , "domain" : "$domain"}
} } ,
{ "$group" : { "_id" : "$_id.domain" , "count" : { "$sum" : 1}}} ,
{ "$sort" : { "_id" : -1 , "_id.domain" : -1}}
I believe you're intending something like
{ "$group" : { "_id" : "$domain" , "count" : { "$sum" : 1}}} ,
{ "$sort" : { "_id" : -1}}
Aggregation Java Code:
Aggregation pipeline = newAggregation(
group("domain").count().as("count"),
sort(Sort.Direction.DESC, previousOperation())
);
You will need a $project stage to map _id back to domain in your DomainDTO class.
Aggregation pipeline = newAggregation(
group("domain").count().as("count"),
sort(Sort.Direction.DESC, previousOperation()),
project(bind("domain", "_id")).andExclude("_id").andInclude("count")
);
Mongo Shell
{ "$group" : { "_id" : "$domain" , "count" : { "$sum" : 1}}} ,
{ "$sort" : { "_id" : -1}},
{ "$project" : { "domain" : "$_id" , "_id" : 0 , "count" : 1}
Related
We're creating some packages, but that process is currently rather slow, because of the sheer amount of data being sent between microservices. Therefore, I have pruned the information being sent between those microservices and instead want to enrich the documents with the necessary information directly from within ElasticSearch. This gives documents of the following shape:
{
"_index" : "packages-2022.02.28",
"_type" : "_doc",
"_id" : "SG_DH-8019-ao-74783-20220315-12",
"_score" : 1.0,
"_source" : {
"id" : "SG_DH-8019-ao-74783-20220315-12",
"updatedOn" : "2022-02-28T14:45:57.7511562+01:00",
"code" : "SG",
"createdDate" : "2022-02-28T15:17:48.2571391+01:00",
"content" : {
"contentId" : "74783",
"units" : [
{
"id" : "HB_DBL.ST_RO_NFP",
"globalId" : "74783_HB_DBL.ST_RO_NFP",
"globalIntId" : -592692223,
"forPackaging" : false
},
{
"id" : "HB_DBL.ST_BB_NFP",
"globalId" : "74783_HB_DBL.ST_BB_NFP",
"globalIntId" : 446952442,
"forPackaging" : false
},
{
"id" : "HB_DBL.ST_AI_NFP",
"globalId" : "74783_HB_DBL.ST_AI_NFP",
"globalIntId" : -1174348304,
"forPackaging" : false
},
{
"id" : "HB_DBL.SU_RO_NFP",
"globalId" : "74783_HB_DBL.SU_RO_NFP",
"globalIntId" : -2111509049,
"forPackaging" : false
},
{
"id" : "HB_DBL.SU_BB_NFP",
"globalId" : "74783_HB_DBL.SU_BB_NFP",
"globalIntId" : 307969427,
"forPackaging" : false
},
{
"id" : "HB_DBL.SU_AI_NFP",
"globalId" : "74783_HB_DBL.SU_AI_NFP",
"globalIntId" : 1418623211,
"forPackaging" : false
},
{
"id" : "HB_DBL.PO-1_RO_NFP",
"globalId" : "74783_HB_DBL.PO-1_RO_NFP",
"globalIntId" : 1328251159,
"forPackaging" : false
},
{
"id" : "HB_DBL.PO-1_BB_NFP",
"globalId" : "74783_HB_DBL.PO-1_BB_NFP",
"globalIntId" : -1228155826,
"forPackaging" : false
},
{
"id" : "HB_DBL.PO-1_AI_NFP",
"globalId" : "74783_HB_DBL.PO-1_AI_NFP",
"globalIntId" : 749215308,
"forPackaging" : false
},
{
"id" : "HB_DBL.OF_RO_NFP",
"globalId" : "74783_HB_DBL.OF_RO_NFP",
"globalIntId" : 1981865239,
"forPackaging" : false
},
{
"id" : "HB_DBL.OF_BB_NFP",
"globalId" : "74783_HB_DBL.OF_BB_NFP",
"globalIntId" : 545563435,
"forPackaging" : false
},
{
"id" : "HB_DBL.OF_AI_NFP",
"globalId" : "74783_HB_DBL.OF_AI_NFP",
"globalIntId" : -481310774,
"forPackaging" : false
}
]
"duration" : {
"value" : 12,
"durationType" : "Day"
}
},
"generatedInfo" : {
"productGroupName" : null,
"subProductGroupName" : "Foo",
"version" : 0
}
}
}
]
with information from an enrich policy's index of the shape (when queried):
{
"_index" : ".enrich-package-enrich-1646044129711",
"_type" : "_doc",
"_id" : "zt_gP38BZeMUiw0-LxLa",
"_score" : 1.0,
"_source" : {
"contentId" : "365114",
"name" : "PackageName",
"board" : [
"B1",
"B2"
],
"units" : [
{
"price" : [
{
"margin" : 0,
"combination" : 10000,
"value" : 189030,
"currency" : "EUR"
}
],
"id" : "W2M_AX2_SC_NFP",
"globalId" : "365114_W2M_AX2_SC_NFP",
"globalIntId" : -988330164,
"name" : "UnitName",
"prop1": "Foo",
"prop2": "Bar"
}
]
}
}
]
I originally could get this working. However, when enriching, I only want to keep the units with the same global ID as those in the document to save. To this end, I have tried also enriching each unit with a simple Enrich processor and a ForEach processor referencing the enrich policy, matching on globalId and have even attempted matching on its hash code globalIntId (although in even in the latter case I would often get the error that it 'is not an integer', even though it clearly is one). This separate enrich-policy index has a shape similar to the following:
{
"_index" : ".enrich-package-unit-enrich-1646044158417",
"_type" : "_doc",
"_id" : "dN_gP38BZeMUiw0-t2Io",
"_score" : 1.0,
"_source" : {
"units" : [
{
"price" : [
{
"margin" : 0,
"combination" : 10000,
"value" : 189030,
"currency" : "EUR"
}
],
"globalId" : "365114_W2M_AX2_SC_NFP",
"globalIntId" : -988330164,
"name" : "UnitName",
"prop1": "Foo",
"prop2": "Bar",
"id" : "W2M_AX2_SC_NFP"
}
]
}
}
]
I have also tried to use Painless script, but so far my experience hasn't been exactly painless (pun intended). Every time I would try to access any data (I've tried various ways I encountered), I would get nothing but compilation errors. Also, given that I'm working on making this process faster, I'm a bit worried about performance here if I were to get it to work. I've read that Painless is fast, yet I've also heard it's actually fairly slow (I think compared to using processors, not necessarily other scripts).
Now, I'm at a loss about how to get this to work. I would prefer to do this without scripting if possible. However, if it is only possible using scripting, that's okay as long as the performance is acceptable. I'm using Elastic 7.12.
Update 1:
I'm creating the enrich policy from C# using Nest like so:
var enrichPolicyRequest = new PutEnrichPolicyRequest(enrichPolicyName)
{
Match = new MyPackageBedEnrichPolicy(index)
};
var putEnrichPolicyResponse = await elasticClient.Enrich.PutPolicyAsync(enrichPolicyRequest);
var executeEnrichPolicyResponse = await elasticClient.Enrich.ExecutePolicyAsync(enrichPolicyName);
...
public class MyPackageBedEnrichPolicy : IEnrichPolicy
{
public MyPackageBedEnrichPolicy(string index)
{
Indices = index;
MatchField = "contentId";
EnrichFields = new[] { "name", "board", "units" };
}
public Indices Indices { get; set; }
public Field MatchField { get; set; }
public Fields EnrichFields { get; set; }
public string Query { get; set; }
}
and the index for the units very similarly, but with
public class MyPackageUnitEnrichPolicy : IEnrichPolicy
{
public MyPackageUnitEnrichPolicy(string index)
{
Indices = index;
MatchField = "units.globalId";
EnrichFields = new[] { "units" };
}
...
For now, I have created the ingest processors in Kibana for easier prototyping, though I will have take care of that using Nest later as well. I have defined them basically as follows:
This is the definition of the ingest pipeline in JSON:
[
{
"enrich": {
"field": "content.contentId",
"policy_name": "enrichPolicyName",
"target_field": "enrichTest"
}
},
{
"foreach": {
"field": "content.units.globalId",
"processor": {
"enrich": {
"field": "content.units.globalId",
"policy_name": "unitEnrichPolicyName",
"target_field": "enrichTest.units",
"tag": "enrich-units-on-globalId-processor"
}
}
}
}
]
Is this possible to do in QuickSight or Elasticsearch? I have tried calculated fields in QuickSight and runtime scripts in Elasticsearch, not sure how to do it? Also, is what I'm not what I'm expecting is even possible in this tool.
Trying out a simple date difference between columns based on their action, here... "Time taken for 'creating a post' after a user registered"
Data Input:
Data output
It is possible using scripted metric aggregation
Data
"hits" : [
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "aqJ3HnoBF6_U07qsNY-s",
"_score" : 1.0,
"_source" : {
"user" : "Jen",
"activity" : "Logged In",
"activity_Time" : "2020-01-08"
}
},
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "a6J3HnoBF6_U07qsXY_8",
"_score" : 1.0,
"_source" : {
"user" : "Jen",
"activity" : "Created a post",
"activity_Time" : "2020-05-08"
}
},
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "bKJ3HnoBF6_U07qsk4-0",
"_score" : 1.0,
"_source" : {
"user" : "Mark",
"activity" : "Logged In",
"activity_Time" : "2020-01-03"
}
},
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "baJ3HnoBF6_U07qsu48g",
"_score" : 1.0,
"_source" : {
"user" : "Mark",
"activity" : "Created a post",
"activity_Time" : "2020-01-08"
}
}
]
Query
{
"size": 0,
"aggs": {
"user": {
"terms": {
"field": "user.keyword",
"size": 10000
},
"aggs": {
"distinct_sum_feedback": {
"scripted_metric": {
"init_script": "state.docs = []",
"map_script": """ Map span = [
'timestamp':doc['activity_Time'],
'activity':doc['activity.keyword'].value
];
state.docs.add(span)
""",
"combine_script": "return state.docs;",
"reduce_script": """
def all_docs = [];
for (s in states)
{
for (span in s) {
all_docs.add(span);
}
}
all_docs.sort((HashMap o1, HashMap o2)->o1['timestamp'].getValue().toInstant().toEpochMilli().compareTo(o2['timestamp'].getValue().toInstant().toEpochMilli()));
Hashtable result= new Hashtable();
boolean found = false;
JodaCompatibleZonedDateTime loggedIn;
for (s in all_docs)
{
if(s.activity =='Logged In')
{
loggedIn=s.timestamp.getValue();
found= true;
}
if(s.activity =='Created a post' && found==true)
{
found=false;
def dt=loggedIn.getYear()+ '-' + loggedIn.getMonth() + '-' + loggedIn.getDayOfMonth();
def diff= s.timestamp.getValue().toInstant().toEpochMilli() - loggedIn.toInstant().toEpochMilli();
if(result.get(dt) == null)
{
result.put(dt, diff / 1000 / 60 / 60 / 24 )
}
}
}
return result;
"""
}
}
}
}
}
}
Result
"user" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Jen",
"doc_count" : 2,
"distinct_sum_feedback" : {
"value" : {
"2020-JANUARY-8" : 121
}
}
},
{
"key" : "Mark",
"doc_count" : 2,
"distinct_sum_feedback" : {
"value" : {
"2020-JANUARY-3" : 5
}
}
}
]
}
Explanation
"init_script":
Executed prior to any collection of documents. Allows the aggregation
to set up any initial state.
Have declared a Map"
"map_script"
Executed once per document collected
Loop through all document and add activity and timestamp to map
combine_script
Executed once on each shard after document collection is complete
Return collection of Map for all shards
reduce_script
Executed once on the coordinating node after all shards have returned their results
Once again go through through all Map and create a single collection and sort on timestamp. Then go through sorted Map and insert logged in and next "created post" time (diff of logged in and post created time)
I have mongo collection:
{
"_id" : 123,
"index" : "111",
"students" : [
{
"firstname" : "Mark",
"lastname" : "Smith"),
}
],
}
{
"_id" : 456,
"index" : "222",
"students" : [
{
"firstname" : "Mark",
"lastname" : "Smith"),
}
],
}
{
"_id" : 789,
"index" : "333",
"students" : [
{
"firstname" : "Neil",
"lastname" : "Smith"),
},
{
"firstname" : "Sofia",
"lastname" : "Smith"),
}
],
}
I want to get document that has index that is in the set of the given indexes, for example givenSet = ["111","333"] and has min length of students array.
Result should be the first document with _id:123, because its index is in the givenSet and studentsArrayLength = 1, which is smaller than third.
I need to write custom JSON #Query for Spring Mongo Repository. I am new to Mongo and am stuck a bit with this problem.
I wrote something like this:
#Query("{'index':{$in : ?0}, length:{$size:$students}, $sort:{length:-1}, $limit:1}")
Department getByMinStudentsSize(Set<String> indexes);
And got error: error message '$size needs a number'
Should I just use .count() or something like that?
you should use the aggregation framework for this type of query.
filter the result based on your condition.
add a new field and assign the array size to it.
sort based on the new field.
limit the result.
the solution should look something like this:
db.collection.aggregate([
{
"$match": {
index: {
"$in": [
"111",
"333"
]
}
}
},
{
"$addFields": {
"students_size": {
"$size": "$students"
}
}
},
{
"$sort": {
students_size: 1
}
},
{
"$limit": 1
}
])
working example: https://mongoplayground.net/p/ih4KqGg25i6
You are getting the issue because the second param should be enclosed in curly braces. And second param is projection
#Query("{{'index':{$in : ?0}}, {length:{$size:'$students'}}, $sort:{length:1}, $limit:1}")
Department getByMinStudentsSize(Set<String> indexes);
Below is the mongodb query :
db.collection.aggregate(
[
{
"$match" : {
"index" : {
"$in" : [
"111",
"333"
]
}
}
},
{
"$project" : {
"studentsSize" : {
"$size" : "$students"
},
"students" : 1.0
}
},
{
"$sort" : {
"studentsSize" : 1.0
}
},
{
"$limit" : 1.0
}
],
{
"allowDiskUse" : false
}
);
I have an index for an example airport:
public class FlightIndex
{
public int Id { get; set; }
[keyword]
public string Destination { get; set; }
}
The Destination field stores data like "London Airport," :London Airport (XYZ)," and "London Airport (ABC)."
I would like to search and return the exact match on Destination.
In the query below, I want a list of flights whose destination matches the destination list provided:
q.Terms(m => m.Field(f => f.Destination).Terms(parameters.Destinations
.Select(_ => _.ToLower()).ToList()));
For example, if parameters.Destinations contains "London Airport (ABC)," then nothing is returned, but if it has "London Airport," it returns the ones with "London Airport."
It does not seem to work with the brackets.
I'm not sure if it needs/can to be escaped.
It sounds very much like Destination is not indexed as a keyword datatype; if it was, a terms query would return matches for verbatim input values. Additionally, parentheses would not make a difference, the indexed value would either match exactly or not.
I would check the mapping in the target index with the Get Mapping API.
Here's an example to demonstrate it working
var client = new ElasticClient(settings);
if (client.IndexExists("example").Exists)
{
client.DeleteIndex("example");
}
client.CreateIndex("example", c => c
.Mappings(m => m
.Map<FlightIndex>(mm => mm
.AutoMap()
)
)
);
client.Index(new FlightIndex { Id = 1, Destination = "London Airport (XYZ)" }, i => i
.Index("example")
.Refresh(Refresh.WaitFor)
);
client.Search<FlightIndex>(s => s
.Index("example")
.Query(q => q
.Terms(t => t
.Field(f => f.Destination)
.Terms("London Airport (XYZ)")
)
)
);
sends the following requests and receives responses
HEAD http://localhost:9200/example?pretty=true
Status: 200
------------------------------
PUT http://localhost:9200/example?pretty=true
{
"mappings": {
"flightindex": {
"properties": {
"id": {
"type": "integer"
},
"destination": {
"type": "keyword"
}
}
}
}
}
Status: 200
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "example"
}
------------------------------
PUT http://localhost:9200/example/flightindex/1?pretty=true&refresh=wait_for
{
"id": 1,
"destination": "London Airport (XYZ)"
}
Status: 201
{
"_index" : "example",
"_type" : "flightindex",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
------------------------------
POST http://localhost:9200/example/flightindex/_search?pretty=true&typed_keys=true
{
"query": {
"terms": {
"destination": [
"London Airport (XYZ)"
]
}
}
}
Status: 200
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "example",
"_type" : "flightindex",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"destination" : "London Airport (XYZ)"
}
}
]
}
}
------------------------------
Can I have difference between two dates to greater than 0 in Criteria operator in spring-data-mongodb? I wrote the query below :
Criteria c= Criteria.where("myDate").gte(startDate).
andOperator(Criteria.where("myDate").lte(endDate).andOperator(Criteria.where("studentId").is(studentId).andOperator(Criteria.where("currDate - myDate").gt(0))));
This query not working.
If possible please help me in getting this query work with spring-data-mongodb.
Edit:
The mongodb pipeline query is as follows:
{ "aggregate" : "__collection__" , "pipeline" : [ { "$match" : { "myDate" : { "$gte" : { "$date" : "2000-01-01T07:57:33.231Z"}} , "$and" : [ { "myDate" : { "$lte" : { "$date" : "2015-11-05T07:57:33.231Z"}} , "$and" : [ { "studentId" : "100" , "$and" : [ { "currDate - myDate" : { "$gt" : 0}}]}]}]}} , { "$project" : { "status" : 1}} , { "$group" : { "_id" : { "status" : "$status"} , "activeCount" : { "$sum" : 1}}}]}
Regards
Kris
For it to work, you'd essentially want to convert the current aggregation pipeline to this:
var pipeline = [
{
"$project" : {
"status" : 1,
"studentId" : 1,
"myDate" : 1,
"dateDifference": { "$subtract": [ new Date(), "$myDate" ] }
}
},
{
"$match" : {
"studentId": "100" ,
"myDate": {
"$gte": ISODate("2000-01-01T07:57:33.231Z"),
"$lte": ISODate("2015-11-05T07:57:33.231Z")
},
"dateDifference": { "$gt" : 0 }
}
},
{
"$group": {
"_id": "$status",
"activeCount": { "$sum" : 1 }
}
}
];
db.collection.aggregate(pipeline);
The Spring Data MongoDB equivalent follows:
Criteria dateCriteria = new Criteria().andOperator(Criteria.where("myDate").gte(startDate).lte(endDate),
Criteria.where("dateDifference").gt(0));
Aggregation agg = Aggregation.newAggregation(
project("id", "status", "studentId", "myDate")
.andExpression("currDate - myDate").as("dateDifference"),
//.and(currDate).minus("myDate").as("dateDifference"), <-- or use expressions
match(Criteria.where("studentId").is("100").andOperator(dateCriteria)),
group("status"),
.count().as("activeCount")
);