mgo with aggregation and grouping - go

I am trying to perform a query using golang mgo
to effectively get distinct values from a join, I understand that this might not be the best paradigm to work with in Mongo.
Something like this:
pipe := []bson.M{
{
"$group": bson.M{
"_id": bson.M{"user": "$user"},
},
},
{
"$match": bson.M{
"_id": bson.M{"$exists": 1},
"user": bson.M{"$exists": 1},
"date_updated": bson.M{
"$gt": durationDays,
},
},
},
{
"$lookup": bson.M{
"from": "users",
"localField": "user",
"foreignField": "_id",
"as": "user_details",
},
},
{
"$lookup": bson.M{
"from": "organizations",
"localField": "organization",
"foreignField": "_id",
"as": "organization_details",
},
},
}
err := d.Pipe(pipe).All(&result)
If I comment out the $group section, the query returns the join as expected.
If I run as is, I get NULL
If I move the $group to the bottom of the pipe I get an array response with Null values
Is it possible to do do an aggregation with a $group (with the goal of simulating DISTINCT) ?

The reason you're getting NULL is because your $match filter is filtering out all of documents after the $group phase.
After your first stage of $group the documents are only as below example:
{"_id": { "user": "foo"}},
{"_id": { "user": "bar"}},
{"_id": { "user": "baz"}}
They no longer contains the other fields i.e. user, date_updated and organization. If you would like to keep their values, you can utilise Group Accumulator Operator. Depending on your use case you may also benefit from using Aggregation Expression Variables
As an example using mongo shell, let's use $first operator which basically pick the first occurrence. This may make sense for organization but not for date_updated. Please choose a more appropriate accumulator operator.
{"$group": {
"_id":"$user",
"date_updated": {"$first":"$date_updated"},
"organization": {"$first":"$organization"}
}
}
Note that the above also replaces {"_id":{"user":"$user"}} with simpler {"_id":"$user"}.
Next we'll add $project stage to rename our result of _id field from the group operation back to user. Also carry along the other fields without modifications.
{"$project": {
"user": "$_id",
"date_updated": 1,
"organization": 1
}
}
Your $match stage can be simplified, by just listing the date_updated filter. First we can remove _id as it's no longer relevant up to this point in the pipeline, and also if you would like to make sure that you only process documents with user value you should placed $match before the $group. See Aggregation Pipeline Optimization for more.
So, all of those combined will look something as below:
[
{"$group":{
"_id": "$user",
"date_updated": { "$first": "$date_updated"},
"organization": { $first: "$organization"}
}
},
{"$project":{
"user": "$_id",
"date_updated": 1,
"organization": 1
}
},
{"$match":{
"date_updated": {"$gt": durationDays } }
},
{"$lookup":{
"from": "users",
"localField": "user",
"foreignField": "_id",
"as": "user_details"
}
},
{"$lookup":{
"from": "organizations",
"localField": "organization",
"foreignField": "_id",
"as": "organization_details"
}
}
]
(I know you're aware of it) Lastly, based on the database schema above with users and organizations collections, depending on your application use case you may re-consider embedding some values. You may find 6 Rules of Thumb for MongoDB Schema Design useful.

Related

How to cleanly batch queries together in Gremlin

I am writing a GraphQL resolver that retrieves all vertices by a particular edge using the following query (created returns label person):
software {
created {
name
}
}
Which would resolve to the following Gremlin Query for each software node found:
g.V().hasLabel('software').has('name', 'ripple').in('created')
This returns a result that includes all properties of the object:
{
"result": [
{
"#type": "d",
"#rid": "#24:0",
"#version": 6,
"#class": "person",
"in_knows": [
"#35:0"
],
"name": "josh",
"out_created": [
"#32:0",
"#33:0"
],
"age": 32,
"#fieldTypes": "in_knows=g,out_created=g"
}
],
"dbStats": {
...
}
}
I realize that this will fall foul on GraphQL's N+1 query so i'm trying to batch queries together using a Dataloader pattern. (i'm also hoping to do property selections, so i'm not asking the database to return too much info)
So i'm trying to craft a query like so:
g.V().union(
__.hasLabel('software').has('name', 'ripple').
project('parent', 'child').by('id').
by(__.in('created').fold()),
__.hasLabel('software').has('name', 'lop').
project('parent', 'child').by('id').
by(__.in('created').fold())
)
But this results in the following where the props are missing and it just includes the id of the vertices I want:
{
"result": [
{
"parent": "ripple",
"child": [
"#24:0"
]
},
{
"parent": "lop",
"child": [
"#22:0",
"#23:0",
"#24:0"
]
}
],
"dbStats": {
...
}
}
My Question is, how can I have the Gremlin query return all of the props for the found vertices and none of the other props? Should I even been doing batching this way?
For anyone else reading, the query I was trying to write wouldn't work because the TraversalSet created in the .by(_.in('created') can't be cast from a List to an ElementMap as the stream cardinality wouldn't be enforced. (You can only have one record per row, I think?)
My working query would be to duplicate the keys for each row and specify the props needed (the query below is ok for gremlin 3.3 as used in ODB, otherwise if you've got < gremlin 3.4 replace the last by step with be(elementMap('name', 'age')):
g.V().union(
__.hasLabel('software').has('name', 'ripple').
as('parent').
in('created').as('child').
select('parent', 'child').
by(values('name')).
by(properties('id', 'name', 'age').
group().by(__.key()).
by(__.value())),
__.hasLabel('software').has('name', 'lop').
as('parent').
in('created').as('child').
select('parent', 'child').
by(values('name')).
by(properties('id', 'name', 'age').
group().by(__.key()).
by(__.value()))
)
So that you get a result like this:
{"data": [
{
"parent": "ripple",
"child": {
"id": 5717,
"name": "josh",
"age": 32
}
},
{
"parent": "lop",
"child": {
"id": 5709,
"name": "peter",
"age": 35
}
},
{
"parent": "lop",
"child": {
"id": 5713,
"name": "marko",
"age": 29
}
},
{
"parent": "lop",
"child": {
"id": 5717,
"name": "josh",
"age": 32
}
}
]
}
Which would allow you to create a lookup where you concat all results for "lop" and "ripple" into arrays.

What is proper way to use graphql query variables to search for values matching two different searches?

I have a "users" table that is connected to "interestTags" table. I would like to be able to search users interestTags and return all users that match one or more tags, In this example I would like to be able to return all users that has interestTags of either "dog" or "apple.
The code below is only showing matches for "apple" and leaving out the "dog" interestTag users. I would like to get both "dog" users and "apple" users returned instead of one or the other. How would I go about doing this? Here is my code:
users(offset: $offset, limit: 30, order_by: {lastRequest: asc}, where: {dob: {_gte: $fromDate, _lte: $toDate}, interestTagsFromSenderId: {_or: [{tag: $tagList}]}}) {
id
displayName
profilePhotoUrl
dob
bio
location
interestTags: interestTagsFromSenderId {
tag
}
created_at
}
}
graphql query variables:
{
"offset": 0,
"fromDate": "1999-07-01",
"toDate": "2024-01-01",
"tagList":
{
"_eq": "dog", "_eq": "apple"
}
}
This is what graphql is returning:
{
"data": {
"users": [
{
"id": 31,
"displayName": "n00b account",
"profilePhotoUrl": "default.jpg",
"dob": "2021-07-15",
"bio": null,
"location": null,
"interestTags": [
{
"tag": "apple"
}
],
"created_at": "2021-07-15T06:57:23.068243+00:00"
}
]
}
}
to fix the issue:
I added $tagList: [interestTags_bool_exp!] to the query function
I changed the query to interestTagsFromSenderId: {_or: $tagList}}
And changed the variable query to { "tagList": [{"tag": {"_eq": "dog"}}, {"tag": {"_eq": "apple"}}]}

Parse Query by subfield/dot notation

tl;dr
Can ParseCloud/MongoDB filter by Pointer<class>.filed ? By
Pointer<class>.Pointer<class> ? By existence of data in that filed?
Long question:
Round is object which will be played automatically when time will come.
Payment object which indicates that user made payment. When payment being spent we set field round to it.
Player which links online User with Payment
I need to query player for few conditions:
Player
online
has valid(no round and valid equal to 'valid') payment
Player
user equal to specific user
has no payment
Player
user equal to specific user
has valid(no round and valid equal to 'valid') payment
And I made everything to work except validating Payment inside Player query.
Here is condition 1 from the list.
var query = new Parse.Query(keys.Player);
query.skip(0);
query.limit(oneRoundMaxPlayers);
query.greaterThanOrEqualTo(keys.last_online_date, lastAllowedOnline);
// looks like no filter applied here
query.doesNotExist("payment.round");
query.exists(keys.payment);
// This line will make query return 0 elements
// query.equalTo("payment.valid", "valid");
query.include(keys.user);
query.include(keys.payment);
Here is 2 OR 3
var queryPaymentExists = new Parse.Query(keys.Player);
queryPaymentExists.skip(0);
queryPaymentExists.limit(1);
queryPaymentExists.exists(keys.payment);
//This line not filtering
queryPaymentExists.doesNotExist(keys.payment + "." + keys.round);
queryPaymentExists.equalTo(keys.user, user);
// This line makes query always return 0 elements
// queryPaymentExists.equalTo(keys.payment + "." + keys.valid, keys.payment_valid);
var queryPaymentDoesNotExist = new Parse.Query(keys.Player);
queryPaymentDoesNotExist.skip(0);
queryPaymentDoesNotExist.limit(1);
queryPaymentDoesNotExist.doesNotExist(keys.payment);
queryPaymentDoesNotExist.equalTo(keys.user, user);
var compoundQuery = Parse.Query.or(queryPaymentExists, queryPaymentDoesNotExist);
compoundQuery.include(keys.user);
compoundQuery.include(keys.payment);
compoundQuery.include(keys.payment + "." + keys.round);
I've checked logs from Mongo and they looks following
verbose: REQUEST for [GET] /classes/Player: {
"include": "user,payment,payment.round",
"where": {
"$or": [
{
"payment": {
"$exists": true
},
"payment.round": {
"$exists": false
},
"user": {
"__type": "Pointer",
"className": "_User",
"objectId": "ASPKs6UVwb"
}
},
{
"payment": {
"$exists": false
},
"user": {
"__type": "Pointer",
"className": "_User",
"objectId": "ASPKs6UVwb"
}
}
]
}
}
Here is response:
verbose: RESPONSE from [GET] /classes/Player: {
"response": {
"results": [
{
"objectId": "VHU9uwmLA7",
"last_online_date": {
"__type": "Date",
"iso": "2017-10-28T15:15:23.547Z"
},
"user": {
"objectId": "ASPKs6UVwb",
"username": "cn92Ekv5WPJcuHjkmTajmZMDW",
},
"createdAt": "2017-10-22T11:43:16.804Z",
"updatedAt": "2017-10-25T09:23:20.035Z",
"ACL": {
"*": {
"read": true
},
"ASPKs6UVwb": {
"read": true,
"write": true
}
},
"__type": "Object",
"className": "_User"
},
"createdAt": "2017-10-27T21:03:35.442Z",
"updatedAt": "2017-10-28T15:15:23.556Z",
"payment": {
"objectId": "nr7ln7U3eJ",
"payment_date": {
"__type": "Date",
"iso": "2017-10-27T23:42:50.614Z"
},
"user": {
"__type": "Pointer",
"className": "_User",
"objectId": "ASPKs6UVwb"
},
"createdAt": "2017-10-27T23:42:50.624Z",
"updatedAt": "2017-10-28T15:12:30.131Z",
"valid": "valid",
"round": {
"objectId": "jF9gqG4ndh",
"round_date": {
"__type": "Date",
"iso": "2017-10-28T15:12:00.027Z"
},
"createdAt": "2017-10-28T15:11:00.036Z",
"updatedAt": "2017-10-28T15:12:30.108Z",
,
"ACL": {
"*": {
"read": true
}
},
"__type": "Object",
"className": "Round"
},
"ACL": {
"ASPKs6UVwb": {
"read": true
}
},
"__type": "Object",
"className": "Payment"
},
"ACL": {
"ASPKs6UVwb": {
"read": true
}
}
}
]
}
}
You can see that response contains payment.round.
My question is following:
Can ParseCloud/MongoDB filter by Pointer<class>.filed ? By Pointer<class>.Pointer<class> ? By existence of data in that filed?
How can I workaround in situation when I need to check field presence if User can have may Players, User can have many Payments.
UPD
As far as I found mongo should support filtering by "dot notation"
mongodb query by sub-field
So what am I doing wrong?
Short answer:
No
Simplify your data structure
Long answer:
Dot notation can be used to
include documents of pointers, as you already did in your code, e.g. include(keys.user)
filter for properties of fields, e.g. {properyA: 1, propertyB: 2}. All the data is in the field, not in another document in another collection that is referenced by a Parse pointer.
Dot notation cannot be used as filter parameter for referenced pointers in a Parse query. MongoDB also does not support such a filtering, the concept of pointer is one by Parse and not by MongoDB. In a NoSQL environment like MongoDB there are no relations between tables to be used in the query language, as it is not a "relational database" like an SQL database. However Parse provides some comfort of an SQL for simple queries with its concepts of pointer, compoundQuery and matchesKeyInQuery.
If that is not sufficient in your case, simply add the fields to the collection. To the expense that you may have the same fields and data in multiple collections but with the advantage of faster query execution time.
Finding the right data structure is one of the big topics for NoSQL as there is no general right structure. The collections and document structures are basically designed as a trade off between:
execution performance
query necessity / frequency
security (access level)
and data storage size
And they are liquid and can change over time. As your app and its queries mutate you'd also change the data structure if the long term gain is greater than the one time effort.

What are good ways to solve a strange data retrieval issue in elastic search?

I've got a strange issue with an elastic search server.
The elastic search version is 1.6. 'records' is the name of the type. The url for the search is http://some.domain:9200/user/records/_search. The field mapping for 'un' is string.
The following query which been working for years is sometimes failing depending on the value of {someId} newer ids fail, old ones work. The data is there it's just not being found ...
{
"from": 0,
"size": 1,
"sort": {
"un": "desc",
"_score": "desc"
},
"query": {
"query_string": {
"query": "un:\"{someId}\"",
"fields": [
"id",
"un",
"e",
"fn",
"ln",
"bn",
"jt",
"sy",
"c",
"st",
"p",
"fbid",
"lnid"
]
}
}
}
After doing some diagnostics I discovered the following query always works whether or not {someId} is old or new ...
{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "records.un",
"query": "{someId}"
}
}
],
"must_not": [],
"should": []
}
},
"from": 0,
"size": 10,
"sort": [],
"aggs": {}
}
This is a sample document that matches with the second query and fails with the first.
{
"un": "xxxxxxx.xxxxxxx",
"e": "xxxxxxx",
"pswd": "xxxxxxx",
"fn": "xxxxxxx",
"ln": "xxxxxxx",
"bn": "xxxxxxx",
"jt": "",
"sy": "xxxxxxx",
"urole": "User",
"id": "xxxxxxx",
"status": "1",
"lld": "201704280016",
"cd": "201702100132",
"md": "201704280549",
"cc": "0",
"p": "",
"logo": "",
"mlogo": "",
"ad": "201702100132",
"com": "xxxxxxx",
"rr": "true",
"sid": "00000000-0000-0000-0000-000000000000",
"fbidp": "",
"lnidp": "",
"role": "Lots of data is in this one",
"dim": "",
"drm": "",
"drcm": "xxxxxxx",
"drcfbm": "xxxxxxx",
"drclnm": "xxxxxxx",
"as": "false",
"apr": "true",
"iuid": "xxxxxxx",
"vcount": "9",
"pplatform": "",
"pname": "",
"pid": "00000000-0000-0000-0000-000000000000",
"preciept": "",
"ms": "Free"
}
I'm thinking that reindexing the server might solve the issue. What are good ways to solve strange data retrieval issues in elastic search?
There is significant difference between your first ("query": "un:\"{someId}\"") query and second ("query": "{someId}") query. In former query as you are wrapping someId in quotes as a result it will search for exact phrase i.e if you have xxx.yyy then it will look for whole id including dot(.) so id will be matched only when id doesn't contains dot where as in latter query your someId will be analyzed i.e xxx.yyy will be tokenized into two strings (xxx and yyy) and it will be matched if you have dot.
You need to change mappings of un field. If you are not doing any full-text search queries on un then I'd suggest you to make it not_analyzed. Otherwise you need to use different analyzer like whitespace instead of default standard analyzer. I'd really suggest to go with former solution as it(structured exact fields) is more efficient than latter.

mongo slow query

My mongodb is currently loaded with 105,000 documents, and I still have to insert 500,000 more, and it is taking more than 4hours just to insert 1000 documents, due to querying for references:
Insert DocA, and DocA have many citations (about 30)
Find documents in the database which are cited by DocA. [ie: findBy-Doi-Or-Pmid-Or-Pmc(...)]
-so for each of the query for DocA's citation, it is taking about 400ms to complete.
following is one of the profile:
Query { $or [ {$or [ {doi: ""}, {pmid: "10508155"} ] }, {pmc: "" } ]}
{
"ts": ISODate("2012-12-22T11: 55: 39.796Z"),
"op": "query",
"ns": "fyparticles.mArticle",
"query": {
"$or": {
"0": {
"$or": {
"0": {
"doi": ""
},
"1": {
"pmid": "10508155"
}
}
},
"1": {
"pmc": ""
}
}
},
"ntoreturn": NumberInt(1),
"nscanned": NumberInt(105707),
"responseLength": NumberInt(20),
"millis": NumberInt(477),
"client": "192.168.0.15",
"user": ""
}
And the index I have created:
{
"v": NumberInt(1),
"key": {
"doi": NumberInt(1),
"pmid": NumberInt(1),
"pmc": NumberInt(1)
},
"ns": "fyparticles.system.indexes",
"background": NumberInt(1),
"name": "params"
}
Please help me out here! Am I missing something or doing something wrong?
First off you are using an $or which in itself is not the fastest operator in the world due to its need to run multiple queries and then merge duplicates to return a result.
Second you are using an $or with one index. Since an $or is basically one or more queries you may need one or more indexes to cover the unique fields you have in each clause.
Third you are using nested $ors it is good to note that nested $ors do not use indexes: https://jira.mongodb.org/browse/SERVER-3327
So already you have like 3 or more performance problems with your query.
first off, take out that nested $or:
{ $or: [ {doi: ""}, {pmid: "10508155"}, {pmc: ""} ] }
And then you will probably need to create three indexes on this (you might be able to get one to fit all I haven't tested):
db.col.ensureIndex({ doi: 1 });
db.col.ensureIndex({ pmdi: 1 });
db.col.ensureIndex({ pmc: 1 });
That should be the first place to start to make your query faster.

Resources