I'm maintaining a small program that goes through documents in a Neo4j database and dumps a JSON-encoded object to a document database. In Neo4j—for performance reasons, I imagine—there's no real data, just ID's.
Imagine something like this:
posts:
post:
id: 1
tags: 1, 2
author: 2
similar: 1, 2, 3
I have no idea why it was done like this, but this is what I have to deal with. The program then uses the ID's to fetch information for each data structure, resulting in a proper structure. Instead of author being just an int, it's an Author object, with name, email, and so on.
This worked well until the similar feature was added. Similar consists of ID's referencing other posts. Since in my loop I'm building the actual post objects, how can I reference them in an efficient manner? The only thing I could imagine was creating a cache with the posts I already "converted" and, if the referenced ID is not in the cache, put the current post on the bottom of the list. Eventually, they will all be processed.
The approach you're proposing won't work if there are cycles of similar relationships, which there probably are.
For example, you've shown a post 1 that is similar to a post 2. Let's say you come across post 1 first. It refers to post 2, which isn't in the cache yet, so you push post 1 back onto the end of the queue. Now you get to post 2. It refers to post 1, which isn't in the cache yet, so you push post 2 back onto the end of the queue. This goes on forever.
You can solve this problem by building the post objects in two passes. During the first pass, you make Post objects and fill them with all the information except for the similar references, and you build up a map[int]*Post that maps ID numbers to posts. On the second pass, for each post, you iterate over the similar ID numbers, look up each one in the map, and use the resulting *Post values to fill a []*Post slice of similar posts.
Related
I am working on a service that provides information about a few related entities, somewhat like a database. Suppose that there's calls to retrieve information about a school:
service MySchool {
rpc GetClassRoom (ClassRoomRequest) returns (ClassRoom);
rpc GetStudent (StudentRequest) returns (Student);
}
Now, suppose that I want to find out a class room's information, I'd receive a proto that looks like so:
message ClassRoom {
string id = 1;
string address = 2;
string teacher = 3;
}
Sometimes I also want to know all of the students of the classroom. I am struggling to think which is the better design pattern.
Option A) Add an extra rpc like so: rpc GetClassRoomStudents (ClassRoomRequest) returns (ClassRoomStudents), where ClassRoomStudents has a single field repeated Student students. This technique requires more than one call to get all the information that we want (and many if we wanted to know information for more than one classroom).
Option B) Add an extra repeated Student students field to the ClassRoom proto, and B') Fill it up only when necessary, or B") Fill it up whenever the server receives a GetClassRoom call. This may sometimes fetch extra information, or lead to ambiguity according to what fields are filled up.
I am not sure what's the best / most conventional way of dealing with this. How have some of you dealt with this?
There is no simple answer. It's a tradeoff between simplicity (option A) and performance (option B), and it depends on the situation which solution is best.
In general, I'd recommend to go with the simple solution first, unless your measurements demonstrate that it leads to performance issues. At that point, it's easy to add repeated Student students to ClassRoom and a field bool fetch_students [default=false] to ClassRoomRequest. Then clients are free to continue using the simple API, or choose to upgrade to the more performant API if they need to.
Note that this isn't specific to gRPC; the same issue is seen in REST APIs, and basically almost any request/response model.
In v2 it was possible to make a call to /files with the query fileId in children to get a list of DriveFile objects that were parents of the supplied file.
Now, it seems to be required to make a call to /files/:fileId?fields=parents, then make a separate call to /files/:parentId for each returned parent, possibly turning one call into a dozen.
Is this correct, and if so why? This is a huge performance hit to our app, so hopefully there's an undocumented method.
The query "'fileId' in children'" doesn't publicly exist (not documented/supported) in v2 either and I don't recall it ever existing. What does exist in V2 is the Parents collection which effectively answers the same question. In v3, to get the parents of a file you just get the child and ask for the parents field.
As for whether or not that is a performance hit, I don't think it is in practice. The Parents resource in v2 was very light to begin with, and other than the ID the only useful field was the 'isRoot' property. That you can calculate yourself by calling files/root up front to get the ID of the root folder for that user (just once and save it, it won't change for that user.)
If you need to get more information about the parents than just the IDs and are worried about the # of calls you have to make, use batching to fetch them. If you just have one parent, no need to batch (it's just overhead.) If you find that a file has multiple parents, create a batch request. That'll be sent as a single HTTP request/response and is handled very efficiently on the back end.
Point is, if you just need IDs, it's no worse than before. It's one call to get the parents of a file.
If you need more than IDs, it's at most 2 HTTP requests (outside really bizarre edge cases like 1000+ parents which would exceed the batch size :)
In V3 it is possible to list all children of a parent as it's explained here: https://developers.google.com/drive/v3/web/search-parameters
Example call:
https://www.googleapis.com/drive/v3/files?q=parents in '0Byho0qAdzabmVl8xcDR1S0pNY3c' of course replace spaces with %20, this will list all the files in the folder which has id='0Byho0qAdzabmVl8xcDR1S0pNY3c'
you just need to mention like below:
var request = service.Files.List();
request.Q = "('root' in parents)";
var FileListOfParentOnly = request.Execute();
I never quite understood the if needed part of the description.
.fetchAll()
Fetches the given list of Parse.Object.
.fetchAllIfNeeded()
Fetches the given list of Parse.Object if needed.
What is the situation where I might use this and what exactly determines the need? I feel like it's something super elementary but I haven't been able to find a satisfactory and clear definition.
In the example in the API, I notice that the fetchAllIfNeeded() has:
// Objects were fetched and updated.
In the success while the fetchAll only has:
// All the objects were fetched.
So does the fetchAllIfNeeded() also save stuff too? Very confused here.
UPDATES
TEST 1
Going on some of the hints #danh left in the comments I tried the following things.
var todos = [];
var x = new Todo({content:'Test A'}); // Parse.Object
todos.push(x);
x.save();
// So now we have a todo saved to parse and x has an id. Async assumed.
x.set({content:'Test B'});
Parse.Object.fetchAllIfNeeded(todos);
So in this scenario, my client x is different than the server. But the x.hasChanged() is false since we used the set function and the change event is triggered. fetchAllIfNeeded returns no results. So it isn't that it's trying to compare this outright to what is on the server to sync and fetch.
I notice that in the request payload, running the fetchAllIfNeeded is sending the following interesting thing.
{where: {objectId: {$in: []}}, _method: "GET",…}
So it seems that on the clientside something determines whether an object isNeeded
Test 2
So now, based on the comments I tried manipulating the changed state of the object by setting with silent.
x.set({content:'Test C'}, {silent:true});
x.hasChanged(); // true
Parse.Object.fetchAllIfNeeded(todos);
Still nothing interesting. Clearly the server state ("Test A") is different than clientside ("Test C"). and I still results [] and the request payload is:
{where: {objectId: {$in: []}}, _method: "GET",…}
UPDATE 2
Figured it out by looking at the Parse source. See answer.
After many manipulations, then taking a look at the source - I figured this out. Basically fetchAllIfNeeded will fetch models in an array that have no data, meaning there are no attribute properties and values.
So the use case would be you have lets say a parent object with an array of nested Parse Objects. When you fetch the parent object, the nested child objects in the array will not be included (unless you have the include query constraint set). Instead, the pointers are sent back to clientside and in your client, those pointers are translated into 'empty' models with no data, basically just blank Parse.Objects with ids.
Specifically, the Parse.Object has an internal Boolean property called _hasData which seems to be toggled true any time stuff like set, or fetch, or whatever gives that model attributes.
So, lets say you need to fetch those child objects. You can just do something like
var childObjects = parent.get('children'); // Array
Parse.Object.fetchAllIfNeeded(childObjects);
And it will search for those children who are currently only represented as empty Objects with id.
It's useful as opposed to fetchAll in that you might go through the children array and lazily load one at a time as needed, then at a later time need to "get the rest". fetchAllIfNeeded essentially just filters "the rest" and sends a whereIn query that limits fetching to those child objects that have no data.
In the Parse documentation, they have a comment in the callback response to fetchAllIfNeeded as:
// Objects were fetched and UPDATED.
I think they mean the clientside objects were updated. fetchAllIfNeeded is definitely sending GET calls so I doubt anything updates on the serverside. So this isn't some sync function. This really confused me as I instantly thought of serverside updating when they really mean:
// Client objects were fetched and updated.
I'm still a bit lost when it comes to Sorted Sets and how to best construct them. Currently I have a simple set of activity on my site. Normally it will display things like User Followed, User liked, User Post etc. The JSON looks something like...
id: 2808697,
activity_type: "created_follower",
description: "Bob followed this profile",
body: null,
user: "Bob",
user_id: 99384,
user_profile_id: 233007,
user_channel_id: 2165811,
user_cube_url: "bob-anerson",
user_action: "followed this profile",
buddy: "http://s3.amazonaws.com/stuff/ju-logo.jpg",
affected: "Bill Anerson is following Jon Denver.",
created_at: "2014-06-24T20:34:11-05:00",
created_ms: 1403660051902,
profile_id: 232811,
channel_id: 2165604,
cube_url: "jondenver",
type: "profiles",
So if the activity type can be multiple things (IE Created Follow, Liked Event, Posted News, ETC) how would I go about putting this all in a sorted set? I'm already sure I want the score to be the created_ms but the question is, can I do multiple values in a sorted set that all have keys as fields? Should most of this be in a hash? I realize this is a fairly open question but after trying to wrap my head around all the tutorials Im just concerned about setting up the data structure before had so I dont get caught to deep in the weeds.
A sorted set is useful if you want to... keep stuff sorted! ;)
So, I assume you're interested in keeping the activities sorted by their creation time (ms). As for storing the actual data, you have two options:
Use the sorted set itself to store the data, even in native JSON format. Note that with this approach you'll only be able to fetch the entire JSON and you'll have to parse it at the client.
Alternatively, use the sorted to store "pointers" to hashes - i.e. the values will be key names in which you'll store the data. From your description, this appears the preferable approach.
I am building an API using Codeigniter and MongoDB.
I got some questions about how to "model" the mongoDB.
A user should have basic data like name and user should also be able to
follow other users. Like it is now each user document keeps track of all people
that is following him and all that he is following. This is done by using arrays
of user _ids.
Like this:
"following": [323424,2323123,2312312],
"followers": [355656,5656565,5656234234,23424243,234246456],
"fullname": "James Bond"
Is this a good way? Perhaps the user document should only contain ids of peoples that the user is following and not who is following him? I can imaging that keeping potentially thousands of ids (for followers) in an array will make the document to big?
All input is welcome!
The max-document size is currently limited to 16MB (v1.8.x and up), this is pretty big. But i still think, that it would be ok in this case to move the follower-relations to an own collection -- you never know how big your project gets.
However: i would recommend using database references for storing the follower-relations: it's way easier to resolve the user from a database reference. Have a look at:
http://www.mongodb.org/display/DOCS/Database+References