Design of reputation engine - data-structures

Lets imagine a social network where each user can gain reputation from others by, say, delegation. So given A and B initially have a reputation of 1 when A delegates to B then A has 0 and B has 2.
Then B can delegate to C and so on.
Also - the delagation has its scope, and scopes can be nested. So A can delegate the reputaion on all topics, or only programming, or only c#. And he can delegate on programming to B but on C# to C. That means the final reputation varies depending on a given scope.
So we get a kind of directed graph structure (probably a tree but it's not yet clear what about cycles) which we need to traverse to calculate the reputation.
I'm trying to model that with DDD principles and I'm not sure what is the aggregate here.
I suppose the delegation tree/graph is a candidate for that as the aggregate is a unit of consistency. However that means the aggregates would be very large. The scope thing complicates it even more because it makes an aggregate boundry not clear. Is delegation on C# a part of aggregate with delegations on programming?
What about user? As an aggregate it would have to store references (delegations) to/from other users. Again - which aggregate a given user belongs to?
A separate question is how to efficiently calculate the reputation. I guess the graph database will be more apropriate than relational in this case but is that the only good answer?

A root aggregate in is meant to enforce invariants. The rules of delegation you've informed us about are one set of invariants. Not knowing what other invariants you may require it is hard to tell what a suitable root aggregate would be, but simply going by what you've presented "user" seems to me a perfect root aggregate to enforce all your delegation rules as invariants. A user may have one or more delegation scopes, which themselves may be root aggregates. A user can, under the rules of delegation, delegate to another user, which may in turn delegate under those same rules. This allows you to enforce all your invariants and there is no problem storing references to (other) users under the rules of DDD.
Keep asking how you can enforce your domain specific rules consistently and you will find your root aggregates.
On your separate question: a graph db seems like a better idea then a relation database, but it's hard to tell with limited information. I suggest that you post this question separately and include your considerations about relational versus graph databases.

Related

Documenting the aggregates & relations between microservices

In my organization we're trying to design our microservices based on the Bounded Context (BC) pattern (part of Domain-driven design). While we're doing this we also try to use another DDD pattern called the Context Mapping, to better identify the various contexts in the application, their boundaries and the relations between them.
All of this can be done on a whiteboard or in some online drawing tool. However, I'm looking for a way to generate a complete picture of the various services, what aggregates they contain and potentially the relations between such aggregates (as the same User in one BC might be a Customer in another). A good example is figure 4-10 in here. The generation should ideally be based on some DSL or script which we would maintain, as this kind of work is fairly high-level and context boundaries don't change very often. For example, a team adds a new aggregate or starts keeping a copy of an aggregate from another service, they update the script/DSL and regenerate the diagram.
Solutions I've looked at so far:
Context mapper - it doesn't visualise the aggregates in each BC/service, nor does it show relations
C4 model, Level 2 - we already use it, so it could be fairly easy to add a textual list of aggregates per container, but it's not what it's intended for (and the visualisation is not optimal)
ddd bounded context/microservice canvas - it's too detailed and can't really be used to look at the big picture
I'm wondering how and if this is done in other organization, and looking for suggestions for some tooling that would be of help.
I think the format used for event storming sessions might be worth to have look at in your case. Once done it covers all domain events, commands, actors, read models, policies, external systems. Also it illustrates the bounded contexts in which the aggregates, events, etc. live in. An example can be found here:
https://medium.com/capital-one-tech/event-storming-decomposing-the-monolith-to-kick-start-your-microservice-architecture-acb8695a6e61
I know the format is mainly used for domain exploration but from my experience, if done nicely (e.g. using some tool like Miro, Lucid, or the like) it also provides good documentation and overview of what's going on in your system.

The impact of relation's direction on performance and how to decide on it

The Neo4j documentation says:
Even though all relationships have a direction they are equally well
traversed in both directions so there's no need to create duplicate
relationships in the opposite direction (with regard to traversal or
performance).
I'm not sure how relations are implemented in Neo4j, but if incoming and outgoing relations are kept in separate sets, even though they are traversed well equally yet how you design your relations can affect the performance.
So I guess my question is, does the direction of a relation affect the performance of a Graph database in a global scheme and if it does, then how should I decide on it? E.g. does keeping the number of incoming and outgoing relations balanced help?
Relationship directionality does not affect performance.
On disk, a node record just keeps a reference to the record for its "first" relationship (either incoming or outgoing). Traversal of relationship paths is done mainly through the relationship records. The full details are too complex to merit discussion here, but relationship data is stored symmetrically with respect to directionality. So, there is no need to worry about balancing relationship directions.

Path finding in real world maps with custom valuation function

Description:
Our customer interested in logistics wants automated path finding for his business. The issue is that each city, harbor, state, and country has different legal system with different politics for various goods. For example, some goods are forbidden to be transported through the country or fee applies on others, etc.
Intention:
For a given transport of particular goods I need to find the best possible route with a respect to the known politics. If no such route exists, I have to find several the less problematic alternatives.
Questions:
I believe that custom weights of nodes and edges in the graph disqualifies the public maps APIs. The quick search I made showed that no well-known API accepts custom valuation function. Am I right?
I thought that the path finding over custom graph is quite simple even with the custom valuation function. On the other hand, the real world maps are such a giant graph that only thinking about application of conventional algorithms seems silly. Should I think about this solution or is that too complicated and I should look at some other options?
The only thing I can think of to be possible is something like OSPF algorithm - divide the world into regions with their politics and then find the routes only through the possible regions. For these routes for each region find routes through states, etc. That means dynamically change granularity based on the supported geographical objects (countries, states, cities, ...) and slowly converge to the highest granularity, i.e., streets. The bad side of this approach is that it requires a lot of programming as well as lot of computation. Furthermore, I am not sure if maps with such granularity exist and are publicly available. Is this wrong way of thinking or is this too complicated?
What are other options? I could not figure out anything else.

How to use BDD to code complex data structures / data layers

I'm new to behavior-driven development and I can't find any examples or guidelines that parallel my current problem.
My current project involves a massive 3D grid with an arbitrary number of slots in each of the discrete cells. The entities stored in these slots have their own slots and, thus, an arbitrary nesting of entities can exist. The final implementation of the object(s) used will need be backed by some kind of persistent data store, which complicates the API a bit (i.e. using words like load/store instead of get/set and making sure modifying returned items doesn't modify the corresponding items in the data store itself). Don't worry, my first implementation will simply exist in-memory, but the API is what I'm supposed to be defining behavior against, so the actual implementation doesn't matter right now.
The thing I'm stuck on is the fact that BDD literature focuses on the interactions between objects and how mock objects can help with that. That doesn't seem to apply at all here. My abstract data store's only real "behavior" involves loading and storing data from entities outside those represented by the programming language itself; I can't define or test those behaviors since they're implementation-dependent.
So what can I define/test? The natural alternative is state. Store something. Make sure it loads. Modify the thing I loaded and make sure after I reload it's unmodified. Etc. But I'm under the impression that this is a common pitfall for new BDD developers, so I'm wondering if there's a better way that avoids it.
If I do take the state-testing route, a couple other questions arise. Obviously I can test an empty grid first, then an empty entity at one location, but what next? Two entities in different locations? Two entities in the same location? Nested entities? How deep should I test the nesting? Do I test the Cartesian product of these non-exclusive cases, i.e. two entities in the same location AND with nested entities each? The list goes on forever and I wouldn't know where to stop.
The difference between TDD and BDD is about language. Specifically, BDD focuses on function/object/system behavior to improve design and test readability.
Often when we think about behavior we think in terms of object interaction and collaboration and therefore need mocks to unit test. However, there is nothing wrong with an object whose behavior is to modify the state of a grid, if that is appropriate. State or mock based testing can be used in TDD/BDD alike.
However, for testing complex data structures, you should use a Matchers (e.g. Hamcrest in Java) to test only the part of the state you are interested in. You should also consider whether you can decompose the complex data into objects that collaborate (but only if that makes sense from an algorithmic/design standpoint).

Intelligent web features, algorithms (people you may follow, similar to you ...)

I have 3 main questions about the algorithms in intelligent web (web 2.0)
Here the book I'm reading http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665 and I want to learn the algorithms in deeper
1. People You may follow (Twitter)
How can one determine the nearest result to my requests ? Data mining? which algorithms?
2. How you’re connected feature (Linkedin)
Simply algorithm works like that. It draws the path between two nodes let say between Me and the other person is C. Me -> A, B -> A connections -> C . It is not any brute force algorithms or any other like graph algorithms :)
3. Similar to you (Twitter, Facebook)
This algorithms is similar to 1. Does it simply work the max(count) friend in common (facebook) or the max(count) follower in Twitter? or any other algorithms they implement? I think the second part is true because running the loop
dict{count, person}
for person in contacts:
dict.add(count(common(person)))
return dict(max)
is a silly act in every refreshing page.
4. Did you mean (Google)
I know that they may implement it with phonetic algorithm http://en.wikipedia.org/wiki/Phonetic_algorithm simply soundex http://en.wikipedia.org/wiki/Soundex and here is the Google VP of Engineering and CIO Douglas Merrill speak http://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s
What about first 3 questions? Any ideas are welcome !
Thanks
People who you may follow
You can use the factors based calculations:
factorA = getFactorA(); // say double(0.3)
factorB = getFactorB(); // say double(0.6)
factorC = getFactorC(); // say double(0.8)
result = (factorA+factorB+factorC) / 3 // double(0.5666666666666667)
// if result is more than 0.5, you show this person
So say in the case of Twitter, "People who you may follow" can based on the following factors (User A is the user viewing this "People who you may follow" feature, there may be more or less factors):
Relativity between frequent keywords found in User A's and User B's tweets
Relativity between the profile description of both users
Relativity between the location of User A and B
Are people User A is following follows User B?
So where do they compare "People who you may follow" from? The list probably came from a combination of people with high amount of followers (they are probably celebrities, alpha geeks, famous products/services, etc.) and [people whom User A is following] is following.
Basically there's a certain level of data mining to be done here, reading the tweets and bios, calculations. This can be done on a daily or weekly cron job when the server load is least for the day (or maybe done 24/7 on a separate server).
How are you connected
This is probably a smart work here to make you feel that loads of brute force has been done to determine the path. However after some surface research, I find that this is simple:
Say you are User A; User B is your connection; and User C is a connection of User B.
In order for you to visit User C, you need to visit User B's profile first. By visiting User B's profile, the website already save the info indiciating that User A is at User B's profile. So when you visit User C from User B, the website immediately tells you that 'User A -> User B -> User C', ignoring all other possible paths.
This is the max level as at User C, User Acannot go on to look at his connections until User C is User A's connection.
Source: observing LinkedIN
Similar to you
It's the exact same thing as #1 (People you may follow), except that the algorithm reads in a different list of people. The list of people that the algorithm reads in is the people whom you follow.
Did you mean
Well you got it right there, except that Google probably used more than just soundex. There's language translation, word replacement, and many other algorithms used for the case of Google. I can't comment much on this because it will probably get very complex and I am not an expert to handle languages.
If we research a little more into Google's infrastructure, we can find that Google has servers dedicated to Spelling and Translation services. You can get more information on Google platform at http://en.wikipedia.org/wiki/Google_platform.
Conclusion
The key to highly intensified algorithms is caching. Once you cache the result, you don't have to load it every page. Google does it, Stack Overflow does it (on most of the pages with list of questions) and Twitter not surprisingly too!
Basically, algorithms are defined by developers. You may use others' algorithms, but ultimately, you can also create your own.
People you may follow
Could be one of many types of recommendation algorithms, maybe collaborative filtering?
How you are connected
This is just a shortest path algorithm on the social graph. Assuming there is no weight to the connections, it will simply use breadth-first.
Similar to you
Simply a re-arrangement of the data set using the same algorithm as People you may follow.
Check out the book Programming Collective Intelligence for a good introduction to the type of algorithms that are used for People you may follow and Similar to you, it has great python code available too.
People You may follow
From Twitter blog - "suggestions are based on several factors, including people you follow and the people they follow" http://blog.twitter.com/2010/07/discovering-who-to-follow.html
So if you follow A and B and they both follow C, then Twitter will suggest C to you...
How you’re connected feature
I think you have answered this one.
Similar to you
As above and as you say, although the results are probably cached - so its only done once per session or maybe even less frequently...
Hope that helps,
Chris
I don't use twitter; but with that in mind:
1). On the surface, this isn't that difficult: For each person I follow, see who they follow. Then for each of the people they follow, see who they follow, etc. The deeper you go, of course, the more number crunching it takes.
You can take this a bit further, if you can also efficiently extract the reverse: For those I follow, who also follows them?
For both ways, what's unsaid is a way to weight the tweeters to see if they're someone I'd really want to follow: A liberal follower may also follow a conservative tweeter, but that doesn't mean I'd want follow the conservative (see #3).
2). Not sure, thinking about it...
3). Assuming the bio and tweets are the only thing to go on, the hard parts are:
Deciding what attributes should exist (political affiliation, topic types, etc.)
Cleaning each 140 characters to data-mine.
Once you have the right set of attributes, then two different algorithms come to mind:
K means clustering, to decide which attributes I tend to discriminate on.
N-Nearest neighbor, to find the N most similar tweeters to you given the attributes I tend to give weight to.
EDIT: Actually, a decision tree is probably a FAR better way to do all of this...
This is all speculative, but it sounds fun if one were getting paid to do this.

Resources