Is OrientDB suitable for social network analytics? - social-networking

I'm trying to use OrientDB for social network analytics, but is it suitable for that?
In my application, there might be billions of nodes and relationship, since OrientDB support distributed servers, so I think scalability wouldn't be a problem.
I also need to get friends of one person, mutual friends of two persons, friends recommadations and merge two nodes in case that they are actually one person. It seems that OrientDB traverse is not that suitable for these queries.
Is OrientDB suitable for my application?
Thanks!

Course it is! OrientDB SQL Traverse command is powerful, but you can always use Gremlin to do such things.

Related

Do graph databases have problems with aggregation operations?

I came across multiple opinions that graph databases tend to have problems with aggregation operations. Like if you have a set of users and want to get maximum age, RDBMS will outperform graph database. Is true and if it is, what is the reason behind it? As far as I understand, key difference between relational and graph database is that each graph database node somehow includes references to the nodes it is connected to. How does it impact "get max age"-like query?
Disclaimer: most of what I have read was about Neo4j, but I suppose if these limitations exist, they should apply to any graph db.
The use of graph databases like Neo4j is recommended when dealing with connected data and complex queries.
The book Learning Neo4j by Rik Van Bruggen state that you should not use graph databases when dealing with simple, aggregate-oriented queries:
From the book:
(...) simple queries, where write patterns and read patterns align to
the aggregates that we are trying to store, are typically served quite
inefficiently in a graph, and would be more efficiently handled by an
aggregate-oriented Key-Value or Document store. If complexity is low,
the advantage of using a graph database system will be lower too.
The reason behind this is highly related to the nature of the persistence model. Its more easy to make a sum, max or avg operation over a tabled data than a data stored as graph.

What are the pitfalls for using ElasticSearch as a nosql db for a social application vs a graph database?

Our company has several products and several teams. One team is in charge of searching, and is standardizing on Elasticsearch as a nosql db to store all their data, with plans to use Neo4j later to compliment their searches with relationship data.
My team is responsible for the product side of a social app (people have friends, and work for companies, and will be colleagues with everyone working at their companies, etc). We're looking at graph dbs as a solution (after abandoning the burning ship that is n^2 relationships in rdbms), specifically neo4j (the Cypher query language is a beautiful thing).
A subset of our data is similar to the data used by the search team, and we will need to make sure search can search over their data and our data simultaneously. The search team is pushing us to standardize on ElasticSearch for our db instead of Neo4j or any graph db. I believe this is for the sake of standardization and consistency.
We're obviously coming from very different places here, search concerns vs product concerns. He asserts that ElasticSearch can cover all our use cases, including graph-like queries to find suggestions. While that's probably true, I'm really looking to stick with Neo4j, and use an ElasticSearch plugin to integrate with their search.
In this situation, are there any major gotchas to choosing ElasticSearch over Neo4j for a product db (or vice versa)? Any guidelines or anecdotes from those who have been in similar situations?
We are heavy users of both technologies, and in our experience you would better use both to what they are good for.
Elasticsearch is a super good piece of software when it comes to search functionalities, logs management and facets.
Despite their graph plugin, if you want to use a lot of social network and alike relationships in elasticsearch indices, you will have two problems :
You will have to update documents everytime a relationship changes, which can come to a lot when a single entity changes. For example, let's say you have organizations having users which are doing contributions on github, and you want to search for organizations having the top contributors in a certain language, everytime a user is doing a contribution on github you will have to reindex the whole organization, compute percentage of contributions of languages for all users etc... And this is a simple example.
If you intend to use nested fields and partent/child mapping, you will loose performance during search, in reference, the quote from the "tuning for search" documentation here : https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html#_document_modeling
Documents should be modeled so that search-time operations are as cheap as possible.
In particular, joins should be avoided. nested can make queries
several times slower and parent-child relations can make queries
hundreds of times slower. So if the same questions can be answered
without joins by denormalizing documents, significant speedups can be
expected.
Relationships are very well handled in a graph database like neo4j. Neo4j on the contrary lacks search features elasticsearch provides, doing full_text search is possible but not so performant and introduces some burden in your application.
Note apart : when you talk about "store", elasticsearch is a search engine not a database (while being used a lot as it), while neo4j is a database fully transactional.
However, combining both is the winning process, we have actually written an article describing this process that we call Graph-Aided Search with a set of open source plugins for both Elasticsearch and Neo4j providing you a powerful two-way integration out of the box.
You can read more about it here : http://graphaware.com/neo4j/2016/04/20/graph-aided-search-the-rise-of-personalised-content.html

Neo4j and Cluster Analysys

I'm developing a web application that will heavily depend on its ability to make suggestions on items basing on users with similar preferences. A friend of mine told me that what I'm looking for - mathematically - is some Cluster Analysis algorithm. On the other hand, here on SO, I was told that Neo4j (or some other Graph DB) was the kind DB that I should have approached for this task (the preferences one).
I started studying both this tools, and I'm having some doubts.
For Cluster Analysis purposes it looks to me that a standard SQL DB would still be the perfect choice, while Neo4j would be better suited for a Neural Network kind of approach (although still perfectly fit for the task).
Am I missing something? Am I trying to use the wrong tools combination?
I would love to hear some ideas on the subject.
Thanks for sharing
this depends on your data. neo4j is capable to provide even complex recommendations in real-time for one particular node - let's say you want to recommend to a user some product and this can be handle within a graph db in real-time
whereas using some clustering system is the best way to do recommendations for all users at once (and than maybe save it somewhere so you wouldn't need to calculate it again).
the computational difference:
neo4j has has no initialization cost and can give you one recommendations in an acceptable time
clustering needs more time for initialization (e.g. not in seconds but most likely in minutes/hours) and is better to calculate the recommendations for the whole dataset. in fact, taking strictly the time for one calculations for a specific user this clustering can do it faster than neo4j but the big restriction is the initial initialization - thus not good for real-time application
the practical difference:
if you have mostly static data and is ok for you to do recommendations once in a time than do clustering with SQL
if you got dynamical data where the data are being updated with each interaction and is necessary for you to always provide the newest recommendation, than use neo4j
I am currently working on various topics related to recommendation and clustering with neo4j.
I'm not exactly sure what you're looking for, but depending on how you implement you data on the graph, you can easily work out clustering algorithms based on counting links to various type of nodes.
If you plan correctly you nodes and relationships, you can then identify group of nodes that share most common links to a set of category.
let me introduce Reco4J (http://www.reco4j.org), is is an open source framework that provide recommendation based on graph database source. It uses neo4j as graph database management system.
Have a look at it and contact us if you are interested in support.
It is in a really early release but we are working hard to provide extended documentation and new interesting features.
Cheers,
Alessandro

Social network functionality finding connections you might know

I want to create a functionality for suggesting connections in a social network.
In the network you can have connections and connect to other users.
I want to implement a connection suggestion functionality on the network.
I think the most basic approach to implement this is to check all my connections most occurring common connection that my user is not connected to and sugest this user to my user to connect to.
My questions is:
Is this a good basic approach for an easy connection finder?
Is there any good implementation algorithm that i can use for finding my connections most occurring user that they are connected to?
I'd try a machine learning approach for this problem.
I'll suggest two common machine learning concepts in order to solve this problem. In order for both of them to work - you need to extract features from the data (for example look at a subgraph, and friendship with each member in the subgraph is a binary feature).
The two approaches are:
Classification. In here, you are trying to find a classifier C:UserxUser->Boolean (A classifier that given two users, gives a boolean answer - should they be friends). The classification approach will require you to first manually label, or extract some classified information (A big enough set of pairs, each with a classification). The algorithm will learn this pattern, and use it to predict future inputs.
Clustering (AKA Unsupervised learning). You can try and find clusters in your graph, and suggest users to be friends with all members in their cluster.
I have to admit I never used any of these methods for friendship suggestion - so I have no idea how accurate it will be. You can use cross-validation in order to estimate the accuracy of the algorithm.
If you are interested in learning more about it - two weeks ago an on line free course has started in stanford about machine learning: https://class.coursera.org/ml-2012-002

Efficient traversal/search algorithm to fetch data from RDF?

I have my data as a RDF graph in DB and using SPARQL i am retriving the data. Now the nodes (objects) in the graphs gets huge and the traversal/search gets much slower now.
a. Can anyone suggest the efficient traversal/search algorithm to fetch the data?
As a next step, i have federated data i.e the data from external applications like SAP. In this case, the search becomes even much slower.
b. What efficient search algorithm do i use in this case?
This seems like a common issue in an large enterprise systems, and any inputs on how these problems have been solved in such systems will also be helpful.
I had a similiar problem. I was doing a lot of graph traversal using SPARQL property paths and it was too slow using an RDF based repository. I was using Jena TDB which is supposed to be fast but still it was too slow !
Like #Mikos suggested, I tried Neo4J. It then got much faster. Like Mark Watson says on this blog entry,
RDF data stores support SPARQL queries: good for matching patterns in data.
Neo4j supports arbitrary graph structures and seems best for exploring
a neighborhood of a graph: start at a node and explore the connected
nodes. (graph traversal)
I used Neo4j but you can try any tool that is built for graph traversal. I read that Allegrograph 4 is RDF based and has good graph traversal speed.
Now Im using Neo4j but I didnt give up on RDF. I still use URIs as identifiers and try to reuse the popular rdf vocabularies and relations. Later I'll add a feature to render my gaphs as RDF. I know that with Neo4j you can also use Tinkerpop to render RDF but I havent tried it myself.
Graph traversal and efficient querying is a wide-ranging problem and the approach to use is dependent on your situation. I would suggest looking at a data-store like Neo4j and complementing it with a tool like Lucene.

Resources