Can Neo4j work with Hadoop? - hadoop

Can Neo4j work with Hadoop, for social network analysis of big data? If yes, is it hard to make them work together, and what is the bottleneck in such a system?
Basically, I am looking for a solution for social network analysis of big data, and the network could be of hundreds of millions of vertices. I am also expecting a user-friendly GUI for interactive exploring and analysis of graphs. Will Hadoop+Neo4j be good for above purpose? Or is Hadoop+Griph or Spark+GraphX better solution?
Any comments or suggestion will be appreciated. Thanks.

Spark + GraphX give you a faster performance. This is derived Pregal and GraphLab libs. But it doesnt has any UI to see graph output directly. User should have their own UI or can extend any graph example from D3 library,
check this link to know further about spark + graphx :
https://spark.apache.org/docs/latest/graphx-programming-guide.html

Related

What are the different graph processing alternatives for Hadoop/Spark

Giraph, GraphX, Neo4J are some solutions today I am aware of. As this is an area all the tech-giants are working, an updated list is much appreciated. The good comparison of the tools listed above is also not seen anywhere.
Firstly, I should mention that Giraph and GraphX are for graph processing and Neo4j is a graph database. If you are going to store your graph and query it like "give me some nodes that have content 'X' with two distance neighbor having content 'Y'" solutions like Neo4j (graph database) should be applied. Otherwise, Giraph and GraphX could play graph processing role.
Unfortunately, although GraphX offer very nice APIs, for large graph size it fails when available distributed memory is not enough. This condition is mostly observed when the size of intermediate data could not be fit in the available memory.
In addition, as represented in the literatures, Giraph often got the worst place in the performance but it is more stable than GraphX.
There are other solutions like GraphLab and Titan for "Distributed Graph Processing" which are valuable to investigate.

Neo4j and Cluster Analysys

I'm developing a web application that will heavily depend on its ability to make suggestions on items basing on users with similar preferences. A friend of mine told me that what I'm looking for - mathematically - is some Cluster Analysis algorithm. On the other hand, here on SO, I was told that Neo4j (or some other Graph DB) was the kind DB that I should have approached for this task (the preferences one).
I started studying both this tools, and I'm having some doubts.
For Cluster Analysis purposes it looks to me that a standard SQL DB would still be the perfect choice, while Neo4j would be better suited for a Neural Network kind of approach (although still perfectly fit for the task).
Am I missing something? Am I trying to use the wrong tools combination?
I would love to hear some ideas on the subject.
Thanks for sharing
this depends on your data. neo4j is capable to provide even complex recommendations in real-time for one particular node - let's say you want to recommend to a user some product and this can be handle within a graph db in real-time
whereas using some clustering system is the best way to do recommendations for all users at once (and than maybe save it somewhere so you wouldn't need to calculate it again).
the computational difference:
neo4j has has no initialization cost and can give you one recommendations in an acceptable time
clustering needs more time for initialization (e.g. not in seconds but most likely in minutes/hours) and is better to calculate the recommendations for the whole dataset. in fact, taking strictly the time for one calculations for a specific user this clustering can do it faster than neo4j but the big restriction is the initial initialization - thus not good for real-time application
the practical difference:
if you have mostly static data and is ok for you to do recommendations once in a time than do clustering with SQL
if you got dynamical data where the data are being updated with each interaction and is necessary for you to always provide the newest recommendation, than use neo4j
I am currently working on various topics related to recommendation and clustering with neo4j.
I'm not exactly sure what you're looking for, but depending on how you implement you data on the graph, you can easily work out clustering algorithms based on counting links to various type of nodes.
If you plan correctly you nodes and relationships, you can then identify group of nodes that share most common links to a set of category.
let me introduce Reco4J (http://www.reco4j.org), is is an open source framework that provide recommendation based on graph database source. It uses neo4j as graph database management system.
Have a look at it and contact us if you are interested in support.
It is in a really early release but we are working hard to provide extended documentation and new interesting features.
Cheers,
Alessandro

MapReduce project with data mining

I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?
I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.
As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!

Efficient traversal/search algorithm to fetch data from RDF?

I have my data as a RDF graph in DB and using SPARQL i am retriving the data. Now the nodes (objects) in the graphs gets huge and the traversal/search gets much slower now.
a. Can anyone suggest the efficient traversal/search algorithm to fetch the data?
As a next step, i have federated data i.e the data from external applications like SAP. In this case, the search becomes even much slower.
b. What efficient search algorithm do i use in this case?
This seems like a common issue in an large enterprise systems, and any inputs on how these problems have been solved in such systems will also be helpful.
I had a similiar problem. I was doing a lot of graph traversal using SPARQL property paths and it was too slow using an RDF based repository. I was using Jena TDB which is supposed to be fast but still it was too slow !
Like #Mikos suggested, I tried Neo4J. It then got much faster. Like Mark Watson says on this blog entry,
RDF data stores support SPARQL queries: good for matching patterns in data.
Neo4j supports arbitrary graph structures and seems best for exploring
a neighborhood of a graph: start at a node and explore the connected
nodes. (graph traversal)
I used Neo4j but you can try any tool that is built for graph traversal. I read that Allegrograph 4 is RDF based and has good graph traversal speed.
Now Im using Neo4j but I didnt give up on RDF. I still use URIs as identifiers and try to reuse the popular rdf vocabularies and relations. Later I'll add a feature to render my gaphs as RDF. I know that with Neo4j you can also use Tinkerpop to render RDF but I havent tried it myself.
Graph traversal and efficient querying is a wide-ranging problem and the approach to use is dependent on your situation. I would suggest looking at a data-store like Neo4j and complementing it with a tool like Lucene.

Are there other algorithms like LSM tree?

There are many strategies for disk space (and memory) management in databases.
I try to track the best ones like log-structured merge tree in form of BigTable (and HBase, Hypertable, Cassandra) or fractal tree used in TokuDB. From what I have mentioned it is easy to guess, I mean algorithms what use wisely resources (for example avoiding I/O and scale well).
Are there other algorithms like LSM tree? Just direct me.
currently , google release levelDB (you can search it in google);
People say it is the memtable sstable implemetion of google's bigtable!
I think it is a simple version after read some source code!
Hope it can give some help
and nessDB.
It's using a simple LSM-Tree, https://github.com/shuttler/nessDB
H2Database's MVStore uses Log Structured Storage, a slight similar to LSM-Tree
Fragmented LSM-Tree, implemented in PebblesDB
WiscKey, implemented in this contest project

Resources