How to use neo4j as input to hadoop? - hadoop

I have a large neo4j database. I need to check for multiple patterns existing across the graph, which I was thinking would be easily done in hadoop. However, I'm not sure of the best way to feed tuples from neo4j into hadoop. Any suggestions?

In my opinion, while it can be done, I don't think MapReduce (which I believe is what you mean when you say "Hadoop") is a good (or at least performant) choice for graph analytics. You want a Bulk Synchronous Parallel approach instead. If you want to perform cloud-scale graph analytics, you want Apache Giraph, which "understands" the Hadoop ecosystem.
Then again, I would ask why you need to use anything outside of Neo4J at all. I don't know your use case obviously, but first make sure you can't do what you need to do within Neo4J.

Related

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

Pure spark vs spark SQL for quering data on HDFS

I have (tabular) data on a hdfs cluster and need to do some slightly complex querying on it. I expect to face the same situation many times in the future, with other data. And so, question:
What are the factors to take into account to choose where to use (pure) Spark and where to use Spark-SQL when implementing such task?
Here is the selection factors I could think of:
Familiarity with language:
In my case, I am more of a data-analyst than a DB guy, so this would lead me to use spark: I am more comfortable to think of how to (efficiently) implement data selection in Java/Scala than in SQL. This however depends mostly on the query.
Serialization:
I think that one can run Spark-SQL query without sending home-made-jar+dep to the spark worker (?). But then, returned data are raw and should be converted locally.
Efficiency:
I have no idea what differences there are between the two.
I know this question might be too general for SO, but maybe not. So, could anyone with more knowledge provides some insight?
About point 3, depending on your input-format, the way in which the data is scanned can be different when you use a pure-Spark vs Spark SQL. For example if your input format has multiple columns, but you need only few of them, it's possible to skip the retrieval using Spark SQL, whereas this is a bit trickier to achieve in pure Spark.
On top of that Spark SQL has a query optimizer, when using DataFrame or a query statement, the resulting query will go through the optimizer such that it will be executed more efficiently.
Spark SQL does not exclude Spark; combined usage is probably for the best results.

MapReduce project with data mining

I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?
I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.
As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!

Hadoop Hypercube

Hey,
i am starting a hadoop based hypercube with a flexible number of dimensions.
Does anybody know any existing approaches for this?
I just found PigOLAPSketch, but there is no code to use it.
Another approach is Zohmg from lastfm, which uses hbase, but seems to be very dead.
I think i will start a pig solution, maybe you have some advices?
This would be very cool/useful. OpenTSDB is an HBase time-series database that might be interesting to look at, they have a clever approach to secondary indexing.
You can also look at gpu based database https://www.kinetica.com/
but this is not open source, requires separate appliances and movement of data from Hadoop to Kinetica infrastructure.

Ad Hoc Reports Hadoop

I want to allow people to put in simple text search terms, run a pig job (if that's best? it's what I know best) and output the results (the tsv file results?) so I can show them in a web interface.
Is there anything that approaches this problem?
Anything known to link a few disjointed pieces of the flow I am going for, together?
Thanks
Why don't you index the docs into Lucene or Solr? Then you can do text search in real-time. Hadoop is designed for batch oriented processes, which doesn't seem like what you want in this case.
Well, it depends on your project's requirements. Does it need low-latency, and how complex is the ad hoc search. Well I think hbase+pig might be a comprised solution. hbase can be used for search real-time search purpose (although its search function is not so powerful than RDBMS) and pig for batch_processing of large amount for data.

Resources