how to distribute neo4j to several machines(is it possible at all)? - hadoop

how should I distribute neo4j in order to traverse n numbers of graphs on different machines,concurrently?and each machine returns its result,so the results can be compared with each other(reminds me map reduce,am I right?),and the best be selected?Can that be done?
which tools should I use?hadoop?
I will be really thankful if you give me tutorials too.

Neo4J distribution is supported through replication of the data, storing the data on a single machine and read it from many machines.
Neo4J doesn't automatically shard the data across multiple machines, this has to be handled at the application layer.
Not sure why this is difficult to implement in a graph database. Noe4J might be offering this feature in their future releases.
Check the HA documentation from Neo4J for more details.

Related

When to create or reuse an Elasticsearch cluster?

My team has been using a minimal Elasticsearch implementation for a year now, and we'd now like to additionally use ES for a new and totally different use case, using different and essentially unrelated data. While I have been reading about Clusters, Nodes, and ES in general, I do not intuitively understand whether or not we should create a new cluster for this, or add the data into our existing cluster. Where is a good place to look to better understand the factors involved in this decision? We're using ES hosted by Elastic Cloud, v5.2.x for the record.
If you have the resources available, it does not hurt to use the same cluster for multiple types of data/use cases.
If I were you I would just take a look at the Monitoring page to check out your usage statistics like storage, search rates, indexing rates, etc. to see if you have the resources available. If so, you don't really need to have separate clusters.

Best technology stack for aggregation across various properties

We are working on developing a platform which models flow of entities across a graph. The system has to answer questions of the kind how many entities having these properties are sitting at a given node on the graph , what is the inflow on a node, outflow on a node etc. Flow data is fed to the system in a stream. We are thinking of breaking the flow data in time buckets(say 5 mins) and pre-compute various aggregates against different properties and storing the aggregates in DynamoDB to serve queries.
With regards to this we are evaluating the following options:
EMR: Put flow data in AWS -S3/DynamoDB run a Map Reduce/hive job
Putting recent data into AWS- RDS, computing the aggregates via sql
Akka: It is a framework to build distributed applications via Actors
and Message passing.
If anyone has worked on similar usecase or has used any of the above technologies, please let me know what approach would be best fit for our use case.
I have used EMR to process data in S3... works pretty well. And the best part is you can spin up hadoop clusters of various sizes that fit the work load.
you may want to look into Storm for stream processing
I am also collecting a list of big-data tools here: http://hadoopilluminated.com/hadoop_book/Bigdata_Ecosystem.html
The final solution employed AWS Redshift, the driving reason was the requirement of high speed data ingestion, which Redshift provides via the COPY command.
Hadoop is built to store the data efficiently, however it does not gurantees a sub-second sla for ingestion, neither does it provide an SLA for when the data will be available for MR jobs, this was the main reason we did not go with EMR or Hadoop in general.

Analytics and Mining of data sitting on Cassandra

We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying. However we also would like to build large queries that would typically involve Joins on large Column Families (containing millions of rows).
What approach is best suited for this? One possibility is to extract data out to a relational database such as mySQL and do data mining there. Alternate could be to attempt at use hadoop with hive or pig to run map reduce queries for this purpose? I must admit I have zero experience with the latter.
Anyone have experience of performance differences in one one vs the other? Would you run map reduce queries on a live Cassandra production instance or on a backup copy to prevent query load from affecting write performance?
In my experience Cassandra is better suited to processes where you need real-time access to your data, fast random reads and just generally handle large traffic loads. However, if you start doing complex analytics, the availability of your Cassandra cluster will probably suffer noticeably. In general from what I've seen it's in your best interest to leave the Cassandra cluster alone, otherwise the availability starts suffering.
Sounds like you need an analytics platform, and I would definitely advise exporting your reporting data out of Cassandra to use in an offline data-warehouse system.
If you can afford it, having a real data-warehouse would allow you to do complex queries with complex joins on multiples tables. These data-warehouse systems are widely used for reporting, here is a list of what are in my opinion the key players:
Netezza
Aster/TeraData
Vertica
A recent one which is gaining a lot of momentum is Amazon Redshift, but it is currently in beta, but if you can get your hands on it you could give this a try since it looks like a solid analytics platform with a pricing much more attractive than the above solutions.
Alternatives like using Hadoop MapReduce/Hive/Pig are also interesting to look at, but probably not a replacement for Hadoop technologies. I would recommend Hive if you have a SQL background because it will be very easy to understand what you're doing and you can scale easily. There are actually already libraries integrated with Hadoop, like Apache Mahout, which allow you to do data-mining on a Hadoop cluster, you should definitely give this a try and see if it fits your needs.
To give you an idea, an approach that I've used that has been working well so far is pre-aggregating the results in Hive and then have the reports themselves generated in a data-warehouse like Netezza to compute complex joins .
Disclosure: I'm an engineer at DataStax.
In addition to Charles' suggestions, you might want to look into DataStax Enterprise (DSE), which offers a nice integration of Cassandra with Hadoop, Hive, Pig, and Mahout.
As Charles mentioned, you don't want to run your analytics directly against Cassandra nodes that are handling your real-time application needs because they can have a substantial impact on performance. To avoid this, DSE allows you to devote a portion of your cluster strictly to analytics by using multiple virtual "datacenters" (in the NetworkToplogyStrategy sense of the term). Queries performed as part of a Hadoop job will only impact those nodes, essentially leaving your normal Cassandra nodes unaffected. Additionally, you can scale each portion of the cluster up or down separately based on your performance needs.
There are a couple of upsides to the DSE approach. The first is that you don't need to perform any ETL prior to processing your data; Cassandra's normal replication mechanisms keep the nodes devoted to analytics up to date. Second, you don't need an external Hadoop cluster. DSE includes a drop-in replacement for HDFS called CFS (CassandraFS), so all source data, intermediate results, and final results from a Hadoop job can be stored in the Cassandra cluster.

Distributed and replicated data storage for small amounts of data under Windows

We're looking for a good solution to a caching problem. We'd like to distribute a relatively small amount of data (perhaps 10's of GBs) among a cluster of web servers such that:
The data is replicated to all nodes
The data is persistent
The data can be accessed locally
Our motivation for a caching solution is that we currently have a single point of failure: a SQL Server database. We're unable to set up a fail-over cluster for this database, unfortunately. We're already using Memcached to a large extent, but we want to avoid the problem where if a Memcached node goes down, we'd suddenly have a large amount of cache misses and therefore experience a massive amount of requests to one endpoint.
We'd prefer instead to have local persistent caches on each web server node so that the resulting load would be distributed. When a retrieval is made, it would pass through the following:
Check for data in Memcached. If it's not there...
Check for data in local persistent storage. If it's not there...
Retrieve data from the database.
When data changes, the cache key is invalidated at both caching layers.
We've been looking at a bunch of potential solutions, but none of them seem to match exactly what we need:
CouchDB
This is pretty close; the data model we'd like to cache is very document-oriented. However, its replication model isn't exactly what we're looking for. It seems to me as though replication is an action you have to perform rather than a permanent relationship among nodes. You can set up continuous replication, but this doesn't persist between restarts.
Cassandra
This solution seems to be mostly geared toward those with large storage requirements. We have a large amount of users, but small amounts of data. Cassandra looks to be able to support n number of fail-over nodes, but 100% replication among nodes doesn't seem to be what it's intended for; instead, it seems more geared toward distribution only.
SAN
One attractive idea is that we can store a bunch of files on a SAN or similar type of appliance. I haven't worked with these before, but it seems like this would still be a single point of failure; if the SAN goes down, we'd suddenly be going to the database for all cache misses.
DFS Replication
A simple Google search revealed this. It seems to do what we want; it synchronizes files across all nodes in a replication cluster. But the marketing text makes it look like it's more of a system for ensuring documents are copied to different office locations. Also, it has limits, like a file count maximum, that wouldn't work well for us.
Have any of you had similar requirements to ours and found a good solution that meets your needs?
We've been using Riak successfully in production for several months now for a problem that's somewhat similar to what you describe. We too have evaluated CouchDB and Cassandra before.
The advantage of Riak in this sort of problems imo is that distribution and data replication are at the core of the system. You define how many replicas of the data across the cluster you want and it takes care of the rest (it's a bit more complicated than that of course, but that's the essence). We went through adding nodes, removing nodes, had nodes crush, and it's proven surprisingly resilient.
It's a lot like Couch in other matters - document oriented, REST interface, Erlang.
You can check the hazelcast.
It does not persist the data but provides a fail-over system. Each node can have a number of nodes to backup it's data in case a node fails.

How to build a distributed robust linked list on several computers on the net?

I was thinking about building a program that use a raid(disk) like algorithms. If one computer dies. The next will step in. In it's place. And it need to scale from 1 - 1000 computers.
I need some advice.
What the name of the algorithms I'm need to learn?
At one point I thought it was possible to build it on top of git.
You may want to read this paper on the Google File System. From the abstract:
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
Try Hazelcast. It has distributed implementation of Set, List and more. Hazelcast is an open source transactional, distributed/partitioned implementation of queue, topic, map, set, list, lock and executor service. It is super easy to work with; just add hazelcast.jar into your classpath and start coding. Almost no configuration is required.
Hazelcast is released under Apache license and enterprise grade support is also available. Code is hosted at Google Code.
Distributed hash tables pop into my mind...
I've seen both Hadoop and the Google File System mentioned, but nobody has specifically mentioned HDFS - the distributed filesystem that comes with Hadoop. You can set the desired level of redundancy, and lose the occasional node without losing your data.
One caveat: You need to make sure the one machine that holds the "namenode" (the master machine and single point of failure in an HDFS cluster) is solid - RAID mirroring, backups, the works. You lose the namenode, you lose the cluster.
Also check out the MapReduce algorithm. It's a relatively simple way of getting high scalability, that doesn't force the algorithm designer to think about locking, communication, etc. There are several implementations available, for example the open-source Hadoop by the Apache foundation.
1) You can use distributed locks/mutexes as in:
A sqrt(N) Algorithm for Mutual Exclusion in Decentralized Systems, by Maekawa: http://portal.acm.org/citation.cfm?id=214445
On the performance of distributed lock-based synchronization, by Lubowich and Taubenfeld: http://portal.acm.org/citation.cfm?id=1946155
2) Or you can use lock-free linked lists as in:
Lock-Free Linked Lists and Skip Lists, by Fomitchev
and Rupert: http://www.cse.yorku.ca/~ruppert/papers/lfll.pdf
Lock-free linked lists using compare-and-swap, by Valois: http://portal.acm.org/citation.cfm?id=224988
You could build something like memcached. Each hash entry could be a file block (e.g. SHA hash of block to contents).
BitTorrent? :)
You might want to check out Appistry EAF. Its a distributed execution platform. It handles all the failover of tasks for you, so you don't have to build that into your code. If one node fails, another node automatically takes over. And unlike Grid, there is no centralized controller, to you remove the single point of failure/bottleneck of those types of solutions.
There is a free download available up to 5 machines.

Resources