implementing a Distributed System/database - database-installation

We are implementing a system in the company I work for where by we will need to install the system in various sites of the same client (warehouses). The users in all sites should see the same information. The system should be able to work in each site when the network is down. What design architecture solution would be most suitable?

I suggest you consider CouchDB. Its robust replication feature is designed specifically for this sort of use case. It supports both continuous replication, which could keep the data in the various warehouses in sync in near-real-time during normal operation, and occasional replication, which could be used to sync data after a network outage.
There's a really good free O'Reilly book: CouchDB: The Definitive Guide, which has a chapter on replication.

Related

NoSQL for multi-site archival logging with full-text search

I'm looking at building a somewhat complex log handling system to replace an old ad-hoc setup and could use a bit of advice. I'm pretty familiar with SQL databases and networking, but am very new to NoSQL stores, which seem to be the key to solving this mess. Note that we have a very good team, but a limited licensing budget, so free/open-source options are vastly preferred. (That said, availability of support if something goes pear-shaped would be nice.)
Requirements:
Archive (test) logs generated in the several GB/day range at multiple sites around the world.
Provide full text search of those logs at each site fairly instantaneous for debugging purposes.
Push that archived data back to a central location (though a replica at each site would be absolutely okay).
Provide for analytics of that data back at the central location.
Constraints:
The sites have fairly crap Internet connections for the moment (high latency and fairly low bandwidth). Much of the data is generated during the day and a good portion of the sync would have to lag behind and finish overnight each day.
Sites MUST be able to function if the WAN goes completely off-line.
Extras
The log data is (as usual) highly compressible. Any solution that compresses data transacting from node to node across the WAN is preferred.
Many log files are related to each other in multi-level hierarchies, and that relationship is very important and must be maintained!
Sites will generally not modify the same data or modify it again once stored. This is all archival for the most part.
We can either stream as the logs are generated or push blocks of logs. Streaming is preferred, as it would simplify things considerably.
Options I'm aware of:
Local MySQL and folder structure for logging and local configuration management.
This is what we have now and it's running, but not a long-term solution by any means.
Elasticsearch
I've read that ElasticSearch would probably be really good for this, though from what I understand that doesn't support multi-site.
Cassandra
This seems to have built-in multi-site support, but I'm not exactly familiar with the data-model. Is this a good choice for something like this, or will I hate myself if I give it a try?
CouchDB
This is a document store that seems(?) like a good match for log data, but again doesn't appear to have multi-site support.
Apache Kafka
I read up on this, but I haven't quite wrapped my head around it yet...
Questions:
Do any of these actually let you stream-append logs or are they best suited to dumping completed files in?
Is there a solution I'm missing that might be better?
Any recommendations on multi-site with some of the options that don't support multi-site by themselves?
Interesting links:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://blog.cloudera.com/blog/2015/07/deploying-apache-kafka-a-practical-faq/
https://www.elastic.co/blog/scaling_elasticsearch_across_data_centers_with_kafka
https://kafka.apache.org/08/ops.html
https://github.com/Stratio/cassandra-lucene-index
I may be a bit biased, since Couchbase is my employer, but this sounds like the kind of problem that XDCR (Cross Datacenter Replication) was made to solve.
You could stand up a cluster on multiple geographical sites (Couchbase calls these "datacenters") and then XDCR would automatically replicate (bidirectionally) the data between sites. If I understand your requirements correctly, this sounds like just what you need.

Is rethinkdb suitable for a chat application

I saw that rethinkdb has real-time capabilities which made me think it would be great for a chat application - however I saw the caveat in the rethink website that says apps requiring high write throughput should consider riak instead.
What is this limit for write that it is mentioning, and is it still suitable for a standard chat application that would support many thousands of concurrent users?
RethinkDB is a good choice for a chat application. In fact, its realtime changefeeds are specifically designed to make it easy to build these kinds of realtime applications.
The FAQ actually states:
In some cases RethinkDB trades off write availability in favor of data consistency. If high write availability is critical and you don’t mind dealing with conflicts you may be better off with a Dynamo-style system like Riak.
Write availability is not the same as write throughput. RethinkDB's write throughput is more than capable of handling thousands of concurrent users (most databases will do fine in this respect).
Regarding write availability: RethinkDB favors consistency, whereas Riak favors availability. This set of tradeoffs is commonly referred to as the CAP theorem, which states that in one distributed system it is impossible to achieve all three properies: consistency, availability, and partition tolerance.
You can read more about what this means in the RethinkDB architecture FAQ.

understanding about stackoverflow underlying software infrastructure

I wonder what all databases/combination of databases stack overflow uses underneath, managing extensive user profile information over various verticals.
As i case of social networking sites like twitter and facebook the Big Data managemnet is done over hadoop. Is stack overflow also handles such higher volumes of data?
How about indexing the information , is redis part of stackoverflow solutions?
It will be really interesting to understand solution deployed at world most popular technical forum .
This article provides a glimpse at what stackoverflow's architecture looks like circa March 2011: http://highscalability.com/blog/2011/3/3/stack-overflow-architecture-update-now-at-95-million-page-vi.html
At a high level, its a .NET application which uses MS SQL server for a database, Redis for caching, HAProxy for load balancing, and a whole host of tools and hosted on both windows servers and linux servers (ubuntu+centos).
It doesn't look like they had any hadoop usage at the time of that article, but that could have changed. They might also be doing something different/custom for map/reduce type jobs or might not need anything like that at all yet. With delicacy, SQL servers can be scaled pretty far without needing to lean on "big data" toys. This is especially true if you can get most of your data out of your caching layer.

Oracle active/active architecture

I look for solutions that can synchronize data between two different geographical sites.
This solutions must synchronize data online or with minimum delay
What's the best solution for this architecture?
which existing solutions on the market?
Have you looked at Oracle's offerings?
Updated according to your comment: It seems that you are looking for replication capabilities. In general this is a hard problem to solve. If you can limit the problem in some way, such as having central reference data replicated out, but independent satellite data capture, you may find things much easier. Oracle have offerings and there are third party vendors such as this.
I've got no personal experience of these solutions. I suggest that you define in detail exactly what you need to do. Differentiate between infrequently changing data that can be maintained centrally and transactional data captured remotely. Specify how relaxed your consistency rules can be.

Has anyone tried using ZooKeeper?

I was currently looking into memcached as way to coordinate a group of server, but came across Apache's ZooKeeper along the way. It looks interesting, and Yahoo uses it, so it shouldn't be bad, but I'd never heard of it before, so I'm kind of skeptical. Has anyone else given it a try? Any comments or ideas?
ZooKeeper and Memcached have different purposes. You can use memcached to do server coordination, but you'll have to do most of this work yourself. Memcached only allows coordination in that it caches common data lookups to be used by multiple clients. From reading ZooKeeper's documentation, it has a much broader focus than this. ZooKeeper seems to provide support for server clustering, which isn't the same as the cache clustering memcached provides.
Have a look at Brad Fitzpatrick's Linux Journal article on memcached to get a better idea what I mean.
To get an overview of what Zookeper is capable of, watch the following presentation by it's creators. It's capable of so much more (creating queue's, electing master processes amongst a group of peers, distributed high performance run time configurations, rendezvous points for dis-joined processes, determining if processes are still running, etc).
http://zookeeper.sourceforge.net/index.sf.shtml
To answer your question, if "coordination" is what you are looking for Zookeeper is much better targeted at that than memcached.
Zookeeper is great for coordinating data across servers. It does a good job of ordering every transaction and making guarantees that transactions happen in order. However when first breaking into it the documentation sucks; it's very 'high-level' without enough concrete examples or explanations as how to properly handle certain events. One of the included examples (as of version 3.3.3) had its own bugs in it.
Your code will also need to be cognizant of event driven interactions, and polling interactions. With massively distributed architecture, when acting upon 'events' you can inadvertently create a stampede that could not be desirable for your environment (herding effect).

Resources