Google Cloud Bigtable Durability/Availability Guarantees - high-availability

I would like someone from Google to provide some guidelines on the durability and availability guarantees provided by the Cloud Bigtable service.
Here is my understanding so far:
The fact that the minimum cluster requires 3 nodes suggests that, at least within a zone, the data is highly durable and replicated to 3 nodes.
However, this answer by a Googler states that "Cloud Bigtable doesn’t replicate data" — directly contradicting the quote on the Cloud Bigtable homepage which claims it "is built with a replicated storage strategy". So which is it? Is it replicated or not? And if so, how many copies are kept?
The fact that clusters can only be set up within a particular zone suggests that the availability of a cluster is tied directly to the availability of that zone. So if I want to have a highly available Bigtable-based data storage, would it be best practice to set up independent clusters across multiple zones and handle the synchronisation of writes across the clusters myself?
There is no information on whether Bigtable clusters across zones are independent or not. If I were to set up clusters across multiple zones, and one zone goes down, could we expect the clusters in other zones to carry on working? Or is there some underlying single point of failure which could impact clusters even across zones?
Compared to the App Engine datastore which is very specific about these details, the Cloud Bigtable documentation is rather lacking — or, at least, I've not managed to find a page which goes into detail on these aspects.
The Cloud Bigtable docs are similarly vague on other aspects, e.g. on the matter of size limits for values, the documentation states that individual values should stay below "~10 MB per cell". What on earth does "~10 MB" mean?! Can I hardcode a limit of exactly 10MB and expect it to always work or will that change from day to day dependent on unknown factors?
Anyway, apologies if I sound agitated. I genuinely would like to use the Bigtable service. But I, like presumably many others, need to understand the durability/availability aspects of it before being able to invest in it. Thank you.

On replication:
The answer you referenced is referring to replication of data across Bigtable Clusters, which is not supported at this time. (For example, a Bigtable Cluster in the United States replicating its writes to a second Cluster in Europe)
This concept is separate from replication of data within a Bigtable cluster, which is analogous to replication in HDFS, which is something that the product absolutely does today.
On availability:
Yes, the availability of a Bigtable Cluster is tied to the availability of a Google Cloud Zone.
On Independence:
Yes, Cloud Bigtable clusters are independent across zones. An outage in one zone should not impact the availability of other zones.
On data per cell:
We do not reject writes >10Mb per cell, we have this set as a guideline for getting optimal performance.

Related

HBase Replication - Replicate data in 3 data centers

I our application we are having data from 3 different countries and we are persisting data in HBase.
In each country, we will be keeping data of all the 3 countries.
To achieve this, is it possible that we create our Hadoop cluster using data centers in all these 3 countries and we keep data replication as 3. So due to rack-awareness feature, our data will get auto replicated in all the 3 countries?
Any pointers will be of great help.
Thanks
You can’t have HBASE cluster across countries. This won’t work because of latency, failover problems, network issues, etc.
A good option would be to have 3 clusters, one HBase table per country and sync the tables between clusters as proposed above
As far as I know only Google has successfully implemented a multi-country database providing both consistency and availability: Spanner. But the key elements of the solution are: a private physical network between the Data Centers and their own implementation of NTP which guarantee that all servers across the world have the same clock with just a few milliseconds precision.
This solution looks theoretically feasible but writes may become pretty slow as data needs to replicated to 3 nodes located in different geographies. It needs to be tried out and check whether the latency is within tolerable limit.
Another option could be, to have three different HBase clusters at three locations and design tables in such a way that tables from one HBase cluster could be copied to another one during night hours to keep the data in sync daily. In this case, an HBase cluster will have current data from it's own location but the data from other two cities will lag by a day.

Distributed Spark and HDFS Cluster with 6 to 7 Nodes hardware configuration

I am planning to spin my development cluster for trend analysis for Infrastructure Monitoring application which I am planning to build using Spark for analysing failure trend and Cassandra for storing incoming data and analysed data.
Consider collecting performance matrix from around 25000 machines/servers (probably set of same application on different servers). I am expecting performance matrix of size 2MB/sec from each machine, which I am planning to push into Cassandra table having timestamp, server as primary key and application along with some important matrix as clustering key. I will be running Spark job on top of this stored information for performance matrix failure trend analysis.
Comming to the question, How many nodes (machines) and of what configuration in terms of CPU and Memory do I need to kick start my cluster considering above scenario.
Cassandra needs a well planned out data model for things to run well. It is very much worth spending time planning things out at this stage before you have a large data set and find out you probably would have done better re-arranging the data model!
The "general" rule of thumb is you shape your model to the queries, while paying attention to avoiding things like really large rows, large deletes, batches and such the like which can have big performance penalties.
The docs give a good start on planning and testing you would probably find useful. I would also recommend using the Cassandra stress tool. You can use it to push performance tests into your Cassandra cluster to check latencies and any performance problems. You can use your own schema too which I personally think is super-useful!
If you are using cloud based hardware like AWS then its relatively easy to scale up / down and see what works best for you. You dont need to throw big hardware at Cassandra, its easier to scale horizontally than vertically.
I'm assuming you are pulling back the data into a separate spark cluster for the analytics side too so these nodes would be running plain Cassandra (less hardware specs). If however you are using the Datastax Enterprise version (where you can run nodes in spark "mode") then you will need more beefier hardware with the additional load you need for spark driver programs, executors and such the like. Another good docs link is the DSE hardware recommendations

Autoscaling EMR- is it required? Should I just use EC2? Should I just use Qubole?

In order to reduce the time for provisioning, we've decided to keep up a dedicated EMR cluster with 5 instances (we expect to need about 5). In case we need more, we think we'll need to implement some sort of autoscaling.
I'm not familiar at all with EMR- does it support autoscaling? I found this in the docs: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage-resize.html
Is that the correct place to look for autoscaling or am I misunderstanding what they mean by "resize". I've read that one benefit of EMR is the "on demand processing" and I think that it splits the load between ec2 instances without you specifying how many instances so this gives me the impression that it does the scaling of ec2 instances on its own, meaning we don't need to autoscale ourselves. Am I misunderstanding what "on demand processing" means?
If the resizing link I provided is appropriate for what I'm trying to do, does anyone have experience with determining when to resize? The doc only describes how but not, for example, how to have an alarm for when to resize. I've used their regular autoscaling service and it allows you to resize based on certain conditions but I'm not seeing this here.
I'm still unsure if autoscaling EMR is a bad idea- is it too involved (since there are entire companies like Qubole that provide this) or maybe not very useful since EMR already uses whatever computing power it needs? I don't know very much about what EMR actually provides so maybe that's why I'm confused.
The page you linked showed ways of either manually or programmatically increasing the nodes in your cluster. I couldn't find anything else about autoscaling for EMR.
Unless we're missing some facts, you’d still have to come up with your own scaling algorithm and process. If you’re taking factors into account such as your job backlog, the units of time you’re paying for, the use of less-expensive “spot” instances, multiple clusters, etc, this is probably not a trivial exercise.
In addition to increasing size of your cluster, there is also downsizing. EMR allows this (manually or programmatically) for task nodes, but they state they don't for core nodes. You'd have to terminate the core node through AWS functionality and risk losing data. If your workloads increase and decrease over time, core node downsizing would be valuable for keeping your costs lower.
Qubole automatically takes care of all of these things out of the box. You run your jobs from the UI or API and it starts, sizes or resizes the cluster. When you're finished, it downsizes or terminates the cluster. It also allows you to have a minimum number of nodes constantly running at one time. I've also heard that the startup time for Qubole nodes is significantly faster than EMR.
Hope this helps you.
AWS does currently (as of late 2016) not support autoscaling out of the box as part of EMR. However, the EMR API provides all necessary ingredients to 1) collect monitoring data, and 2) programmatically scale the cluster up and down.
Basically, there are two main options to implement autoscaling for EMR clusters:
Autoscaling Loop: A process that is running on a server and continuously monitors the cluster for its current load. Performance metrics (memory, CPU, I/O, etc) can be collected in regular intervals and stored in a database. Autoscaling rules are evaluated against the performance metrics, and the cluster's task nodes are scaled up or down if required.
Event-Based Autoscaling: Using CloudWatch metrics (e.g., metrics for EMR, or metrics for EC2), you can programmatically define triggers that are fired under certain conditions (for instance, add nodes if average CPUUtilization of all nodes exceeds 80%).
Both options have their pros and cons. The advantage of option 2 is that it is a server-less approach (does not require to run your own server). The downside is that CloudWatch metrics are collected in batches (typically five-minute intervals) and hence the data may be slightly delayed or less precise. Also, the event-based approach may not provide the required tools to inspect the current and historical state of your cluster scaling. Option 1, on the other hand, does require a server, but therefore comes with more control to customize the logic of your scaling rules. Also, it allows to keep searchable records of the history of the scaling decisions.
You could take a look at Themis, an EMR autoscaling framework developed at Atlassian. Themis implements the autoscaling loop as discussed in option 1 above. Current features include proactive as well as reactive autoscaling, it comes with a Web UI, and the tool is very easy to configure.

How to use HBase and Hadoop to serve live traffic AND perform analytics? (Single cluster vs separate clusters?)

Our primary purpose is to use Hadoop for doing analytics. In this use case, we do batch processing, so throughput is more important than latency, meaning that HBase is not necessarily a good fit (although getting closer to real-time analytics does sound appealing). We are playing around with Hive and we like it so far.
Although analytics is the main thing we want to do in the immediate future with Hadoop, we are also looking to potentially migrate parts of our operations to HBase and to serve live traffic out of it. The data that would be stored there is the same data that we use in our analytics, and I wonder if we could just have one system for both live traffic and analytics.
I have read a lot of reports and it seems that most organizations choose to have separate clusters for serving traffic and for analytics. This seems like a reasonable choice for stability purposes, since we plan to have many people writing Hive queries, and badly written queries could potentially compromise the live operations.
Now my question is: how are those two different use cases reconciled (serving live traffic and doing batch analytics)? Do organizations use systems to write all data in two otherwise independent clusters? Or is it possible to do this out of the box with a single cluster in which some of the nodes serve live traffic and others do only analytics?
What I'm thinking is that we could perhaps have all data coming into the nodes that are used for serving live traffic, and let the HDFS replication mechanisms manage the copying of data into nodes that are used for analytics (increasing the replication higher than the default 3 probably makes sense in such scenario). Hadoop can be made aware of special network topologies, and it has functionality to always replicate at least one copy to different racks, so this seems to mesh well with what I'm describing.
The nodes dedicated to live traffic could be set to have zero (or few) map and reduce slots, so that all Hive queries end up being processed by the nodes dedicated to analytics.
The nodes dedicated to analytics would always be a little behind those dedicated to serving live traffic, but that does not seem to be a problem.
Does that kind of solution make sense? I am thinking it could be more simple to have one cluster than two, but would this be significantly riskier? Are there known cases of companies using a HBase cluster to serve live traffic while also running batch analytics jobs on it?
I'd love to get your opinions on this :) !
Thanks.
EDIT: What about Brisk? It's based on Cassandra instead of HBase, but it seems to be made exactly for what I'm describing (hybrid clusters). Has anyone worked with it before? Is it mature?
--
Felix
Your approach has a few problems... even in rack aware mode, if you have more than a few racks I don't see how you can be guaranteed your nodes will be replicated on those nodes. If you lose one of your "live" nodes, then you will be under-replicated for a while and won't have access to that data.
HBase is greedy in terms of resources and I've found it doesn't play well with others (in terms of memory and CPU) in high load situations. You mention, too, that heavy analytics can impact live performance, which is also true.
In my cluster, we use Hadoop quite a bit to preprocess data for ingest into HBase. We do things like enrichment, filtering out records we don't want, transforming, summarization, etc. If you are thinking you want to do something like this, I suggest sending your data to HDFS on your Hadoop cluster first, then offloading it to your HBase cluster.
There is nothing stopping you from having your HBase cluster and Hadoop cluster on the same network backplane. I suggest instead of having hybrid nodes, just dedicate some nodes to your Hadoop cluster and some nodes to your Hbase cluster. The network transfer between the two will be quite snappy.
Just my personal experience so I'm not sure how much of it is relevant. I hope you find it useful and best of luck!
I think this kind of solution might have sense, since MR is mostly CPU intensive and HBASE is a memory hungry beast. What we do need - is to properly arrange resource management. I think it is possible in the following way:
a) CPU. We can define maximum number of MR mappers/reducers per slot and assuming that each mapper is single threaded we can limit CPU consumption of the MR. The rest will go to HBASE.
b) Memory.We can limit memory for mappers and reducers and the rest give to HBASE.
c) I think we can not properly manage HDFS bandwidth sharing, but I do not think it should be a problem for HBASE -since for it disk operations are not on the critical path.

Distributed and replicated data storage for small amounts of data under Windows

We're looking for a good solution to a caching problem. We'd like to distribute a relatively small amount of data (perhaps 10's of GBs) among a cluster of web servers such that:
The data is replicated to all nodes
The data is persistent
The data can be accessed locally
Our motivation for a caching solution is that we currently have a single point of failure: a SQL Server database. We're unable to set up a fail-over cluster for this database, unfortunately. We're already using Memcached to a large extent, but we want to avoid the problem where if a Memcached node goes down, we'd suddenly have a large amount of cache misses and therefore experience a massive amount of requests to one endpoint.
We'd prefer instead to have local persistent caches on each web server node so that the resulting load would be distributed. When a retrieval is made, it would pass through the following:
Check for data in Memcached. If it's not there...
Check for data in local persistent storage. If it's not there...
Retrieve data from the database.
When data changes, the cache key is invalidated at both caching layers.
We've been looking at a bunch of potential solutions, but none of them seem to match exactly what we need:
CouchDB
This is pretty close; the data model we'd like to cache is very document-oriented. However, its replication model isn't exactly what we're looking for. It seems to me as though replication is an action you have to perform rather than a permanent relationship among nodes. You can set up continuous replication, but this doesn't persist between restarts.
Cassandra
This solution seems to be mostly geared toward those with large storage requirements. We have a large amount of users, but small amounts of data. Cassandra looks to be able to support n number of fail-over nodes, but 100% replication among nodes doesn't seem to be what it's intended for; instead, it seems more geared toward distribution only.
SAN
One attractive idea is that we can store a bunch of files on a SAN or similar type of appliance. I haven't worked with these before, but it seems like this would still be a single point of failure; if the SAN goes down, we'd suddenly be going to the database for all cache misses.
DFS Replication
A simple Google search revealed this. It seems to do what we want; it synchronizes files across all nodes in a replication cluster. But the marketing text makes it look like it's more of a system for ensuring documents are copied to different office locations. Also, it has limits, like a file count maximum, that wouldn't work well for us.
Have any of you had similar requirements to ours and found a good solution that meets your needs?
We've been using Riak successfully in production for several months now for a problem that's somewhat similar to what you describe. We too have evaluated CouchDB and Cassandra before.
The advantage of Riak in this sort of problems imo is that distribution and data replication are at the core of the system. You define how many replicas of the data across the cluster you want and it takes care of the rest (it's a bit more complicated than that of course, but that's the essence). We went through adding nodes, removing nodes, had nodes crush, and it's proven surprisingly resilient.
It's a lot like Couch in other matters - document oriented, REST interface, Erlang.
You can check the hazelcast.
It does not persist the data but provides a fail-over system. Each node can have a number of nodes to backup it's data in case a node fails.

Resources