I am reading cassandra 2.0 manual the quotation (page 19, title NetworkTopologyStrategy, second paragraph):
NetworkTopologyStrategy places replicas in the same data center by walking the ring clockwise until reaching the first node in another rack.
What is meant by rack?
It's a physical enclosure that houses several servers:
See e.g. this wikipedia article.
One architecture is for each rack to house an ethernet switch, which interconnects all servers within one rack, meaning there can be a lot of available bandwidth between servers within one rack - topology knowledge which distributed systems can make use of to make the best out of available resources.
Read the Understanding Hadoop Clusters and the Network for more insigth.
Related
My ES cluster has 20 machines with 50 nodes(ES instances), I'm not sure how many racks should I set. Is two racks enough? or 3 or 4 racks better.
As I know if I set rack_id in ES configuration, it can provide the following functions:
1, Select data location or relocation(to make sure replicas in different racks)
2, Use Rack_id as doc routing
Any reasons should I set more racks, but I think even just one rack by default is good too.
The chance of an outage of two machines is highest if they share hardware because you use VMs, smaller if they share a rack but not hardware, and again smaller if they share a building but not a rack. So it makes sense to use more than a single rack.
Whether you need more than 2 racks depends on your replicas. The default number of replications is 1. If you require a higher value, strictly speaking you will degrade the Availability of your cluster a bit if you use only 2 racks because the >= 3 setting will not be effective on the rack level.
I think that in your case, it's simpler and easier to just set cluster.routing.allocation.same_shard.host to true. (See https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allocation.html) This will prevent copies of the same shard to be placed on the same host (host is identified by address and host name). Please, test this before going in production with this approach.
Also, keep in mind that you need to specify the processors setting (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-threadpool.html#processors) accordingly. Each ES node detects the # of cores available on the machine (not aware of other nodes present). With multiple nodes on the same machine, each node can think that it has dedicated access to all cores on the machine (this can be problematic for the default thread pool sizes are derived from this). So you will want to explicitly specify the # of cores available via the processors setting so that it does not end up over-allocating the thread pools.
I recommend using dedicated master nodes and to ensure cluster stability, each dedicated master node instance should be on its own machine (certainly can be a much smaller machine, e.g. 4Gb of RAM to start with).
For example, if I have a GetFile processor that I have designated to be isolated, how do the flow files coming from that processor get distributed across the cluster nodes?
Is there any additional work / processors that need to be added?
In Apache NiFi today the question of load balancing across the cluster has two main answers. First, you must consider how data gets to the cluster in the first place. Second, once it is in the cluster do you need to rebalance.
For getting data into the cluster it is important that you select protocols which are themselves scalable in nature. Protocols which offer queuing semantics are good for this whereas protocols which do not offer queuing semantics are problematic. As an example of one with queueing semantics think JMS queues or Kafka or some HTTP APIs. Those are great because one or more clients can pull from them in a queue fashion and thus spread the load. An example of a protocol which does not offer such behavior would bet GetFile or GetSFTP and so on. These are problematic because the client(s) have to share state about which data they see to pull. To address even these protocols we've moved to a model of 'ListSFTP' and 'FetchSFTP' where ListSFTP occurs on one node in the cluster (primary node) and then it uses Site-to-Site feature of NiFi to load balance to the rest of the cluster then each node gets its share of work and does FetchSFTP to actually pull the data. The same pattern is offered for HDFS now as well.
In describing that pattern I also mentioned Site-to-Site. This is how two nifi clusters can share data which is great for Inter-site and Instra-Site distribution needs. It also works well for spreading load within the same cluster. For this you simply send the data to the same cluster and NiFi takes care then of load balancing and fail-over and detection of new nodes and removed nodes.
So there are great options already. That said we can do more and in the future we plan to offer a way for you to on a connection indicate it should be auto-load-balanced and then it will behind the scenes do what I've described.
Thanks
Joe
Here is an updated answer, that works even simpler with newer versions of NiFi. I am running Apache NiFi 1.8.0 here.
The approach I found here is to use a processor on the primary node, that will emit flow files to be consumed via a load balanced connection.
For example, use one of the List* processors, in "Scheduling" set its "Execution" to run on the primary node.
This should feed into the next processor. Select the connection and set its "Load Balance Strategy".
You can read more about the feature in its design document.
I am using client side partitioning on a 4 node redis setup. The writes and reads are distributed among the nodes. Redis is used as a persistence layer for volatile data as well as a cache by different parts of application. We also have a cassandra deployment for persisting non-volatile data.
On redis we peak at nearly 1k ops/sec (instantaneous_ops_per_sec). The load is expected to increase with time. There are many operations where we query for a non-existent key to check whether data is present for that key.
I want to achieve following things:
Writes should failover to something when a redis node goes down.
There should be a backup for reading the data lost when the redis node went down.
If we add more redis nodes in the future (or a dead node comes back up), reads and writes should be re-distributed consistently.
I am trying to figure out suitable design to handle the above scenario. I have thought of the following options:
Create hot slaves for the existing nodes and swap them as and when a master goes down. This will not address the third point.
Write a Application layer to persist data in both redis and cassandra allowing a lazy load path for reads when a redis node goes down. This approach will have an overhead of writing to two stores.
Which is a better approach? Is there a suitable alternative to the above approaches?
A load of 1k ops/s is far below the capabilities of Redis. You would need to increase by up to two or more orders of magnitude before you come close to overloading it. If you aren't expecting to exceed 50-70,000 ops/second and are not exceeding your available single/0-node memory I really wouldn't bother with sharding your data as it is more effort than it is worth.
That said, I wouldn't do sharding for this client-side. I'd look at something like Twemproxy/Nutcracker to do it do you. This provides a path to a Redis Cluster as well as the ability to scale out connections and proved transparent client-side support for failover scenarios.
To handle failover in the client you would want to set up two instances per slot (in your description a write node) with one shaved to the other. Then you would run a Sentinel Constellation to manage the failover.
Then you would need to have your client code connect to sentinel to get the current master connectivity for each slot. This also means client code which can reconnect to the newly promoted master when a failover occurs. If you have load Balancers available you can place your Redis nodes behind one or more (preferably two with failover) and eliminated client reconnection requirements, but you would then need to implement a sentinel script or monitor to update the load balancer configuration on failover.
For the Sentinel Constellation a standard 3 node setup will work fine. If you do your load balancing with software in nodes you control it would be best to have at least two sentinel nodes on the load Balancers to provide natural connectivity tests.
Given your description I would test out running a single master with multiple read slaves, and instead of hashing in client code, distribute reads to slaves and writes to master. This will provide a much simpler setup and likely less complex code on the client side. Scaling read slaves is easier and simpler, and as you describe it the vast majority if ops will be read requests so it fits your described usage pattern precisely.
You would still need to use Sentinel to manage failover, but that complexity will still exist, resulting in a net decrease in code and code complexity. For a single master, sentinel is almost trivial so setup; the caveats being code to either manage a load balancer or Virtual IP or to handle sentinel discovery in the client code.
You are opening the distributed database Pandora's box here.
My best suggestion is; don't do it, don't implement your own Redis Cluster unless you can afford loosing data and / or you can take some downtime.
If you can afford running on not-yet-production-ready software, my suggestion is to have a look at the official Redis Cluster implementation; if your requirements are low enough for you to kick your own cluster implementation, chances are that you can afford using Redis Cluster directly which has a community behind.
Have you considered looking at different software than Redis? Cassandra,Riak,DynamoDB,Hadoop are great examples of mature distributes databases that would do what you asked out of the box.
When each file is written into the cluster, the HDFS will create a replication pipeline. Assume there are two racks 1 and 5. According to rack awareness, the first block will be saved into rack 1, and the the other two replicated blocks will be pushed into rack 5.
I am very confused that why hadoop pipeline will store the second and third replicated blocks in the same rack (5) rather than store the first and second blocks in the same rack (1)? Both two scenarios have same network traffic. What's the advantage of it?
Thanks.
Please refer to this figure http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/HDFS-Pipleline-Write-s.png
This applies to HDFS 1.2 The reason that both blocks are put on the remote rack is to minimize the inter rack traffic. Since the second block is already on the remote rack, copying the block between two servers on the remote rack will not consume the bandwidth of the link between the racks. If the top of the rack switch of efficient design this will minimize the bandwidth utilization to minimum. On a busy cluster the same process will be happening on both racks. Where rack A and rack B will generate blocks roughly at the same rate. So, following this policy helps to even out bandwidth utilization of all the links.
Has anyone tried to test the performance of data nodes across multiple data centers? Especially over networks with small pipes. I can't seem to find too much information on it and the information I have found is either old (circa 2010) or proprietary (seems like DataStax has something). I know Hadoop supports rack awareness but like I said I haven't seen any documentation for tuning a system for multiple data centers.
I've tried it with a 12 x DataNode cluster arranged in a 2:1 ratio split between two data centers roughly 120 miles apart. Latency between data centres was ~4ms across 2 x 1GbE pipes.
2 racks were configured in site A, 1 rack configured in site B. Each "rack" had 4 machines in it. We were basically testing Site B as a 'DR' site. Replication factor was set to 3.
Long story short, it works, but the performance was really, really bad. You definitely have to use compression on your source, map and reduce outputs in order to shrink your write I/O, and if the links between sites are used for anything else, you will get timeouts while transferring data. TCP windowing would effectively limited our transfer to around 4MBps, instead of a potential 100MBps+ on a 1GbE line.
Save yourself the headache and just use distcp jobs to replicate data!