Cassandra N copies in N cluster? - amazon-ec2

I'm trying to configure Cassandra cluster on EC2.
The thing is that (for my purposes) I want to have N replicas in N machine cluster (all machine should have the same data).
I did the following:
- made a N machine cluster; all seeds; I deployed schema with replication factor N
- populate the base with WRITE ALL consensus
- now I'm trying to access data with WRITE ANY, and READ ONE.
- I load balance my clients and theoretically I should have a N time better throughput, however that is not the case.
nodetool shows in Owns column sum of 100%, instead of N*100% (each node should have all data).
any suggestions?

If you increase replicas to N you will not see any throughput benefits, since Cassandra now has to write N copies. You will also not see any throughput benefits on reads, unless you disable read repair.
Best practice is to keep replica count constant as you increase N.

Related

Hadoop Optimization Suggestion

Consider a scenario:
If I increase the replication factor of the data I have in HDFS ; suppose in a 10 node cluster I make the RF = 5 instead of 3(default), will it increase the performance of my data processing tasks?
Will the map phase complete sooner compared to the default replication setting?
Will there be any effect on the reduce phase?
Impact of Replication on Storage:
Replication factor has a huge impact on the storage of the cluster. It's obvious that: Larger the replication factor, lesser the number of files you can store in the cluster.
If replication factor is 5, then for every 1 GB of data ingested into cluster, you will need 5 GB of storage space and you will quickly run out of space in the cluster.
Since NameNode stores all the meta information in memory, it will quickly run of space to store the meta data. Hence, your NameNode will have to be allocated more memory (check HADOOP_NAMENODE_OPTS).
Data copy operation will take more time, since data copy is daisy-chained across Data Nodes. Instead of 3 Data Nodes, now 5 Data Nodes will have to confirm data storage, before a write/append is committed
Impact of Replication on Computation:
Mapper:
With a higher replication factor, there are more options to schedule a mapper. With a replication factor of 3, you can schedule a mapper on 3 different nodes. But, with a factor of 5, you will have 5 choices
You may be able to achieve better data locality, with increase in the replication factor. Each of the mapper could get scheduled on the same node where the data is present (since now there are 5 choices compared to the default 3), thus improving the performance.
Since there is a better data locality, lesser number of mappers will copy off-node or off-rack data
Due to these reasons, its possible that, with a higher replication factor, the mappers could complete earlier than with a lower replication factor.
Since typically the number of mappers are always higher than the number of reducers, you may see an overall improvement in your job performance.
Reducer:
Since the output of the reducer directly gets written into HDFS, its possible that your reducers will take more time to execute, with a higher replication factor.
Overall, your mappers may execute faster with a higher replication factor. But, actual performance improvement depends on various factors like, the size of your cluster, bandwidth, NameNode memory etc.
After answering this question, I came across another similar question in SO here: Map Job Performance on cluster. This also contains some more information, with links to various research papers.
Setting the replication factor to 5 will cause the HDFS namenode to maintain 5 total copies of the file blocks on the available datanodes in the cluster. This copy operation performed by the namenode will result in higher network bandwidth usage depending on the size of the files to be replicated and the speed of your network.
The replication factor has no direct effect in the either the map or reduce phase. You may see a performance hit initially while blocks are being replicated while running a map-reduce job - this could cause significant network latency depending on the size of the files and your network bandwidth.
A replication factor of 5 across your cluster means that 4 of your data nodes can disappear from your cluster, and you'll still have enough nodes to access to all files in HDFS with no file corruption or missing blocks. If your RF = 4 then you can loose 3 servers and still have access to all files in HDFS.
Setting a higher replication factor increases your overall HDFS usage so if your total data size is 1TB a RF=3 means your HDFS usage will be 3TB since the chopped up blocks are duplicated n-1 (3-1 = 2) times across the cluster.

Using multiple node clients in elasticsearch

I'm trying to think of ways to scale our elasticsearch setup. Do people use multiple node clients on an Elasticsearch cluster and put them in front of a load balancer/reverse proxy like Nginx. Other ideas would be great.
So I'd start with re-capping the three different kinds of nodes you can configure in Elasticsearch:
Data Node - node.data set to true and node.master set to false -
these are your core nodes of an elasticsearch cluster, where the data
is stored.
Dedicated Master Node - node.data is set to false and node.master is
set to true - these are responsible for managing the cluster state.
Client Node - node.data is set to false and node.master is set to
false - these respond to client data requests, querying for results
from the data nodes and gathering the data to return to the client.
By splitting the functions into 3 different base node types you have a great degree of granularity and control in managing the scale of your cluster. As each node type handles a more isolated set of responsibilities you are better able to tune each one and to scale appropriately.
For data nodes, it's a function of handling indexing and query responses, along with making certain you have enough storage allocated to each node. You'll want to monitor storage usage and disk thru-put for each node, along with cpu and memory usage. You want to avoid configurations where you run out of disk, or saturate disk thru-put, while still have substantial excess cpu and memory, or the reverse where memory and cpu max but you have lot's of disk available. The best way to determine this is thru some benchmarking of typical indexing and querying activities.
For master nodes, you should always have at least 3 and should always have an odd number. The quorum should be set to N/2 + 1 where is N is the number of master nodes. This way you don't run into split brain issues with your cluster. Dedicated master nodes tend not to be heavily loaded so that can be quite small.
For client nodes you can indeed put them behind a load balancer, or use dns entries to point to them. They are easily scaled up and down by just adding more to the cluster and should be added for both redundancy and as you see cpu and memory usage climb. Not much need for a lot of disk.
No matter what your configuration, in addition to benchmarking likely loads ahead of time I'd strongly advise close monitoring of cpu, memory and disk - ES is easy to start rolling out but it does need watching as you scale into larger numbers of transactions and more nodes. Dealing with a yellow or red status cluster due to node failures from memory or disk exhaustion is not a lot of fun.
I'd take a close read of this article for some background:
http://elastic.co/guide/en/elasticsearch/reference/current/modules-node.html
Plus this series of articles:
http://elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html

Map Job Performance on cluster

Suppose I have 15 blocks of data and two clusters. The first cluster has 5 nodes and a replication factor is 1, while the second one has a replication factor is 3. If I run my map job, should I expect any change in the performance or the execution time of the map job?
In other words, how does replication affect the performance of the mapper on a cluster?
When the JobTracker assigns a job to a TaskTracker on HDFS, a job is assigned to a particular node based upon locality of data (preference is same node first, then same network switch/frame). By having different replication factors, you limit the ability for the JobTracker to assign a node local to the data (JobTracker will still assign the task nodes, but without the benefits of locality). The effect is to restrict the number of TaskTracker nodes which are both local to the data (either data on task node, or data on same switch frame), thus affecting performance for work on your task (reducing parallelization).
Your smaller cluster likely has a single switch, so data is local to the network/frame, so the only bottleneck you might experience would be data transfer from one TaskTracker to another, as the JobTracker is likely to assign jobs to all available TaskTrackers.
But with a larger hadoop cluster, the replication factor = 1 would limit the number of TaskTracker nodes local to the data and thus able to efficiently operate on your data.
There are several papers which support data locality, http://web.eecs.umich.edu/~michjc/papers/tandon_hpdic_minimizeRemoteAccess.pdf, this paper which you cited also supports data locality, http://assured-cloud-computing.illinois.edu/sites/default/files/PID1974767.pdf, and this one, http://www.eng.auburn.edu/~xqin/pubs/hcw10.pdf (which tested a 5 node cluster, same as the OP).
This paper quotes significant benefits to data locality, http://grids.ucs.indiana.edu/ptliupages/publications/InvestigationDataLocalityInMapReduce_CCGrid12_Submitted.pdf, and observes that an increase in replication factor gives better locality.
Note that this paper claims little difference between network throughput and local disk access (8%), http://www.cs.berkeley.edu/~ganesha/disk-irrelevant_hotos2011.pdf, but reports orders of magnitude difference in performance between local memory access and either disk or network access. Furhtermore, the paper quotes a large fraction of jobs (64%) find their data cached in memory "in large part due to the heavy-tailed nature of the workload", as most jobs "access only a small fraction of the blocks".
EDIT: This part of my answer is obsolete now that the other answer was edited: "The other answer is not entirely correct." This was meant to address the incorrect implication that less replicas = fewer paralelism. The rest of my answer (below) still applies.
Any node can execute your tasks, regardless of whether the data is located in that node or not. Hadoop will try to achieve data locality (preference order is: node-local, then rack-local, then any node), but if it can't, then it will chose any node that has available compute capacity to run your task.
Performance wise, in a typical multi-rack installation, rack-local performs almost as good as node-local, since the bottleneck occurs when transmitting data across racks. However, with high-end networking equipment (i.e., full-bisection bandwidth), then it wouldn't matter if your computation is rack-local or not. For more details on this, read this paper.
How much performance improvement can you expect from having more replicas (and thus achieving higher data locality)? Not much; 5-20% maximum improvement. But this is an upper limit, when you implement additional popularity-based replication as in this and this projects. NOTE: I did not just make-up those performance improvement numbers; they come from the papers I linked.
Since vanilla Hadoop does not have these mechanisms in place, I would expect your performance to improve at most 1-5%. This is just a ballpark guess, but you can easily run some tests yourself. The reason for this, is that more replicas could improve the performance of some of your map tasks (the ones that are now able to run with a data-local copy of the block), but it would not improve your shuffle and reduce phases. Furthermore, even if just one mapper takes longer than the rest, this one will determine the length of your whole map phase; so for many jobs, it is likely that increasing locality will not improve their running times at all. Finally, I/O bound jobs can be map-input IO bound, shuffle IO bound (map output heavy), or reduce output IO bound. Only the first type (map-input IO bound) would benefit from locality. More details on MapReduce workload characterization in this paper.
If you are further interested in this, you can also read this paper, in which they improve the running times of mappers but having the input data of ALL the mappers in memory.

Replication factor

I am new to Hadoop and I want to understand how do we determine the highest replication factor we can have for any given cluster. I know that the default setting is 3 replicas, but if I have a cluster with 5 node what is the highest replication factor that I can user in that case. Is there a formula that we have to follow to determine the replication factor?
Thank you
The highest replication factor that you can use is a function of the number of nodes in your cluster (as #Tarik said, you cannot have more replicas than nodes in your cluster), your expected usage (how much data do you plan to store) AND your cluster's storage capacity.
This other SO question has some calculations on capacity and storage use.
Obviously you cannot have more replicas than nodes as storing two copies on the same node is useless. It seems to me to be the upper limit.
In the Hadoop environment, the default replication factor is 3 for 3 slave machines or more than that. Here is a simple formula for that is 'N' Replication Factor = 'N' Slave Nodes. Here is more info about replication http://commandstech.com/replication-factor-in-hadoop/

Replication Factor in Hadoop

I have a data of 5 TB and actual size of the whole size of combined cluster is 7 TB and I have set the Replication factor to 2.
In this case how it will replicate the data?
Due to the Replication factor the minimum size of the storage on the cluster(Nodes) should be always double the size of the Data,Do you think this is a drawback in Hadoop?
If your minimum size of storage on the cluster is not double the size of your data, then you will end up having under-replicated blocks. Under-replicated block are those which are replicated < replication factor, so if you're replication factor is 2, you will have blocks will have replication factor of 1.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Consider that one of the nodes in your cluster goes down. That node would have some data stored in it and if you do not replicate your data, then a part of your data will not be available due to the node failure. However, if your data is replicated, the data which was on the node which went down will still be accessible to you from other nodes.
If you do not feel the need to replicate your data, you can always set your replication factor = 1.
Replication of the data is not a drawback of Hadoop -- it's the factor that increases the efficiency of Hadoop (HDFS). Replication of data to a larger number of slave nodes provides high availability and good fault tolerance to the cluster. If we consider the losses incurred by the client due to downtime of nodes in the cluster (typically will be in millions of $), the cost spent for buying the extra storage facility required for replication of data is much less. So the replication of data is justified.
This is the case of under replication. Assume you have 5 blocks. HDFS was able to create the replicas only for first 3 blocks because of space constraint. Now the other two blocks are under replicated. When the HDFS finds sufficient space, it will try to replicate the 2 blocks also.

Resources