Replication of data across the cluster in Cassandra database

Replication of data across the cluster in Cassandra database - hadoop

According to DataStax Each node communicates with each other through the Gossip protocol, which exchanges information across the cluster...
I just wanted to know:
is it really possible to replicate 100gb data in 1 sec across the cluster????????
if it is..then how it's possible..using what kind of technique...can you elaborate??

The gossip protocol is just to share state information around the cluster. This is how Cassandra nodes discover new ones and detect if nodes are unavailable.
Data, however, is not transferred using gossip. Messages are sent directly to replicas during inserts and bulk streaming is done during bootstrap/decommission/repair.

Related

Is it good to have hadoop Namenode and datanode in two different networks?

We are installing HA enabled 10 node Hadoop cluster by using Cloudera distribution.
Is it good to have Namenode and datanode on two different subnet which is secured through the hardware firewall ?

As long as network requests work in both directions from the active namenode (assuming you setup HA) and every datanode, then should work fine, although the extra network hop would add some latency

In case of big data networks, large number of node to node interactions shall get generated from a single client interaction for getting the expected operation done or result (like clients reading more than a single block of data). Such big data networks shall face performance impact due to additional hop count that can increase latency between the client, name node & job tracker and data node & task tracker when the data traverses between through rack switches.
Hadoop basically provides distributed processing of large data sets across clusters of computers which directly implies that networking plays a key role in deployment architecture and also directly associated with its performance and scalability. HDFS and MapReduce have high east-west traffic pattern.
In HDFS, if rack awareness configuration is enabled for HA, the replication is a continuous activity which happens across network based on replication factor. The shuffle phase involving the transfer of data from mapper to reducer in Hadoop is one of the most network bandwidth consuming activity as all the involved servers shall transfer data to every other simultaneously and this directly underlines the network topology.
Also, RPC mechanism are used by platform services like HDFS, HBase, Hive when a client requests for the remote service to execute a function. Every RPC would require the response sent back to client as soon as possible and if there is a delay for the response to reach the client, then the execution of the command can take longer time.
For optimum performance of hadoop, the network must have high bandwidth, low latency and reliable node connectivity across different nodes which boils down to having reduced hops as far as possible as one of the criteria.
In a typical network deployment, firewalls can impact cluster performance if placed between cluster nodes as they have to inspect the packets in network. Hence, it is better to avoid firewall between nodes in cluster.

About elasticsearch cluster

I need to provide many elasticSearch instances for different clients but hosted in my infrastructre.
For the moment it is only some small instances.
I am wondering if it is not better to build a big ElastSearch Cluster with 3-5 servers to handle all instances and then each client gets a different index in this cluster and each instance is distributed over servers.
Or maybe another idea?
And another question is about quorum, what is the quorum for ES please?
thanks,

You don’t have to assign each client to different index, Elasticsearch cluster will automatically share loading among all nodes which share shards.
If you are not sure how many nodes are needed, start from a small cluster then keep monitoring the health status of cluster. Add more nodes to the cluster if server loading is high; remove nodes if server loading is low.
When the cluster continuously grow, you may need to assign a dedicated role to each node. In this way, you will have more control over the cluster, easier to diagnose the problem and plan resources. For example, adding more master nodes to stabilize the cluster, adding more data nodes to increase searching and indexing performance, adding more coordinate nodes to handle client requests.
A quorum is defined as majority of eligible master nodes in cluster as follows:
(master_eligible_nodes / 2) + 1

hadoop datanodes use too much bandwidth after adding new nodes

the problem is that: I have 3 datanodes when I created the cluster, and a few days ago I added another two datanodes.
After I did this, I ran the balancer, and the balancer finished quickly, and said the cluster was balanced.
But I found that once I put data(about 30MB) into the cluster, the datanodes used a lot of bandwidth (about 400Mbps) to send and receive data between the old datanodes and the new ones.
Could someone tell me what's the possible reason ?
Maybe I described the problem not very clear, I'll show you two pics (from zabbix), hadoop-02 is one of the "old datanode", and hadoop-07 is one of the "new datanode".

If you mean network traffic. Hdfs uses write pipeline. Assume the replication factor is 3, the data flow is
client --> Datanode_1 --> Datanode_2 --> Datanode_3
If the data size is 30mb, the overall traffic is 90mb plus a little overhead (for connection creation, packet headers, data checksums in packets)
If you mean traffic rate. I believe currently Hdfs doesn't have bandwidth throttling between client <--> DN, and DN <--> DN. It will use as much as bandwidth as it can get.
If you noticed more data flows between the old datanodes and the new ones. It might happens when some blocks are under-replicated before. After you add new nodes, NameNode periodically schedule replication task from old DNs to the other DNs(not necessarily the new ones).

Hold on!! You are saying that the bandwidth is over-utilized during the data transfer OR the DNs were not balanced after putting the data because balancer is used to balance the amount of data present on nodes in the cluster.

How do the Flowfiles get distributed across the cluster nodes?

For example, if I have a GetFile processor that I have designated to be isolated, how do the flow files coming from that processor get distributed across the cluster nodes?
Is there any additional work / processors that need to be added?

In Apache NiFi today the question of load balancing across the cluster has two main answers. First, you must consider how data gets to the cluster in the first place. Second, once it is in the cluster do you need to rebalance.
For getting data into the cluster it is important that you select protocols which are themselves scalable in nature. Protocols which offer queuing semantics are good for this whereas protocols which do not offer queuing semantics are problematic. As an example of one with queueing semantics think JMS queues or Kafka or some HTTP APIs. Those are great because one or more clients can pull from them in a queue fashion and thus spread the load. An example of a protocol which does not offer such behavior would bet GetFile or GetSFTP and so on. These are problematic because the client(s) have to share state about which data they see to pull. To address even these protocols we've moved to a model of 'ListSFTP' and 'FetchSFTP' where ListSFTP occurs on one node in the cluster (primary node) and then it uses Site-to-Site feature of NiFi to load balance to the rest of the cluster then each node gets its share of work and does FetchSFTP to actually pull the data. The same pattern is offered for HDFS now as well.
In describing that pattern I also mentioned Site-to-Site. This is how two nifi clusters can share data which is great for Inter-site and Instra-Site distribution needs. It also works well for spreading load within the same cluster. For this you simply send the data to the same cluster and NiFi takes care then of load balancing and fail-over and detection of new nodes and removed nodes.
So there are great options already. That said we can do more and in the future we plan to offer a way for you to on a connection indicate it should be auto-load-balanced and then it will behind the scenes do what I've described.
Thanks
Joe

Here is an updated answer, that works even simpler with newer versions of NiFi. I am running Apache NiFi 1.8.0 here.
The approach I found here is to use a processor on the primary node, that will emit flow files to be consumed via a load balanced connection.
For example, use one of the List* processors, in "Scheduling" set its "Execution" to run on the primary node.
This should feed into the next processor. Select the connection and set its "Load Balance Strategy".
You can read more about the feature in its design document.

How Connection Pool/distribution are across Vertica cluster is done?

How Connection Pool/distribution are across Vertica cluster ?
I am trying to understand how connections are handeled in Vertica! Like Oracle handles it's connections thou it's listener or how the connections are balanced inside the cluster (for better distribution).

Vertica's process of handling a connection is basically as follows:
A node receives the connection, making it the Initiator Node.
The initiator node generates the query execution plan and distributes it to the other nodes.
The nodes fill in any node specific details of the execution plan
The nodes execute the query
(ignoring some stuff here)*
The nodes send the result set back to the initiator node
The initiator node collects the data and does final aggregations
The initiator node sends the data back to the client.
The recommended way to connect through Vertica is through a load balancer so no single node becomes a failure point. Vertica itself does not distribute connections between nodes, it distributes the query to the other nodes.
I'm not well versed in Oracle or the details of how systems do their data connection process; so hopefully I'm not too far off the mark of what you're looking for.
From /my/ experience, each node can handle a number of connections. Once you try to connect more than that to a node, it will reject the connection. That was experienced from a map-reduce job that connected in the map function.
*Depending on the query/data/partitioning it may need to do some data transfer behind the scene to complete the query for each node. It slows the query down when this happens.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio