How do I set up a WebSphere JMS topic across two clusters? - jms

In WAS 9.0.5, I have a single cell of 2 clusters with 2 servers in each. I'd like to have a JMS topic that can be subscribed to by all 4 servers. Currently, I have a separate bus, topic connection factory and activation spec in both clusters, but a message published in one cluster is not consumed in the other cluster.
Should I have a cell level bus with each of the 4 servers as bus members? Or add the 2 clusters as bus members? What's the correct way to set up a topic across multiple clusters?

You should configure a single bus for the cell. That would enable the behaviour you are interested in. Topics are scoped by the bus not the cell, so to make your current topology work you would need to set up an inter-bus link, but given you are all in the same cell that would be needlessly complex for your scenario.
There is really little reason to have more than one bus in a cell.

Related

What is subtopology in Kafka Streams?

I've been exploring the sources of Kafka Streams and have hard time understanding what operators would make for one or many sub-topologies (and then node groups). Can someone explain what to use so Topology.describe shows sub-topologies?
What are sub-topologies?
Seems Apache Kafka docs don't describe them in detail... There is a section in Confluent docs about it: https://docs.confluent.io/platform/current/streams/architecture.html#stream-partitions-and-tasks
Sub-topologies (also called sub-graphs): [...] A sub-topology is a set of processors, that are all transitively connected as parent/child or via state stores in the topology. Hence, different sub-topologies exchange data via topics and don’t share any state stores. [...]
A node group is the internal name of a sub-topology.
Sub-topologies are scaled independently, i.e., each sub-topology may have a different number of instantiations (so-called tasks) depending on the maximum number of partitions over all its input topics.

kafka broker server uneven CPU utilization

We have 3 node of kafka cluster with around 32 topic and 400+ partition
spread across these servers. We have the load evenly distributed amongst
this partition however we are observing that 2 broker server are running
around >60% CPU where as the third one is running just abour 10%. How do we
ensure that all server are running smoothly? Do i need to reassing the
partition (kafka-reassign-parition cmd).
PS: The partition are evenly distributed across all the broker servers.
In some cases, this is a result of the way that individual consumer groups determine which partition to use within the __consumer_offsets topic.
On a high level, each consumer group updates only one partition within this topic. This often results in a __consumer_offsets topic with a highly uneven distribution of message rates.
It may be the case that:
You have a couple very heavy consumer groups, meaning they need to update the __consumer_offsets topic frequently. One of these groups uses a partition that has the 2nd broker as its leader. The other uses a partition that has the 3rd broker as its leader.
This would result in a significant amount of the CPU being utilized for updating this topic, and would only occur on the 2nd and 3rd brokers (as seen in your screenshot).
A detailed blog post is found here

How to make RabbitMQ scalable?

I tried to test RabbitMQ, but I found that rabbitmq has some problems:
if I created a cluster of 3 nodes, I can't publish/delivered more than 6000/s.
in other hand, if I worked with one single node, I can publish/delivery until 25000/s.
which means, more that I add nodes, more performance is deteriorating.
but from this article : https://blog.pivotal.io/pivotal/products/rabbitmq-hits-one-million-messages-per-second-on-google-compute-engine
they can publish more than 1 million, so how they can do that?
I want to make RabbitMQ process more than 1 million messages per second
I resolved the problem by adding load balancer.
The producers send data to load balancer. On the other hand the load balancer id connected to many nodes of rabbitmq, but those nodes are not connected between them (to avoid synchronization which affects the performance).
So by this way, I can multiply the throughput (ex: 3 nodes= 3x throughput).
It might depend on other factors such as your network, or your hardware performance.
When reading benchmark always consider the environment surrounding the tests
As on how to improve perf you can improve your hardware or network if this is the limiting factor.
Consider switching to a SSD or using link aggregation on your network would be a good start.
In this test of RabbitMQ performance, the authors concluded that a small cluster will underperform a single node cluster. More nodes need to be added to increase the performance. This makes sense when you think about the overhead induced by replication required in a distributed system, especially given that RabbitMQ focus is reliability.
The following is mentioned in a blog post by RabbitMQ:
If you use quorum queues or mirrored queues, then each message will be delivered to multiple brokers. If you have a cluster of three brokers and quorum queues with a replicator factor of 3, then every broker will receive every message. In that case, we’ve created a cluster for redundancy only. But we can also create larger clusters for scalability. We could have a cluster of 9 brokers, with quorum queues with a rep factor of 3 and now we’ve spread that load out and can handle a much larger total throughput.

How do the Flowfiles get distributed across the cluster nodes?

For example, if I have a GetFile processor that I have designated to be isolated, how do the flow files coming from that processor get distributed across the cluster nodes?
Is there any additional work / processors that need to be added?
In Apache NiFi today the question of load balancing across the cluster has two main answers. First, you must consider how data gets to the cluster in the first place. Second, once it is in the cluster do you need to rebalance.
For getting data into the cluster it is important that you select protocols which are themselves scalable in nature. Protocols which offer queuing semantics are good for this whereas protocols which do not offer queuing semantics are problematic. As an example of one with queueing semantics think JMS queues or Kafka or some HTTP APIs. Those are great because one or more clients can pull from them in a queue fashion and thus spread the load. An example of a protocol which does not offer such behavior would bet GetFile or GetSFTP and so on. These are problematic because the client(s) have to share state about which data they see to pull. To address even these protocols we've moved to a model of 'ListSFTP' and 'FetchSFTP' where ListSFTP occurs on one node in the cluster (primary node) and then it uses Site-to-Site feature of NiFi to load balance to the rest of the cluster then each node gets its share of work and does FetchSFTP to actually pull the data. The same pattern is offered for HDFS now as well.
In describing that pattern I also mentioned Site-to-Site. This is how two nifi clusters can share data which is great for Inter-site and Instra-Site distribution needs. It also works well for spreading load within the same cluster. For this you simply send the data to the same cluster and NiFi takes care then of load balancing and fail-over and detection of new nodes and removed nodes.
So there are great options already. That said we can do more and in the future we plan to offer a way for you to on a connection indicate it should be auto-load-balanced and then it will behind the scenes do what I've described.
Thanks
Joe
Here is an updated answer, that works even simpler with newer versions of NiFi. I am running Apache NiFi 1.8.0 here.
The approach I found here is to use a processor on the primary node, that will emit flow files to be consumed via a load balanced connection.
For example, use one of the List* processors, in "Scheduling" set its "Execution" to run on the primary node.
This should feed into the next processor. Select the connection and set its "Load Balance Strategy".
You can read more about the feature in its design document.

what is difference between partition and replica of a topic in kafka cluster

What is difference between partition and replica of a topic in kafka cluster.
I mean both store the copies of messages in a topic. Then what is the real diffrence?
When you add the message to the topic, you call send(KeyedMessage message) method of the producer API. This means that your message contains key and value. When you create a topic, you specify the number of partitions you want it to have. When you call "send" method for this topic, the data would be sent to only ONE specific partition based on the hash value of your key (by default). Each partition may have a replica, which means that both partitions and its replicas store the same data. The limitation is that both your producer and consumer work only with the main replica and its copies are used only for redundancy.
Refer to the documentation: http://kafka.apache.org/documentation.html#producerapi
And a basic training: http://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign
Topics are partitioned across multiple nodes so a topic can grow beyond the limits of a node. Partitions are replicated for fault tolerance. Replication and leader takeover is one of the biggest difference between Kafka and other brokers/Flume. From the Apache Kafka site:
Each partition has one server which acts as the "leader" and zero or
more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader for
some of its partitions and a follower for others so load is well
balanced within the cluster.
partition: each topic can be splitted up into partitions for load balancing (you could write into different partitions at the same time) & scalability (the topic can scale up without the instance limitations); within the same partition the records are ordered;
replica: for fault-tolerant durability mainly;
Quotes:
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
There is a quite intuitive tutorial to explain some fundamental concepts in Kafka: https://www.tutorialspoint.com/apache_kafka/apache_kafka_fundamentals.htm
Furthermore, there is a workflow to get you through the confusing jumgle: https://www.tutorialspoint.com/apache_kafka/apache_kafka_workflow.htm
Partitions
A topic consists of a bunch of buckets. Each such bucket is called a partition.
When you want to publish an item, Kafka takes its hash, and appends it into the appropriate bucket.
Replication Factor
This is the number of copies of topic-data you want replicated across the network.
In simple terms, partition is used for scalability and replication is for availability.
Kafka topics are divided into a number of partitions. Any record written to a particular topic goes to particular partition. Each record is assigned and identified by an unique offset. Replication is implemented at partition level. The redundant unit of topic partition is called replica. The logic that decides partition for a message is configurable. Partition helps in reading/writing data in parallel by splitting in different partitions spread over multiple brokers. Each replica has one server acting as leader and others as followers. Leader handles the read/write while followers replicate the data. In case leader fails, any one of the followers is elected as the leader.
Hope this explains!
Further Reading
Partitions store different data of the same type and
Yes, you can store the same message in different topic partitions but your consumers need to handle duplicated messages.
Replicas are a copy of these partitions in other servers.
Your number of replicas will be defined by the number of kafka brokers (servers) of your cluster
Example:
Let's suppose you have a Kafka cluster of 3 brokers and inside you have a topic with name AIRPORT_ARRIVALS that receives messages of Flight information and it has 3 partitions; partition 1 for flight arrivals from airline A, partition 2 from airline B, and partition 3 from airline C. All these messages will be initially written in one broker (leader) and a copy of each message will be stored/replicated to the other 2 Kafka broker (followers). Disclaimer; this example is only for an easier explanation and not an ideal way to define a message key because you could ending up with unbalanced load over specific partitions.
Partitions are the way that Kafka provides redundancy.
Kafka keeps more than one copy of the same partition across multiple brokers.
This redundant copy is called a Replica. If a broker fails, Kafka can still serve consumers with the replicas of partitions that failed broker owned

Resources