I need to scale up my ActiveMQ solution so I have defined a network of brokers.
I'm tring to figure out how to connect my producers and consumers to the cluster.
does each producer has to be connected to a single broker (with the failover uri for availability)? in this case how can I guarentry the distribution of traffic accross the brokers? do I need to configure the producers to connect each to a diffrent broker?
should I apply the same schema for the consumers?
This makes the application aware of the cluster topology, which I hope can be avoided by a discent cluster
Tx
Tomer
I strongly suggest you carefully read through the documentation from activemq.apache.org on clustering ActiveMQ. There are a lot of very helpful tips.
From what you have written I suggest you pay special attention to this. At the bottom of the page it details how you can control from server side the failover/failback configuration for your producers.
For example:
updateClusterClients - if true pass information to connected clients about changes in the topology of the broker cluster
rebalanceClusterClients - if true, connected clients will be asked to rebalance across a cluster of brokers when a new broker joins the network of brokers
updateURIsURL - A URL (or path to a local file) to a text file containing a comma separated list of URIs to use for reconnect in the case of failure
In a production active system then I would think that making use of updateURIsURL would make it a lot less painful scaling out.
Related
I would like to figure out the best way to route messages from Kafka to web socket clients connected to a load balanced application server cluster. I understand that spring-kafka facilitates consuming and publishing messages to a kafka topic, but how does this work in a load balanced application server scenario when connecting to a distributed kafka topic. Here are the requirements that I would like to satisfy, with the overall goal of facilitating peer to peer messaging in an application with a very, very large volume of users:
Web clients can connect to a tomcat application server via web sockets connection via a load balancer.
Web client can send a message/notification to another client thats connected to different tomcat application server.
Messages are saved in the database and published to a kafka topic/partition that can be consumed by the appropriate web clients/users.
Kafka can be scaled to many brokers with many consumers.
I can see how this can be implemented quite easily in a single application server scenario where the consumer consumes all messages from a kafka topic and re-distributes via spring messaging/websockets. But I can't figure out how this would work in a load balanced application server scenario where there are consumers on each application server forming an overall consumer group for the kafka topic. Assuming that each of the application servers are are consuming sub-sets/partitions of the kafka topic, how do they know which server their intended recipients are connected to? And even if they knew which server their recipients were connected to, how would they route the message to them via websockets?
I considered that the application server load balancing could work by logging users with a particular routing key (users starts with 'A' etc) on to a specific application server, then only consuming messages for users starts with 'A' on that application server. But this seems like it would be difficult to maintain and would make autoscaling very difficult. This seems like it should be an common scenario to implement but I can't find any tools or approaches that fit this scenario.
Sounds like every single consumer should live in its own consumer group. This way all the available consumers are going to consume all the messages sent to the topic. Therefore all the connected websocket clients are going to be notified with those messages.
If you need more complex logic with those messages at
after consuming, e.g. filtering, routing, transforming, aggregating etc., you should consider to involve Spring Integration in you project: https://spring.io/projects/spring-integration
Broadcast to all the consumer may work, but the most efficient solution should route message to the node holds the websocket connection for the target user precisely. As i know, route in a distributed system can be done as follows:
Put the route information in a middleware,such as Redis; Or implement a service by yourself to keep track of all the ssesions. That is, solved in a centralized way.
Let the websocket server find route by themselves. In this circumstance, consensus algorithm like gossip should be taken into consideration.
To get Kafka running, you need to set some properties in config/server.properties file. There are two settings I don't understand.
Can somebody explain the difference between listeners and advertised.listeners property?
The documentation says:
listeners: The address the socket server listens on.
and
advertised.listeners:
Hostname and port the broker will advertise to producers and consumers.
When do I have to use which setting?
listeners is what the broker will use to create server sockets.
advertised.listeners is what clients will use to connect to the brokers.
The two settings can be different if you have a "complex" network setup (with things like public and private subnets and routing in between).
Since I cannot comment yet I will post this as an "answer", adding on to M.Situations answer.
Within the same document he links there is this blurb about which listener is used by a KAFKA client (https://cwiki.apache.org/confluence/display/KAFKA/KIP-103%3A+Separation+of+Internal+and+External+traffic):
As stated previously, clients never see listener names and will make metadata requests exactly as before. The difference is that the list of endpoints they get back is restricted to the listener name of the endpoint where they made the request.
This is important as depending on what URL you use in your bootstrap.servers config that will be the URL* that the client will get back if it is mapped in advertised.listeners (do not know what the behavior is if the listener does not exist).
Also note this:
The exception is ZooKeeper-based consumers. These consumers retrieve the broker registration information directly from ZooKeeper and will choose the first listener with PLAINTEXT as the security protocol (the only security protocol they support).
As an example broker config (for all brokers in cluster):
advertised.listeners=EXTERNAL://XXXXX.compute-1.amazonaws.com:9990,INTERNAL://ip-XXXXX.ec2.internal:9993
inter.broker.listener.name=INTERNAL
listener.security.protocol.map=EXTERNAL:SSL,INTERNAL:PLAINTEXT
If the client uses XXXXX.compute-1.amazonaws.com:9990 to connect, the metadata fetch will go to that broker. However, the returning URL to use with the Group Coordinator or Leader could be 123.compute-1.amazonaws.com:9990* (a different machine!). This means that the match is done on the listener name as advertised by KIP-103 irrespective of the actual URL (node).
Since the protocol map for EXTERNAL is SSL this would force you to use an SSL keystore to connect.
If on the other hand you are within AWS lets say, you can then issue ip-XXXXX.ec2.internal:9993 and the corresponding connection would be plaintext as per the protocol map.
This is especially needed in IaaS where in my case brokers and consumers live on AWS, whereas my producer lives on a client site, thus needing different security protocols and listeners.
EDIT:
Also adding Inbound Rules is much easier now that you have different ports for different clients (brokers, producers, consumers).
EDIT2:
This article is a great in depth guide if the above is still not clear: https://rmoff.net/2018/08/02/kafka-listeners-explained/
There's so much confusion or little information in answers provided here for the question. So posting my elaborate answer for clarity.
listeners - Used by the embedded jetty web server in kafka to bind to. This jetty web server is used to provide REST API that provides the control plane for Kafka Connect workers. The hostname in this setting can be left empty if you want kafka to bind to localhost (it does by calling InetAddress.getLocalHost().getCanonicalHostName() java api)
advertised.listeners: This address is published to zookeeper by every kafka broker. If this setting is not set, then value of listeners will be used here and published to zookeeper. That's the only purpose of this setting for notifying others. Kafka Clients use the 'advertised.listeners' setting published to zookeeper (as /brokers/ids/<id>/ # endpoints) to talk to Kafka broker.
Now the question is why to have two setting? Why not a single setting? Let's say your kafka broker is sitting behind a proxy. And all the kafka clients have to talk to the proxy to reach the broker. In this case, we want kafka's embedded jetty server to bind to localhost and local port, but we can't publish this to zookeeper as clients can't use it. So kafka admin can set the setting advertised.listeners to the proxy host and port.
Also, in some of our production hosts, InetAddress.getLocalHost().getCanonicalHostName() returns empty and so listeners setting's hostname was empty which was fine for jetty to bind. But advertised.listeners was published to zookeeper as NULL:9092 since it took the same value as listeners by default. Now all the brokers tried to publish in this way to zookeeper and so brokers got the error java.lang.IllegalArgumentException: requirement failed: Configured end points null:14092 as advertised.listeners as NULL:9092 is already registered by broker 101. The fix was to change the advertised.listeners setting to have hostname in it.
Listeners are all the addresses the Kafka broker listens on (it can be more than 1 address) whereas advertised listeners are the addresses other agents (producers, consumers, or brokers) need to connect to if they want to talk to the current broker.
The 2 lists should be the same if all are running on the same machine (can connect using localhost:9092 or 127.0.0.1:9092) but if consumers, producers, or other brokers do not stay on the same machine or Docker instance, they must use different addresses (that's why we have advertised listeners). Two examples:
Saying we use Docker to run 2 Kafka instances named kafka and kafka2. kafka2 for sure cannot connect to kafka using localhost:29092. It must use kafka:9092 instead. So for kafka, listener = localhost:29092, advertised listener = kafka:9092
Producer from host machine cannot connect to kafka using kafka:9092. It must use localhost:29092 instead.
Let use the following docker-compose config to understand more about the startup process of a Kafka broker:
# config/docker-compose.yml
kafka:
image: docker.io/bitnami/kafka:3
ports:
- "29092:29092"
- "9092:9092"
environment:
- KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
- ALLOW_PLAINTEXT_LISTENER=yes
- KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CLIENT:PLAINTEXT,EXTERNAL:PLAINTEXT
- KAFKA_CFG_LISTENERS=CLIENT://:9092,EXTERNAL://:29092
- KAFKA_CFG_ADVERTISED_LISTENERS=CLIENT://kafka:9092,EXTERNAL://localhost:29092
- KAFKA_CFG_INTER_BROKER_LISTENER_NAME=CLIENT
depends_on:
- zookeeper
With this config, Docker will start 1 Kafka broker instance which listens on 2 ports:
9092 with name CLIENT
29092 with name EXTERNAL
The broker then connects to Zookeeper at zookeeper:2181 and registers its 2 addresses: kafka:9092 and localhost:29092. Also, with KAFKA_CFG_INTER_BROKER_LISTENER_NAME=CLIENT, it wants Zookeeper to tell other brokers to connect to kafka:9092 if want to talk to it.
But why need 2 ports? Read more here
References:
My notes while learning Kafka
Kafka Listeners – Explained
From this link: https://cwiki.apache.org/confluence/display/KAFKA/KIP-103%3A+Separation+of+Internal+and+External+traffic
During the 0.9.0.0 release cycle, support for multiple listeners per
broker was introduced. Each listener is associated with a security
protocol, ip/host and port. When combined with the advertised
listeners mechanism, there is a fair amount of flexibility with one
limitation: at most one listener per security protocol in each of the
two configs (listeners and advertised.listeners).
In some environments, one may want to differentiate between external
clients, internal clients and replication traffic independently of the
security protocol for cost, performance and security reasons. A few
examples that illustrate this:
Replication traffic is assigned to a separate network interface so that it does not interfere with client traffic.
External traffic goes through a proxy/load-balancer (security, flexibility) while internal traffic hits the brokers directly
(performance, cost).
Different security settings for external versus internal traffic even though the security protocol is the same (e.g. different set of
enabled SASL mechanisms, authentication servers, different keystores,
etc.)
As such, we propose that Kafka brokers should be able to define
multiple listeners for the same security protocol for binding (i.e.
listeners) and sharing (i.e. advertised.listeners) so that internal,
external and replication traffic can be separated if required.
So,
listeners - Comma-separated list of URIs we will listen on and their protocols.
Specify hostname as 0.0.0.0 to bind to all interfaces.
Leave hostname empty to bind to default interface.
Examples of legal listener lists:
PLAINTEXT://myhost:9092,TRACE://:9091
PLAINTEXT://0.0.0.0:9092, TRACE://localhost:9093
advertised.listeners - Listeners to publish to ZooKeeper for clients to use, if different than the listeners above.
In IaaS environments, this may need to be different from the interface to which the broker binds. If this is not set, the value for listeners will be used.
I want to create a udp-based message broker service.
I have a few dozens of sources, each transmiting at different rates, part of them streaming and part of them forward the data in batch.
I want all the data to go to one destination - a Cloudera Hadoop cluster (using RedHat 6.6 OS), that will use Kafka/Flume as it's message broker.
I need to create the inbetween message broker service. It has to be robust and fault tollerant. It can receive the data from the sources using any protocol, but it has to forward the messages using UDP (or any one-way protocol, no ACK/SYN or any other respond allowed).
For that reason it has to use a PUSH mechanism, and the data cannot be pulled by the Hadoop cluster.
As much as i know Kafka and Flume - they use TCP to forward messages. I found "udp-kafka-bridge" and "flume-udp-source", but I do not have any experience with them.
The message broker has to be robust and fault tollerant. It has to be able to deal with changing rates of incoming data, and also preferred to be Near Real Time broker.
Do you have any recommendation for tools/architecture I should use?
thank you!
I'm trying to configure a clustered websphere application server that connects to a clustered MQ.
However, the the information I have is details for two instances of MQ with different host names, server channels and queue manager which belongs to the same MQ cluster name.
On the websphere console, I can see input fields for hostname, queue manager and server channel, I cannot find anything that I can specify multiple MQ details.
If I pick one of the MQ detail, will MQ clustering still work? If not, how will I enable MQ clustering given the details I have?
WebSphere MQ clustering affects the behavior of how queue managers talk amongst themselves. It does not change how an application connects or talks to a queue manager so the question as asked seems to be assuming some sort of clustering behavior that is not present in WMQ.
To set up the app server with two addresses, please see Configuring multi-instance queue manager connections with WebSphere MQ messaging provider custom properties in the WAS v7 Knowledge Center for instructions on how to configure a connection factory with a multi-instance CONNAME value.
If you specify a valid QMgr name in the Connection Factory and the QMgr to which the app connects doesn't have that specific name then the connection is rejected. Normally a multi-instance CONNAME is used to connect to a multi-instance QMgr. This is a single highly available queue manager that can be at one of two different IP addresses so using a real QMgr name works in that case. But if the QMgrs to which your app is connecting are two distinct and different-named queue managers, which is what you described, you should specify an asterisk (a * character) as the queue manager name in your connection factory as described here. This way the app will not check the name of the QMgr when it gets a connection.
If I pick one of the MQ detail, will MQ clustering still work? If not,
how will I enable MQ clustering given the details I have?
Depends on what you mean by "clustering". If you believe that the app will see one logical queue which is hosted by two queue managers, then no. That's not how WMQ clustering works. Each queue manager hosting a clustered queue gets a subset of messages sent to that queue. Any apps getting from that queue will therefore only ever see the local subset.
But if by "clustering" you intend to connect alternately to one or the other of the two queue managers and transmit messages to a queue that is in the same cluster but not hosted on either of the two QMgrs to which you connect, then yes it will work fine. If your Connection Factory knows of only one of the two QMgrs you will only connect to that QMgr, and sending messages to the cluster will still work. But set it up as described in the links I've provided and your app will be able to connect to either of the two QMgrs and you can easily test that by stopping the channel on the one it connects to and watching it connect to the other one.
Good luck!
UPDATE:
To be clear the detail provide are similar to hostname01, qmgr01,
queueA, serverchannel01. And the other is hostname02, qmgr02, queueA,
serverchannel02.
WMQ Clients will connect to two different QMgrs using a multi-instance CONNAME only when...
The channel name used on both QMgrs is the exactly the same
The application uses an asterisk (a * character) or a space for the QMgr name when the connection request is made (i.e. in the Connection Factory).
It is possible to have WMQ connect to one of several different queue managers where the channel name differs on each by using a Client Connection Definition Table, also known as a CCDT. The CCDT is a compiled artifact that you create using MQSC commands to define CLNTCONN channels. It contains entries for each of the QMgrs the client is eligible to connect to. Each can have a different QMgr name, host, port and channel. However, when defining the CCDT the administrator defines all the entries such that the QMgr name is replaced with the application High Level Qualifier. For example, the Payroll app wants to connect to any 1 of 3 different QMgrs. The WMQ Admin defines a CCDT with three entries but uses PAY01, PAY02, and PAY03 for the QMgr names. Note this does not need to match the actual QMgr names. The application then specifies the QMgr name as PAY* which selects all three QMgrs in the CCDT.
Please see Using a client channel definition table with WebSphere MQ classes for JMS for more details on the CCDT.
Is MQ cluster not similar to application server clusters?
No, not at all.
Wherein two-child nodes are connected to a cluster. And an F5 URL will
be used to distribute the load to each node. Does not WMQ come with a
cluster url / f5 that we just send message to and the partitioning of
messages are transparent?
No. The WMQ cluster provides a namespace within which applications and QMgrs can resolve non-local objects such as queues and topics. The only thing that ever connects to a WebSphere MQ cluster is a queue manager. Applications and human users always connect to specific queue managers. There may be a set of interchangeable queue managers such as with the CCDT, but each is independent.
With WAS the messaging engine may run on several nodes, but it provides a single logical queue from which applications can get messages. With WMQ each node hosting that queue gets a subset of the messages and any application consuming those messages sees only that subset.
HTTP is stateless and so an F5 URL works great. When it does maintain a session, that session exists mainly to optimize away connection overhead and tends to be short lived. WMQ client channels are stateful and coordinate both single-phase and two-phase units of work. If an application fails over to another QMgr during a UOW, it has no way to reconcile that UOW.
Because of the nature of WMQ connections, F5 is never used between QMgrs. It is only used between client and QMgr for connection balancing and not message traffic balancing. Furthermore, the absence or presence of an MQ cluster is entirely transparent to the application which, in either case, simply connects to a QMgr to get and./or put messages. Use of a Multi-Instance CONNAME or a CCDT file makes that connection more robust by providing multiple equivalent QMgrs to which the client can connect but that has nothing whatever to do with WMQ clustering.
Does that help?
Please see:
Clustering
How Clusters Work
Queue manager groups in the CCDT
Connecting WebSphere MQ MQI client applications to queue managers
I know there's HornetQ HA with Master/Backup setups. But I would like to run HornetQ in a non-master setup and handle duplicate messages myself.
The cluster setup looks perfect for this, but nowhere I see a hint to its ability to service such these requirements. What happens to clients of a failed node? Do they connect to other servers?
Will a rebooted/repaired node be able to rejoin the cluster and continue distribution of its persistent messages?
Failover on clients require a backup node at the moment. you would have to reconnect manually in case of a failure to get into other nodes.
Example: get the connection factory and connect there.