Using flume to read IBM MQ data - hadoop

I want to read data from IBM MQ and put it into HDFs.
Looked into JMS source of flume, seems it can connect to IBM MQ, but I’m not understanding what does “destinationType” and “destinationName” mean in the list of required properties. Can someone please explain?
Also, how I should be configuring my flume agents
flumeAgent1(runs on the machine same as MQ) reads MQ data ---- flumeAgent2(Runs on Hadoop cluster) writes into Hdfs
OR only one agent is enough on Hadoop cluster
Can someone help me in understanding how MQs can be integrated with flume
Reference
https://flume.apache.org/FlumeUserGuide.html
Thanks,
Chhaya

Regarding the Flume agent architecture, it is composed in its minimalist form by a source in charge of receiving or polling for events, and converting the events into Flume events that are put in a channel. Then, a sink takes those events in order to persist the data somewhere, or send the data to another agent. All these components (source, channel, sink, i.e. an agent) run in the same machine. Different agents may be distributed, instead.
Being said that, your scenario seems to require a single agent based on a JMS source, a channel, typically Memory Channel, and a HDFS sink.
The JMS source, as stated in the documentation, has only been tested for ActiveMQ, but shoukd work for any other queue systemm. The documentation also provides an example:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE
a1 is the name of the single agent. c1 is the name for the channel and its configuration must be still completed; and a sink configuration is totally missing. It can be easily completed by adding:
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = ...
a1.sinks.k1...
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1...
r1 is the JMS source, and as can be seen, destinationName simply ask for a string name. destinationType can only take two values: queue or topic. I think the important parameters are providerURL and initialContextFactory and connectionFactory, which must be adapted for IBM MQ.

Related

Using dead letter queue with Kafka MirrorMaker2

Kafka Connect converters provide the feature of dead letter queue (DLQ) that can be configured (errors.deadletterqueue.topic.name) to store failing records. I tried configuring it on a MirrorMaker2 setup but it doesn't seem to be working as expected. My expectation is that messages that failed to replicate to target cluster are stored in the dead letter queue topic.
To test this, I simulated failures by bringing down the target cluster and expected MirrorMaker2 to create a DLQ on source cluster with failed message but didn't see the dead letter queue topic created. The Kafka documentation is not very clear on whether this configuration option works for MirrorMaker2.
Below is the configuration I used:
clusters = sourceKafkaCluster,targetKafkaCluster
sourceKafkaCluster.bootstrap.servers = xxx
targetKafkaCluster.bootstrap.servers = yyy
sourceKafkaCluster->targetKafkaCluster.enabled = true
targetKafkaCluster->sourceKafkaCluster.enabled = false
#Not sure which one of the below ones are correct.
sourceKafkaCluster->targetKafkaCluster.errors.deadletterqueue.topic.name=dlq_topic_1
sourceKafkaCluster->targetKafkaCluster.errors.deadletterqueue.topic.replication.factor=1
errors.deadletterqueue.topic.name=dlq_topic_1
errors.deadletterqueue.topic.replication.factor=1
Does the deadletterqueue configuration option work with MirrorMaker2?

Amazon MQ (ActiveMQ) bad performance on large messages

We are migrating from IBM MQ to Amazon MQ, at least we would like to do so. The problem is Amazon MQ is having bad performance when using JMS producer to put a large message on a queue compared to IBM MQ.
All messages are persistent and the system is High Available regarding IBM MQ, and Amazon MQ is multi AZ.
If we put this size of XML files to IBM MQ (2 cpu and 8GB RAM HA instance) we have this performance:
256 KB = 15ms
4,6 MB = 125ms
9,3 MB = 141ms
18,7 MB = 218ms
37,4 MB = 628ms
74,8 MB = 1463ms
If we put the same files on Amazon MQ (mq.m5.2xlarge = 8 CPU and 32 GB RAM) or ActiveMQ we have this performance:
256 KB = 967ms
4,6 MB = 1024ms
9,3 MB = 1828ms
18,7 MB = 3550ms
37,4 MB = 8900ms
74,8 MB = 14405ms
What we also see is that IBM MQ has equal response times for sending a message to a queue and getting a message from a queue, while Amazon MQ is real fast in getting a message (e.g. just takes 1 ms), but very slow on sending.
On Amazon MQ we use the OpenWire protocol. We use this config in Terraform style:
resource "aws_mq_broker" "default" {
broker_name = "bernardamazonmqtest"
deployment_mode = "ACTIVE_STANDBY_MULTI_AZ"
engine_type = "ActiveMQ
engine_version = "5.15.10"
host_instance_type = "mq.m5.2xlarge"
auto_minor_version_upgrade = "false"
apply_immediately = "false"
publicly_accessible = "false"
security_groups = [aws_security_group.pittensbSG-allow-mq-external.id]
subnet_ids = [aws_subnet.pittensbSN-public-1.id, aws_subnet.pittensbSN-public-3.id]
logs {
general = "true"
audit = "true"
}
We use Java 8 with JMS ActiveMQ library via POM (Maven):
<dependency>
<groupId>org.apache.activemq</groupId>
<artifactId>activemq-client</artifactId>
<version>5.15.8</version>
</dependency>
<dependency>
<groupId>org.apache.activemq</groupId>
<artifactId>activemq-pool</artifactId>
<version>5.15.8</version>
</dependency>
In JMS we have this Java code:
private ActiveMQConnectionFactory mqConnectionFactory;
private PooledConnectionFactory mqPooledConnectionFactory;
private Connection connection;
private Session session;
private MessageProducer producer;
private TextMessage textMessage;
private Queue queue;
this.mqConnectionFactory = new ActiveMQConnectionFactory();
this.mqPooledConnectionFactory = new PooledConnectionFactory();
this.mqPooledConnectionFactory.setConnectionFactory(this.mqConnectionFactory);
this.mqConnectionFactory.setBrokerURL("ssl://tag-1.mq.eu-west-1.amazonaws.com:61617");
this.mqPooledConnectionFactory.setMaxConnections(10);
this.connection = mqPooledConnectionFactory.createConnection());
this.connection.start();
this.session = this.connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
this.session.createQueue("ExampleQueue");
this.producer = this.session.createProducer(this.queue);
long startTimeSchrijf = 0;
startTimeWrite= System.currentTimeMillis();
producer.send("XMLFile.xml"); // here we send the files
logger.debug("EXPORTTIJD_PUT - Put to queue takes: " + (System.currentTimeMillis() - startTimeWrite));
// close session, producer and connection after 10 cycles
We also have run the performance test as a Single Instance AmazonMQ. But same results.
We have also run the performance test with a mq.m5.4xlarge (16 cpu, 96 GB RAM) engine but still no improvement of the bad performance.
Performance test configuration:
We first push the messages(XML files) according above one by one to a queue. We do that 5 times. After 5 times we read those messages(XML files) from the queue. We call this 1 cycle.
We run 10 cycles one after another, so in total we have pushed 300 files to the queue and we have getted 300 files from the queue.
We run 3 tests in parallel: One from AWS Region Londen, one from AWS Region Frankfurt in a different VPC and 1 from Frankfurt in the same VPC as the Amazon MQ broker and in the same subnet. Alle clients run on an EC2 instance: m4.xlarge.
If we run a test with only one VPC for example only the local VPC which is in the same subnet as the AmazonMQ broker the performance improves and we have these results:
256 KB = 72ms
4,6 MB = 381ms
9,3 MB = 980ms
18,7 MB = 2117ms
37,4 MB = 3985ms
74,8 MB = 7781ms
The client and server are in the same subnet, so we have nothing to do with firewalling etc.
Maybe somebody can tell me what is wrong, and why we have such a terrible performance with Amazon MQ or ActiveMQ?
extra info:
Response times are measured in the JMS Java app with Java starttime just before the producer.send('XML') and just endtime just after the producer.send('XML'). Difference is the recorded time. Times are average times over 300 calls.
IBM MQ server is located in our datacenter, and client app is running at a server in the same datacenter.
extra info test:
The jms app starts create connectionFactory queues sessions. Then it uploads the files to MQ 1 by 1. This is a cycle, then it run this cycle 10 times in a for lus without opening or closing sessions queues or connectionfactorys. Then all 60 messages are read from queue and written to files on the local drive. Then it closes the connection factory and session and producer/consumer. This is one batch.
Then we run 5 batches. So between the batches connectionFactory, queue, session are recreated.
In response to Sam:
When I also execute the test with the same size of files like you did Sam I approach the same response times, I set the persistence mode also to false value between () :
500 KB = 30ms (6ms)
1 MB = 50ms (13ms)
2 MB = 100ms (24ms)
I removed the connection pooling and I set
concurrentStoreAndDispatchQueues="false"
The system I have used broker: mq.m5.2xlarge and client: m4.xlarge.
But if I test with bigger files, this are the response times:
256 KB = 72ms
4,6 MB = 381ms
9,3 MB = 980ms
18,7 MB = 2117ms
37,4 MB = 3985ms
74,8 MB = 7781ms
I am having a very simple requirement. I have a system what puts messages on a queue and the messages are get from the queue by another system, sometimes at the same time sometimes not, sometimes there are 20 or 30 messages on the system before they get unloaded. Thats why I need a queue and messages must be persistent and it must be a Java JMS implementation.
I think Amazon MQ might be a solution for small files but for big files it is not. I think we have to use IBM MQ for this case which has better performance. But one important thing: I tested IBM MQ only on premis in our LAN. We tried to test IBM MQ on Amazon but we didn't succeed yet.
I tried to reproduce the scenario you were testing. When I ran a JMS client in the same VPC as the AmazonMQ broker for mq.m5.4xlarge broker with an Active and Standby instance, I see the following roundtrip latencies - measuring the moment from which a producer sends a message to the moment when consumer receives the message.
2MB - 50ms
1MB - 31ms
500KB - 15ms
My code just created a connection and a session. I did not use a PooledConnectionFactory (stating this as a matter of fact, not saying/suspecting that's the cause). Also it is better to strip down the code to bare minimum in order to establish a baseline and remove noise when doing performance testing. That way, when you introduce additional code, you can easily see if the new code introduced a performance issue. I used the default broker configuration.
In ActiveMQ, there is a concept of Fast Producer and Fast Consumer, this means, if consumer can process the messages at the same rate as the Producer, the broker transfers the message from producer to consumer via memory and then it writes the message to disk. This is the default behavior and is controlled by a broker configuration setting named concurrentStoreAndDispatch which is true (default)
If consumer is unable to keep up with producer, and thus becomes a "slow" consumer and with the concurrentStoreAndDispatch flag set to true, you take a performance hit.
ActiveMQ provides advisory topics which you can subscribe to detect slow consumers. If in fact, you detected that the consumer is slower than the producer, it is better to set concurrentStoreAndDispatch flag to false to get better performance.
I don't get any response.
I think its because there is no solution for this performance problem. Amazon MQ is a cloud service and mabye thats the reason why performance is this bad.
IBM MQ is a different architecture, and it is on premise.
I have to investigate the performance of ActiveMQ some more before I can tell what exactly the reason is for this problem.

spark.kafka.poll.time in MapR stream consumers

I am working with MapR streams and setting the parameter "spark.kafka.poll.time" in my direct kafka API consumer; However, I don't know exactly what is the meaning of this parameter?
According to the MapR documention is the query interval time for a consumer on the MapR Streams (http://maprdocs.mapr.com/home/Spark/Spark_IntegrateMapRStreams_Consume.html). Mostly you have to specify it only when using Spark Streaming to connect to Kafka. In a standard Java Kafka Consumer, on the poll method, there is a interval in millis that you have to specify it, so there could be an analogy between the two of them.
For Java:
ConsumerRecords<String, String> records = kafkaConsumer.poll(consumerPoolTime);
For Spark Streaming as Map params:
"spark.kafka.poll.time" -> "300",
// other params
KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topics)

In Storm Spout, Naming the Consumer Group

I am currently using:
https://github.com/wurstmeister/storm-kafka-0.8-plus/commits/master
which has been moved to:
https://github.com/apache/storm/tree/master/external/storm-kafka
I want to specify the Kafka Consumer Group Name. By looking at the storm-kafka code, I followed the setting, id, to find that is is never used when dealing with a consumer configuration, but is used in creating the zookeeper path at which offset information is stored. Here in this link is an example of why I would want to do this: https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
Am I correct in saying that the Consumer Group Name cannot be set using the https://github.com/apache/storm/tree/master/external/storm-kafka code?
So far, storm-kafka integration is implemented using SimpleConsumer API of kafka and the format it stores consumer offset in zookeeper is implemented in their own way(JSON format).
If you write spout config like below,
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts,
"topic name",
"/kafka/consumers(just an example, path to store consumer offset)",
"yourTopic");
It will write consumer offset in subdirectories of /kafka/consumers/yourTopic.
Note that by default storm-kafka uses same zookeeper that your Storm uses.

Using Kafka to import data to Hadoop

Firstly I was thinking what to use to get events into Hadoop, where they will be stored and periodically analysis would be performed on them (possibly using Ooozie to schedule periodic analysis) Kafka or Flume, and decided that Kafka is probably a better solution, since we also have a component that does event processing, so in this way, both batch and event processing components get data in the same way.
But know I'm looking for suggestions concretely how to get data out of broker to Hadoop.
I found here that Flume can be used in combination with Kafka
Flume - Contains Kafka Source (consumer) and Sink (producer)
And also found on the same page and in Kafka documentation that there is something called Camus
Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.
I'm interested in what would be a better (and easier, better documented solution) to do that? Also, are there any examples or tutorials how to do it?
When should I use this variants over simpler, High level consumer?
I'm opened for suggestions if there is another/better solution than this two.
Thanks
You can use flume to dump data from Kafka to HDFS. Flume has kafka source and sink. Its a matter of property file change. An example is given below.
Steps:
Create a kafka topic
kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 -- partitions 1 --topic testkafka
Write to the above created topic using kafka console producer
kafka-console-producer --broker-list localhost:9092 --topic testkafka
Configure a flume agent with the following properties
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = localhost:2181
flume1.sources.kafka-source-1.topic =testkafka
flume1.sources.kafka-source-1.batchSize = 100
flume1.sources.kafka-source-1.channels = hdfs-channel-1
flume1.channels.hdfs-channel-1.type = memory
flume1.sinks.hdfs-sink-1.channel = hdfs-channel-1
flume1.sinks.hdfs-sink-1.type = hdfs
flume1.sinks.hdfs-sink-1.hdfs.writeFormat = Text
flume1.sinks.hdfs-sink-1.hdfs.fileType = DataStream
flume1.sinks.hdfs-sink-1.hdfs.filePrefix = test-events
flume1.sinks.hdfs-sink-1.hdfs.useLocalTimeStamp = true
flume1.sinks.hdfs-sink-1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
flume1.sinks.hdfs-sink-1.hdfs.rollCount=100
flume1.sinks.hdfs-sink-1.hdfs.rollSize=0
flume1.channels.hdfs-channel-1.capacity = 10000
flume1.channels.hdfs-channel-1.transactionCapacity = 1000
Save the above config file as example.conf
Run the flume agent
flume-ng agent -n flume1 -c conf -f example.conf - Dflume.root.logger=INFO,console
Data will be now dumped to HDFS location under the following path
/tmp/kafka/%{topic}/%y-%m-%d
Most of the time, I see people using Camus with azkaban
You can you at the github repo of Mate1 for their implementation of Camus. It's not a tutorial but I think it could help you
https://github.com/mate1/camus

Resources