My requirement is to send the data to a different ES sink (based on the data). Ex: If the data contains a particular info send it to sink1 else send it to sink2 etc(basically send it dynamically to any one sink based on the data). I also want to set parallelism separately for ES sink1, ES sink2, Es sink3 etc.
-> Es sink1 (parallelism 4)
Kafka -> Map(Transformations) -> ES sink2 (parallelism 2)
-> Es sink3 (parallelism 2)
Is there any simple way to achieve the above in flink ?
My solution: (but not satisfied with it)
I could come up with a solution but there are intermediate kafka topics which i write to (topic1,topic2,topic3) and then have separate pipelines for Essink1,Essink2 and ESsink3. I want to avoid writing to these intermediate kafka topics.
kafka -> Map(Transformations) -> Kafka topics (Insert into topic1,topic2,topic3 based on the data)
Kafka topic1 -> Essink1(parallelism 4)
Kafka topic2 -> Essink2(parallelism 2)
Kafka topic3 -> Essink3(parallelism 2)
You can use a ProcessFunction [1] with side outputs [2] to split the stream n ways, and then connect each side output stream to the appropriate sink. And then call setParallelism() [3] on each sink.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/process_function.html#the-processfunction
[2] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html
[3] https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html#operator-level
Related
Apologies if this has been already covered before here, I couldn't find anything closely related. I have this Kafka Streams app which reads from multiple topics, persist the records on a DB and then publish an event to an output topic. Pretty straightforward, it's stateless in terms of kafka local stores. (Topology below)
Topic1(T1) has 5 partitions, Topic2(T2) has a single partition. The issue here is, while consuming from two topics, if I want to go "full speed" with T1 (5 consumers), it doesn't guarantee that I will have dedicated consumers for each partition on T1. It will be distributed within the two topic partitions and I might end up with unbalanced consumers (and idle consumers), something like below:
[c1: t1p1, t1p3], [c2: t1p2, t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: (idle consumer)]
[c1: t1p1, t1p2], [c2: t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: t1p3]
With that said:
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
Option A (Current topology)
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Processor: topic1-processor (stores: [])
--> topic1-sink
<-- topic1-source
Sink: topic1-sink (topic: OUTPUT-TOPIC)
<-- topic1-processor
Sub-topology: 1
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic2-processor (stores: [])
--> topic2-sink
<-- topic2-source
Sink: topic2-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor
Option B:
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic1-processor (stores: [])
--> response-sink
<-- topic1-source
Processor: topic2-processor (stores: [])
--> response-sink
<-- topic2-source
Sink: response-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor, topic1-processor
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
config1.put("application.id", "app1");
KakfaStreams stream1 = new KafkaStreams(config1, topologyTopic1);
stream1.start();
config2.put("application.id", "app2");
KakfaStreams stream2 = new KafkaStreams(config2, topologyTopic2);
stream2.start();
The initial assignments you describe, would never happen with Kafka Streams (And also not with any default Consumer config). If there are 5 partitions and you have 5 consumers, each consumer would get 1 partition assigned (for a plain consumer with a custom PartitionAssignor you could do the assignment differently, but all default implementations would ensure proper load balancing).
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
There is not issue with that.
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Depending how you write your topology, this would be the assignment Kafka Streams uses out-of-the-box. For you two options, option B would result in this assignment.
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
As mentioned above, Option B would result in the assignment above. For Option A, you could actually even use a 6th instance and each instance would processes exactly one partition (because there are two sub-topologies, you get 6 tasks, 5 for sub-topology-0 and 1 for sub-topology-1; sub-topologies are scaled out independently of each other); for Option A, you only get 5 tasks though because there is only one sub-topology and thus the maximum number of partitions of both input topic (that is 5) determines the number of tasks.
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
Yes, it would be basically the same as Option A -- however, you get two consumer groups and thus "two application" instead of one.
I'm currently having a working flow :
Fiware Orion -> Fiware Cygnus -> Kafka -> Logstash -> Elasticsearch -> Kibana
I would like to push directly data from Cygnus to Elasticsearch, is there a sink available already ?
An Apache Flume/Elasticsearch sink already exist : https://flume.apache.org/releases/content/1.3.0/apidocs/org/apache/flume/sink/elasticsearch/ElasticSearchSink.html
I was wondering if it would be easy to use it for Cygnus ?
Until Cygnus 1.5.0 (included) such a sink could be perfectly used (as any other Flume sink) in a Cygnus agent configuration.
From 1.6.0 (included, this is the last version) you will not be able since we internally replaced the usage of native Event objects with custom NGSIEvent ones. Why?:
An Event is a set of headers and an array of raw bytes for the body.
NGSIEvent inherits from Eventand is a set of headers, an already parsed version of the body (as an object) and an array of raw bytes for the body pointing to null (this last part is the one avoiding compatibility with native Flume sinks).
Anyway, this is "easy" to fix: new version a NGSIEvent will containg both the parsed version of the body and the body itself as raw bytes.
I am working with MapR streams and setting the parameter "spark.kafka.poll.time" in my direct kafka API consumer; However, I don't know exactly what is the meaning of this parameter?
According to the MapR documention is the query interval time for a consumer on the MapR Streams (http://maprdocs.mapr.com/home/Spark/Spark_IntegrateMapRStreams_Consume.html). Mostly you have to specify it only when using Spark Streaming to connect to Kafka. In a standard Java Kafka Consumer, on the poll method, there is a interval in millis that you have to specify it, so there could be an analogy between the two of them.
For Java:
ConsumerRecords<String, String> records = kafkaConsumer.poll(consumerPoolTime);
For Spark Streaming as Map params:
"spark.kafka.poll.time" -> "300",
// other params
KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topics)
I'm using Spring XD 1.2.1 with kafka as a transport layer. I have the follow set up:
xd:
transport: kafka
messagebus:
kafka:
default:
concurrency: 10
minPartitionCount: 10
I have the following streams as example:
Streams
stream create f --definition "queue:foo > transform --expression=payload+'-foo' | log"
stream create b --definition "queue:bar > transform --expression=payload+'-bar' | log"
stream deploy --name f --properties "module.transform.count=2"
stream deploy --name b --properties "module.transform.count=2"
stream create r --definition "time | router --expression=payload.contains('10')?'queue:foo':'queue:bar'" --deploy
Question
How can I scale up the first processor in the streams which "source" is a named channel? I was expecting something like 20 partitions in the transformers of the streams f and b, as long as the count is 2, and the concurrency is 10. But the number of partitions are 10.
This is working as expected when you deploy other modules that are not the first.
Should I configure the named channels in a specific way to achieve this?
Thanks.
The kafka partitions are controlled by the producing side.
For a mid-stream channel we can look at the "next" module and calculate the partition count by calculating its needs (concurrency*count).
With a named channel, we have no way of knowing the number of consumers (or their concurrency) so a count and concurrency of 1 is used, and so minPartitionCount is used as the partition count.
You would need to deploy the producing stream with an appropriate setting to increase the partitions:
stream deploy foo --properties module.last.producer.minPartitionCount=20
EDIT
Actually - it looks like we have a bug - you can't specifiy the minPartitionCount on a named channel.
I see you have opened a JIRA Issue.
Firstly I was thinking what to use to get events into Hadoop, where they will be stored and periodically analysis would be performed on them (possibly using Ooozie to schedule periodic analysis) Kafka or Flume, and decided that Kafka is probably a better solution, since we also have a component that does event processing, so in this way, both batch and event processing components get data in the same way.
But know I'm looking for suggestions concretely how to get data out of broker to Hadoop.
I found here that Flume can be used in combination with Kafka
Flume - Contains Kafka Source (consumer) and Sink (producer)
And also found on the same page and in Kafka documentation that there is something called Camus
Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.
I'm interested in what would be a better (and easier, better documented solution) to do that? Also, are there any examples or tutorials how to do it?
When should I use this variants over simpler, High level consumer?
I'm opened for suggestions if there is another/better solution than this two.
Thanks
You can use flume to dump data from Kafka to HDFS. Flume has kafka source and sink. Its a matter of property file change. An example is given below.
Steps:
Create a kafka topic
kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 -- partitions 1 --topic testkafka
Write to the above created topic using kafka console producer
kafka-console-producer --broker-list localhost:9092 --topic testkafka
Configure a flume agent with the following properties
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = localhost:2181
flume1.sources.kafka-source-1.topic =testkafka
flume1.sources.kafka-source-1.batchSize = 100
flume1.sources.kafka-source-1.channels = hdfs-channel-1
flume1.channels.hdfs-channel-1.type = memory
flume1.sinks.hdfs-sink-1.channel = hdfs-channel-1
flume1.sinks.hdfs-sink-1.type = hdfs
flume1.sinks.hdfs-sink-1.hdfs.writeFormat = Text
flume1.sinks.hdfs-sink-1.hdfs.fileType = DataStream
flume1.sinks.hdfs-sink-1.hdfs.filePrefix = test-events
flume1.sinks.hdfs-sink-1.hdfs.useLocalTimeStamp = true
flume1.sinks.hdfs-sink-1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
flume1.sinks.hdfs-sink-1.hdfs.rollCount=100
flume1.sinks.hdfs-sink-1.hdfs.rollSize=0
flume1.channels.hdfs-channel-1.capacity = 10000
flume1.channels.hdfs-channel-1.transactionCapacity = 1000
Save the above config file as example.conf
Run the flume agent
flume-ng agent -n flume1 -c conf -f example.conf - Dflume.root.logger=INFO,console
Data will be now dumped to HDFS location under the following path
/tmp/kafka/%{topic}/%y-%m-%d
Most of the time, I see people using Camus with azkaban
You can you at the github repo of Mate1 for their implementation of Camus. It's not a tutorial but I think it could help you
https://github.com/mate1/camus