As part of a POC, I have a Kafka JDBC Source Connector task which basically reads all the contents of a MySQL table and dumps the records into a Kafka topic.
The table contains 2 records only & my batch.max.rows value is 2. When the task runs on "bulk" mode, I see 2 individual JSON records in the kafka topic. How would I configure the connector to insert 1 JSON record which contains a JSON array containing those 2 records. Ultimately the no. of messages published to kafka topic to be 1 instead of 2.
Each database row will become a unique Kafka record.
If you want to join / window records, then you would use a Stream Processor
I cannot find the commit strategy or a parameter for Kafka Connect JDBC Sink in terms of that JDBC target.
Is it commit every N rows or when batch.size reached? Whatever that N rows is? Batch size or when complete would make sense.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task: once partitions have been opened for writing, Connect will begin forwarding records from Kafka using the put(Collection) API.
JDBC sink connector writes each batch of messages passed through the put(Collection) method using a transaction (the size of which can be controlled via the connector's consumer settings).
In Kafka streams, my kv store is linked to a sink which sends records to output topic.
- What exceptions would we get if for some reason sink can't commit records to topics?
If the sink cannot write the record it will internally retry and after all retries are exhausted the whole application goes down with an exception. If the store was updated successfully it will (by default only) contain the data and you cannot delete it. This is the guarantee "at-least-once" processing gives you.
As of Kafka 0.11 you can enable "exactly-once" processing:
properties.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
For this case, on application restart the store will be deleted and recreated before any processing is repeated. This ensures, that the data written to the store before the error will be "removed" before processing continues.
I use logstash to transfer data from Kafka to Elasticsearch and I'm getting the following error:
WARN org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group kafka-es-sink: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
I tried to adjust the session timeout (to 30000) and max poll records (to 250).
The topic produces 1000 events per seconds in avro format. There are 10 partitions (2 servers) and two logstash instances with 5 consumer threads each.
I have no problems with other topics with ~100-300 events per second.
I think it should be a config issue because I also have a second connector between Kafka and Elasticsearch on the same topic which works fine (confluent's kafka-connect-elasticsearch)
The main aim is to compare kafka connect and logstash as connector. Maybe anyone has also some experience in general?
Stream Analytics job takes several minutes to output results to an Event Hub.
A Stateless Web API Azure Service Fabric application is distributed across 8 nodes. The application is very simple, consisting of a single Controller, which:
Receives a series of JSON objects
Initialises a series of EventData instances that wrap each JSON object
Sets the PartitionKey property of each EventData isntance to the machine-name value
Publishes the JSON objects as a single batch of EventData instances to Azure Event Hub
The JSON payload is a simple series of IP addresses and time-stamps, as follows:
[{
"IPAddress": "10.0.0.2",
"Time": "2016-08-17T12:00:01",
"MachineName": "MACHINE01"
}, {
"IPAddress": "10.0.0.3",
"Time": "2016-08-17T12:00:02",
"MachineName": "MACHINE01"
}]
Once received, the Event Hub acts as an input to a Stream Analytics job, which executes the following Query:
SELECT
IPAddress, COUNT(*) AS Total, MachineName
INTO
Output
FROM
Input TIMESTAMP BY TIME
PARTITION BY PartitionId
GROUP BY
TUMBLINGWINDOW(MINUTE, 1), IPAddress, MachineName, PartitionId
HAVING Total >= 2
Note that the Query is partitioned by PartitionId, where PartitionId is set to the machine-name of the originating Service Fabric application. There will therefore be a maximum of 8 PartitionKeys
There are 8 individual Service Fabric instances, and 8 corresponding Partitions assigned to the input Event Hub.
Finally, The Stream Analytics job outputs the result to a 2nd Event Hub. Again, this Event Hub has 8 Partitions. The Stream Analytics Query retains the machine-name, which is used as the PartitionKey of the output Event Hub. The output format is JSON.
At best, the process takes 30-60 seconds, and sometimes several minutes, for a single HTTP request to reach the output Event Hub. The bottleneck seems to be the Stream Analytics job, given that the ASP.NET application publishes the EventData batches in sub-second timescales.
Edit:
Applying a custom field to Timestamp By adds a great degree of latency, when coupled with a Group By clause. I have achieved acceptable latency (1-2 seconds) when removing the Timestamp By clause.
The optimal Query is as follows:
SELECT
Count(*) AS Total, IPAddress
FROM
Input
Partition by PartitionId
GROUP BY TUMBLINGWINDOW(MINUTE, 1), IPAddress, PartitionId
However, adding a Having clause, results in latency increasing to 10-20 seconds:
SELECT
Count(*) AS Total, IPAddress
FROM
Input
Partition by PartitionId
GROUP BY TUMBLINGWINDOW(MINUTE, 1), IPAddress, PartitionId
HAVING Total >= 10
Without the ability to aggregate data within the Query, using a Having clause, in a timely fashion, seems to defeat the purpose.
Incidentally, Streaming Units, Partitions, Input and Output are configured optimally, as per this guide to achieving parallelism with Stream Analytics.