Kafka Connect separated logging - debugging

Currently we are using a couple of custom connetor plugins for our confluent kafka connect distributed worker cluster. One thing that bothers me for a long time is that kafka connect writes all logs from all deployed connectors to one File/Stream. This makes debugging an absolute nightmare. Is there a way to let kafka connect log the connectors in different Files/Streams?
Via the connect-log4j.properties I am able to let a specific class log to a different File/Stream. But this means that with every additional connector I have to adjust the connect-log4j.properties
Thanks

Kafka Connect does not currently support this. I agree that it is not ideal.
One option would be to split out your connectors and have a dedicated worker cluster for each, and thus separate log files.
Kafka Connect is part of Apache Kafka so you could raise a JIRA to discuss this further and maybe contribute it back via a PR?
Edit April 12, 2019: See https://cwiki.apache.org/confluence/display/KAFKA/KIP-449%3A+Add+connector+contexts+to+Connect+worker+logs

Related

Use Kafka Connect with Azure Event Hubs and/or AWS Kinesis/MSK to send data to ElasticSearch

Has anyone used Kafka connect with one or more of the following cloud streaming services?
AWS Kinesis
AWS MSK
Azure Event Hubs
FWIW we're looking to send data from Kafka to ElasticSearch without needing to use additional component such as Logstash or FileBeat.
At first I thought we could only do this using the Confluent platform, but then read that Kafka Connect is just an open-source Apache project. The only need for Confluent would be if we want/need to use one of the proprietary connectors, but given the ElasticSearch Sink connector is the only one we need (at least for now) and this is a community connector - see here (and here for licensing info), we might be able to do this using one of the AWS/Azure streaming services assuming this is supported (Note: AWS or Azure represents a path of less resistance as the company I work for already has vendor relationships with both AWS & Microsoft. Not saying we won't use Confluent or migrate to it at some stage, but for now Azure/AWS is going to be easier to get across the line).
I found a Microsoft document that implies we can use Azure Event Hubs with Kafka Connect, even though AEH is a bit different to open source Kafka... not sure about AWS Kinesis or MSK - I assume MSK would be fine, but not sure... any guidance/blogs/articles would be much appreciated....
Cheers,

Kafka connect behaviour in Distribution mode

Im running Kafka connect in distributed mode having two different connectors with one task each. Each connector is running in different instance which is exactly I want.
Is it always ensure the same behaviour that Kafka connect cluster share the load properly ?
Connectors in Kafka Connect run with one, or more tasks. The number of tasks depends on how you have configured the connector, and whether the connector itself can run multiple tasks. An example would be the JDBC Source connector, which if ingesting more than one table from a database will run (if configured to do so) one task per table.
When you run Kafka Connect in distributed mode, tasks from all the connectors are executed across the available workers. Each task will only be executing on one worker at one time.
If a worker fails (or is shut down) then Kafka Connect will rebalance the tasks across the remaining worker(s).
Therefore, you may see one connector running across different workers (instances), but only if it has more than one task.
If you think you are seeing the same connector's task executing more than once then it suggests a misconfiguration of the Kafka Connect cluster, and I would suggest reviewing https://rmoff.net/2019/11/22/common-mistakes-made-when-configuring-multiple-kafka-connect-workers/.

Centralized logging with Kafka and ELK stack

There are more than 50 Java applications (They are not microservices, so we don't have to worry about multiple instance of the service). Now my architect designed a solution to get the log files and feed into a kafka topic and from kafka feed it into logstash and push it to elastic search so we can view the logs in kibana. Now I am new to Kafka and ELK stack. Will someone point me to a right direction on how to do this task. I learnt that Log4J and SLF4J can be configured to push the logs to kafka topic.
1. Now how to consume from kafka and load it into logstash? Do I have to write a kafka consumer or we can do that just by configuration?
2. How logstash will feed the logs to elastic search?
3. How can I differentiate all the 50 application logs, do i have to create topic for each and every application?
I put the business problem, now I need step by step expert advice. - Thanks in advance.
Essentially what your architect has laid out for you can be divided into two major components based upon their function (on architecture level);
Log Buffer (Kafka)
Log Ingester (ELK)
[Java Applications] =====> [Kafka] ------> [ELK]
If you study ELK you would feel like it is sufficient for your solution and Kafka would appear surplus. However, Kafka has important role to play when it comes to scale. When many of your Java applications would send logs to ELK, ELK may become overloaded and break.
To avoid ELK from overload your architect has setup a buffer (Kafka). Kafka will receive logs from applications and queue it up in case ELK is under load. In this way you do not break ELK and also you do not loose logs when ELK is struggling.
Answers to your questions in the same order;
(1) Logstash has 'input' plugins that can be used to setup a link between Kafka & Logstash. Read on Logstash and its plugins.
i- Logstash Guide or Reference
ii- Input Plugins (scroll down to find Kafka plugin)
(2) Logstash will feed received logs to Elasticsearch by Output plugin for Elasticsearch. See Logstash output plugin for Elasticsearch.
(3) I may not be spot-on on this, but I think you would be able to filter & distinguish the logs at the Logstash level once you receive it from Kafka. You could apply tags or fields to each log message on reception. This additional info will be used by Elasticsearch to distinguish the applications from one another.
Implementation Steps
As somebody who is new to Kafka & ELK follow these steps to your solution;
Step 1: Start by setting up ELK first. Once you do that you would be able to see how logs are visualized and will become clearer how end solution may look like.
Guide to ELK Stack
Step 2: Setup Kafka to link your application logs to ELK.
Caveats:
You may find ELK to have some decent learning curve. Much time is required to understand how each element in the ELK stack works and what is its individual configuration and languages are.
To have deep understanding of ELK use the local deployment path where you setup ELK on your system. Avoid the cloud ELK services for that matter.
Logstash has a kafka input and an elasticsearch output, so this is configuration on the logstash side. You could differentiate the application using configuration on the log4j side (although using many topics is another possibility).

Oracle replication data using Apache kafka

I would like to expose the data table from my oracle database and expose into apache kafka. is it technicaly possible?
As well i need to stream data change from my oracle table and notify it to Kafka.
do you know good documentation of this use case?
thanks
You need Kafka Connect JDBC source connector to load data from your Oracle database. There is an open source bundled connector from Confluent. It has been packaged and tested with the rest of the Confluent Platform, including the schema registry. Using this connector is as easy as writing a simple connector configuration and starting a standalone Kafka Connect process or making a REST request to a Kafka Connect cluster. Documentation for this connector can be found here
To move change data in real-time from Oracle transactional databases to Kafka you need to first use a Change Data Capture (CDC) proprietary tool which requires purchasing a commercial license such as Oracle’s Golden Gate, Attunity Replicate, Dbvisit Replicate or Striim. Then, you can leverage the Kafka Connect connectors that they all provide. They are all listed here
Debezium, an open source CDC tool from Redhat, is planning to work on a connector that is not relying on Oracle Golden Gate license. The related JIRA is here.
You can use Kafka Connect for data import/export to Kafka. Using Kafka Connect is quite simple, because there is no need to write code. You just need to configure your connector.
You would only need to write code, if no connector is available and you want to provide your own connector. There are already 50+ connectors available.
There is a connector ("Golden Gate") for Oracle from Confluent Inc: https://www.confluent.io/product/connectors/
At the surface this is technically feasible. However, understand that the question has implications on downstream applications.
So to comprehensively address the original question regarding technical feasibility, bear in mind the following:
Are ordering/commit semantics important? Particularly across tables.
Are continued table changes across instance crashes (Kafka/CDC components) important?
When the table definition changes - do you expect the application to continue working, or will resort to planned change control?
Will you want to move partial subsets of data?
What datatypes need to be supported? e.g. Nested table support etc.
Will you need to handle compressed logical changes - e.g. on update/delete operations? How will you address this on the consumer side?
You can consider also using OpenLogReplicator. This is a new open source tool which reads Oracle database redo logs and sends messages to Kafka. Since it is written in C++ it has a very low latency like around 10ms and yet a relatively high throughput ratio.
It is in an early stage of development but there is already a working version. You can try to make a POC and check yourself how it works.

hadoop logging facility?

If I am to use zookeeper as a work queue and connect to it individual consumers/workers. What would you recommend as a good distributed setup for logging these workers' activities?
Assume the following:
1) At anytime we could be down to 1 single computer housing the hadoop cluster. The system will autoscale up and down as needed but has alot of down time where only 1 single computer is needed.
2) I just need the ability to access all of the workers logs without accessing the individual machine that worker is located at. Bare in mind, that by the time I get to read one of these logs that machine might very well be terminated and long gone.
3) We'll need easy access to the logs i.e being able to cat/grep and tail or alternatively in a more SQLish manner - we'll need real time ability to both query as well as monitor output for short periods of time in real time. (i.e tail -f /var/log/mylog.1)
I appreciate your expert ideas here!
Thanks.
Have you looked at using Flume, chukwa or scribe - ensure that your flume etc process has access to the log files that you are trying to aggregate onto a centralized server.
flume reference:
http://archive.cloudera.com/cdh/3/flume/Cookbook/
chukwa:
http://incubator.apache.org/chukwa/docs/r0.4.0/admin.html
scribe:
https://github.com/facebook/scribe/wiki/_pages
hope it helps.
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd. Fluentd's Java library is clever enough to buffer locally when Fluentd daemon is down. This lessens the possibility of the data loss.
Fluentd: Data Import from Java Applications
High availability configuration is also available, which basically enables you to have centralized log aggregation system.
Fluentd: High Availability Configuration

Resources