Configuring connectors for multiple topics on Kafka Connect Distributed Mode - elasticsearch

We have producers that are sending the following to Kafka:
topic=syslog, ~25,000 events per day
topic=nginx, ~5,000 events per day
topic=zeek.xxx.log, ~100,000 events per day (total). In this last case there are 20 distinct zeek topics, such as zeek.conn.log and zeek.http.log
kafka-connect-elasticsearch instances function as consumers to ship data from Kafka to Elasticsearch. The hello-world Sink configuration for kafka-connect-elasticsearch might look like this:
# elasticsearch.properties
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=24
topics=syslog,nginx,zeek.broker.log,zeek.capture_loss.log,zeek.conn.log,zeek.dhcp.log,zeek.dns.log,zeek.files.log,zeek.http.log,zeek.known_services.log,zeek.loaded_scripts.log,zeek.notice.log,zeek.ntp.log,zeek.packet_filtering.log,zeek.software.log,zeek.ssh.log,zeek.ssl.log,zeek.status.log,zeek.stderr.log,zeek.stdout.log,zeek.weird.log,zeek.x509.log
topic.creation.enable=true
key.ignore=true
schema.ignore=true
...
And can be invoked with bin/connect-standalone.sh. I realized that running or attempting to run tasks.max=24 when work is performed in a single process is not ideal. I know that using distributed mode would be a better alternative, but am unclear on the performance-optimal way to submit connectors to distributed mode. Namely,
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call? Or would it be best to break up multiple .properties configs + connectors (e.g. one for syslog, one for nginx, one for zeek.**) and submit them separately?
I understand that tasks be equal to the number of topics x number of partitions, but what dictates the number of workers?
Is there anywhere in the documentation that walks through best practices for a situation such as this where there is a noticeable imbalance of throughput for different topics?

In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call?
It'd be a JSON file, but yes.
what dictates the number of workers?
Up to you. JVM usage is one factor that you can monitor and scale on
Not really any documentation that I am aware of

Related

Flink web UI: Monitor Metrics doesn't work

run with flink-1.9.0 on yarn(2.6.0-cdh5.11.1), but the flink web ui metrics does'nt work, as shown below:
I guess you are looking at the wrong metrics. Due no data flows from one task to another (you can see only one box at the UI) there is nothing to show. The metrics you are looking at only show the data which flows from one flink task to another. At your example everything happens within this task.
Look at this example:
You can see two tasks sending data to the map-task which emits this data to another task. Therefore you see incoming and outgoing data.
But on the other hand a source task never has incoming data(I must admit that this is confusing at the first look):
The number of records recieved is 0 but it send a couple of records to the downstream task.
Back to your problem: What you can do is have a look at the operator metrics. If you look at the metrics tab (the one at the very right) you can select beside the task metrics also some operator metrics. These metrics have a name like 0.Map.numRecordsIn.
The name assembles like this <slot>.<operatorName>.<metricsname>. But be aware that this metrics are not recorded, you don't have any historic data and once you leave this tab or remove a metric the data collected until that point are gone. I would recommend to use a proper metrics backend like influx, prometheus or graphite. You can find a description at the flink docs.
Hope that helped.

Real-time monitoring of SQS queue in AWS

What's the best way to provide real-time monitoring of the total count of messages sent to an SQS queue?
I currently have a Grafana dashboard set up to monitor an SQS queue, but it seems to refresh about every two minutes. I'm looking to get something set up to update almost in real-time, e.g. refresh every second.
The queue I'm using consumes around 6,000 messages per minute.
Colleagues of mine have built something for real-time monitoring of uploads to an S3 bucket, using a lambda to populate a PostgreSQL DB and using Grafana to query this.
Is this the best way of achieving this? Is there a more efficient way?
SQS is not event driven - it must be polled. Therefore, there isn't an event each time a message is put into the queue or removed from it. With S3 to Lambda there is an event sent in pretty much real time every time an object has been created or removed.
You can change the polling interval for SQS and poll as fast as you'd like. But be aware that polling does have a cost. The first 1 million requests a month are free.
I'm not sure what you're trying to accomplish (I'll address after my idea), but there's certainly a couple ways you could accomplish this. Each has positive and negative.
In every place you produce or consume messages, increment or decrement a cloudwatch metric (or datadog, librato, etc). It's still polling-based, but you could get the granularity down (even by using Cloudwatch) to 15-60 seconds. The biggest problem here is that it's error prone (what happens if the SQS message times out and gets reprocessed?).
Create a secondary queue. Each message that goes into this queue is either a "add" or "delete" message. Attach a lambda, container, autoscale group to process the queue and update metrics in an RDS or DynamoDB table. Query the table as needed.
Use a different queue processing system instead of SQS. I've seen RabbitMQ and Sensu used in very large environments, they will easily handle 6,000 messages per minute.
Keep in mind, there are a lot more metrics than just number of messages in the queue. I've recently become really fond of ApproximateAgeOfOldestMessage, because it indicates whether messages are being processed without error. Here's a blog post about the most helpful SQS metrics. It's called How to Monitor Amazon SQS with CloudWatch

TCP replication of topics

According to the documentation here: https://github.com/OpenHFT/Chronicle-Engine one is able to do pub/sub using maps. This allows one to create a construct similar to topics that are available in middleware such as Tibco, 29W, Kafka and use that as a way of sending events across processes. Is this a recommended usage of chronicle map? What kind of latency can I expect if both publisher and subscriber stay in the same machine?
My second question is, how can this be extended to send messages across machines? How does this work with enterprise TCP replication?
My requirement is to create thousands of topics and use them to communicate across processes running in different machines (in a LAN). Each of these topics would be written by a single source and read by multiple readers running in same or different machines. If the source of a particular topic dies, that source's replica would start writing to the topic and listeners will continue to receive messages. These messages need not be stored for replay.
Is this a recommended usage of chronicle map?
Yes, you can use engine to support event notification across a machine. However, if you want lowest latencies you might need to send a notification via Queue and keep the latest value in a map.
What kind of latency can I expect if both publisher and subscriber stay in the same machine?
It depends on your use case esp the size of the data (in maps case the number of entries as well) The Latency for Map in Engine is around 30 - 100 us, however the latency for Queue is around 2 - 5 us.
My second question is, how can this be extended to send messages across machines?
For this you need our licensed product but the code is the same.
Each of these topics would be written by a single source and read by multiple readers running in same or different machines. If the source of a particular topic dies, that source's replica would start writing to the topic and listeners will continue to receive messages.
Most likely, the simplest solution is to have a Map where each topic is a different key. This will send the latest value for that topic to the consumers.
If you need to recorded every event, a Queue is likely to be a better choice. If you don't need to retain the data for long, you can use a very sort file rotation.

Queueing mechanism and Elasticsearch 1.4.0

I have a RabbitMQ broker, on which I post different messages that will end up as documents in Elasticsearch. There are multiple consumers from the broker, which are actually different threads in a task executor assigned to an amqp inbound gateway (using spring integration and spring amqp here).
Think at the following scenario: I have created a doc in ES with the structure
{
"field1" : "value1",
"field2" : "value2"
}
Afterwards I send two update requests, both updating the same field, let's say field1. If I send this messages one right after another(common use case in production), my consumer threads will fetch the messages in the right order(amqp allows this), but the processing could happen in the wrong order and the later updated value could be overwritten by the first one. I will end up having wring data.
How can I make sure my data won't get corrupted? =>Having 1 single consumer thread is not enough, because if I want to scale out by adding more machines with my consuming app, I will still end up having multiple consumers. I might need ordering of messages, but having multiple machines I will probably need to create some sort of a cluster aware component, I am using SI, so this seems really hard to do in my opinion.
In pre 1.2 versions of ES, we used an external version, like a timestamp, and ES would have thrown VersionConflictException in my scenario:first update would have had version 10000 let's say, the second 10001 and if the first would have been processed first, ES would reject the request with version 10000 as it's lower than the existing one. But from the latest versions, ES guys have removed this functionality for update operations.
One solution might be to use multiple queues and have a single consumer on each queue; use a hash function to always route updates to the same document to the same queue see the RabbitMQ Tutorials for the various options.
You can scale out by adding more queues (and changing your hash function).
For resiliency, consider running your consumers in Spring XD. You can have a single instance of each rabbit source (for each queue) and XD will take care of failing it over to another container node if it goes down.
Otherwise you could roll your own by having a warm standby - inbound adapters configured with auto-startup="false" and have something monitor and use a <control-bus/> to start a new instance if the active one goes down.
EDIT:
In response to the fourth comment below.
As I said above, to scale out, you would have to change the hash function. So adding consumers automatically while running would be tricky.
You don't have to hard-code the queue names in the jar, you can use a property placeholder and fill it from properties, system properties, or an environment variable.
This solution is the simplest but does have these limitations.
You could, however, build a management app that could scale it out - stop the producer, wait for all queues to quiesce, reconfigure the consumers and restart the producer - Spring Integration provides a <control-bus/> to start/stop adapters; you can also do it via JMX.
Alternative solutions are possible but will generally require maintaining some shared state across a cluster (perhaps using zookeeper etc), so are much more complex; and you still have to deal with race conditions (where the second update might arrive at some consumer before the first).
You can use the default mechanism for consistency checks. Basically you want to verify that you have the latest version of whatever you are updating.
So for that you need to fetch the _version with the object. In queries you can do this by setting version=true on the toplevel. That will cause the _version to be returned along with your query results. Then when doing an update, you simply set the version parameter in the url to the value you have and it will generate a version conflict if it doesn't match.
Nicer is to handle updates using closures. Basically this works as follows: have an update method that fetches the object by id, applies a closure (parameter to the update function) that encapsulate the modifications you want to make, and then stores modified object. If you trap the still possible version conflict, you can simply get the object again and re-apply the closure to the object. We do this and added a random sleep before the retry as well, this vastly reduces the chance of multiple updates failing and is a nice design pattern. Keeping the read and write together minimizes the chance of a conflict and then retrying with a sleep before that minimizes it further. You could add multiple retries to further reduce the risk.

Usecases for mapred.job.queue.name

What are the real world use cases on using map reduce job queues i.e. the value of mapred.job.queue.name property. I see default being used as the value always.
mapred.job.queue.name is what you use to assign a job to a particular queue. By default all jobs go to "default" queue. However, it is possible to create hierarchical queue. Like root, root.q1, root.q1.q1a and so on.
Each of these queues can have its own set of attributes to ensure certain priority.
A real world scenario will be when you have multiple stakeholder asking for reports on same set of infrastructure. For example, at my work place , we have the data scientist trying to run various kind of research job, the customer support team looking for various daily and weekly figures and then there are real jobs that supports the day to day business. It is at the heart that the infrastructure should be able to support the best it can.
Having various queues with different priorities just makes it easy for Hadoop to decide what to do next when a processor is available, or how much it can use.
So data scientist will assign to "Data Analyst" queue, marketing team will assign to "Marketing" queue. It is possible to change priority of a queue depending on time of the day.
The Map/Reduce system always supports atleast one queue with the name as default. Hence, this parameter's value should always contain the string default. Some job schedulers supported in Hadoop, like the Capacity Scheduler, support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Once queues are defined, users can submit jobs to a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same.
Ref: http://hadoop.apache.org/docs/r0.19.1/cluster_setup.html

Resources