Logstash/not logstash for kafka-elasticsearch integration? - elasticsearch

I read that elasticsearch rivers/river plugins are deprecated. So we cannot directly have elasticsearch-kafka integration. If we want to do this then we need to have some java(or any language) layer in between that puts the data from kafka to elastic search using its apis.
On the other hand – if we have kafka-logstash-elasticsearch – that we get rid of the above middle layer and achieve that through logstash with just configuration. But I am not sure if having logstash in between is an overhead or not?
And is my undertsanding right?
Thanks in advance for the inputs.
Regards,
Priya

Your question is quite general. It would be good to understand your architecture, its purpose and assumptions you made.
Kafka, as it is stated in its documentation, is a massively scalable publish-subscribe messaging system. My assumption would be that you use it to as a data broker in your architecture.
Elasticsearch on the other hand, is a search engine, hence I assume that you use it as a data access/searching/aggregation layer.
These two separate systems require connectors to create a proper data-pipeline. That's where Logstash comes in. It allows you to create data streaming connection between, in your case, Kafka and Elasticsearch. It also allows you to mutate the data on the fly, depending on your needs.
Ideally, Kafka uses raw data events. Elasticsearch stores documents which are useful to your data consumers (web or mobile application, other systems etc.), so can be quite different to the raw data format. If you need to modify the data between its raw form, and ES document, that's where Logstash might be handy (see filters stage).
Another approach could be to use Kafka Connectors, building custom tools e.g. based on Kafka Streams or Consumers, but it really depends on the concepts of your architecture - purpose, stack, data requirements and more.

Related

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

How to route based on content with high perfomance?

In nifi, if I am listening to Kafka from single topic and based on the routing logic it'll call the respective process group.
However, in RouteOnContent processor, if we give regular expression for checking the occurance of string will it affect performance or how to achieve the a good performance while routing based on condition.
It would be more efficient to do some split at KSQL / Stream Processing level into different topics and have Nifi reading from different topics?
Running a regex on the content of each message is an inefficient approach, consider if you can modify your approach to one of the following:
Have your Producers write the necessary metadata into a Kafka Header which can use a much more efficient RouteOnAttribute processor in NiFi. This is still message-at-a-time which has throughput limitations
If your messages conform to a schema, use the more efficient KafkaRecord processors in NiFi with a QueryRecord approach which will significantly boost throughput
If you cannot modify the source data and the regex logic is involved, it may be more efficient to use a small Kafka Streams app to split the topic before processing the data further downstream

Is there an added-value to log JSON data to Elasticsearch via Logstash?

I am reviewing my logging strategy (logs from OS and applications which log to syslog, and from my own applications where I can freely decide what and where to log). Not having a lot of experience with Logstash I was wondering whether there is an added value to log JSON data through it (as opposed to directly sending them to Elasticsearch).
The only advantage I could think of is that logging could be consistently to stdout (and then be picked up by syslog), and consistently sent to Logstash (as syslog), to be analyzed there (Logstash would know that data from application myapp.py send raw JSON, for instance).
Are there other advantages to use Logstash as an intermediate? (security aspects are not important in that context).
There are a couple of advantages in using Logstash even when your data is already a JSON Object.
For example:
HTTP Compression: When logstash outputs to elasticsearch, you have the option to use http compression, which greatly reduces the size of the requests and the use of the bandwith.
Persistent Queue: Logstash allows you to have a persistent queue, in memory or in disk, to save the events when it cannot connect with elasticsearch for some reason.
Data Manipulation: You can use filters to change and enrich your data, for example you can remove and add fields, change the name of fields, use a geoip filter on a ip field etc.

What is the purpose of data provenance in Apache NiFi Processors

For every processor there is a way to configure the processor and there is a context menu to view data provenance.
Is there a good explanation of what is data provenance?
Data provenance is all about understanding the origin and attribution of data. In a typical system you get 'logs'. When you consider data flowing through a series of processes and queues you end up with a lot of lots of course. If you want to follow the path a given piece of data took, or how long it took to take that path, or what happened to an object that got split up into different objects and so on all of that is really time consuming and tough. The provenance that NiFi supports is like logging on steroids and is all about keeping and tracking these relationships between data and the events that shaped and impacted what happened to it. NiFi is keeping track of where each piece of data comes from, what it learned about the data, maintains the trail across splits, joins, transformations, where it sends it, and ultimately when it drops the data. Think of it like a chain of custody for data.
This is really valuable for a few reasons. First, understanding and debugging. Having this provenance capture means from a given event you can go forwards or backwards in the flow to see where data came from and went. Given that NiFi also has an immutable versioned content store under the covers you can also use this to click directly to the content at each stage of the flow. You can also replay the content and context of a given event against the latest flow. This in turn means much faster iteration to the configuration and results you want. This provenance model is also valuable for compliance reasons. You can prove whether you sent data to the correct systems or not. If you learn that you didn't then have data with which you can address the issue or create a powerful audit trail for follow-up.
The provenance model in Apache NiFi is really powerful and it is being extended to the Apache MiNiFi which is a subproject of Apache NiFi as well. More systems producing more provenance will mean you have a far stronger ability to track data from end-to-end. Of course this becomes even more powerful when it can be combined with other lineage systems or centralized lineage stores. Apache Atlas may be a great system to integrate with for this to bring a centralized view. NiFi is able to not only do what I described above but to also send these events to such a central store. So, exciting times ahead for this.
Hope that helps.

ETL , Esper or Drools?

The question environment relates to JavaEE, Spring
I am developing a system which can start and stop arbitrary TCP (or other) listeners for incoming messages. There could be a need to authenticate these messages. These messages need to be parsed and stored in some other entities. These entities model which fields they store.
So for example if I have property1 that can have two text fields FillLevel1 and FillLevel2, I could receive messages on TCP which have both fill levels specified in text as F1=100;F2=90
Later I could add another filed say FillLevel3 when I start receiving messages F1=xx;F2=xx;F3=xx. But this is a conscious decision on the part of system modeler.
My question is what do you think is better to use for parsing and storing the message. ETL (using Pantaho, which is used in other system) where you store the raw message and use task executor to consume them one by one and store the transformed messages as per your rules.
One could use Espr or Drools to do the same thing , storing rules and executing them with timer, but I am not sure how dynamic you could get with making rules (they have to be made by end user in a running system and preferably in most user friendly way, ie no scripts or code, only GUI)
The end user should be capable of changing the parse rules. It is also possible that end user might want to change the archived data as well (for example in the above example if a new value of FillLevel is added, one would like to put a FillLevel=-99 in the previous values to make the data consistent).
Please ask for explanations, I have the feeling that I need to revise this question a bit.
Thanks
Well Esper is a great CEP engine, but drools has it's own implementation Drools Fusion which integrates really well with jBpm. That would be a good choice.

Resources