Within a cloud application I'm using NiFi (=> I'm a newbee) to work with data streams published by a mqtt broker. So far so good.
In the end I want to stream into an InfluxDB. That's the point I'm struggling with.
Does anybody have some experiences with a processor for such a setup? Is there a suitable processor for writing data into an InfluxDB?
Thanks a lot.
Kind regards,
T_F
There is a PutInfluxDB processor which accepts the incoming flowfile and writes the content as 'line content' in InfluxDB.
Related
I'm starting with kafka and I need to control the inserts in a specific Oracle table, send the new records through kafka at the moment. I have no control over the database, so, in principle, Debizium is excluded. How can I do this? Without using triggers.
I've made a producer read data from Oracle with a java program in eclipse but, that would make constant requests to the database. I use java for simulated a ETL with consumer.
PS: I work with Windows but that's secondary.
If I understand your problem correctly, you are trying to route inserts from Kafka to Oracle Database. There could be few possibilities:
You implement Kafka consumer and as soon as your kafka cluster gets a message consumer makes a insert. You could reuse your java code here- just remove the polling part. Please visit here
If you have kafka deployed in a cloud environment and are using it as a service(aws msk) you would have the option to handling the events. Again you can use java program or can write a python script to make inserts. Please visit here
I would like to understand your throughput requirements, whether you really need kafka as a distributed messaging system or a simple aws sqs would work just fine. If you can use sqs things would be straightforward for you. You create a queue and you write a listener in
python or java
boto3 is an excellent python library for working with sqs
We are receiving data as HTTP POST messages from a number of servers. We want to receive the messages, do some pre-processing and then write it to HDFS. What are the best options to operate on real time data streams?
Some options i have read: Flume, Kafka, Spark streaming. How to connect the pieces?
It's hard to say because it's too general question. I can briefly describe our pipeline because we do the exact same thing. We have a few NodeJS HTTP server, they send all incoming requests to Kafka. Then we use Samza to preprocess the data. Samza reads the messages from Kafka and writes it back to Kafka (to another topic). Finally we use Camus to transfer data from Kafka to HDFS (Camus is deprecated by now). You can also use Kafka Connect to transfer data from Kafka to HDFS.
Both Samza and Kafka are (or were) LinkedIn projects thus it's easy to setup this architecture and Samza takes advantages of some Kafka features.
The use case is this:
I've several java applications running which all have to interact with different (each one has a specific target) elasticsearch indices. For instance an application A uses the indices A,B,C of ElasticSearch to query and update. Application B uses indices A,C,D(say).
Some common interface is required which can manage all these data streams. Currently I'm evaluating Kafka and fluentd for this purpose.
Can someone explain which will be better suited for this situation. I've looked at features of both Kafka and Fluentd and I don't really understand the difference it would make here.
Thanks a lot.
kafka provides publish/subscribe messaging as a distributed commit log. Usually you install kafka on each host where you need to produce some data to be forwarded somewhere else and all those hosts will together form a cluster. The good thing here is that if for some reason network connectivity becomes unstable or goes down, your application can continue to produce data/logs and they won't be lost. Whereas if your application directly sends logs to some remote centralized logging host, you might lose some logs during the time the network goes down.
fluentd is a centralized log collector which is commonly installed on one host (or more if you need horizontal scaling). It connects to remote data sources, applies filtering and sends unified log data to remote data sinks.
From the fluentd docs, you can see that fluentd can consume data from kafka and produce data towards kafka as well. This alone should hint that fluentd and kafka are on different layers since the former uses the latter.
It would be more logical to compare fluentd and logstash actually. As far as fluentd is concerned, kafka is just another data source and/or data sink, but they are different beasts altogether.
If you want the best of both worlds, use kafka as input/output data pipes from/to your apps and fluentd (or logstash) as your centralized logging system reading from those kafka topics.
If you want to read more on the topic, you can read how fluentd and kafka complement each other very well, read they are not competing against each other.
From: The Life Blood Of Your Data Pipeline
Kafka is primarily related to holding log data rather than moving log
data. Thus, Kafka producers need to write the code to put data in
Kafka, and Kafka consumers need to write the code to pull data out of
Kafka.
Fluentd has both input and output plugins for Kafka so that data
engineers can write less code to get data in and out of Kafka. We have
many users that use Fluentd as a Kafka producer and/or consumer.
I am trying to monitor the performance of Kafka spout for my project. I have used the KafkaSpout that is included in apache-storm-0.9.2-incubating release.
Is it possible to monitor the throughput of kafka spout using the kafka offset monitoring tool? Is there another, better way to monitor the spout?
Thanks,
Palak Shah
The latest Yahoo Kafka Manager has added metrics information and you see TPS, bytes in/out etc.
https://github.com/yahoo/kafka-manager
We could not find any tool that provides the offset for all the consumers including the kafka-spout consumer. So, we ended up building one ourselves. You can get the tool from here:
https://github.com/Symantec/kafka-monitoring-tool
It might be of use to you.
Apache Kafka: Distributed messaging system
Apache Storm: Real Time Message Processing
How we can use both technologies in a real-time data pipeline for processing event data?
In terms of real time data pipeline both seems to me do the job identical. How can we use both the technologies on a data pipeline?
You use Apache Kafka as a distributed and robust queue that can handle high volume data and enables you to pass messages from one end-point to another.
Storm is not a queue. It is a system that has distributed real time processing abilities, meaning you can execute all kind of manipulations on real time data in parallel.
The common flow of these tools (as I know it) goes as follows:
real-time-system --> Kafka --> Storm --> NoSql --> BI(optional)
So you have your real time app handling high volume data, sends it to Kafka queue. Storm pulls the data from kafka and applies some required manipulation. At this point you usually like to get some benefits from this data, so you either send it to some Nosql db for additional BI calculations, or you could simply query this NoSql from any other system.
I know that this is an older thread and the comparisons of Apache Kafka and Storm were valid and correct when they were written but it is worth noting that Apache Kafka has evolved a lot over the years and since version 0.10 (April 2016) Kafka has included a Kafka Streams API which provides stream processing capabilities without the need for any additional software such as Storm. Kafka also includes the Connect API for connecting into various sources and sinks (destinations) of data.
Announcement blog - https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
Current Apache documentation - https://kafka.apache.org/documentation/streams/
In 0.11 Kafka the stream processing functionality was further expanded to provide Exactly Once Semantics and Transactions.
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/
Kafka and Storm have a slightly different purpose:
Kafka is a distributed message broker which can handle big amount of messages per second. It uses publish-subscribe paradigm and relies on topics and partitions. Kafka uses Zookeeper to share and save state between brokers. So Kafka is basically responsible for transferring messages from one machine to another.
Storm is a scalable, fault-tolerant, real-time analytic system (think like Hadoop in realtime). It consumes data from sources (Spouts) and passes it to pipeline (Bolts). You can combine them in the topology. So Storm is basically a computation unit (aggregation, machine learning).
But you can use them together: for example your application uses kafka to send data to other servers which uses storm to make some computation on it.
This is how it works
Kafka - To provide a realtime stream
Storm - To perform some operations on that stream
You might take a look at the GitHub project https://github.com/abhishekgoel137/kafka-nodejs-d3js.
(D3js is a graph-representation library)
Ideal case:
Realtime application -> Kafka -> Storm -> NoSQL -> d3js
This repository is based on:
Realtime application -> Kafka -> <plain Node.js> -> NoSQL -> d3js
As every one explain you that
Apache Kafka: is continuous messaging queue
Apache Storm: is continuous processing tool
here in this aspect Kafka will get the data from any website like FB,Twitter by using API's and that data is processed by using Apache Storm and you can store the processed data in either in any databases you like.
https://github.com/miguno/kafka-storm-starter
Just follow it you will get some idea
When I have a use case that requires me to visualize or alert on patterns (think of twitter trends), while continuing to process the events, I have a several patterns.
NiFi would allow me to process an event and update a persistent data store with low(er) batch aggregation with very, very little custom coding.
Storm (lots of custom coding) allows me nearly real time access to the trending events.
If I can wait for many seconds, then I can batch out of kafka, into hdfs (Parquet) and process.
If I need to know in seconds, I need NiFi, and probably even Storm. (Think of monitoring thousands of earth stations, where I need to see small region weather conditions for tornado warnings).
Simply Kafka send the messages from node to another , and Storm processing the messages . Check this example of how you can Integration Apache Kafka With Storm