kafka-connect-elasticsearch: storing messages as format of predefined index - elasticsearch

Example:
{"id":"1","firstName":"abc","lastName":"xyz","dob":"12/09/1995","age":"23"}
This message structure is in kafka topic, but i want to index this in elasticsearch as below
{"id":"1","name"{"firstName":"abc","lastName":"xyz"},"dob":"12/09/1995","age":"23"}
how I can achieve this?

Two options:
Stream processing against the data in the Kafka topic. Using Kafka Streams you could wrangle the data model as required. KSQL would work for this in inverse but doesn't support creating STRUCTs yet. Other stream processing options would be Flink, Spark Streaming, etc
Modify the data as it passes through Kafka Connect, using Single Message Transform. There's no pre-built transform that does this but you could write one using the API.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project, constributes to Kafka Streams, Kafka Connect, etc.

Related

Choosing between DynamoDB Streams vs. Kinesis Streams for IoT Sensor data

I have a fleet of 250 Wifi-enabled IoT sensors streaming weight data. Each devices samples once per second. I am requesting help between choosing AWS DynamoDB Streams vs. AWS Kinesis Streams to to store and process this data in real-time. Here are some additional requirements:
I need to keep all raw data in a SQL-accessible table.
I also need to clean the raw stream data with Python's Pandas library to recognize device-level events based on weight changes (e.g. if weight of sensor #1 increases, record as "sensor #1 increased by x lbs # XX:XX PM" If no change, do nothing).
I need that change-event data (interpreted with library from the raw data streams) to be accessible in real time dashboard (e.g. device #1 weight just went to zero, prompting employee to refill container #1)
Either DDB Streams or Kinesis Streams can support Lambda functions, which is what I'll use for the data cleaning, but I've read the documentation and comparison articles and can't distinguish which is best for my use case. Cost is not a key consideration. Thanks in advance!!
Unfortunately, I think you will need a few pieces of infrastructure for a full solution.
I think you could use Kinesis and firehose to write to a database to store the raw data in a way that can be queried with SQL.
For the data cleaning step, I think you will need to use a stateful stream processor like flink or bytewax and then the transformed data can be written to a real-time database or back to kinesis so that it can be consumed in a dashboard.
DynamoDB stream works with DynamoDB. It streams row changes to be picked up by downstream services like Lambda. You mentioned that you want data to be stored in SQL data base. DynamoDB is a NOSQL databse. So you can exclude that service.
Not sure why you want to have data in SQL database. If it is timeseries data, you would probably store them into a time series db like TimeStream.
If you are using AWS IoT Core to send data over MQTT to AWS, you can forward those messages to Kinesis Data Stream (or SQS). Then you can have a lambda triggered on messages received in Kinesis. This lambda can process the data and store them in the DB you want.

How to ingest CDC events produced by Oracle CDC Source Connector into Snowflake

Our current pipeline is following a structure similar to the one outlined here except we are pulling events from Oracle and pushing them to snowflake. The flow goes something like this:
Confluent Oracle CDC Source Connector mining the Oracle transaction log
Pushing these change events to a Kafka topic
Snowflake Sink Connector reading off the Kafka topic and pulling raw messages into Snowflake table.
In the end I have a table of record_metadata, and record_content fields that contain the raw kafka messages.
I'm having to build a set of procedures that handle the merge/upsert logic operating on a stream on top of the raw table. The tables I'm trying to replicate in snowflake are very wide and there are around 100 of them, so writing the SQL merge statements by hand is unfeasible.
Is there a better way to ingest the Kafka topic containing all of the CDC events generated from the Oracle connector straight into Snowflake, handling auto-creating nonexistent tables, auto-updating/deleting/etc as events come across the stream?

Writing multiple entries from a single message in Kafka Connect

If on one topic I receive messages in some format which represent a list of identical structs (e.g. a JSON list or a repeated field in protobuf) could I configure Kafka Connect to write each entry in the list as a separate row (say in a parquet file in HDFS, or in a SQL database)? Is this possible using only the bundled converters/connectors?
I.e. can I use each Kafka message to represent thousands of records, rather than sending thousands of individual messages?
What would be a straightforward way to achieve this with Kafka Connect?
The bundled message transforms are only capable of making one-to-one message manipulations. Therefore, you would have to explicitly produce those flattened lists in some way (directly, or via a stream processing application) if you wanted Connect to write it out as separate records.
Or, if applicable, you can use Hive or Spark to expand that list as well for later processing.

Kafka Connect- Modifying records before writing into sink

I have installed Kafka connect using confluent-4.0.0
Using hdfs connector I am able to save Avro records received from Kafka topic to hive.
I would like to know if there is any way to modify the records before writing into hdfs sink.
My requirement is to do small modifications to values of the record. For Example, performing arithmetic operations on integers or manipulation of strings etc.
Please suggest if there any way to achieve this
You have several options.
Single Message Transforms, which you can see in action here. Great for light-weight changes as messages pass through Connect. Configuration-file based, and extensible using the provided API if there's not an existing transform that does what you want.
See the discussion here on when SMT are suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS. See this example here.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like. Here's an example.
Take a look at Kafka connect transformers [1] & [2]. You can build a custom transformer library and use it in connector.
[1] http://kafka.apache.org/documentation.html#connect_transforms
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect

Big Data ingestion - Flafka use cases

I have seen that the Big Data community is very hot in using Flafka in many ways for data ingestion but I haven't really gotten why yet.
A simple example I have developed to better understand this is to ingest Twitter data and move them to multiple sinks(HDFS, Storm, HBase).
I have done the implementation for the ingestion part in the following two ways:
(1) Plain Kafka Java Producer with multiple consumers (2) Flume agent #1 (Twitter source+Kafka sink) | (potential) Flume agent #2(Kafka source+multiple sinks). I haven't really seen any difference in the complexity of developing any of these solutions(not a production system I can't comment on performance) - only what I found online is that a good use case for Flafka would be for data from multiple sources that need aggregating in one place before getting consumed in different places.
Can someone explain why would I use Flume+Kafka over plain Kafka or plain Flume?
People usually combine Flume and Kafka, because Flume has a great (and battle-tested) set of connectors (HDFS, Twitter, HBase, etc.) and Kafka brings resilience. Also, Kafka helps distributing Flume events between nodes.
EDIT:
Kafka replicates the log for each topic's partitions across a
configurable number of servers (you can set this replication factor on
a topic-by-topic basis). This allows automatic failover to these
replicas when a server in the cluster fails so messages remain
available in the presence of failures. -- https://kafka.apache.org/documentation#replication
Thus, as soon as Flume gets the message to Kafka, you have a guarantee that your data won't be lost. NB: you can integrate Kafka with Flume at every stage of your ingestion (ie. Kafka can be used as a source, channel and sink, too).

Resources