How to use AWS Kinesis streams for multiple different data sources - spark-streaming

We have a traditional batch application where we ingest data from multiple sources (Oracle, Salesforce, FTP Files, Web Logs etc.). We store the incoming data in S3 bucket and run Spark on EMR to process data and load on S3 and Redshift.
Now we are thinking of making this application near real time by bringing in AWS Kinesis and then using Spark Structured Streaming from EMR to process streaming data and load it to S3 and Redshift. Given that we have different variety of data e.g. 100+ tables from Oracle, 100+ salesforce objects, 20+ files coming from FTP location, Web Logs etc. what is the best way to use AWS Kinesis here.
1) Using Separate Stream for each source (Salesforce, Oracle, FTP) and then using a separate shard (within a stream) for each table/ object - Each consumer reads from its own shard which has a particular table/ file
2) Using a separate stream for each table/ object - We will end up having 500+ streams in this scenario.
3) Using a single stream for everything - not sure how the consumer app will read data in this scenario.

Kinesis does not care what data you put into a stream, data is just a blob to Kinesis. It will be up to you to determine (code) the writers and readers for a stream. You could intermix different types of data into one stream, the consumer will then need to figure out what each blob is and what to do with it.
I would break this into multiple streams based upon data type and priority of the data. This will make implementation and debugging a lot easier.
I think you are misunderstanding what shards are. They are for performance and not for data separation.

Related

Choosing between DynamoDB Streams vs. Kinesis Streams for IoT Sensor data

I have a fleet of 250 Wifi-enabled IoT sensors streaming weight data. Each devices samples once per second. I am requesting help between choosing AWS DynamoDB Streams vs. AWS Kinesis Streams to to store and process this data in real-time. Here are some additional requirements:
I need to keep all raw data in a SQL-accessible table.
I also need to clean the raw stream data with Python's Pandas library to recognize device-level events based on weight changes (e.g. if weight of sensor #1 increases, record as "sensor #1 increased by x lbs # XX:XX PM" If no change, do nothing).
I need that change-event data (interpreted with library from the raw data streams) to be accessible in real time dashboard (e.g. device #1 weight just went to zero, prompting employee to refill container #1)
Either DDB Streams or Kinesis Streams can support Lambda functions, which is what I'll use for the data cleaning, but I've read the documentation and comparison articles and can't distinguish which is best for my use case. Cost is not a key consideration. Thanks in advance!!
Unfortunately, I think you will need a few pieces of infrastructure for a full solution.
I think you could use Kinesis and firehose to write to a database to store the raw data in a way that can be queried with SQL.
For the data cleaning step, I think you will need to use a stateful stream processor like flink or bytewax and then the transformed data can be written to a real-time database or back to kinesis so that it can be consumed in a dashboard.
DynamoDB stream works with DynamoDB. It streams row changes to be picked up by downstream services like Lambda. You mentioned that you want data to be stored in SQL data base. DynamoDB is a NOSQL databse. So you can exclude that service.
Not sure why you want to have data in SQL database. If it is timeseries data, you would probably store them into a time series db like TimeStream.
If you are using AWS IoT Core to send data over MQTT to AWS, you can forward those messages to Kinesis Data Stream (or SQS). Then you can have a lambda triggered on messages received in Kinesis. This lambda can process the data and store them in the DB you want.

Writing multiple entries from a single message in Kafka Connect

If on one topic I receive messages in some format which represent a list of identical structs (e.g. a JSON list or a repeated field in protobuf) could I configure Kafka Connect to write each entry in the list as a separate row (say in a parquet file in HDFS, or in a SQL database)? Is this possible using only the bundled converters/connectors?
I.e. can I use each Kafka message to represent thousands of records, rather than sending thousands of individual messages?
What would be a straightforward way to achieve this with Kafka Connect?
The bundled message transforms are only capable of making one-to-one message manipulations. Therefore, you would have to explicitly produce those flattened lists in some way (directly, or via a stream processing application) if you wanted Connect to write it out as separate records.
Or, if applicable, you can use Hive or Spark to expand that list as well for later processing.

Read data from Amazon sqs and write to s3 in Parquet format

We have a use case where data for messages from different users are sent to the SQS, and we as a data team, want to subscribe to that queue and put data into S3, partitioned by time, so that we can due analysis on top of them.
What is the best way to consume those messages and write them to S3?
Something that I have in mind is using AWS lambda to put those messages to Firehose, and then use Firehose as a buffer, and once data is available for an a specific time period (let's say an hour), use Firehose to write it to S3 in Parquet format?
Is there any other solution? Maybe using AWS Glue or Data Pipeline?
AWS Kinesis Firehose supports now JSON to Parquet (or ORC) conversion in a serverless manner - see here for the details https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html

Import small stream in Impala

We are currently on a Big Data project.
The Big Data platform Hadoop Cloudera.
Input of our system we have a small flow of data, we collect via Kafka (approximately 80Mo/h continuously).
Then the messages are stored in HDFS to be queried via Impala.
Our client does not want to separate the hot data with the cold data. After 5 mins, the data must be accessible in the history data (cold data). We chose to have a single database.
To insert the data, we use the JDBC connector provided by Impala API (eg INSERT INTO ...).
we are aware that this is not the recommended solution, each Impala insertion creates a file (<10kb) in HDFS.
We seek a solution to insert a small stream in a Imapala base which avoids getting many small files.
What solution we preconize?

Big Data ingestion - Flafka use cases

I have seen that the Big Data community is very hot in using Flafka in many ways for data ingestion but I haven't really gotten why yet.
A simple example I have developed to better understand this is to ingest Twitter data and move them to multiple sinks(HDFS, Storm, HBase).
I have done the implementation for the ingestion part in the following two ways:
(1) Plain Kafka Java Producer with multiple consumers (2) Flume agent #1 (Twitter source+Kafka sink) | (potential) Flume agent #2(Kafka source+multiple sinks). I haven't really seen any difference in the complexity of developing any of these solutions(not a production system I can't comment on performance) - only what I found online is that a good use case for Flafka would be for data from multiple sources that need aggregating in one place before getting consumed in different places.
Can someone explain why would I use Flume+Kafka over plain Kafka or plain Flume?
People usually combine Flume and Kafka, because Flume has a great (and battle-tested) set of connectors (HDFS, Twitter, HBase, etc.) and Kafka brings resilience. Also, Kafka helps distributing Flume events between nodes.
EDIT:
Kafka replicates the log for each topic's partitions across a
configurable number of servers (you can set this replication factor on
a topic-by-topic basis). This allows automatic failover to these
replicas when a server in the cluster fails so messages remain
available in the presence of failures. -- https://kafka.apache.org/documentation#replication
Thus, as soon as Flume gets the message to Kafka, you have a guarantee that your data won't be lost. NB: you can integrate Kafka with Flume at every stage of your ingestion (ie. Kafka can be used as a source, channel and sink, too).

Resources