Stream Analytics job takes several minutes to output results to an Event Hub.
A Stateless Web API Azure Service Fabric application is distributed across 8 nodes. The application is very simple, consisting of a single Controller, which:
Receives a series of JSON objects
Initialises a series of EventData instances that wrap each JSON object
Sets the PartitionKey property of each EventData isntance to the machine-name value
Publishes the JSON objects as a single batch of EventData instances to Azure Event Hub
The JSON payload is a simple series of IP addresses and time-stamps, as follows:
[{
"IPAddress": "10.0.0.2",
"Time": "2016-08-17T12:00:01",
"MachineName": "MACHINE01"
}, {
"IPAddress": "10.0.0.3",
"Time": "2016-08-17T12:00:02",
"MachineName": "MACHINE01"
}]
Once received, the Event Hub acts as an input to a Stream Analytics job, which executes the following Query:
SELECT
IPAddress, COUNT(*) AS Total, MachineName
INTO
Output
FROM
Input TIMESTAMP BY TIME
PARTITION BY PartitionId
GROUP BY
TUMBLINGWINDOW(MINUTE, 1), IPAddress, MachineName, PartitionId
HAVING Total >= 2
Note that the Query is partitioned by PartitionId, where PartitionId is set to the machine-name of the originating Service Fabric application. There will therefore be a maximum of 8 PartitionKeys
There are 8 individual Service Fabric instances, and 8 corresponding Partitions assigned to the input Event Hub.
Finally, The Stream Analytics job outputs the result to a 2nd Event Hub. Again, this Event Hub has 8 Partitions. The Stream Analytics Query retains the machine-name, which is used as the PartitionKey of the output Event Hub. The output format is JSON.
At best, the process takes 30-60 seconds, and sometimes several minutes, for a single HTTP request to reach the output Event Hub. The bottleneck seems to be the Stream Analytics job, given that the ASP.NET application publishes the EventData batches in sub-second timescales.
Edit:
Applying a custom field to Timestamp By adds a great degree of latency, when coupled with a Group By clause. I have achieved acceptable latency (1-2 seconds) when removing the Timestamp By clause.
The optimal Query is as follows:
SELECT
Count(*) AS Total, IPAddress
FROM
Input
Partition by PartitionId
GROUP BY TUMBLINGWINDOW(MINUTE, 1), IPAddress, PartitionId
However, adding a Having clause, results in latency increasing to 10-20 seconds:
SELECT
Count(*) AS Total, IPAddress
FROM
Input
Partition by PartitionId
GROUP BY TUMBLINGWINDOW(MINUTE, 1), IPAddress, PartitionId
HAVING Total >= 10
Without the ability to aggregate data within the Query, using a Having clause, in a timely fashion, seems to defeat the purpose.
Incidentally, Streaming Units, Partitions, Input and Output are configured optimally, as per this guide to achieving parallelism with Stream Analytics.
Related
I have a fleet of 250 Wifi-enabled IoT sensors streaming weight data. Each devices samples once per second. I am requesting help between choosing AWS DynamoDB Streams vs. AWS Kinesis Streams to to store and process this data in real-time. Here are some additional requirements:
I need to keep all raw data in a SQL-accessible table.
I also need to clean the raw stream data with Python's Pandas library to recognize device-level events based on weight changes (e.g. if weight of sensor #1 increases, record as "sensor #1 increased by x lbs # XX:XX PM" If no change, do nothing).
I need that change-event data (interpreted with library from the raw data streams) to be accessible in real time dashboard (e.g. device #1 weight just went to zero, prompting employee to refill container #1)
Either DDB Streams or Kinesis Streams can support Lambda functions, which is what I'll use for the data cleaning, but I've read the documentation and comparison articles and can't distinguish which is best for my use case. Cost is not a key consideration. Thanks in advance!!
Unfortunately, I think you will need a few pieces of infrastructure for a full solution.
I think you could use Kinesis and firehose to write to a database to store the raw data in a way that can be queried with SQL.
For the data cleaning step, I think you will need to use a stateful stream processor like flink or bytewax and then the transformed data can be written to a real-time database or back to kinesis so that it can be consumed in a dashboard.
DynamoDB stream works with DynamoDB. It streams row changes to be picked up by downstream services like Lambda. You mentioned that you want data to be stored in SQL data base. DynamoDB is a NOSQL databse. So you can exclude that service.
Not sure why you want to have data in SQL database. If it is timeseries data, you would probably store them into a time series db like TimeStream.
If you are using AWS IoT Core to send data over MQTT to AWS, you can forward those messages to Kinesis Data Stream (or SQS). Then you can have a lambda triggered on messages received in Kinesis. This lambda can process the data and store them in the DB you want.
I want to apply 5 minute window operation on timestamp column in Mapping data flow.First I am using stream analytics to get the telemetry data from Event Hub and storing that data in csv files on Blob Storage. After that I want to perfrom windowing of 5 minute interval on the data stored in csv files through Mapping data flow and ther I want to perform some aggregation. I want to apply windowing of 5 minutes on timestamp column. How to do it?
It's hard for me to understand the streaming table in Flink. I can understand Hive, map a fixed, static data file to a "table" but how to embody a table built on streaming data?
For example, every 1 second, 5 events with same structure are sent to a Kafka stream:
{"num":1, "value": "a"}
{"num":2, "value": "b"}
....
What does the dynamic table built on them look like? Flink consumes them all and store them somewhere (memory, local file, hdfs, etc.) then map to a table? Once the "transformmer" finishes processing these 5 events then clear the data and refill the "table" with 5 new events?
Any help is appreciated...
These dynamic tables don't necessarily exist anywhere -- it's simply an abstraction that may, or may not, be materialized, depending on the needs of the query being performed. For example, a query that is doing a simple projection
SELECT a, b FROM events
can be executed by simply streaming each record through a stateless Flink pipeline.
Also, Flink doesn't operate on mini-batches -- it processes each event one at a time. So there's no physical "table", or partial table, anywhere.
But some queries do require some state, perhaps very little, such as
SELECT count(*) FROM events
which needs nothing more than a single counter, while something like
SELECT key, count(*) FROM events GROUP BY key
will use Flink's key-partitioned state (a sharded key-value store) to persist the current counter for each key. Different nodes in the cluster will be responsible for handling events for different keys.
Just as "normal" SQL takes one or more tables as input, and produces a table as output, stream SQL takes one or streams as input, and produces a stream as output. For example, the SELECT count(*) FROM events will produce the stream 1 2 3 4 5 ... as its result.
There are some good introductions to Flink SQL on YouTube: https://www.google.com/search?q=flink+sql+hueske+walther, and there are training materials on github with slides and exercises: https://github.com/ververica/sql-training.
I am trying to create a Kafka Streams service where
I am trying to initialize a cache in a processor, that will then be updated by consuming messages with a topic say "nodeStateChanged" for a partition key lets say locationId.
I need to check the node state when I consume another topic lets say "Report" again keyed by the same locationId. Effectively I am joining with the table created by nodeStateChanged.
How do I ensure that all the updates for nodeStateChanged fall on the same instance as the Report topic so that the lookup for a location is possible when a new report is recieved. Do 1 and 2 need to be created by the same topology or it okay to create two seperate topologies that share the same APPLICATION_ID_CONFIG.
You don't need to do anything. Kafka Streams will always co-partition topics. Ie, if you have a sub-topology that reads from multiple topics with N partitions each, you get N tasks and each task is processing corresponding partitions, ie, task 0 processes partitions zero of both input topics, task 1 processes partitions one of both input topics, etc.
My understanding as per Kafka stream documentation,
Maximum possible parallel tasks is equal to maximum number of partitions of a topic among all topics in a cluster.
I have around 60 topics at Kafka cluster. Each topic has single partition only.
Is it possible to achieve scalability/parallelism with Kafka stream for my Kafka cluster?
Do you want to do the same computation over all topics? For this, I would recommend to introduce an extra topic with many partitions that you use to scale out:
// using new 1.0 API
StreamsBuilder builder = new StreamsBuilder():
KStream parallelizedStream = builder
.stream(/* subscribe to all topics at once*/)
.through("topic-with-many-partitions");
// apply computation
parallelizedStream...
Note: You need to create the topic "topic-with-many-partitions" manually before starting your Streams application
Pro Tip:
The topic "topic-with-many-partitions" can have a very short retention time as it's only used for scaling and must not hold data long term.
Update
If you have 10 topic T1 to T10 with a single partitions each, the program from above will execute as follows (with TN being the dummy topic with 10 partitions):
T1-0 --+ +--> TN-0 --> T1_1
... --+--> T0_0 --+--> ... --> ...
T10-0 --+ +--> TN-10 --> T1_10
The first part of your program will only read all 10 input topics and write it back into 10 partitions of TN. Afterwards, you can get up to 10 parallel tasks, each processing one input partition. If you start 10 KafakStreams instances, only one will execute T0_0, and each will alsa one T1_x running.