Spark streaming batch interval with Kenisis - spark-streaming

What is the effect of setting batch interval when creating the streaming context
new StreamingContext(spark.sparkContext, batchInterval)
According to this Amazon blog the Kinesis batch interval is hard coded to 1s.

The the Kinesis batch interval mentioned in the Blog is the interval at which the receiver reads data from a stream, which is by default set at 1 second. This interval just decides input rate of the receiver.
The batchInterval provided while creating the StreamingContext divides the input records into batches of given interval, to be processed by spark streaming.
For example if you have single Kinesis receiver and your batchInterval is 10 seconds then receiver would be able to read up to 10000 records in 10 second, that is reading 1000 records per second interval from the Kinesis stream. So your that streaming batch will include 10000 records.

Related

Low latency(subsecond) kafka + spark structured streaming tuning

I'm trying to build a kafka + spark structured streaming stateful application with low latency. By saying low latency I mean a couple of hundred of milliseconds each job.
The spark app read data from a kafka topic with partition number that's 2 times of executor core, then process and output it to another kafka topic. The rate that the data is produced into this topic is 100 records/s with approximate 2 kb record size. The DAG of the job indicate that stage that includes reading from kafka source takes 0.5s
. This stage basically transform the data from kafka into a dataset of custom case class, followed by groupByKey and flatMapGroupsWithState function from second stage. The shuffle write time in web UI is 0 ms(which should be small because the shuffled data size is around 10~20kb). So AFAIK the only time-consuming operation should be reading from kafka.
I've read about that kafka can perform much better than this. The end-to-end latency can be smaller than 100 ms.
The kafka broker is not heavily loaded. I don't know if it's related to the question but the whole application runs on a kubernetes cluster. And there's pic of this stage and pic of the whole query attached if they might help.
Sorry I cannot post the code. Is there anything I can try doing?
Best Regards
Today I find that some tasks of the first stage takes 0.5s while others do not, which looks very suspicious to me. And I looked more into the kafka settings. There's one configuration called fetch.max.wait.ms, which prevent the consumer task from stoping waiting for new message for 500 ms by default. After reducing this config everything goes fine. More info here: fetch.max.wait.ms

Apache Storm: throttling spout based on configuration

My topology reads from Kafka and makes a HTTP call to an external system. The ingestion rate in Kafka is about 200 messages per second. The external system only supports 20 HTTP calls per second. How can i introduce throttling so that the bolt that makes HTTP calls processes only 20 messages per second?
You can use the topology.max.spout.pending setting to throttle the spout based on how many tuples are in flight in the topology. The setting is per spout instance, so if you have e.g. 10 spout executors and you set max 100 tuples, you will get a max of 1000 tuples in the topology.
You can use the resetTimeout method on the OutputCollector to keep tuples you want to postpone from failing due to timeout.
This being said, you probably need to batch up your messages into larger bundles. If you can only process 20 messages per second, and you have an input of 200 per second, you will start falling behind and never catch up.

Kinesis triggers lambda with small batch size

I have a Lambda which is configured as a consumer of a Kinesis data stream, with a batch size of 10,000 (maximal).
The lambda parses given records and inserts them to Aurora Postgresql (using an INSERT command).
Somehow, I see that the lambda is invoked most of the time with a relatively small number of records (less than 200), although the 'IteratorAge' is constantly high (about 60 seconds). The records are put in the stream with a random partition key (generated as uuid4), and of size
How can that be explained? As I understand, if the shard is not empty, all the current records, up to the configured batch size, should be polled.
I assume that if the Lambda was invoked with bigger batches this delay could be prevented.
Note: There is also a Kinesis Firehose configured as a consumer (doesn't seem to have any issue).
Finds out that the iterator age of the Kinesis was 0ms, so this behavior makes sense.
The iterator age of the Lambda is a bit different:
Measures the age of the last record for each batch of records processed. Age is the difference between the time Lambda received the batch, and the time the last record in the batch was written to the stream.

Kinesis stream / shard - multiple consumers

I have already read some questions about kinesis shard and multiple consumers but I still don't understand how it works.
My use case: I have a kinesis stream with just one shard. I would like to consume this shard using different lambda function, each of them independently. It's like that each lambda function will have it's own shard iterator.
Is it possible? Set multiple lambda consumers ( stream based) reading from the same stream/shard?
Hey Mr Magalhaes I believe the following picture should answer some of your questions.
So to clarify you can set multiple lambdas as consumers on a kinesis stream, but the Lambdas will block each other on processing. If your stream has only one shard it will only have one concurrent Lambda.
If you have one kinesis stream, you can connect as many lambda functions as you want through an event source mapping.
All functions will run simultaneously and fully independent of each other and will constantly be invoked if new records arrive in the stream.
The number of shards does not matter.
For a single lambda function:
"For Lambda functions that process Kinesis or DynamoDB streams the number of shards is the unit of concurrency. If your stream has 100 active shards, there will be at most 100 Lambda function invocations running concurrently. This is because Lambda processes each shard’s events in sequence." [https://docs.aws.amazon.com/lambda/latest/dg/scaling.html]
But there is no limit on how many different lambda consumers you want to attach with kinesis.
Yes, no problem with this !
The number of shards doesn't limit the number of consumers a stream can have.
In you case, it will just limit the number of concurrent invocations of each lambda. This means that for each consumers, you can only have the number of shards of concurrent executions.
Seethis doc for more details.
Short answer:
Yes it will work, and will work concurrently.
Long answer:
Each shared in Kinesis stream has 2MiB/sec read throughput:
https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html
If you have multiple applications (in your case Lambda's). They will share the throughput.
A description taken from the link above:
Fixed at a total of 2 MiB/sec per shard. If there are multiple consumers reading from the same shard, they all share this throughput. The sum of the throughput they receive from the shard doesn't exceed 2 MiB/sec.
If you create (write) less than 1mib/sec of data you should be able to support two "applications" with a single shard.
In general if you have Y shards and X applications it should work properly assuming your total write throughput (mib/sec) is less than 2mib/sec * Y / X and that data is spread equally between shards.
If you require each "Application" to use 2 Mib/sec each, you may enable "Consumers with Enhanced Fan-Out" which "fan-outs" the stream giving each application a dedicated 2 Mib/sec per shard (instead of sharing the throughput).
This is described in the following link:
https://docs.aws.amazon.com/streams/latest/dev/introduction-to-enhanced-consumers.html
In Amazon Kinesis Data Streams, you can build consumers that use a feature called enhanced fan-out. This feature enables consumers to receive records from a stream with throughput of up to 2 MiB of data per second per shard. This throughput is dedicated, which means that consumers that use enhanced fan-out don't have to contend with other consumers that are receiving data from the stream.

bigquery-api-go-client, bigquery streaming inserts latency

Is there an optimal way to improve the speed of requests to an insertAll request via the golang client?
My current structure is such that:
Rows get submit to a queue
Upon hitting job size, a new job is queued up (approx. 250 rows)
A worker picks up a job from the queue
Worker formats + sends insertAll request to BQ
The above works fairly well, however I seem to be having an average of 8-12 seconds per 250 row request.
I should mention that the number of workers being used is generally 200, however I have tried higher / lower values and it does not seem to make much a difference (higher usually seems to take longer per job).

Resources