Parent Shard Exists but not the Child Shard - spark-streaming

I am setting up a Spark Streaming project with Kinesis and when I try to connect to my Kinesis stream I am getting the following error from Spark:
ERROR ShardSyncTask: Caught exception while sync'ing Kinesis shards and leases
com.amazonaws.services.kinesis.clientlibrary.exceptions.internal.KinesisClientLibIOException: Parent shard shardId-000000000000 exists but not the child shard shardId-000000000002
When I post test data to this stream or read data from the stream using the base Amazon libraries I get no errors, this only occurs when I try to connect with Spark.
Below is the code that I am using for my tests:
val conf = new SparkConf().setMaster("local[2]").setAppName("KinesisCounter")
val ssc = new StreamingContext(conf, Seconds(1))
val rawStream = KinesisUtils.createStream(ssc, "dev-test", "kinesis.us-east-1.amazonaws.com", Duration(1000), InitialPositionInStream.TRIM_HORIZON, StorageLevel.MEMORY_ONLY)
rawStream.map(msg => new String(msg)).count.print

How many shards you have on Kinesis?
what I would do is:
check the Kinesis region, make sure your application setting and stream are in the same region
delete your DynomoTable which stores the Kinesis streaming shards, and start all over again. below is from official documentation:
Changing the application name or stream name can lead to Kinesis errors in some cases. If you see errors, you may need to manually delete the DynamoDB table
check your application code, to see if some settings are being set during code running.
Hope it helps.

Related

Why Kafka streams creates topics for aggregation and joins

I recently created my first Kafka stream application for learning. I used spring-cloud-stream-kafka-binding. This is a simple eCommerce system, in which I am reading a topic called products, which have all the product entries whenever a new stock of a product comes in. I am aggregating the quantity to get the total quantity of a product.
I had two choices -
Send the aggregate details (KTable) to another kafka topic called aggregated-products
Materialize the aggregated data
I opted second option and what I found out that application created a kafka topic by itself and when I consumed messages from that topic then got the aggregated messages.
.peek((k,v) -> LOGGER.info("Received product with key [{}] and value [{}]",k, v))
.groupByKey()
.aggregate(Product::new,
(key, value, aggregate) -> aggregate.process(value),
Materialized.<String, Product, KeyValueStore<Bytes, byte[]>>as(PRODUCT_AGGREGATE_STATE_STORE).withValueSerde(productEventSerde)//.withKeySerde(keySerde)
// because keySerde is configured in application.properties
);
Using InteractiveQueryService, I am able to access this state store in my application to find out the total quantity available for a product.
Now have few questions -
why application created a new kafka topic?
if answer is 'to store aggregated data' then how is this different from option 1 in which I could have sent the aggregated data by my self?
Where does RocksDB come into picture?
Code of my application (which does more than what I explained here) can be accessed from this link -
https://github.com/prashantbhardwaj/kafka-stream-example/blob/master/src/main/java/com/appcloid/kafka/stream/example/config/SpringStreamBinderTopologyBuilderConfig.java
The internal topics are called changelog topics and are used for fault-tolerance. The state of the aggregation is stored both locally on the disk using RocksDB and on the Kafka broker in the form of a changelog topic - which is essentially a "backup". If a task is moved to a new machine or the local state is lost for a different reason, the local state can be restored by Kafka Streams by reading all changes to the original state from the changelog topic and applying it to a new RocksDB instance. After restoration has finished (the whole changelog topic was processed), the same state should be on the new machine, and the new machine can continue processing where the old one stopped. There are a lot of intricate details to this (e.g. in the default setting, it can happen that the state is updated twice for the same input record when failures happen).
See also https://developer.confluent.io/learn-kafka/kafka-streams/stateful-fault-tolerance/

Flink, Kafka and JDBC sink

I have a Flink 1.11 job that consumes messages from a Kafka topic, keys them, filters them (keyBy followed by a custom ProcessFunction), and saves them into the db via JDBC sink (as described here: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/jdbc.html)
The Kafka consumer is initialized with these options:
properties.setProperty("auto.offset.reset", "earliest")
kafkaConsumer = new FlinkKafkaConsumer(topic, deserializer, properties)
kafkaConsumer.setStartFromGroupOffsets()
kafkaConsumer.setCommitOffsetsOnCheckpoints(true)
Checkpoints are enabled on the cluster.
What I want to achieve is a guarantee for saving all filtered data into the db, even if the db is down for, let's say, 6 hours, or there are programming errors while saving to the db and the job needs to be updated, redeployed and restarted.
For this to happen, any checkpointing of the Kafka offsets should mean that either
Data that was read from Kafka is in Flink operator state, waiting to be filtered / passed into the sink, and will be checkpointed as part of Flink operator checkpointing, OR
Data that was read from Kafka has already been committed into the db.
While looking at the implementation of the JdbcSink, I see that it does not really keep any internal state that will be checkpointed/restored - rather, its checkpointing is a write out to the database. Now, if this write fails during checkpointing, and Kafka offsets do get saved, I'll be in a situation where I've "lost" data - subsequent reads from Kafka will resume from committed offsets and whatever data was in flight when the db write failed is now not being read from Kafka anymore nor is in the db.
So is there a way to stop advancing the Kafka offsets whenever a full pipeline (Kafka -> Flink -> DB) fails to execute - or potentially the solution here (in pre-1.13 world) is to create my own implementation of GenericJdbcSinkFunction that will maintain some ValueState until the db write succeeds?
There are 3 options that I can see:
Try out the JDBC 1.13 connector with your Flink version. There is a good chance it might just work.
If that doesn't work immediately, check if you can backport it to 1.11. There shouldn't be too many changes.
Write your own 2-phase-commit sink, either by extending TwoPhaseCommitSinkFunction or implement your own SinkFunction with CheckpointedFunction and CheckpointListener. Basically, you create a new transaction after a successful checkpoint and commit it with notifyCheckpointCompleted.

Kafka streams: Is there a way to delete an entry from kv store only if successfully committed to topics?

In Kafka streams, my kv store is linked to a sink which sends records to output topic.
- What exceptions would we get if for some reason sink can't commit records to topics?
If the sink cannot write the record it will internally retry and after all retries are exhausted the whole application goes down with an exception. If the store was updated successfully it will (by default only) contain the data and you cannot delete it. This is the guarantee "at-least-once" processing gives you.
As of Kafka 0.11 you can enable "exactly-once" processing:
properties.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
For this case, on application restart the store will be deleted and recreated before any processing is repeated. This ensures, that the data written to the store before the error will be "removed" before processing continues.

Reading AWS Dynamodb Stream

I want to do an incremental DynamoDB backup on S3 using DynamoDB Streams. I have a lambda that reads the dynamodb stream and writes files into S3. In order to mark already read shards I have ExclusiveStartShardId logged into configuration file.
What I do is:
Describe the stream (using the logged ExclusiveStartShardId)
Get stream's shards
For all shards that are CLOSED (has EndingSequenceNumber) I do the following:
Get shard iterator for the certain shard (shardIteratorType: 'TRIM_HORIZON')
Iterate through shard and fetch records till NextShardIterator becomes null
The problem here is that I read only closed shards and in order to get new records I must wait (undetermined-amount-of-time) for it to be closed.
It seems that the last shard is usually in OPEN state (has NO EndingSequenceNumber). If I remove the check for EndingSequenceNumber from the pseudo code above I end up with infinite loop because when I hit the last shard NextShardIterator is always presented. I cannot also do a check if fetched items are 0 because there could be "gaps" in the shard.
In this tutorial numChanges is used in order to stop the infinite loop http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.LowLevel.Walkthrough.html#Streams.LowLevel.Walkthrough.Step5
What is the best approach in this situation?
I also found a similar question: Reading data from dynamodb streams. Unfortunately I could not find the answer for my question.
Why not attach the DynamoDB stream as an event source for your Lambda function? Then Lambda will take care of polling the stream and calling your function when necessary. See this for details.

Sticking stream data to specific working

We are trying to replace Apache Storm with Apache Spark streaming.
In storm; we partitioned stream based on "Customer ID" so that msgs with a range of "customer IDs" will be routed to same bolt (worker).
We do this because each worker will cache customer details (from DB).
So we split into 4 partitions and each bolt (worker) will have 1/4 of the entire range.
I did see comparison Spark and Storm; and this being limitation on Spark.
I am hoping we have a solution to this in Spark Streaming
When using Kafka, one way to address this problem is to partition your data at the producer side. As you probably have seen, Kafka messages have a key, and you may use that key to partition the data among partitions.
Using the Kafka receiver, you create one receiver per partition. Upon start of the Streaming job, the receivers will be distributed over several executors.
This means that every executor (JVM) will be receiving data for only the partitions it's got assigned. This results on the same id going to the same executor for the lifetime of the receiver, and enables effective local caching as intended in the question.

Resources