Flink - AWSS3IOException in AWS EMR caused by BucketingSink with S3A

Flink - AWSS3IOException in AWS EMR caused by BucketingSink with S3A - hadoop

I have a Flink app with high parallelism (400) running in AWS EMR. It sources Kafka and sinks to S3 using BucketingSink (using RocksDb backend for checkpointing). The destination is defined using "s3a://" prefix. The Flink job is a streaming app which runs continuously. At any given time, it's possible that all workers combined will generate/write to 400 files (due to 400 parallelism). After a few days, one of the workers will fail with the exception:
org.apache.hadoop.fs.s3a.AWSS3IOException: copyFile(bucket/2018-09-01/05/_file-10-1.gz.in-progress, bucket/2018-09-01/05/_file-10-1.gz.pending): com.amazonaws.services.s3.model.AmazonS3Exception: We encountered an internal error. Pelase try again. (Service: Amazon S3; Status Code: 200 InternalError; Request ID: xxxxxxxxxx; S3 Extended Request ID: yyyyyyyyyyyyyyy
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java: 178)
at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(S3AFileSystem.java: 1803)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:776)
at org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:662)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.closeCurrentPartFile(BucketingSink.java:575)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.openNewPartFile(BucketingSink.java:514)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.invoke(BucketingSink.java:446)
This seems to randomly occur when a new part file is created by the BucketingSink. The odd thing is that this happens randomly and when it occurs, it happens to 1 of the parallel flink workers (not all). Also, when this occurs, the Flink job transitions into a FAILING state, but the Flink job does not restart and resume/recover from the last successful checkpoint. What is the cause for this and how should it be resolved? Additionally, how can the job be configured to restart/recover from the last successful checkpoint instead of remaining in the FAILING state?

I think this is known behavior with the bucketing sink and S3, and the suggested solution is to use the shiny new StreamingFileSink in Flink 1.7.0.
Basically, the bucketing sink expects writes and renames to happen immediately like they would in a real file system, but that isn't a good assumption for object stores like S3, so the bucketing sink ends up with race conditions that cause intermittent problems. Here's a JIRA ticket that sort of describes the problem, and the related tickets flesh it out a bit more. JIRA FLINK-9752

Related

kafka commit during rebalancing

The scenario:
Kafka version 2.4.1.
Kafka partitions are processing messages actively.
CPU usage is less, memory usage is mediocre and no throttling is observed.
Golang Applications deployed on k8s using confluent's go client version 1.7.0.
k8s deletes some of the pods, kafka consumer group goes into rebalancing.
The message which was getting processed during this rebalancing gets stuck in the middle and takes around 17 mins to get processed, usual processing time is 3-4 seconds max.
No DB throttling, load is actually not even 10% of our peak.
k8s pods have 1 core and 1gb of memory.
Messages are consumed and processed in the same thread.
Earlier we found that one of the brokers in the 6 cluster node was unhealthy and we replaced it, post which we started facing the issue.
Question - Why did the message get stuck? Is it because rebalancing made the processing thread hang? OR something else?
Thanks in advance for your answers!

Messages are stuck due to rebalancing which is happening for your consumer group (CG). The rebalancing process for Kafka is normal procedure and is always triggered when new member joins the CG or leaves the CG. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. But if the CG stuck in PreparingRebalance you will not process any data.
You can identify the CG state by running some Kafka commands as example:
kafka-consumer-groups.sh --bootstrap-server $BROKERS:$PORT --group $CG --describe --state
and it should show you the status of the CG as example:
GROUP COORDINATOR (ID) ASSIGNMENT-STRATEGY STATE #MEMBERS
name-of-consumer-group brokerX.com:9092 (1) Empty 0
in above example you have STATE : EMPTY
The ConsumerGroup State may have 5 states:
Stable - is when the CG is stable and has all members connected successfully
Empty - is when there is no members in the group (usually mean the module is down or crashed)
PreparingRebalance - is when the members are connecting to the CG (it may indicate issue with client when members keep crashing but also is the State of CG before gets stable state)
CompletingRebalance - is the state when the PreparingRebalance is completing the process of rebalancing
Dead - consumer group does not have any members and metadata has been removed.
To indicate if the issue is on Cluster or client per PreparingRebalance just stop the client and execute the command to verify CG state... if the CG will be still showing members .. then you have to restart the broker which is pointed in the output command as Coordinator of that CG example brokerX.com:9092 .. if the CG become empty once you stop all clients connected to the CG would mean that something is off with the client code/data which causes members to leave/rejoin CG and as effect of this you sees that the CG is always in the status of PreparingRebalance that you will need to investigate why is this happening.
since from what I recall there was bug in Kafka version 2.4.1. and been fixed in 2.4.1.1 you can read about it here:
https://issues.apache.org/jira/browse/KAFKA-9752
https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-msk-now-offers-version-2-4-1-1-fixing-a-perpetual-rebalance-bug-in-apache-kafka-2-4-1/
my troubleshooting steps should show you how can you verify If this is the case that you facing the bug issue or is just bad code.

Apache Nifi - Flowfiles are stuck in queue

The flow files are stuck in the queue(Load balance by attribute) and are not read by the next downstream processor(MergeRecord with CSVReader and CSVRecordSetWriter). From the Nifi UI, it appears that flow files are in the queue but when tried to list queue it says "Queue has no flow files". Attempting to empty queue also gives the exact message. Nifi Logs doesn't have any exceptions related to the processor. There are around 80 flow files in queue.
I have tried below action items but all in vain:
Restarting the downstream and upstream(ConvertRecord) processor.
Disabled and enabled CSVReader and CSVRecordSetWriter.
Disabled load balancing.
Flow file expiration set to 3 sec.
Screenshot:
Flowfile:
MergeRecord properties:
CSVReader Service:
CSVRecordSetWriter:

Your merge record processor is running only on the primary node, and likely all the files are on other nodes (since you are load balancing). NiFi is not aware enough to notice that the downstream processor is only running on the primary, so it does not automatically rebalance everything to the primary node. Simply changing the MergeRecord to run on all nodes will allow the files to pass through.
Alas, I have not found a way to get all flow files back on the primary node, you can use the "Single Node" load balance strategy to get all the files on the same node, but it will not necessarily be the primary.

This is probably because the content of the flowfile was deleted. However, the entry of it is still present in the flow-file registry.
if you have a dockerized nifi setup and if you dont have a heavy production flow, you can stop your nifi flow and delete everything in the _*repository folders (flowfile-repository, content repository etc)
(provided you have all you directories mounted and no other data loss is at risk)
Let me know if you need further assistance

You have a miss configuration in the way you load balance your FlowFiles. To check that stop your MergeRecord processor to be able to check and view what's inside your queue.
In the modal window displayed you can check where are your flowfiles waiting, it's highly probable that your FlowFiles are in fact on one of the other node(s) but since the MergeRecord is running on the primary node then it has nothing in its Queue.

apache beam stream processing failure recovery

Running a streaming beam pipeline where i stream files/records from gcs using avroIO and then create minutely/hourly buckets to aggregate events and add it to BQ. In case the pipeline fails how can i recover correctly and process the unprocessed events only ? I do not want to double count events .
One approach i was thinking was writing to spanner or bigtable but it may be the case the write to BQ succeeds but the DB fails and vice versa ?
How can i maintain a state in reliable consistent way in streaming pipeline to process only unprocessed events ?
I want to make sure the final aggregated data in BQ is the exact count for different events and not under or over counting ?
How does spark streaming pipeline solve this (I know they have some checkpointing directory for managing state of query and dataframes ) ?
Are there any recommended techniques to solve accurately these kind of problem in streaming pipelines ?

Based on clarification from the comments, this question boils down to 'can we achieve exactly-once semantics across two successive runs of a streaming job, assuming both runs are start from scratch?'. Short answer is no. Even if the user is willing store some state in external storage, it needs to be committed atomically/consistently with streaming engine internal state. Streaming engines like Dataflow, Flink store required state internally, which is needed for to 'resume' a job. With Flink you could resume from latest savepoint, and with Dataflow you can 'update' a running pipeline (note that Dataflow does not actually kill your job even when there are errors, you need to cancel a job explicitly). Dataflow does provide exactly-once processing guarantee with update.
Some what relaxed guarantees would be feasible with careful use of external storage. The details really depend on specific goals (often it is is no worth the extra complexity).

Apache Storm scale/rebalance without downtime

I'm currently analysing Apache Storm if it is usable as Stream Processing Framework for me. It looks really nice, but what worries me, is the scaling.
As far as I understood it, scaling is done by rebalancing.
e.g. If I wan't to add a new server to the cluster, I have to increase the workers. But when I do so with
storm rebalance storm_example -n 4
all the bolts and spouts stop working while it is rebalancing. But what I want is more like:
Add the Server, add a new worker on it, and when new Data arrive, also consider this new one to work off the data
Do I just don't get the idea of Storm or is that not possible with it.

I had the similar requirement and as per my research it is not possible. In my case we ended up creating a new storm cluster without disturbing the existing one. We were (are) trying to assign servers/workers based on the load to storm to avoid AWS cost.
It would be interesting to know if we can do so.

How to restore bolt state during failover

I'm trying to figure out how to restore the state of a storm bolt intance during failover. I can persist the state externally (DB or file system), however once the bolt instance is restarted I need to point to the specific state of that bolt instance to recover it.
The prepare method of a bolt receives a context, documented here http://nathanmarz.github.io/storm/doc/backtype/storm/task/TopologyContext.html
What is not clear to me is - is there any piece of this context that uniquely identifies the specific bolt instance so I can understand which persistent state to point to? Is that ID preserved during failover? Alternatively, is there any variable/object I can set for the specific bolt/instance that is preserved during failover? Any help appreciated!
br
Sib
P.S.
New to stackoverflow so pls bear with me...

You can probably look for Trident Its basically an abstraction built on top of storm . The documentation says
Trident has first-class abstractions for reading from and writing to stateful sources. The state can either be internal to the topology – e.g., kept in-memory and backed by HDFS – or externally stored in a database like Memcached or Cassandra
In case of any fail over it says
Trident manages state in a fault-tolerant way so that state updates are idempotent in the face of retries and failures.
You can go through the documentation for any further clarification.

Tx (and credit) to Storm user group!
http://mail-archives.apache.org/mod_mbox/storm-user/201312.mbox/%3C74083558E4509844944FF5CF2BA7B20F1060FD0E#ESESSMB305.ericsson.se%3E

In original Storm, both spout and bolt are stateless. Storm can managed to restart nodes but it will require some effort to restore the states of nodes. There are two solutions that I can think of:
If a message fails to process, Storm will replay it from ROOT of the topology and the logic of replay has to be implemented by user. So in this case I would like to put more state information (e.g. the ID of some external state storage and id of this task) in messages.
Or you can use Trident. It can provides txid to each transaction and simplify storage process.
I'm OK with first solution because my app doesn't require transactional operations and I have a better understanding of the original Storm (Storm generates simpler log than Trident does).

You can use the task ID.
Task ids are assigned at topology creation and are static. If a task dies/restarts or gets reassigned somewhere else, it will still have the same id.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio