How to handle duplicate messages using Kafka streaming DSL functions - apache-kafka-streams

My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
There is possibility of source system sending duplicate messages to INPUT topic in case of any failures.
FLOW -
Source System --> INPUT Topic --> Kafka Streaming --> OUTPUT Topic
Currently I am using flatMap to generate multiple keys out the payload but flatMap is stateless so not able to avoid duplicate message processing upon receiving from INPUT Topic.
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
Could you please put some light on it.

My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
Take a look at the EventDeduplication example at https://github.com/confluentinc/kafka-streams-examples, which does that. You can then adapt the example with the required flatMap functionality that is specific to your use case.
Here's the gist of the example:
final KStream<byte[], String> input = builder.stream(inputTopic);
final KStream<byte[], String> deduplicated = input.transform(
// In this example, we assume that the record value as-is represents a unique event ID by
// which we can perform de-duplication. If your records are different, adapt the extractor
// function as needed.
() -> new DeduplicationTransformer<>(windowSize.toMillis(), (key, value) -> value),
storeName);
deduplicated.to(outputTopic);
and
/**
* #param maintainDurationPerEventInMs how long to "remember" a known event (or rather, an event
* ID), during the time of which any incoming duplicates of
* the event will be dropped, thereby de-duplicating the
* input.
* #param idExtractor extracts a unique identifier from a record by which we de-duplicate input
* records; if it returns null, the record will not be considered for
* de-duping but forwarded as-is.
*/
DeduplicationTransformer(final long maintainDurationPerEventInMs, final KeyValueMapper<K, V, E> idExtractor) {
if (maintainDurationPerEventInMs < 1) {
throw new IllegalArgumentException("maintain duration per event must be >= 1");
}
leftDurationMs = maintainDurationPerEventInMs / 2;
rightDurationMs = maintainDurationPerEventInMs - leftDurationMs;
this.idExtractor = idExtractor;
}
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, Long>) context.getStateStore(storeName);
}
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
final KeyValue<K, V> output;
if (isDuplicate(eventId)) {
output = null;
updateTimestampOfExistingEventToPreventExpiry(eventId, context.timestamp());
} else {
output = KeyValue.pair(key, value);
rememberNewEvent(eventId, context.timestamp());
}
return output;
}
}
private boolean isDuplicate(final E eventId) {
final long eventTime = context.timestamp();
final WindowStoreIterator<Long> timeIterator = eventIdStore.fetch(
eventId,
eventTime - leftDurationMs,
eventTime + rightDurationMs);
final boolean isDuplicate = timeIterator.hasNext();
timeIterator.close();
return isDuplicate;
}
private void updateTimestampOfExistingEventToPreventExpiry(final E eventId, final long newTimestamp) {
eventIdStore.put(eventId, newTimestamp, newTimestamp);
}
private void rememberNewEvent(final E eventId, final long timestamp) {
eventIdStore.put(eventId, timestamp, timestamp);
}
#Override
public void close() {
// Note: The store should NOT be closed manually here via `eventIdStore.close()`!
// The Kafka Streams API will automatically close stores when necessary.
}
}
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
The DSL doesn't include such functionality out of the box, but the example above shows how you can easily build your own de-duplication logic by combining the DSL with the Processor API of Kafka Streams, with the use of Transformers.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
As Matthias J. Sax mentioned in his answer, from Kafka's perspective these "duplicates" are not duplicates from the point of view of its exactly-once processing semantics. Kafka ensures that it will not introduce any such duplicates itself, but it cannot make such decisions out-of-the-box for upstream data sources, which are black box for Kafka.

Exactly-once can be use to ensure that consuming and processing an input topic, does not result in duplicates in the output topic. However, from an exactly-once point of view, the duplicates in the input topic that you describe are not really duplicates but two regular input messages.
For remove input topic duplicates, you can use a transform() step with an attached state store (there is no built-in operator in the DSL that does what you want). For each input records, you first check if you find the corresponding key in the store. If not, you add it to the store and forward the message. If you find it in the store, you drop the input as duplicate. Note, this will only work with 100% correctness guarantee if you enable exactly-once processing in your Kafka Streams application. Otherwise, even if you try do deduplicate, Kafka Streams could re-introduce duplication in case of a failure.
Additionally, you need to decide how long you want to keep entries in the store. You could use a Punctuation to remove old data from the store if you are sure that no further duplicate can be in the input topic. One way to do this, would be to store the record timestamp (or maybe offset) in the store, too. This way, you can compare the current time with the store record time within punctuate() and delete old records (ie, you would iterator over all entries in the store via store#all()).
After the transform() you apply your flatMap() (or could also merge your flatMap() code into transform() directly.

It's achievable with DSL only as well, using SessionWindows changelog without caching.
Wrap the value with duplicate flag
Turn the flag to true in reduce() within time window
Filter out true flag values
Unwrap the original key and value
Topology:
Serde<K> keySerde = ...;
Serde<V> valueSerde = ...;
Duration dedupWindowSize = ...;
Duration gracePeriod = ...;
DedupValueSerde<V> dedupValueSerde = new DedupValueSerde<>(valueSerde);
new StreamsBuilder()
.stream("input-topic", Consumed.with(keySerde, valueSerde))
.mapValues(v -> new DedupValue<>(v, false))
.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapAndGrace(dedupWindowSize, gracePeriod))
.reduce(
(value1, value2) -> new DedupValue<>(value1.value(), true),
Materialized
.<K, DedupValue<V>, SessionStore<Bytes, byte[]>>with(keySerde, dedupValueSerde)
.withCachingDisabled()
)
.toStream()
.filterNot((wk, dv) -> dv == null || dv.duplicate())
.selectKey((wk, dv) -> wk.key())
.mapValues(DedupValue::value)
.to("output-topic", Produced.with(keySerde, valueSerde));
Value wrapper:
record DedupValue<V>(V value, boolean duplicate) { }
Value wrapper SerDe (example):
public class DedupValueSerde<V> extends WrapperSerde<DedupValue<V>> {
public DedupValueSerde(Serde<V> vSerde) {
super(new DvSerializer<>(vSerde.serializer()), new DvDeserializer<>(vSerde.deserializer()));
}
private record DvSerializer<V>(Serializer<V> vSerializer) implements Serializer<DedupValue<V>> {
#Override
public byte[] serialize(String topic, DedupValue<V> data) {
byte[] vBytes = vSerializer.serialize(topic, data.value());
return ByteBuffer
.allocate(vBytes.length + 1)
.put(data.duplicate() ? (byte) 1 : (byte) 0)
.put(vBytes)
.array();
}
}
private record DvDeserializer<V>(Deserializer<V> vDeserializer) implements Deserializer<DedupValue<V>> {
#Override
public DedupValue<V> deserialize(String topic, byte[] data) {
ByteBuffer buffer = ByteBuffer.wrap(data);
boolean duplicate = buffer.get() == (byte) 1;
int remainingSize = buffer.remaining();
byte[] vBytes = new byte[remainingSize];
buffer.get(vBytes);
V value = vDeserializer.deserialize(topic, vBytes);
return new DedupValue<>(value, duplicate);
}
}
}

Related

How to check the content of 2 or more values that can be contained in the resulting stream of elements - Flux<T> (spring WebFlux)

I have a method that validates containing some elements in Stream.
Task
For example, there is a sequence that has a series of numbers (the numbers are not repeated), and each number is larger than the other :
1, 20, 35, 39, 45, 43
... It is necessary to check whether there is a specified range in this stream, for example 35 ...49. If there is no such range, then you need to throw an Exception.
But since this is asynchronous processing, the usual methods do not work here. After all, we process elements in the stream and do not know what the next element will be.
The elements of this sequence need to be folded, the addition should be done gradually (as soon as the initial range of elements is found).
During the service, you need to check whether there is an endpoint in the generated sequence and when the entire flow of elements is received, but this point is not, then ask for an Exception, since the upper limit of the specified range is not received
Also, do not start calculations until the starting point of the specified range is found,
while we cannot block the stream, otherwise we will get the same Exstrong textception.
How can such a check be organized ?
When I work with a regular thread, it looks like this:
private boolean isRangeValues() {
List<BigInteger> sequence = Arrays
.asList(
new BigInteger("11"),
new BigInteger("15"),
new BigInteger("23"),
new BigInteger("27"),
new BigInteger("30"));
BigInteger startRange = new BigInteger("15");
BigInteger finishRange = new BigInteger("27");
boolean isStartRangeMember = sequence.contains(startRange);
boolean isFinishRangeMembe = sequence.contains(finishRange);
return isStartRangeMember && isFinishRangeMember;
}
But I have a task to process a stream of elements that are generated at some interval.
To get the result, a reactive stack is used in Spring and I get the result in Flux.
Just convert to a list and process, - it will not work, there will be an Exception.
After filtering these elements, the stream will continue to be processed.
But if I see an error in data validation at the time of filtering (in this case, there are no elements that are needed), then I will need to request an Exception, which will be processed globally and returned to the client.
#GetMapping("v1/sequence/{startRange}/{endRange}")
Mono<BigInteger> getSumSequence(
#PathVariable BigInteger startRange,
#PathVariable BigInteger endRange) {
Flux<BigInteger> sequenceFlux = sequenceGenerated();
validateSequence(sequenceFlux)
return sum(sequenceFlux );
}
private Mono<BigInteger> sum (Flux<BigInteger> sequenceFlux ){
.....
}
private void validateSequence(Flux<BigInteger> sequenceFlux){
... is wrong
throw new RuntimeException();
}
}
I came up with some solution (I published it in this topic).
public void validateRangeSequence(sequenceDto dto) {
Flux<BigInteger> sequenceFlux = dto.getSequenceFlux();
BigInteger startRange = dto.getStartRange();
BigInteger endRange = dto.getEndRange();
Mono<Boolean> isStartRangeMember = sequenceFlux.hasElement(startRange);
Mono<Boolean> isEndRangeMember = sequenceFlux.hasElement(endRange);
if ( !isStartRangeMember.equals(isEndRangeMember) ){
throw new RuntimeException("error");
}
But it doesn't work as expected, even the correct results cause an exception.
Update
public void validateRangeSeq(RangeSequenceDto dto) {
Flux<BigInteger> sequenceFlux = dto.getSequenceFlux();
BigInteger startRange = dto.getStartRange();
BigInteger endRange = dto.getEndRange();
Mono<Boolean> isStartRangeMember = sequenceFlux.hasElement(startRange);
Mono<Boolean> isEndRangeMember = sequenceFlux.hasElement(endRange);
sequenceFlux
.handle((number, sink) -> {
if (!isStartRangeMember.equals(isEndRangeMember) ){
sink.error(new RangeWrongSequenceExc("It is wrong given range!."));
} else {
sink.next(number);
}
});
}
Unfortunately , That decision also doesn't work.
sequenceFlux
.handle(((bigInteger, synchronousSink) -> {
if(!bigInteger.equals(startRange)){
synchronousSink.error(new RuntimeException("!!!!!!!!!!! ---- Wrong range!"));
} else {
synchronousSink.next(bigInteger);
}
}));
It piece of code - It doesn't work. (does not react in any way)
Who thinks what about this ? Should this be done or are there other approaches ?
I am not familiar with Reactive stack in Spring and do not know how to handle such a situation here.
Maybe someone has ideas on how to organize such filtering and do not block the processing of elements in the stream.
You can try to do it like that
Flux<Integer> yourStream = Flux.just(1, 2, 3, 4, 5, 6, 7, 8, 9, 10).share();
Flux.zip(
yourStream.filter(integer -> integer.equals(4)),
yourStream.filter(integer -> integer.equals(6)),
(integer, integer2) -> Tuple2.of(integer, integer2))
.subscribe(System.out::println);

Query KTable in the same Application where it is created

I have an Kafka streams application in which I read from a topic, do aggregation and materialize in a KTable. I then create a Stream and run some logic on the stream. Now in the stream processing, I want to use some data from the aforementioned KTable. Once I start the stream app, how do I get access to the KTable stream again? I don't want to push the KTable to a new Topic.
KStream<String, MyClass> source = builder.stream("my-topic");
KTable<Windowed<String>, Long> kTable =
source.groupBy((key, value) -> value.getKey(),
Grouped.<String, MyClass >as("repartition-1")
.withKeySerde(new Serdes.String())
.withValueSerde(new MyClassSerDes()))
.windowedBy(TimeWindows.of(Duration.ofSeconds(5)))
.count(Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("test-store")
.withKeySerde(new Serdes.String())
.withValueSerde(Serdes.Long()));
Here I want to use data from the kTable.
inputstream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.count(Materialized.<myKey, Long, WindowStore<Bytes, byte[]>>as("str")
.withRetention(Duration.ofMinutes(30)))
.toStream()
.filter((k, v) -> {
// Here get the count for the previous Window.
// Use that count for some computation here.
}
You can add the KTable store to a processor/transformer. For you case, you can replace the filter with flatTransform (or any sibling like transform etc depending if you need access to the key) and connect the store to the operator:
inputstream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.count(Materialized.<myKey, Long, WindowStore<Bytes, byte[]>>as("str")
.withRetention(Duration.ofMinutes(30))
)
.toStream()
// requires v2.2; otherwise use `transform()`
// if you don't need access to the key, consider to use `flatTransformValues` (v2.3)
.flatTransform(
() -> new Transformer<Windowed<myKey>,
Long,
List<KeyValue<Windowed<myKey>, Long>>() {
private ReadOnlyWindowStore<myKey, Long> store;
public void init(final ProcessorContext context) {
// get a handle on the store by its name
// as specified via `Materialized` above;
// should be read-only
store = (ReadOnlyWindowStore<myKey, Long>)context.getStateStore("str");
}
public List<KeyValue<Windowed<myKey>, Long>> transform(Windowed<myKey> key,
Long value) {
// access `store` as you wish to make a filtering decision
if ( ... ) {
// record passes
return Collection.singletonList(KeyValue.pair(key, value));
} else {
// drop record
return Collection.emptyList();
}
}
public void close() {} // nothing to do
},
"str" // connect the KTable store to the transformer using its name
// as specified via `Materialized` above
);

Filtering data bolt Storm

I have a simple Storm topology which reads the data from Kafka, parses and extracts message fields. I would like to filter the stream of tuples by one of the fields values and perform counting aggregation on another one. How can I do this in Storm?
I haven't found respective methods for tuples (filter, aggregate) so should I perform these functions directly on the field values?
Here is a topology:
topologyBuilder.setSpout("kafka_spout", new KafkaSpout(spoutConfig), 1)
topologyBuilder.setBolt("parser_bolt", new ParserBolt()).shuffleGrouping("kafka_spout")
topologyBuilder.setBolt("transformer_bolt", new KafkaTwitterBolt()).shuffleGrouping("parser_bolt")
val config = new Config()
cluster.submitTopology("kafkaTest", config, topologyBuilder.createTopology())
I have set up KafkaTwitterBolt for counting and filtering with parsed fields. I've managed to filter the whole list of values only not by specific field:
class KafkaTwitterBolt() extends BaseBasicBolt{
override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
val tweetValues = input.getValues.asScala.toList
val filterTweets = tweetValues
.map(_.toString)
.filter(_ contains "big data")
val resultAllValues = new Values(filterTweets)
collector.emit(resultAllValues)
}
override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
declarer.declare(new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.lang", "user.favorite_count", "entities.hashtags"))
}
}
Turns out Storm core API does not allows that, in order to perform filtering on any field Trident should be used (it has built-in filter function).
The code would look like this:
val tridentTopology = new TridentTopology()
val stream = tridentTopology.newStream("kafka_spout",
new KafkaTridentSpoutOpaque(spoutConfig))
.map(new ParserMapFunction, new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.favorite_count", "user.lang", "entities.hashtags"))
.filter(new LanguageFilter)
Filtering function itself:
class LanguageFilter extends BaseFilter{
override def isKeep(tuple: TridentTuple): Boolean = {
val language = tuple.getStringByField("user.lang")
println(s"TWEET: $language")
language.contains("en")
}
}
Your answer at https://stackoverflow.com/a/59805582/8845188 is a little wrong. The Storm core API does allow filtering and aggregation, you just have to write the logic yourself.
A filtering bolt is just a bolt that discards some tuples, and passes others on. For instance, the following bolt will filter out tuples based on a string field:
class FilteringBolt() extends BaseBasicBolt{
override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
val values = input.getValues.asScala.toList
if ("Pass me".equals(values.get(0))) {
collector.emit(values)
}
//Emitting nothing means discarding the tuple
}
override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
declarer.declare(new Fields("some-field"))
}
}
An aggregating bolt is just a bolt that collects multiple tuples, and then emits a new aggregate tuple anchored in the original tuples:
class AggregatingBolt extends BaseRichBolt {
List<Tuple> tuplesToAggregate = ...;
int counter = 0;
override def execute(input: Tuple): Unit = {
tuplesToAggregate.add(input);
counter++;
if (counter == 10) {
Values aggregateTuple = ... //create a new set of values based on tuplesToAggregate
collector.emit(tuplesToAggregate, aggregateTuple) //This anchors the new aggregate tuple to all the original tuples, so if the aggregate fails, the original tuples are replayed.
for (Tuple t : tuplesToAggregate) {
collector.ack(t); //Ack the original tuples now that this bolt is done with them
//Note that you MUST emit before you ack, or the at-least-once guarantee will be broken.
}
tuplesToAggregate.clear();
counter = 0;
}
//Note that we don't ack the input tuples until the aggregate gets emitted. This lets us replay all the aggregated tuples in case the aggregate fails
}
}
Note that for aggregation, you will need to extend BaseRichBolt and do acking manually, since you want to delay acking a tuple until it has been included in an aggregate tuple.

kafka stream make a local aggregation

I am trying to make a local aggregation.
The input topic has records containing multiple elements and I am using flatmap to split the record into multiple records with another key (here element_id). This triggers a re-partition as I am applying a grouping for aggregation later in the stream process.
Problem: there are way too many records in this repartition topic and the app cannot handle them (lag is increasing).
Here is a example of the incoming data
key: another ID
value:
{
"cat_1": {
"element_1" : 0,
"element_2" : 1,
"element_3" : 0
},
"cat_2": {
"element_1" : 0,
"element_2" : 1,
"element_3" : 1
}
}
And an example of the wanted aggregation result:
key : element_2
value:
{
"cat_1": 1,
"cat_2": 1
}
So I would like to make a first "local aggregation" and stop splitting incoming records, meaning that I want to aggregate all elements locally (no re-partition) for example in a 30 seconds window, then produce result per element in a topic. A stream consuming this topic later aggregates at a higher level.
I am using Stream DSL, but I am not sure it is enough. I tried to use the process() and transform() methods that allow me to benefit from the Processor API, but I don't known how to properly produce some records in a punctuation, or put records in a stream.
How could I achieve that ? Thank you
transform() returns a KStream on which you can call to() to write the results into a topic.
stream.transform(...).to("output_topic");
In a punctuation you can call context.forward() to send a record downstream. You still need to call to() to write the forwarded record into a topic.
To implement a custom aggregation consider the following pseudo-ish code:
builder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<Integer, Integer>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(stateStoreName),
Serdes.Integer(),
Serdes.Integer());
builder.addStateStore(keyValueStoreBuilder);
stream = builder.stream(topic, Consumed.with(Serdes.Integer(), Serdes.Integer()));
stream.transform(() ->
new Transformer<Integer, Integer, KeyValue<Integer, Integer>>() {
private KeyValueStore<Integer, Integer> state;
#Override
public void init(final ProcessorContext context) {
state = (KeyValueStore<Integer, Integer>) context.getStateStore(stateStoreName);
context.schedule(
Duration.ofMinutes(1),
PunctuationType.STREAM_TIME,
timestamp -> {
// You can get aggregates from the state store here
// Then you can send the aggregates downstream
// with context.forward();
// Alternatively, you can output the aggregate in the
// transform() method as shown below
}
);
}
#Override
public KeyValue<Integer, Integer> transform(final Integer key, final Integer value) {
// Get existing aggregates from the state store with state.get().
// Update aggregates and write them into the state store with state.put().
// Depending on some condition, e.g., 10 seen records,
// output an aggregate downstream by returning the output.
// You can output multiple aggregates by using KStream#flatTransform().
// Alternatively, you can output the aggregate in a
// punctuation as shown above
}
#Override
public void close() {
}
}, stateStoreName)
With this manual aggregation you could implement the higher level aggregation in the same streams app and leverage re-partitioning.
process() is a terminal operation, i.e., it does not return anything.

How to repeat Job with Partitioner when data is dynamic with Spring Batch?

I am trying to develop a batch process using Spring Batch + Spring Boot (Java config), but I have a problem doing so. I have a software that has a database and a Java API, and I read records from there. The batch process should retrieve all the documents which expiration date is less than a certain date, update the date, and save them again in the same database.
My first approach was reading the records 100 by 100; so the ItemReader retrieve 100 records, I process them 1 by 1, and finally I write them again. In the reader, I put this code:
public class DocumentItemReader implements ItemReader<Document> {
public List<Document> documents = new ArrayList<>();
#Override
public Document read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
if(documents.isEmpty()) {
getDocuments(); // This method retrieve 100 documents and store them in "documents" list.
if(documents.isEmpty()) return null;
}
Document doc = documents.get(0);
documents.remove(0);
return doc;
}
}
So, with this code, the reader reads from the database until no records are found. When the "getDocuments()" method doesn't retrieve any documents, the List is empty and the reader returns null (so the Job finish). Everything worked fine here.
However, the problem appears if I want to use several threads. In this case, I started using the Partitioner approach instead of Multi-threading. The reason of doing that is because I read from the same database, so if I repeat the full step with several threads, all of them will find the same records, and I cannot use pagination (see below).
Another problem is that database records are updated dynamically, so I cannot use pagination. For example, let's suppose I have 200 records, and all of them are going to expire soon, so the process is going to retrieve them. Now imagine I retrieve 10 with one thread, and before anything else, that thread process one and update it in the same database. The next thread cannot retrieve from 11 to 20 records, as the first record is not going to appear in the search (as it has been processed, its date has been updated, and then it doesn't match the query).
It is a little difficult to understand, and some things may sound strange, but in my project:
I am forced to use the same database to read and write.
I can have millions of documents, so I cannot read all the records at the same time. I need to read them 100 by 100, or 500 by 500.
I need to use several threads.
I cannot use pagination, as the query to the databse will retrieve different documents each time it is executed.
So, after hours thinking, I think the unique possible solution is to repeat the job until the query retrives no documents. Is this possible? I want to do something like the step does: Do something until null is returned - repeat the job until the query return zero records.
If this is not a good approach, I will appreciate other possible solutions.
Thank you.
Maybe you can add a partitioner to your step that will :
Select all the ids of the datas that needs to be updated (and other columns if needed)
Split them in x (x = gridSize parameter) partitions and write them in temporary file (1 by partition).
Register the filename to read in the executionContext
Then your reader is not reading from the database anymore but from the partitioned file.
Seem complicated but it's not that much, here is an example which handle millions of record using JDBC query but it can be easily transposed for your use case :
public class JdbcToFilePartitioner implements Partitioner {
/** number of records by database fetch */
private int fetchSize = 100;
/** working directory */
private File tmpDir;
/** limit the number of item to select */
private Long nbItemMax;
#Override
public Map<String, ExecutionContext> partition(final int gridSize) {
// Create contexts for each parttion
Map<String, ExecutionContext> executionsContexte = createExecutionsContext(gridSize);
// Fill partition with ids to handle
getIdsAndFillPartitionFiles(executionsContexte);
return executionsContexte;
}
/**
* #param gridSize number of partitions
* #return map of execution context, one for each partition
*/
private Map<String, ExecutionContext> createExecutionsContext(final int gridSize) {
final Map<String, ExecutionContext> map = new HashMap<>();
for (int partitionId = 0; partitionId < gridSize; partitionId++) {
map.put(String.valueOf(partitionId), createContext(partitionId));
}
return map;
}
/**
* #param partitionId id of the partition to create context
* #return created executionContext
*/
private ExecutionContext createContext(final int partitionId) {
final ExecutionContext context = new ExecutionContext();
String fileName = tmpDir + File.separator + "partition_" + partitionId + ".txt";
context.put(PartitionerConstantes.ID_GRID.getCode(), partitionId);
context.put(PartitionerConstantes.FILE_NAME.getCode(), fileName);
if (contextParameters != null) {
for (Entry<String, Object> entry : contextParameters.entrySet()) {
context.put(entry.getKey(), entry.getValue());
}
}
return context;
}
private void getIdsAndFillPartitionFiles(final Map<String, ExecutionContext> executionsContexte) {
List<BufferedWriter> fileWriters = new ArrayList<>();
try {
// BufferedWriter for each partition
for (int i = 0; i < executionsContexte.size(); i++) {
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(executionsContexte.get(String.valueOf(i)).getString(
PartitionerConstantes.FILE_NAME.getCode())));
fileWriters.add(bufferedWriter);
}
// Fetching the datas
ScrollableResults results = runQuery();
// Get the result and fill the files
int currentPartition = 0;
int nbWriting = 0;
while (results.next()) {
fileWriters.get(currentPartition).write(results.get(0).toString());
fileWriters.get(currentPartition).newLine();
currentPartition++;
nbWriting++;
// If we already write on all partitions, we start again
if (currentPartition >= executionsContexte.size()) {
currentPartition = 0;
}
// If we reach the max item to read we stop
if (nbItemMax != null && nbItemMax != 0 && nbWriting >= nbItemMax) {
break;
}
}
// closing
results.close();
session.close();
for (BufferedWriter bufferedWriter : fileWriters) {
bufferedWriter.close();
}
} catch (IOException | SQLException e) {
throw new UnexpectedJobExecutionException("Error writing partition file", e);
}
}
private ScrollableResults runQuery() {
...
}
}

Resources