I have an Kafka streams application in which I read from a topic, do aggregation and materialize in a KTable. I then create a Stream and run some logic on the stream. Now in the stream processing, I want to use some data from the aforementioned KTable. Once I start the stream app, how do I get access to the KTable stream again? I don't want to push the KTable to a new Topic.
KStream<String, MyClass> source = builder.stream("my-topic");
KTable<Windowed<String>, Long> kTable =
source.groupBy((key, value) -> value.getKey(),
Grouped.<String, MyClass >as("repartition-1")
.withKeySerde(new Serdes.String())
.withValueSerde(new MyClassSerDes()))
.windowedBy(TimeWindows.of(Duration.ofSeconds(5)))
.count(Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("test-store")
.withKeySerde(new Serdes.String())
.withValueSerde(Serdes.Long()));
Here I want to use data from the kTable.
inputstream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.count(Materialized.<myKey, Long, WindowStore<Bytes, byte[]>>as("str")
.withRetention(Duration.ofMinutes(30)))
.toStream()
.filter((k, v) -> {
// Here get the count for the previous Window.
// Use that count for some computation here.
}
You can add the KTable store to a processor/transformer. For you case, you can replace the filter with flatTransform (or any sibling like transform etc depending if you need access to the key) and connect the store to the operator:
inputstream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.count(Materialized.<myKey, Long, WindowStore<Bytes, byte[]>>as("str")
.withRetention(Duration.ofMinutes(30))
)
.toStream()
// requires v2.2; otherwise use `transform()`
// if you don't need access to the key, consider to use `flatTransformValues` (v2.3)
.flatTransform(
() -> new Transformer<Windowed<myKey>,
Long,
List<KeyValue<Windowed<myKey>, Long>>() {
private ReadOnlyWindowStore<myKey, Long> store;
public void init(final ProcessorContext context) {
// get a handle on the store by its name
// as specified via `Materialized` above;
// should be read-only
store = (ReadOnlyWindowStore<myKey, Long>)context.getStateStore("str");
}
public List<KeyValue<Windowed<myKey>, Long>> transform(Windowed<myKey> key,
Long value) {
// access `store` as you wish to make a filtering decision
if ( ... ) {
// record passes
return Collection.singletonList(KeyValue.pair(key, value));
} else {
// drop record
return Collection.emptyList();
}
}
public void close() {} // nothing to do
},
"str" // connect the KTable store to the transformer using its name
// as specified via `Materialized` above
);
Related
Running 3 Kafka Streams instances with exactly-once, but experiencing loss of data when restarting one of the streams instances (the other 2 doing re-balance).
If I restart the instance quickly (within session.timeout.ms), without the other 2 doing re-balance, everything is working as expected.
Input and output topics are created with 6 partitions.
Running 3 Kafka brokers.
Producing data with a single python producer in a loop (acks='all').
Outputting data to SQL with a single Kafka Connect configured with consumer.override.isolation.level=read_committed
I am expecting the aggregated data to have the same count as the output of my python loop. And this works just fine as long as Kafka Streams is not going into re-balance state.
In short the streams instance does:
Collect session data, and updating a session state.
Delta updates on the session state are then re-partitioned and summed using windowed
aggregation.
Grepping through my own debug output I'm inclined to believe the problem is related to transferring the aggregation state:
Record A which is an update to session X is adding 0 to the aggregation.
Output from the aggregation is now 6
Record B which is an update to session X is adding 1 to the aggregation.
Output from the aggregation is now 7
Rebalance
Update to session X (which may or may not be a replay or Record A) is adding 0 to the aggregation.
Output from the aggregation is now 6
Simplified and stripped out version of the code: (Not really a Java developer, so sorry for non-optimal syntax)
public static void main(String[] args) throws Exception {
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
final StoreBuilder<KeyValueStore<MediaKey, SessionState>> storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(SESSION_STATE_STORE),
mediaKeySerde,
sessionStateSerde
);
builder.addStateStore(storeBuilder);
KStream<String, IncomingData> incomingData = builder.stream(
SESSION_TOPIC, Consumed.with(Serdes.String(), mediaDataSerde));
KGroupedStream<MediaKey, AggregatedData> mediaData = incomingData
.transform(new SessionProcessingSupplier(SESSION_STATE_STORE), SESSION_STATE_STORE)
.selectKey(...)
.groupByKey(...);
KTable<Windowed<MediaKey>, AggregatedData> aggregatedMedia = mediaData
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.aggregate(
new Initializer<AggregatedData>() {...},
new Aggregator<MediaKey, AggregatedData, AggregatedData>() {
#Override
public AggregatedData apply(MediaKey key, AggregatedData input, AggregatedData aggregated) {
// ... Add stuff to "aggregated"
return aggregated
}
},
Materialized.<MediaKey, AggregatedData, WindowStore<Bytes, byte[]>>as("aggregated-media")
.withValueSerde(aggregatedDataSerde)
);
aggregatedMedia.toStream()
.map(new KeyValueMapper<Windowed<MediaKey>, AggregatedData, KeyValue<MediaKey, PostgresOutput>>() {
#Override
public KeyValue<MediaKey, PostgresOutput> apply(Windowed<MediaKey> mediaidKey, AggregatedData data) {
// ... Some re-formatting and then
return new KeyValue<>(mediaidKey.key(), output);
}
})
.to(POSTGRES_TOPIC, Produced.with(mediaKeySerde, postgresSerde));
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
// Shutdown hook
}
and:
public class SessionProcessingSupplier implements TransformerSupplier<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> {
#Override
public Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> get() {
return new Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>>() {
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
this.stateStore = (KeyValueStore<String, Processing.SessionState>) context.getStateStore(sessionStateStoreName);
}
Override
public KeyValue<String, Processing.AggregatedData> transform(String sessionid, Processing.IncomingData data) {
Processing.SessionState state = this.stateStore.get(sessionid);
// ... Update or create session state
return new KeyValue<String, Processing.AggregatedData>(sessionid, output);
}
};
}
}
I am trying to make a local aggregation.
The input topic has records containing multiple elements and I am using flatmap to split the record into multiple records with another key (here element_id). This triggers a re-partition as I am applying a grouping for aggregation later in the stream process.
Problem: there are way too many records in this repartition topic and the app cannot handle them (lag is increasing).
Here is a example of the incoming data
key: another ID
value:
{
"cat_1": {
"element_1" : 0,
"element_2" : 1,
"element_3" : 0
},
"cat_2": {
"element_1" : 0,
"element_2" : 1,
"element_3" : 1
}
}
And an example of the wanted aggregation result:
key : element_2
value:
{
"cat_1": 1,
"cat_2": 1
}
So I would like to make a first "local aggregation" and stop splitting incoming records, meaning that I want to aggregate all elements locally (no re-partition) for example in a 30 seconds window, then produce result per element in a topic. A stream consuming this topic later aggregates at a higher level.
I am using Stream DSL, but I am not sure it is enough. I tried to use the process() and transform() methods that allow me to benefit from the Processor API, but I don't known how to properly produce some records in a punctuation, or put records in a stream.
How could I achieve that ? Thank you
transform() returns a KStream on which you can call to() to write the results into a topic.
stream.transform(...).to("output_topic");
In a punctuation you can call context.forward() to send a record downstream. You still need to call to() to write the forwarded record into a topic.
To implement a custom aggregation consider the following pseudo-ish code:
builder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<Integer, Integer>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(stateStoreName),
Serdes.Integer(),
Serdes.Integer());
builder.addStateStore(keyValueStoreBuilder);
stream = builder.stream(topic, Consumed.with(Serdes.Integer(), Serdes.Integer()));
stream.transform(() ->
new Transformer<Integer, Integer, KeyValue<Integer, Integer>>() {
private KeyValueStore<Integer, Integer> state;
#Override
public void init(final ProcessorContext context) {
state = (KeyValueStore<Integer, Integer>) context.getStateStore(stateStoreName);
context.schedule(
Duration.ofMinutes(1),
PunctuationType.STREAM_TIME,
timestamp -> {
// You can get aggregates from the state store here
// Then you can send the aggregates downstream
// with context.forward();
// Alternatively, you can output the aggregate in the
// transform() method as shown below
}
);
}
#Override
public KeyValue<Integer, Integer> transform(final Integer key, final Integer value) {
// Get existing aggregates from the state store with state.get().
// Update aggregates and write them into the state store with state.put().
// Depending on some condition, e.g., 10 seen records,
// output an aggregate downstream by returning the output.
// You can output multiple aggregates by using KStream#flatTransform().
// Alternatively, you can output the aggregate in a
// punctuation as shown above
}
#Override
public void close() {
}
}, stateStoreName)
With this manual aggregation you could implement the higher level aggregation in the same streams app and leverage re-partitioning.
process() is a terminal operation, i.e., it does not return anything.
I am trying to join a stream to 2 differents GlobalTables, treating them as a lookup, more specifically, devices (user agent) and geocoding (ip address).
The issue being with the serialization, but I dont get why. It gets stuck on DEFAULT_VALUE_SERDE_CLASS_CONFIG but the topic to which I want to write is serialized correctly.
//
// Set up serialization / de-serialization
private static Serde<String> stringSerde = Serdes.String();
private static Serde<PodcastData> podcastSerde = StreamsSerdes.PodCastSerde();
private static Serde<GeoCodedData> geocodedSerde = StreamsSerdes.GeoIPSerde();
private static Serde<DeviceData> deviceSerde = StreamsSerdes.DeviceSerde();
private static Serde<JoinedPodcastGeoDeviceData> podcastGeoDeviceSerde = StreamsSerdes.PodcastGeoDeviceSerde();
private static Serde<JoinedPodCastDeviceData> podcastDeviceSerde = StreamsSerdes.PodcastDeviceDataSerde()
...
GlobalKTable<String, DeviceData> deviceIDTable = builder.globalTable(kafkaProperties.getProperty("deviceid-topic"));
GlobalKTable<String, GeoCodedData> geoIPTable = builder.globalTable(kafkaProperties.getProperty("geoip-topic"));
//
// Stream from source topic
KStream<String, PodcastData> podcastStream = builder.stream(
kafkaProperties.getProperty("source-topic"),
Consumed.with(stringSerde, podcastSerde));
//
podcastStream
// left join the podcast stream to the device table, looking up the device
.leftJoin(deviceIDTable,
// get a DeviceData object from the user agent
(podcastID, podcastData) -> podcastData.getUser_agent(),
// join podcast and device and return a JoinedPodCastDeviceData object
(podcastData, deviceData) -> {
JoinedPodCastDeviceData data =
JoinedPodCastDeviceData.builder().build();
data.setPodcastObject(podcastData);
data.setDeviceData(deviceData);
return data;
})
// left join the podcast stream to the geo table, looking up the geo data
.leftJoin(geoIPTable,
// get a Geo object from the ip address
(podcastID, podcastDeviceData) -> podcastDeviceData.getPodcastObject().getIp_address(),
// join podcast and geo
(podcastDeviceData, geoCodedData) -> {
JoinedPodcastGeoDeviceData data=
JoinedPodcastGeoDeviceData.builder().build();
data.setGeoData(geoCodedData);
data.setDeviceData(podcastDeviceData.getDeviceData());
data.setPodcastData(podcastDeviceData.getPodcastObject());
return data;
})
//
.to(kafkaProperties.getProperty("sink-topic"),
Produced.with(stringSerde, podcastGeoDeviceSerde));
...
...
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, stringSerde.getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, stringSerde.getClass().getName());
The error
ERROR java.lang.String cannot be cast to DeviceData
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, stringSerde.getClass().getName());
Due to above value, the application will use String serde as default value serde unless you specify explicitly while making KTable/KStream/GlobalKTable.
Since expected value Type for deviceIDTable is DeviceData, specify that as given below:
You need to define the value serde in GlobalKTable .
GlobalKTable<String, DeviceData> deviceIDTable = builder.globalTable(kafkaProperties.getProperty("deviceid-topic"), Materialized.<String, DeviceData, KeyValueStore<Bytes, byte[]>>as(DEVICE_STORE)
.withKeySerde(stringSerde)
.withValueSerde(deviceSerde));
My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
There is possibility of source system sending duplicate messages to INPUT topic in case of any failures.
FLOW -
Source System --> INPUT Topic --> Kafka Streaming --> OUTPUT Topic
Currently I am using flatMap to generate multiple keys out the payload but flatMap is stateless so not able to avoid duplicate message processing upon receiving from INPUT Topic.
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
Could you please put some light on it.
My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
Take a look at the EventDeduplication example at https://github.com/confluentinc/kafka-streams-examples, which does that. You can then adapt the example with the required flatMap functionality that is specific to your use case.
Here's the gist of the example:
final KStream<byte[], String> input = builder.stream(inputTopic);
final KStream<byte[], String> deduplicated = input.transform(
// In this example, we assume that the record value as-is represents a unique event ID by
// which we can perform de-duplication. If your records are different, adapt the extractor
// function as needed.
() -> new DeduplicationTransformer<>(windowSize.toMillis(), (key, value) -> value),
storeName);
deduplicated.to(outputTopic);
and
/**
* #param maintainDurationPerEventInMs how long to "remember" a known event (or rather, an event
* ID), during the time of which any incoming duplicates of
* the event will be dropped, thereby de-duplicating the
* input.
* #param idExtractor extracts a unique identifier from a record by which we de-duplicate input
* records; if it returns null, the record will not be considered for
* de-duping but forwarded as-is.
*/
DeduplicationTransformer(final long maintainDurationPerEventInMs, final KeyValueMapper<K, V, E> idExtractor) {
if (maintainDurationPerEventInMs < 1) {
throw new IllegalArgumentException("maintain duration per event must be >= 1");
}
leftDurationMs = maintainDurationPerEventInMs / 2;
rightDurationMs = maintainDurationPerEventInMs - leftDurationMs;
this.idExtractor = idExtractor;
}
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, Long>) context.getStateStore(storeName);
}
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
final KeyValue<K, V> output;
if (isDuplicate(eventId)) {
output = null;
updateTimestampOfExistingEventToPreventExpiry(eventId, context.timestamp());
} else {
output = KeyValue.pair(key, value);
rememberNewEvent(eventId, context.timestamp());
}
return output;
}
}
private boolean isDuplicate(final E eventId) {
final long eventTime = context.timestamp();
final WindowStoreIterator<Long> timeIterator = eventIdStore.fetch(
eventId,
eventTime - leftDurationMs,
eventTime + rightDurationMs);
final boolean isDuplicate = timeIterator.hasNext();
timeIterator.close();
return isDuplicate;
}
private void updateTimestampOfExistingEventToPreventExpiry(final E eventId, final long newTimestamp) {
eventIdStore.put(eventId, newTimestamp, newTimestamp);
}
private void rememberNewEvent(final E eventId, final long timestamp) {
eventIdStore.put(eventId, timestamp, timestamp);
}
#Override
public void close() {
// Note: The store should NOT be closed manually here via `eventIdStore.close()`!
// The Kafka Streams API will automatically close stores when necessary.
}
}
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
The DSL doesn't include such functionality out of the box, but the example above shows how you can easily build your own de-duplication logic by combining the DSL with the Processor API of Kafka Streams, with the use of Transformers.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
As Matthias J. Sax mentioned in his answer, from Kafka's perspective these "duplicates" are not duplicates from the point of view of its exactly-once processing semantics. Kafka ensures that it will not introduce any such duplicates itself, but it cannot make such decisions out-of-the-box for upstream data sources, which are black box for Kafka.
Exactly-once can be use to ensure that consuming and processing an input topic, does not result in duplicates in the output topic. However, from an exactly-once point of view, the duplicates in the input topic that you describe are not really duplicates but two regular input messages.
For remove input topic duplicates, you can use a transform() step with an attached state store (there is no built-in operator in the DSL that does what you want). For each input records, you first check if you find the corresponding key in the store. If not, you add it to the store and forward the message. If you find it in the store, you drop the input as duplicate. Note, this will only work with 100% correctness guarantee if you enable exactly-once processing in your Kafka Streams application. Otherwise, even if you try do deduplicate, Kafka Streams could re-introduce duplication in case of a failure.
Additionally, you need to decide how long you want to keep entries in the store. You could use a Punctuation to remove old data from the store if you are sure that no further duplicate can be in the input topic. One way to do this, would be to store the record timestamp (or maybe offset) in the store, too. This way, you can compare the current time with the store record time within punctuate() and delete old records (ie, you would iterator over all entries in the store via store#all()).
After the transform() you apply your flatMap() (or could also merge your flatMap() code into transform() directly.
It's achievable with DSL only as well, using SessionWindows changelog without caching.
Wrap the value with duplicate flag
Turn the flag to true in reduce() within time window
Filter out true flag values
Unwrap the original key and value
Topology:
Serde<K> keySerde = ...;
Serde<V> valueSerde = ...;
Duration dedupWindowSize = ...;
Duration gracePeriod = ...;
DedupValueSerde<V> dedupValueSerde = new DedupValueSerde<>(valueSerde);
new StreamsBuilder()
.stream("input-topic", Consumed.with(keySerde, valueSerde))
.mapValues(v -> new DedupValue<>(v, false))
.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapAndGrace(dedupWindowSize, gracePeriod))
.reduce(
(value1, value2) -> new DedupValue<>(value1.value(), true),
Materialized
.<K, DedupValue<V>, SessionStore<Bytes, byte[]>>with(keySerde, dedupValueSerde)
.withCachingDisabled()
)
.toStream()
.filterNot((wk, dv) -> dv == null || dv.duplicate())
.selectKey((wk, dv) -> wk.key())
.mapValues(DedupValue::value)
.to("output-topic", Produced.with(keySerde, valueSerde));
Value wrapper:
record DedupValue<V>(V value, boolean duplicate) { }
Value wrapper SerDe (example):
public class DedupValueSerde<V> extends WrapperSerde<DedupValue<V>> {
public DedupValueSerde(Serde<V> vSerde) {
super(new DvSerializer<>(vSerde.serializer()), new DvDeserializer<>(vSerde.deserializer()));
}
private record DvSerializer<V>(Serializer<V> vSerializer) implements Serializer<DedupValue<V>> {
#Override
public byte[] serialize(String topic, DedupValue<V> data) {
byte[] vBytes = vSerializer.serialize(topic, data.value());
return ByteBuffer
.allocate(vBytes.length + 1)
.put(data.duplicate() ? (byte) 1 : (byte) 0)
.put(vBytes)
.array();
}
}
private record DvDeserializer<V>(Deserializer<V> vDeserializer) implements Deserializer<DedupValue<V>> {
#Override
public DedupValue<V> deserialize(String topic, byte[] data) {
ByteBuffer buffer = ByteBuffer.wrap(data);
boolean duplicate = buffer.get() == (byte) 1;
int remainingSize = buffer.remaining();
byte[] vBytes = new byte[remainingSize];
buffer.get(vBytes);
V value = vDeserializer.deserialize(topic, vBytes);
return new DedupValue<>(value, duplicate);
}
}
}
I am retrieving big chunks of data from DB and using this data to write it somewhere else. In order to avoid a long processing time, I'm trying to use parallel streams to write it. When I run this as sequential streams, it works perfectly. However, if I change it to parallel, the behavior is odd: it prints the same object multiple times (more than 10).
#PostConstruct
public void retrieveAllTypeRecords() throws SQLException {
logger.info("Retrieve batch of Type records.");
try {
Stream<TypeRecord> typeQueryAsStream = jdbcStream.getTypeQueryAsStream();
typeQueryAsStream.forEach((type) -> {
logger.info("Printing Type with field1: {} and field2: {}.", type.getField1(), type.getField2()); //the same object gets printed here multiple times
//write this object somewhere else
});
logger.info("Completed full retrieval of Type data.");
} catch (Exception e) {
logger.error("error: " + e);
}
}
public Stream<TypeRecord> getTypeQueryAsStream() throws SQLException {
String sql = typeRepository.getQueryAllTypesRecords(); //retrieves SQL query in String format
TypeMapper typeMapper = new TypeMapper();
JdbcStream.StreamableQuery query = jdbcStream.streamableQuery(sql);
Stream<TypeRecord> stream = query.stream()
.map(row -> {
return typeMapper.mapRow(row); //maps columns values to object values
});
return stream;
}
public class StreamableQuery implements Closeable {
(...)
public Stream<SqlRow> stream() throws SQLException {
final SqlRowSet rowSet = new ResultSetWrappingSqlRowSet(preparedStatement.executeQuery());
final SqlRow sqlRow = new SqlRowAdapter(rowSet);
Supplier<Spliterator<SqlRow>> supplier = () -> Spliterators.spliteratorUnknownSize(new Iterator<SqlRow>() {
#Override
public boolean hasNext() {
return !rowSet.isLast();
}
#Override
public SqlRow next() {
if (!rowSet.next()) {
throw new NoSuchElementException();
}
return sqlRow;
}
}, Spliterator.CONCURRENT);
return StreamSupport.stream(supplier, Spliterator.CONCURRENT, true); //this boolean sets the stream as parallel
}
}
I've also tried using typeQueryAsStream.parallel().forEach((type) but the result is the same.
Example of output:
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[main] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
Well, look at you code,
final SqlRow sqlRow = new SqlRowAdapter(rowSet);
Supplier<Spliterator<SqlRow>> supplier = () -> Spliterators.spliteratorUnknownSize(new Iterator<SqlRow>() {
…
#Override
public SqlRow next() {
if (!rowSet.next()) {
throw new NoSuchElementException();
}
return sqlRow;
}
}, Spliterator.CONCURRENT);
You are returning the same object every time. You achieve your desired effects by implicitly modifying the state of this object when calling rowSet.next().
This obviously can’t work when multiple threads try to access that single object concurrently. Even buffering some items, to hand them over to another thread will cause trouble. Therefore, such interference can cause problems with sequential streams as well, as soon as stateful intermediate operations are involved, like sorted or distinct.
Assuming that typeMapper.mapRow(row) will produce an actual data item which has no interference to other data items, you should integrate this step into the stream source, to create a valid stream.
public Stream<TypeRecord> stream(TypeMapper typeMapper) throws SQLException {
SqlRowSet rowSet = new ResultSetWrappingSqlRowSet(preparedStatement.executeQuery());
SqlRow sqlRow = new SqlRowAdapter(rowSet);
Spliterator<TypeRecord> sp = new Spliterators.AbstractSpliterator<TypeRecord>(
Long.MAX_VALUE, Spliterator.CONCURRENT|Spliterator.ORDERED) {
#Override
public boolean tryAdvance(Consumer<? super TypeRecord> action) {
if(!rowSet.next()) return false;
action.accept(typeMapper.mapRow(sqlRow));
return true;
}
};
return StreamSupport.stream(sp, true); //this boolean sets the stream as parallel
}
Note that for a lot of use cases, like this one, implementing a Spliterator is simpler than implementing an Iterator (which needs to be wrapped via spliteratorUnknownSize anyway). Also, there is no need to encapsulate this instantiation into a Supplier.
As a final note, the current implementation does not perform well for streams with an unknown size, as it treats Long.MAX_VALUE like a very large number, ignoring the “unknown” semantic assigned to it by the specification. It will be very beneficial to the parallel performance to provide an estimate size, it doesn’t need to be precise, in fact, with the current implementation, even a completely made up number, say 1000 may perform better than correctly using Long.MAX_VALUE to denote an entirely unknown size.