How to transform with state? - apache-kafka-streams

I'm trying to build a Kafka streams application using the new version of the DSL (v1.0) but I don't see how to configure a stateful stream transformation. A basic but complete example of how to achieve this would be very helpful.
I didn't find any (stateful) transform examples in the source code. According to the documentation the following strategy should be followed:
StateStoreSupplier myStore = Stores.create("myTransformState")
.withKeys(...)
.withValues(...)
.persistent() // optional
.build();
builder.addStore(myStore);
KStream outputStream = inputStream.transform(new TransformerSupplier() { ... }, "myTransformState");
However, it's not clear what the type of builder should be in the example, none of Topology or StreamsBuilder has a method addStore. If I try addStateStore instead it only accepts an argument of type StoreBuilder which is not the type of myStore defined.

As JavaDocs explain, Stores#create is deprecated in 1.0.0:
#deprecated use persistentKeyValueStore(String), persistentWindowStore(String, long, int, long, boolean), persistentSessionStore(String, long), lruMap(String, int). or inMemoryKeyValueStore(String)
Thus, in your case you would create a persistent key-value store supplier via Stores.persistentKeyValueStore("myTransformState")
In a second step, you need to create a StoreBuilder via Stores.keyValueStoreBuilder(...) that takes you previously created store supplier as argument.
Afterwards, you can add the StoreBuilder to your builder
StreamsBuilder#addStateStore(final StoreBuilder builder)
To connect the store to your transformer you just provide the store name as additional argument as before.

Related

Can I store sensitive data in a Vert.x context in a Quarkus application?

I am looking for a place to store some request scoped attributes such as user id using a Quarkus request filter. I later want to retrieve these attributes in a Log handler and put them in the MDC logging context.
Is Vertx.currentContext() the right place to put such request attributes? Or can the properties I set on this context be read by other requests?
If this is not the right place to store such data, where would be the right place?
Yes ... and no :-D
Vertx.currentContext() can provide two type of objects:
root context shared between all the concurrent processing executed on this event loop (so do NOT share data)
duplicated contexts, which are local to the processing and its continuation (you can share in these)
In Quarkus 2.7.2, we have done a lot of work to improve our support of duplicated context. While before, they were only used for HTTP, they are now used for gRPC and #ConsumeEvent. Support for Kafka and AMQP is coming in Quarkus 2.8.
Also, in Quarkus 2.7.2, we introduced two new features that could be useful:
you cannot store data in a root context. We detect that for you and throw an UnsupportedOperationException. The reason is safety.
we introduced a new utility class ( io.smallrye.common.vertx.ContextLocals to access the context locals.
Here is a simple example:
AtomicInteger counter = new AtomicInteger();
public Uni<String> invoke() {
Context context = Vertx.currentContext();
ContextLocals.put("message", "hello");
ContextLocals.put("id", counter.incrementAndGet());
return invokeRemoteService()
// Switch back to our duplicated context:
.emitOn(runnable -> context.runOnContext(runnable))
.map(res -> {
// Can still access the context local data
String msg = ContextLocals.<String>get("message").orElseThrow();
Integer id = ContextLocals.<Integer>get("id").orElseThrow();
return "%s - %s - %d".formatted(res, msg, id);
});
}

Aws integration spring: Extend Visibility Timeout

Is it possible to extend the visibility time out of a message that is in flight.
See:
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/AboutVT.html.
Section: Changing a Message's Visibility Timeout.
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/sqs/AmazonSQSClient.html#changeMessageVisibility-com.amazonaws.services.sqs.model.ChangeMessageVisibilityRequest-
In summary I want to be able to extend the first set visibility timeout for a given message that is in flight.
Example if 15secs have passed I then want to extend the timeout by another 20secs. Better example in java docs above.
From my understanding in the links above you can do this on the amazon side.
Below are my current settings;
SqsMessageDrivenChannelAdapter adapter =
new SqsMessageDrivenChannelAdapter(queue);
adapter.setMessageDeletionPolicy(SqsMessageDeletionPolicy.ON_SUCCESS);
adapter.setMaxNumberOfMessages(1);
adapter.setSendTimeout(2000);
adapter.setVisibilityTimeout(200);
adapter.setWaitTimeOut(20);
Is it possible to extend this timeout?
Spring Cloud AWS supports this starting with Version 2.0. Injecting a Visiblity parameter in your SQS listener method does the trick:
#SqsListener(value = "my-sqs-queue")
void onMessageReceived(#Payload String payload, Visibility visibility) {
...
var extension = visibility.extend(20);
...
}
Note, that extend will work asynchronously and will return a Future. So if you want to be sure further down the processing, that the visibility of the message is really extended at the AWS side of things, either block on the Future using extension.get() or query the Future with extension.isDone()
OK. Looks like I see your point.
We can change visibility for particular message using API:
AmazonSQS.changeMessageVisibility(String queueUrl, String receiptHandle, Integer visibilityTimeout)
For this purpose in downstream flow you have to get access to (inject) AmazonSQS bean and extract special headers from the Message:
#Autowired
AmazonSQS amazonSqs;
#Autowired
ResourceIdResolver resourceIdResolver;
...
MessageHeaders headers = message.getHeaders();
DestinationResolver destinationResolver = new DynamicQueueUrlDestinationResolver(this.amazonSqs, this.resourceIdResolver);
String queueUrl = destinationResolver.resolveDestination(headers.get(AwsHeaders.QUEUE));
String receiptHandle = headers.get(AwsHeaders.RECEIPT_HANDLE);
amazonSqs.changeMessageVisibility(queueUrl, receiptHandle, YOUR_DESIRED_VISIBILITY_TIMEOUT);
But eh, I agree that we should provide something on the matter as out-of-the-box feature. That may be even something similar to QueueMessageAcknowledgment as a new header. Or even just one more changeMessageVisibility method to this one.
Please, raise a GH issue for Spring Cloud AWS project on the matter with link to this SO topic.

Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping)

I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.
I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.
AvroParquetWriter has only one none-deprecated Constructor, with this signature:
AvroParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
boolean enableDictionary,
boolean enableValidation,
WriterVersion writerVersion,
Configuration conf
)
Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.
So I have a few questions:
1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?
2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?
3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...
To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:
DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);
The Parquet equivalent currently looks like this:
AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);
but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.
Thanks,
Thomas
[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter
Try AvroParquetWriter.builder :
MyData obj = ... // should be avro Object
ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
.withSchema(obj.getSchema())
.build();
pw.write(obj);
pw.close();
Thanks.

Save image (via ImageWriter / FileImageOutputStream) to the filesystem without use of a File object

As a learning task I am converting my software I use every day to NIO, with the somewhat arbitrary objective of having zero remaining instances of java.io.File.
I have been successful in every case except one. It seems an ImageWriter can only write to a FileImageOutputStream which requires a java.io.File.
Path path = Paths.get(inputFileName);
InputStream is = Files.newInputStream(path, StandardOpenOption.READ);
BufferedImage bi = ImageIO.read(is);
...
Iterator<ImageWriter> iter = ImageIO.getImageWritersBySuffix("jpg");
ImageWriter writer = iter.next();
ImageWriteParam param = writer.getDefaultWriteParam();
File outputFile = new File(outputFileName);
ImageOutputStream ios = new FileImageOutputStream(outputFile);
IIOImage iioi = new IIOImage(bi, null, null);
writer.setOutput(ios);
writer.write(null, iioi, param);
...
Is there a way to do this with a java.nio.file.Path? The java 8 api doc for ImageWriter only mentions FileImageOutputStream.
I understand there might only be a symbolic value to doing this, but I was under the impression that NIO is intended to provide a complete alternative to java.io.File.
A RandomAccessFile, constructed with just a String for a filename, can be supplied to the ImageOutputStream constructor constructor.
This doesn't "use NIO" any more than just using the File in the first place, but it doesn't require File to be used directly..
For direct support of Path (or to "use NIO"), the FileImageOutputStream (or RandomAccessFile) could be extended, or a type deriving from the ImageOutputStream interface created, but .. how much work is it worth?
The intended way to instantiate an ImageInputStream or ImageOutputStream in the javax.imageio API, is through the ImageIO.createImageInputStream() and ImageIO.createImageOutputStream() methods.
You will see that both these methods take Object as its parameter. Internally, ImageIO will use a service lookup mechanism, and delegate the creation to a provider able to create a stream based on the parameter. By default, there are providers for File, RandomAccessFile and InputStream.
But the mechanism is extendable. See the API doc for the javax.imageio.spi package for a starting point. If you like, you can create a provider that takes a java.nio.Path and creates a FileImageOutputStream based on it, or alternatively create your own implementation using some more fancy NIO backing (ie. SeekableByteChannel).
Here's source code for a sample provider and stream I created to read images from a byte array, that you could use as a starting point.
(Of course, I have to agree with #user2864740's thoughts on the cost/benefit of doing this, but as you are doing this for the sake of learning, it might make sense.)

Hibernate search, convert byte[] to List<LuceneWork>

As of Hibernate Search 3.1.1, when one wanted to send an indexed entity to a JMS queue for further processing, in the onMessage() method of the processing MDB was enough to apply a cast to obtain the list of LuceneWork, e.g
List<LuceneWork> queue = (List<LuceneWork>) objectMessage.getObject();
But in version 4.2.0 this is no longer an option as objectMessage.getObject() returns a byte[].
How could I deserialize this byte[] into List<LuceneWork>?
I've inspected the message and saw that I have the value for JMSBackendQueueTask.INDEX_NAME_JMS_PROPERTY.
You could extend AbstractJMSHibernateSearchController and have it deal with these details, or have a look at its source which contains:
indexName = objectMessage.getStringProperty(JmsBackendQueueTask.INDEX_NAME_JMS_PROPERTY);
indexManager = factory.getAllIndexesManager().getIndexManager(indexName);
if (indexManager == null) {
log.messageReceivedForUndefinedIndex(indexName);
return;
}
queue = indexManager.getSerializer().toLuceneWorks((byte[]) objectMessage.getObject());
indexManager.performOperations(queue, null);
Compared to older versions 3.x there are two main design differences to keep in mind:
The Serializer service is pluggable so it needs to be looked up
Each index (identified by name) can have an independent backend
The serialization is now performed (by default) using Apache Avro as newer Lucene classes are not Serializable.

Resources