Field Grouping for a Kafka Spout

Field Grouping for a Kafka Spout - apache-storm

Can field grouping be done on tuples emitted by a kafka spout? If yes, then how does Storm gets to know the fields in a Kafka record?

Field grouping (and grouping in general) in Storm is for bolts, not for spouts. This is done via InputDeclarer class.
When you call setBolt() on TopologyBuilder, InputDeclarer is returned.

Kafka Spout declared its output fields like any other component. My explanation is based on current implementation of KafkaSpout.
In KafkaSpout.java class we see declareOutputFields method that call getOutputFields() method of KafkaConfig Scheme.
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(_spoutConfig.scheme.getOutputFields());
}
By default, KafkaConfig uses RawMultiScheme that implements this method in this way.
#Override
public Fields getOutputFields() {
return new Fields("bytes");
}
So what does it mean?, if you declared bolt which reads tuples from KafkaSpout with fieldGrouping you know that every tuple that contains equals field "bytes" is going to be executed by the same task. If you want to emit any field, you should implement new scheme for your needs.

TL:DR
The default implementation of KafkaSpout declares following output fields in declareOutputFields:
new Fields("topic", "partition", "offset", "key", "value");
So in building topology code directly do:
topologyBuilder.setSpout(spoutName, mySpout, parallelismHintSpout);
topologyBuilder.setBolt(boltName, myBolt, parallelismHintBolt).fieldsGrouping(spoutName, new Fields("key"));
Details: A little looking into code tells that:
In Kafka Spout, declareOutputFields is implemented in following way:
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
RecordTranslator<K, V> translator = kafkaSpoutConfig.getTranslator();
for (String stream : translator.streams()) {
declarer.declareStream(stream, translator.getFieldsFor(stream));
}
}
It gets fields from RecordTranslator interface and its instance is fetched from kafkaSpoutConfig i.e. KafkaSpoutConfig<K, V>. KafkaSpoutConfig<K, V> extends from CommonKafkaSpoutConfig (this is slightly different in 1.1.1 version though). The builder of this returns DefaultRecordTranslator. If you check the Fields in this class implementation, you will find:
public static final Fields FIELDS = new Fields("topic", "partition", "offset", "key", "value");
So we can use Fields("key") directly in fields grouping in topology code:
topologyBuilder.setBolt(boltName, myBolt, parallelismHintBolt).fieldsGrouping(spoutName, new Fields("key"));

Related

Publish multiple events shares some attributes in one kafka topic

I need to publish multiple messages from the same project which represents employee journey events, and i need to use one topic only to publish these messages as they are representing the same project, but in some cases the message may contain extra fields for example:
All messages share (id, name, type, date) and
may some events have more fields like (course id, course name), so I am intending to use one parent object called "Journey", contains "Event" object, and I will create multiple children objects like 'LMSEvent' that extends this Event, etc if needed. Also using the Jackson + spring boot over rest APIs to do the needed cast based on type attribute. Finally, then this message to Kafka directly, so, each object contains its own properties.
For the consumer, I will do some strategy patterns and do the required logic per each type if needed.
The message size will not be very big and i don't expect to have more different attributes per each event.
I am looking to know if this approach is good or not and in case is not, what is the alternative.

I think that in general it is good approach. Having single message schema on topic or multiple schemas is always good question and both has some bright sights and drawbacks, you can read more about it in Martin Kleppmann article.
When you decided to have multiple events on single topic, starting from rest api and next by Kafka producer and consumer you can use the same approach of serializing and deserializing events, #JsonTypeInfo and #JsonSubTypes does the job:
#JsonTypeInfo(
use = JsonTypeInfo.Id.NAME,
include = JsonTypeInfo.As.EXISTING_PROPERTY,
property = "type")
#JsonSubTypes({
#JsonSubTypes.Type(value = LMSEvent.class, name = "LMSEvent"),
#JsonSubTypes.Type(value = YetAnotherEvent.class, name = "YetAnotherEvent")
})
public interface Event {
String getType();
default boolean hasType(String type) {
return getType().equalsIgnoreCase(type);
}
default <T> T getConcreteEvent(Class<T> clazz) {
return clazz.cast(this);
}
}
When you consume that type of messages using spring-kafka you can define some very neat code, where each method is consuming concrete event type, so you don't need to write some dirty casting by your own:
#KafkaListener(topics = "someEvents", containerFactory = "myKafkaContainerFactory")
public class MyKafkaHandler {
#KafkaHandler
void handleLMSEvent(LMSEvent event) {
....
}
#KafkaHandler
void handleYetAnotherEvent(YetAnotherEvent yetAnotherEvent) {
...
}
#KafkaHandler(isDefault = true)
void handleDefault(#Payload Object unknown,
#Header(KafkaHeaders.OFFSET) long offset,
#Header(KafkaHeaders.RECEIVED_PARTITION) int partitionId,
#Header(KafkaHeaders.RECEIVED_TOPIC) String topic) {
logger.info("Server received unknown message {},{},{}", offset, partitionId, topic);
}
}
Full code

How to link a Vaadin Grid with the result of Spring Mono WebClient data

This seems to be a missing part in the documentation of Vaadin...
I call an API to get data in my UI like this:
#Override
public URI getUri(String url, PageRequest page) {
return UriComponentsBuilder.fromUriString(url)
.queryParam("page", page.getPageNumber())
.queryParam("size", page.getPageSize())
.queryParam("sort", (page.getSort().isSorted() ? page.getSort() : ""))
.build()
.toUri();
}
#Override
public Mono<Page<SomeDto>> getDataByPage(PageRequest pageRequest) {
return webClient.get()
.uri(getUri(URL_API + "/page", pageRequest))
.retrieve()
.bodyToMono(new ParameterizedTypeReference<>() {
});
}
In the Vaadin documentation (https://vaadin.com/docs/v10/flow/binding-data/tutorial-flow-data-provider), I found an example with DataProvider.fromCallbacks but this expects streams and that doesn't feel like the correct approach as I need to block on the requests to get the streams...
DataProvider<SomeDto, Void> lazyProvider = DataProvider.fromCallbacks(
q -> service.getData(PageRequest.of(q.getOffset(), q.getLimit())).block().stream(),
q -> service.getDataCount().block().intValue()
);
When trying this implementation, I get the following error:
org.springframework.core.codec.CodecException: Type definition error: [simple type, class org.springframework.data.domain.Page]; nested exception is com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Cannot construct instance of `org.springframework.data.domain.Page` (no Creators, like default constructor, exist): abstract types either need to be mapped to concrete types, have custom deserializer, or contain additional type information
at [Source: (io.netty.buffer.ByteBufInputStream); line: 1, column: 1]
grid.setItems(lazyProvider);

I don't have experience with vaadin, so i'll talk about the deserialization problem.
Jackson needs a Creator when deserializing. That's either:
the default no-arg constructor
another constructor annotated with #JsonCreator
static factory method annotated with #JsonCreator
If we take a look at spring's implementations of Page - PageImpl and GeoPage, they have neither of those. So you have two options:
Write your custom deserializer and register it with the ObjectMapper instance
The deserializer:
public class PageDeserializer<T> extends StdDeserializer<Page<T>> {
public PageDeserializer() {
super(Page.class);
}
#Override
public Page<T> deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JacksonException {
//TODO implement for your case
return null;
}
}
And registration:
SimpleModule module = new SimpleModule();
module.addDeserializer(Page.class, new PageDeserializer<>());
objectMapper.registerModule(module);
Make your own classes extending PageImpl, PageRequest, etc. and annotate their constructors with #JsonCreator and arguments with #JsonProperty.
Your page:
public class MyPage<T> extends PageImpl<T> {
#JsonCreator
public MyPage(#JsonProperty("content_prop_from_json") List<T> content, #JsonProperty("pageable_obj_from_json") MyPageable pageable, #JsonProperty("total_from_json") long total) {
super(content, pageable, total);
}
}
Your pageable:
public class MyPageable extends PageRequest {
#JsonCreator
public MyPageable(#JsonProperty("page_from_json") int page, #JsonProperty("size_from_json") int size, #JsonProperty("sort_object_from_json") Sort sort) {
super(page, size, sort);
}
}
Depending on your needs for Sort object, you might need to create MySort as well, or you can remove it from constructor and supply unsorted sort, for example, to the super constructor. If you are deserializing from input manually you need to provide type parameters like this:
JavaType javaType = TypeFactory.defaultInstance().constructParametricType(MyPage.class, MyModel.class);
Page<MyModel> deserialized = objectMapper.readValue(pageString, javaType);
If the input is from request body, for example, just declaring the generic type in the variable is enough for object mapper to pick it up.
#PostMapping("/deserialize")
public ResponseEntity<String> deserialize(#RequestBody MyPage<MyModel> page) {
return ResponseEntity.ok("OK");
}
Personally i would go for the second option, even though you have to create more classes, it spares the tediousness of extracting properties and creating instances manually when writing deserializers.

There are two parts to this question.
The first one is about asynchronously loading data for a DataProvider in Vaadin. This isn't supported since Vaadin has prioritized the typical case with fetching data straight through JDBC. This means that you end up blocking a thread while the data is loading. Vaadin 23 will add support for doing that blocking on a separate thread instead of keeping the UI thread blocked, but it will still be blocking.
The other half of your problem doesn't seem to be directly related to Vaadin. The exception message says that the Jackson instance used by the REST client isn't configured to support creating instances of org.springframework.data.domain.Page. I don't have direct experience with this part of the problem, so I cannot give any advice on exactly how to fix it.

#InboundChannelAdapter in Spring-integration is not running continously?

i am working in spring cloud data flow,there i am having a scenario like reading from the database and send the data to the kafka topic using the #InboundChannelAdapter
Below is the strategy i followed.
->Created common list to store the objects if the list was empty
->if the list have the data i won't poll
->i am sending the values to kafka one by one by using index and after that i will remove the index
if i keep the #Bean it is inserting only the first object in the list to kafka topic.
{"id":101443442,"name":"Mobile1","price":8000}
if i remove the #Bean then it will insert all empty data into kafka.
{}
public static List<Product> products;
#Bean
public void initList() {
products = new ArrayList<>();
}
#Bean
#InboundChannelAdapter(channel = TbeSource.PR1)
public MessageSource<Product> addProducts() {
if (products.size() == 0) {
products.add(new Product(101443442, "Mobile1", 8000));
products.add(new Product(102235434, "book111", 6000));
}
MessageBuilder<Product> message = MessageBuilder.withPayload(products.get(0));
products.remove(0);
return message::build;
}
what am i doing wrong?
i need to send the data frequently by reading from db ?

Really not clear what you are asking.
If you talk about JDBC then you may consider to use a JDBC Source from tout-of-the-box applications for Data Flow.
If you are doing logic yourself to take data from data base, you may consider to use a JdbcPollingChannelAdapter from Spring Integration for the same #InboundChannelAdapter reason.
The rest of your logic with that list is not clear. It is strange to see a #Bean on a void method. If you need to initialize that products and get access from the MessageSource implementation, you just need to do private List<Product> products = new ArrayList<>();. Having property as public is really a bad practice.

Pass data from one writer to another writer after reading from DB

I have to create a batch job where I need to fetch data from 1 DB and after processing dump that data to another DB where auto generated ID would be assigned to persisted data. I need to send that data along with generated ID to solace queue.
Reader(DB1) --data1--> Processor --data2--> Writer (DB2) --data3--> Writer (Solace Publisher)
I am using spring boot-2.2.5.RELEASE and spring-boot-starter-batch.
I have created a job having 1 step that read data from DB1 and write data to DB2 via RepositoryItemReader and RepositoryItemWriter respectively. This is working fine.
Now next task is to send persisted data having generated ID to solace stream (using spring-cloud-starter-stream-solace).
I have below questions. Please assist as I am totally new to spring batch
How can I get the complete record after it's saved to DB2 based on some parameter? Do I have to write my own RepositoryItemWriter having StepExecution Context or can I somehow use the existing RepositoryItemWriter.
Once I got the record I need to use solace stream and there I have publish method which expects argument(record) to be published. I think again I need to write my own Item Writer and either I could use the record passed from above repositoryItemWriter by StepExecutionContext or should I query into DB2 directly from here based on some parameter ?
Either of the above case I need to use stepexecution context but can I use available RepositoryItemWriter or do I have to write my own?
Is there any other concept which is handy in this handy instead of using above approaches?

Passing data to future steps is a common pattern in Spring Batch. According to the documentation https://docs.spring.io/spring-batch/docs/current/reference/html/common-patterns.html#passingDataToFutureSteps you can use stepExecution to store and retrieve your generated IDs. In your case the writers are also listeners which has before step methods annotated with #BeforeStep. For example:
public class DB2ItemWriter implements ItemWriter<Object> {
private StepExecution stepExecution;
public void write(List<? extends Object> items) throws Exception {
// ...
ExecutionContext stepContext = this.stepExecution.getExecutionContext();
stepContext.put("generatedIds", ids);
}
#BeforeStep
public void saveStepExecution(StepExecution stepExecution) {
this.stepExecution = stepExecution;
}
}
and then you retrieve the ids in the next writer
public class SolacePublisherItemWriter implements ItemWriter<Object> {
public void write(List<? extends Object> items) throws Exception {
// ...
}
#BeforeStep
public void retrieveGeneratedIds(StepExecution stepExecution) {
ExecutionContext stepExecutionContext = stepExecution.getExecutionContext();
this.generatedIds = stepExecutionContext.get("generatedIds");
}
}

I have created a job having 1 step that read data from DB1 and write data to DB2 via RepositoryItemReader and RepositoryItemWriter respectively. This is working fine.
I would add a second step that reads data from the table (in which records have been persisted by step 1 and have their IDs generated) and push it to solace using a custom writer.

Kafka Steams Windowing on a different value

When you create a window using kafka streams I assume it uses the timestamp the record was published? Is there anyway to window by something else.
My use case is our record value object contains a timestamp and that is what we want to window on.
If I do something like this it will window on published time stamp. I want to window by myObject.getCallTimestamp()
KTable<Windowed<String>, MyObject> windowedPageViewCounts = pageViews
.groupByKey(Serialized.with(Serdes.String(), myObjectSerde))
.windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(5)))
.count();
EDIT:
Based on recommendation below I believe this is what I need to be doing?
public class RecordTimeStampExtractor implements TimestampExtractor {
//default timestamp extractor
private FailOnInvalidTimestamp failOnInvalidTimestamp = new FailOnInvalidTimestamp();
#Override
public long extract(ConsumerRecord<Object, Object> consumerRecord, long l) {
//could also use consumerRecord.topic().equals("mytopic")
if(consumerRecord.value() instanceof MyClass) {
MyClass myClass = (MyClass) consumerRecord.value();
return myClass.getRecordTimestamp().toEpochMilli();
}
return failOnInvalidTimestamp.extract(consumerRecord,l);
}
}

You can implement and configure (via default.timestamp.extractor) a custom TimestampExtractor that returns myObject.getCallTimestamp().
See the docs for details: https://docs.confluent.io/current/streams/developer-guide/config-streams.html#streams-developer-guide-timestamp-extractor

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio