Can't access the data in Kafka Spark Streaming globally - hadoop

I am trying to Streaming the data from Kafka to Spark
JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils.createDirectStream(ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams, topics);
Here i am iterating over the JavaPairInputDStream to process the RDD's.
directKafkaStream.foreachRDD(rdd ->{
rdd.foreachPartition(items ->{
while (items.hasNext()) {
String[] State = items.next()._2.split("\\,");
System.out.println(State[2]+","+State[3]+","+State[4]+"--");
};
});
});
I can able to fetch the data in foreachRDD and my requirement is have to access State Array globally. When i am trying to access the State Array globally i am getting Exception
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
Any suggestions ? Thanks.

This is more of a joining your lookup table with streaming RDD to get all the items that have a matching 'code' and 'violationCode' fields.
The flow should be like this.
Create an RDD of Hive lookup table => lookupRdd
Create DStream from kafka stream
For each RDD in Dstream, join lookupRDD with streamRdd, process the joined items(calculate sum of amount...) and save this processed result.
Note Below code is incomplete. Please complete all the TODO comments.
JavaPairDStream<String, String> streamPair = directKafkaStream.mapToPair(new PairFunction<Tuple2<String, String>, String, String>() {
#Override
public Tuple2<String, String> call(Tuple2<String, String> tuple2) throws Exception {
System.out.println("Tuple2 Message is----------" + tuple2._2());
String[] state = tuple2._2.split("\\,");
return new Tuple2<>(state[4], tuple2._2()); //pair <ViolationCode, data>
}
});
streamPair.foreachRDD(new Function<JavaPairRDD<String, String>, Void>() {
JavaPairRDD<String, String> hivePairRdd = null;
#Override
public Void call(JavaPairRDD<String, String> stringStringJavaPairRDD) throws Exception {
if (hivePairRdd == null) {
hivePairRdd = initHiveRdd();
}
JavaPairRDD<String, Tuple2<String, String>> joinedRdd = stringStringJavaPairRDD.join(hivePairRdd);
System.out.println(joinedRdd.take(10));
//todo process joinedRdd here and save the results.
joinedRdd.count(); //to trigger an action
return null;
}
});
}
public static JavaPairRDD<String, String> initHiveRdd() {
JavaRDD<String> hiveTableRDD = null; //todo code to create RDD from hive table
JavaPairRDD<String, String> hivePairRdd = hiveTableRDD.mapToPair(new PairFunction<String, String, String>() {
#Override
public Tuple2<String, String> call(String row) throws Exception {
String code = null; //TODO process 'row' and get 'code' field
return new Tuple2<>(code, row);
}
});
return hivePairRdd;
}

Related

AggregatingReplyingKafkaTemplate releaseStrategy Question

There seem to be an issue when I use AggregatingReplyingKafkaTemplate with template.setReturnPartialOnTimeout(true) in that, it returns timeout exception even if partial results are available from consumers.
In example below, I have 3 consumers to reply to the request topic and i've set the reply timeout at 10 seconds. I've explicitly delayed the response of Consumer 3 to 11 seconds, however, I expect the response back from Consumer 1 and 2, so, I can return partial results. However, I am getting KafkaReplyTimeoutException. Appreciate your inputs. Thanks.
I follow the code based on the Unit Test below.
[ReplyingKafkaTemplateTests][1]
I've provided the actual code below:
#RestController
public class SumController {
#Value("${kafka.bootstrap-servers}")
private String bootstrapServers;
public static final String D_REPLY = "dReply";
public static final String D_REQUEST = "dRequest";
#ResponseBody
#PostMapping(value="/sum")
public String sum(#RequestParam("message") String message) throws InterruptedException, ExecutionException {
AggregatingReplyingKafkaTemplate<Integer, String, String> template = aggregatingTemplate(
new TopicPartitionOffset(D_REPLY, 0), 3, new AtomicInteger());
String resultValue ="";
String currentValue ="";
try {
template.setDefaultReplyTimeout(Duration.ofSeconds(10));
template.setReturnPartialOnTimeout(true);
ProducerRecord<Integer, String> record = new ProducerRecord<>(D_REQUEST, null, null, null, message);
RequestReplyFuture<Integer, String, Collection<ConsumerRecord<Integer, String>>> future =
template.sendAndReceive(record);
future.getSendFuture().get(5, TimeUnit.SECONDS); // send ok
System.out.println("Send Completed Successfully");
ConsumerRecord<Integer, Collection<ConsumerRecord<Integer, String>>> consumerRecord = future.get(10, TimeUnit.SECONDS);
System.out.println("Consumer record size "+consumerRecord.value().size());
Iterator<ConsumerRecord<Integer, String>> iterator = consumerRecord.value().iterator();
while (iterator.hasNext()) {
currentValue = iterator.next().value();
System.out.println("response " + currentValue);
System.out.println("Record header " + consumerRecord.headers().toString());
resultValue = resultValue + currentValue + "\r\n";
}
} catch (Exception e) {
System.out.println("Error Message is "+e.getMessage());
}
return resultValue;
}
public AggregatingReplyingKafkaTemplate<Integer, String, String> aggregatingTemplate(
TopicPartitionOffset topic, int releaseSize, AtomicInteger releaseCount) {
//Create Container Properties
ContainerProperties containerProperties = new ContainerProperties(topic);
containerProperties.setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
//Set the consumer Config
//Create Consumer Factory with Consumer Config
DefaultKafkaConsumerFactory<Integer, Collection<ConsumerRecord<Integer, String>>> cf =
new DefaultKafkaConsumerFactory<>(consumerConfigs());
//Create Listener Container with Consumer Factory and Container Property
KafkaMessageListenerContainer<Integer, Collection<ConsumerRecord<Integer, String>>> container =
new KafkaMessageListenerContainer<>(cf, containerProperties);
// container.setBeanName(this.testName);
AggregatingReplyingKafkaTemplate<Integer, String, String> template =
new AggregatingReplyingKafkaTemplate<>(new DefaultKafkaProducerFactory<>(producerConfigs()), container,
(list, timeout) -> {
releaseCount.incrementAndGet();
return list.size() == releaseSize;
});
template.setSharedReplyTopic(true);
template.start();
return template;
}
public Map<String, Object> consumerConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,bootstrapServers);
props.put(ConsumerConfig.GROUP_ID_CONFIG, "test_id");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringDeserializer.class);
return props;
}
public Map<String, Object> producerConfigs() {
Map<String, Object> props = new HashMap<>();
// list of host:port pairs used for establishing the initial connections to the Kakfa cluster
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
bootstrapServers);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
org.apache.kafka.common.serialization.StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringSerializer.class);
return props;
}
public ProducerFactory<Integer,String> producerFactory() {
return new DefaultKafkaProducerFactory<>(producerConfigs());
}
#KafkaListener(id = "def1", topics = { D_REQUEST}, groupId = "D_REQUEST1")
#SendTo // default REPLY_TOPIC header
public String dListener1(String in) throws InterruptedException {
return "First Consumer : "+ in.toUpperCase();
}
#KafkaListener(id = "def2", topics = { D_REQUEST}, groupId = "D_REQUEST2")
#SendTo // default REPLY_TOPIC header
public String dListener2(String in) throws InterruptedException {
return "Second Consumer : "+ in.toLowerCase();
}
#KafkaListener(id = "def3", topics = { D_REQUEST}, groupId = "D_REQUEST3")
#SendTo // default REPLY_TOPIC header
public String dListener3(String in) throws InterruptedException {
Thread.sleep(11000);
return "Third Consumer : "+ in;
}
}
'''
[1]: https://github.com/spring-projects/spring-kafka/blob/master/spring-kafka/src/test/java/org/springframework/kafka/requestreply/ReplyingKafkaTemplateTests.java
template.setReturnPartialOnTimeout(true) simply means the template will consult the release strategy on timeout (with the timeout argument = true, to tell the strategy it's a timeout rather than a delivery call).
It must return true to release the partial result.
This is to allow you to look at (and possibly modify) the list to decide whether you want to release or discard.
Your strategy ignores the timeout parameter:
(list, timeout) -> {
releaseCount.incrementAndGet();
return list.size() == releaseSize;
});
You need return timeout ? true : { ... }.

Kafka Streams: batch keys for a time window and do some processing on the batch of keys together

I have a stream of incoming primary keys (PK) that I am reading in my Kafkastreams app, I would like to batch them over say last 1 minute and query my transactional DB to get more data for the batch of PKs (deduplicated) in the last minute. And for each PK I would like to post a message on output topic.
I was able to code this using Processor API like below:
Topology topology = new Topology();
topology.addSource("test-source", inputKeySerde.deserializer(), inputValueSerde.deserializer(), "input.kafka.topic")
.addProcessor("test-processor", processorSupplier, "test-source")
.addSink("test-sink", "output.kafka.topic", outputKeySerde.serializer(), outputValueSerde.serializer, "test-processor");
Here processor supplier has a process method that adds the PK to a queue and a punctuator that is scheduled to run every minute and drains the queue and queries transactional DB and forwards a message for every PK.
ProcessorSupplier<Integer, ValueType> processorSupplier = new ProcessorSupplier<Integer, ValueType>() {
public Processor<Integer, ValueType> get() {
return new Processor<Integer, ValueType>() {
private ProcessorContext context;
private BlockingQueue<Integer> ids;
public void init(ProcessorContext context) {
this.context = context;
this.context.schedule(Duration.ofMillis(1000), PunctuationType.WALL_CLOCK_TIME, this::punctuate);
ids = new LinkedBlockingQueue<>();
}
#Override
public void process(Integer key, ValueType value) {
ids.add(key);
}
public void punctuate(long timestamp) {
Set<Long> idSet = new HashSet<>();
ids.drainTo(idSet, 1000);
List<Document> documentList = createDocuments(ids);
documentList.stream().forEach(document -> context.forward(document.getId(), document));
context.commit();
}
#Override
public void close() {
}
};
}
};
Wondering if there is a simpler way to accomplish this using DSL windowedBy and reduce/aggregate route?
***** Updated code to use state store ******
ProcessorSupplier<Integer, ValueType> processorSupplier = new ProcessorSupplier<Integer, ValueType>() {
public Processor<Integer, ValueType> get() {
return new Processor<Integer, ValueType>() {
private ProcessorContext context;
private KeyValueStore<Integer, Integer> stateStore;
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore) context.getStateStore("MyStore");
this.context.schedule(Duration.ofMillis(5000), PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
Set<Integer> ids = new HashSet<>();
try (KeyValueIterator<Integer, Integer> iter = this.stateStore.all()) {
while (iter.hasNext()) {
KeyValue<Integer, Integer> entry = iter.next();
ids.add(entry.key);
}
}
List<Document> documentList = createDocuments(dataRetriever, ids);
documentList.stream().forEach(document -> context.forward(document.getId(), document));
ids.stream().forEach(id -> stateStore.delete(id));
this.context.commit();
});
}
#Override
public void process(Integer key, ValueType value) {
Long id = key.getId();
stateStore.put(id, id);
}
#Override
public void close() {
}
};
}
};

Java8 generate Map containing another Map

How do I achieve this using java=8
I have a CSV in below format and from this i want to populate Map<String, Map<String, String>
where the outer map will have key scriptId and transationType as these are the distinct Type and inner map for scriptId key should contain first 5 values stating from position 2 as key and 3 as value.
<scriptId<
<TATA,TATA Moters>
<REL,Reliance Industries Ltd>
<LNT, L&T>
<SBI, State Bank of India>>
<transactionType,<
<P,B>
<S,S>>
Content of CSV File
Type,ArcesiumValue,GICValue
scriptId,TATA,TATA Moters
scriptId,REL,Reliance Industries Ltd
scriptId,LNT,L&T
scriptId,SBI,State Bank of India
transactionType,P,B
transactionType,S,S
How do i generate this using Java8
public void loadReferenceData() throws IOException {
List<Map<String, Map<String, String>>> cache = Files.lines(Paths.get("data/referenceDataMapping.csv")).skip(1)
.map(mapRefereneData).collect(Collectors.toList());
System.out.println(cache);
}
public static Function<String, Map<String, Map<String, String>>> mapRefereneData = (line) -> {
String[] sp = line.split(",");
Map<String, Map<String, String>> cache = new HashMap<String, Map<String, String>>();
try {
if (cache.containsKey(sp[0])) {
cache.get(sp[0]).put(sp[1], sp[2]);
} else {
Map<String, String> map = new HashMap<String, String>();
map.put(sp[1], sp[2]);
cache.put(sp[0], map);
}
} catch (NumberFormatException e) {
e.printStackTrace();
}
return cache;
};
Well it is much simpler to use two Collectors:
Map<String, Map<String, String>> groupCSV = Files.lines(Paths.get("..."))
.skip(1L).map(l -> l.split(","))
.collect(Collectors.groupingBy(a -> a[0], Collectors.toMap(a -> a[1], a -> a[2])));

Failed to flush state store

I'm trying to create a leftJoin in Kafka Streams which works fine for about 10 Records and then it crashes with an exception caused by a NullPointerException with such code:
private static KafkaStreams getKafkaStreams() {
StreamsConfig streamsConfig = new StreamsConfig(getProperties());
KStreamBuilder builder = new KStreamBuilder();
KTable<String, Verkaeufer> umsatzTable = builder.table(Serdes.String(), EventstreamSerde.Verkaeufer(), CommonUtilsConstants.TOPIC_VERKAEUFER_STAMMDATEN);
KStream<String, String> verkaeuferStream = builder.stream(CommonUtilsConstants.TOPIC_ANZAHL_UMSATZ_PER_VERKAEUFER);
KStream<String, String> tuttiStream = verkaeuferStream.leftJoin(umsatzTable,
(tutti, verkaeufer) -> ("Vorname=" + verkaeufer.getVorname().toString() +",Nachname=" +verkaeufer.getNachname().toString() +"," +tutti.toString()), Serdes.String(), Serdes.String());
tuttiStream.to(Serdes.String(), Serdes.String(), CommonUtilsConstants.TOPIC_TUTTI);
return new KafkaStreams(builder, streamsConfig);
}
StreamsConfig looks like this:
private static Properties getProperties() {
Properties props = new Properties();
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, CommonUtilsConstants.BOOTSTRAP_SERVER_CONFIGURATION);
props.put(StreamsConfig.APPLICATION_ID_CONFIG, CommonUtilsConstants.GID_TUTTI);
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG,Serdes.String().getClass());
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, "1000");
return props;
}
Full Stack Trace:
22:19:36.550 [gid-tutti-8fe6be58-d5c5-41ce-982d-88081b98004e-StreamThread-1] ERROR o.a.k.s.p.internals.StreamThread - stream-thread [gid-tutti-8fe6be58-d5c5-41ce-982d-88081b98004e-StreamThread-1] Failed to commit StreamTask 0_0 state: org.apache.kafka.streams.errors.ProcessorStateException: task [0_0] Failed to flush state store KTABLE-SOURCE-STATE-STORE-0000000000
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.flush(ProcessorStateManager.java:262)
at org.apache.kafka.streams.processor.internals.AbstractTask.flushState(AbstractTask.java:190)
at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:282)
at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:264)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:187)
at org.apache.kafka.streams.processor.internals.StreamTask.commitImpl(StreamTask.java:259)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:253)
at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:815)
at org.apache.kafka.streams.processor.internals.StreamThread.access$2800(StreamThread.java:73)
at org.apache.kafka.streams.processor.internals.StreamThread$2.apply(StreamThread.java:797)
at org.apache.kafka.streams.processor.internals.StreamThread.performOnStreamTasks(StreamThread.java:1448)
at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:789)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:778)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:567)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:527) Caused by: java.lang.NullPointerException: null
at java.lang.String.<init>(String.java:143)
at ch.wesr.eventstream.commonutils.serde.GsonDeserializer.deserialize(GsonDeserializer.java:38)
at org.apache.kafka.streams.state.StateSerdes.valueFrom(StateSerdes.java:163)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore.putAndMaybeForward(CachingKeyValueStore.java:90)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore.access$000(CachingKeyValueStore.java:34)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore$1.apply(CachingKeyValueStore.java:78)
at org.apache.kafka.streams.state.internals.NamedCache.flush(NamedCache.java:145)
at org.apache.kafka.streams.state.internals.NamedCache.flush(NamedCache.java:103)
at org.apache.kafka.streams.state.internals.ThreadCache.flush(ThreadCache.java:97)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore.flush(CachingKeyValueStore.java:107)
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.flush(ProcessorStateManager.java:260)
... 14 common frames omitted
Update:
This is what GsonDeserialize looks like
public class GsonDeserializer<T> implements Deserializer<T>{
public static final String CONFIG_VALUE_CLASS = "default.value.deserializer.class";
public static final String CONFIG_KEY_CLASS = "default.key.deserializer.class";
private Class<T> deserializedClass;
private Gson gson = new GsonBuilder().create();
public GsonDeserializer() {}
#Override
public void configure(Map<String, ?> config, boolean isKey) {
String configKey = isKey ? CONFIG_KEY_CLASS : CONFIG_VALUE_CLASS;
String clsName = String.valueOf(config.get(configKey));
try {
if (deserializedClass == null) {
deserializedClass = (Class<T>) Class.forName(clsName);
}
} catch (ClassNotFoundException e) {
System.err.printf("Failed to configure GsonDeserializer. " +
"Did you forget to specify the '%s' property ?%n",
configKey);
System.out.println(e.getMessage());
}
}
#Override
public T deserialize(String s, byte[] bytes) {
return gson.fromJson(new String(bytes), deserializedClass);
}
#Override
public void close() {}
}
As long as the cache is not flushed, your deserializer is never called. That's why it doesn't fail in the beginning and you can increase the time until it fails via cache size parameter and commit interval (we flush on commit).
Looking at your code for GsonDeserializer, it seems that new String(bytes) fails with NPE -- String constructor cannot take null as parameter -- your deserializer code must guard against bytes==null and should return null for this case directly.

send multiple jms messages in one transaction

I have to send a message to 2 different queues(queue1 and queue2). However, i want to rollback, if the send is failed for any of the queue(queue1 or queue2).
my source code looks as follows. can anyone through some inputs on this.
public void sendMessage(final Map<String, String> mapMessage) {
jmsTemplate.send(queue1, session -> {
MapMessage message = session.createMapMessage();
Iterator<Entry<String, String>> it = mapMessage.entrySet().iterator();
while (it.hasNext()) {
Map.Entry<String, String> pair = it.next();
message.setStringProperty(pair.getKey(), pair.getValue());
}
message.setJMSRedelivered(true);
message.setJMSCorrelationID(UUID.randomUUID().toString().replaceAll("-", ""));
return message;
});
jmsTemplate.send(queue2, session -> {
MapMessage message = session.createMapMessage();
Iterator<Entry<String, String>> it = mapMessage.entrySet().iterator();
while (it.hasNext()) {
Map.Entry<String, String> pair = it.next();
message.setStringProperty(pair.getKey(), pair.getValue());
}
message.setJMSRedelivered(true);
message.setJMSCorrelationID(UUID.randomUUID().toString().replaceAll("-", ""));
return message;
});
}
Start a transaction before entering the sendMessage method, e.g. with #Transactional - see the Spring Framework Reference Manual.

Resources