Best strategy to Handle large data in Apache Camel - java-8

I am using Apache Camel to generate monthly reports. I have a MySQL query which when ran against my DB generates around 5 million records (20 columns each). The query itself takes approximately 70 minutes to execute.
To speed up the process, I created 5 seda (worker) routes and used multicast().parallelProcessing()
which query the DB in parallel for different time ranges, and then merged the result using an aggregator.
Now, I can see 5 million records in my exchange body (in the form of List<HashMap<String, Object>>). When I try to format this using a Camel Bindy to generate a csv file out of this data, I am getting a GC Overhead Exception. I tried increasing Java Heap Size, but it takes forever to transform.
Is there any other method, to convert this raw data into a well formatted csv file? Can Java 8 streams be useful?
Code
from("direct://logs/testLogs")
.routeId("Test_Logs_Route")
.setProperty("Report", simple("TestLogs-${date:now:yyyyMMddHHmm}"))
.bean(Logs.class, "buildLogsQuery") // bean that generates the logs query
.multicast()
.parallelProcessing()
.to("seda:worker1?waitForTaskToComplete=Always&timeout=0", // worker routes
"seda:worker2?waitForTaskToComplete=Always&timeout=0",
"seda:worker3?waitForTaskToComplete=Always&timeout=0",
"seda:worker4?waitForTaskToComplete=Always&timeout=0",
"seda:worker5?waitForTaskToComplete=Always&timeout=0");
All my worker routes look like this
from("seda:worker4?waitForTaskToComplete=Always")
.routeId("ParallelProcessingWorker4")
.log(LoggingLevel.INFO, "Parallel Processing Worker 4 Flow Started")
.setHeader("WorkerId", constant(4))
.bean(Logs.class, "testBean") // appends time-clause to the query based in WorkerID
.to("jdbc:oss-ro-ds")
.to("seda:resultAggregator?waitForTaskToComplete=Always&timeout=0");
Aggregator
from("seda:resultAggregator?waitForTaskToComplete=Always&timeout=0")
.routeId("Aggregator_ParallelProcessing")
.log(LoggingLevel.INFO, "Aggregation triggered for processor ${header.WorkerId}")
.aggregate(header("Report"), new ParallelProcessingAggregationStrategy())
.completionSize(5)
.to("direct://logs/processResultSet")
from("direct://logs/processResultSet")
.routeId("Process_Result_Set")
.bean(Test.class, "buildLogReport");
.marshal(myLogBindy)
.to("direct://deliver/ooma");
Method buildLogReport
public void buildLogReport(List<HashMap<String, Object>> resultEntries, Exchange exchange) throws Exception {
Map<String, Object> headerMap = exchange.getIn().getHeaders();
ArrayList<MyLogEntry> reportList = new ArrayList<>();
while(resultEntries != null){
HashMap<String, Object> resultEntry = resultEntries.get(0);
MyLogEntry logEntry = new MyLogEntry();
logEntry.setA((String) resultEntry.get("A"));
logEntry.setB((String) resultEntry.get("B"));
logEntry.setC(((BigDecimal) resultEntry.get("C")).toString());
if (null != resultEntry.get("D"))
logEntry.setD(((BigInteger) resultEntry.get("D")).toString());
logEntry.setE((String) resultEntry.get("E"));
logEntry.setF((String) resultEntry.get("F"));
logEntry.setG(((BigDecimal) resultEntry.get("G")).toString());
logEntry.setH((String) resultEntry.get("H"));
logEntry.setI(((Long) resultEntry.get("I")).toString());
logEntry.setJ((String) resultEntry.get("J"));
logEntry.setK(TimeUtils.convertDBToTZ((Date) resultEntry.get("K"), (String) headerMap.get("TZ")));
logEntry.setL(((BigDecimal) resultEntry.get("L")).toString());
logEntry.setM((String) resultEntry.get("M"));
logEntry.setN((String) resultEntry.get("State"));
logEntry.setO((String) resultEntry.get("Zip"));
logEntry.setP("\"" + (String) resultEntry.get("Type") + "\"");
logEntry.setQ((String) resultEntry.get("Gate"));
reportList.add(logEntry);
resultEntries.remove(resultEntry);
}
// Transform The Exchange Message
exchange.getIn().setBody(reportList);
}

Related

Spring batch job with multi level chunk

I need to write a batch jobs, which reads from DB a list of users and then for each user get data from external system which will return data in batch of 50 to 500 and then insert this data to DB.
I have wrote a steps as below
Tasklet load all users and pass to Reader (step 1) .
Reader takes all users and then pass one by one to processor (step-2) .
Processor calls external system and get full data for one user.
Writer gets this data for each user and insert to DB.
Using SimpleAsyncTaskExecutor , I am running this in paraller but , I want to have a steps as below
Tasklet load all users and pass to Reader (step-1 ,stepTasklet).
Reader reads list of all user (step-2,parallelstep).
Another reader/processor get once chunk of data(50 records) for that user.
Writer gets 50 records data for each user and insert to DB.
This step need to be in parallel for multiple users and also chunk should be at data level not like currently as user level
Can someone help.
#Bean
public Job getParallelJob() {
JobParametersIncrementer jobParametersIncrementers = new JobParametersIncrementer() {
long count = new Date().getTime();
#Override
public JobParameters getNext( JobParameters parameters) {
return new JobParameters(Collections.singletonMap("count", new JobParameter(count)));
}
};
return this.jobBuilderFactory.get("paralleljob").incrementer(jobParametersIncrementers)
.start(stepTasklet())
.next(parallelStep(taskExecutor())).build();
}

Sping Boot Service consume kafka messages on demand

I have requirement where need to have a Spring Boot Rest Service that a client application will call every 30 minutes and service is to return
number of latest messages based on the number specified in query param e.g. http://messages.com/getNewMessages?number=10 in this case should return 10 messages
number of messages based on the number and offset specified in query param e.g. http://messages.com/getSpecificMessages?number=5&start=123 in this case should return 5 messages starting offset 123.
I have simple standalone application and it works fine. Here is what I tested and would lke some direction of incorporating it in the service.
public static void main(String[] args) {
// create kafka consumer
Properties properties = new Properties();
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "my-first-consumer-group");
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
properties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, args[0]);
Consumer<String, String> consumer = new KafkaConsumer<>(properties);
// subscribe to topic
consumer.subscribe(Collections.singleton("test"));
consumer.poll(0);
//get to specific offset and get specified number of messages
for (TopicPartition partition : consumer.assignment())
consumer.seek(partition, args[1]);
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(5000));
System.out.println("Total Record Count ******* : " + records.count());
for (ConsumerRecord<String, String> record : records) {
System.out.println("Message: " + record.value());
System.out.println("Message offset: " + record.offset());
System.out.println("Message: " + record.timestamp());
Date date = new Date(record.timestamp());
Format format = new SimpleDateFormat("yyyy MM dd HH:mm:ss.SSS");
System.out.println("Message date: " + format.format(date));
}
consumer.commitSync();
As my consumer will be on-demand wondering in Spring Boot Service how I can achieve this. Where do I specify the properties if I put in application.properties those get's injected at startup time but how do i control MAX_POLL_RECORDS_CONFIG at runtime. Any help appreciated.
MAX_POLL_RECORDS_CONFIG only impact your kafka-client return the records to your spring service, it will never reduce the bytes that the consumer poll from kafka-server
see the above picture, no matter your start offset = 150 or 190, kafka server will return the whole data from (offset=110, offset=190), kafka server even didn't know how many records return to consumer, he only know the byte size = (220 - 110)
so i think you can control the record number by yourself,currently it is controlled by the kafka client jar, they are both occupy your jvm local memory
The answer to your question is here and the answer with code example is this answer.
Both written by the excellent Gary Russell, the main or one of the main person behind Spring Kafka.
TL;DR:
If you want to arbitrarily rewind the partitions at runtime, have your
listener implement ConsumerSeekAware and grab a reference to the
ConsumerSeekCallback.

Parsing multi-format & multi line data file in spring batch job

I am writing a spring batch job to process the below mentioned data file and write it into a db.
Sample data file is of this format where I have multiple headers and
each header has a bunch of rows associated with it .
I can have million of records for each header and I can have n number
of headers in a flat file that am processing.My requirement is to
pick a few readers which am concerned .
For all the picked readers I need to pick all the data rows .Each
header and its data format is also different .I can receive either of
these data in my processor and need to write them into my DB.
HDR01
A|41|57|Data1|S|62|Data2|9|N|2017-02-01 18:01:05|2017-02-01 00:00:00
A|41|57|Data1|S|62|Data2|9|N|2017-02-01 18:01:05|2017-02-01 00:00:00
HDR02
A|41|57|Data1|S|62|Data2|9|N|
A|41|57|Data1|S|62|Data2|9|N|
I tried exploring the PatternMatchingCompositeLineMapper where I can
map the different header pattern I have to a tokenizer and
corresponding FieldSetMapper but I need to read the body and not the
header here .
Don't have any footer to Crete a end of line policy of my own as well .
Also tried using AggregateItemReader but don't want to club all the
records of a header before I process them .
Each rows corresponding a header should be processed parallel .
#Bean
public LineMapper myLineMapper() {
PatternMatchingCompositeLineMapper< Domain > mapper = new PatternMatchingCompositeLineMapper<>();
final Map<String, LineTokenizer> tokenizers = new HashMap<String, LineTokenizer>();
tokenizers.put("* HDR01*", new DelimitedLineTokenizer());
tokenizers.put("*HDR02*", new DelimitedLineTokenizer());
tokenizers.put("*", new DelimitedLineTokenizer("|"));
mapper.setTokenizers(tokenizers);
Map<String, FieldSetMapper<VMSFeedStyleInfo>> mappers = new HashMap<String, FieldSetMapper<VMSFeedStyleInfo>>();
try {
mappers.put("* HDR01*", customMapper());
mappers.put("*HDR02*", customMapper());
mappers.put("*", customMapper() );
} catch (Exception e) {
e.printStackTrace();
}
mapper.setFieldSetMappers(mappers);
return mapper;
}
Can somebody help me provide some inputs as to how should I achieve this .

JmsTemplate's browseSelected not retrieving all messages

I have some Java code that reads messages from an ActiveMQ queue. The code uses a JmsTemplate from Spring and I use the "browseSelected" method to retrieve any messages from the queue that have a timestamp in their header older than 7 days (by creating the appropriate criteria as part of the messageSelector parameter).
myJmsTemplate.browseSelected(myQueue, myCriteria, new BrowserCallback<Integer>() {
#Override
public Integer doInJms(Session s, QueueBrowser qb) throws JMSException {
#SuppressWarnings("unchecked")
final Enumeration<Message> e = qb.getEnumeration();
int count = 0;
while (e.hasMoreElements()) {
final Message m = e.nextElement();
final TextMessage tm = (TextMessage) MyClass.this.jmsQueueTemplate.receiveSelected(
MyClass.this.myQueue, "JMSMessageID = '" + m.getJMSMessageID() + "'");
myMessages.add(tm);
count++;
}
return count;
}
});
The BrowserCallback's "doInJms" method adds the messages which match the criteria to a list ("myMessages") which subsequently get processed further.
The issue is that I'm finding the code will only process 400 messages each time it runs, even though there are several thousand messages which match the criteria specified.
When I previously used another queueing technology with this code (IBM MQ), it would process all records which met the criteria.
I'm wondering whether I'm experiencing an issue with ActiveMQ's prefetch limit: http://activemq.apache.org/what-is-the-prefetch-limit-for.html
Versions: ActiveMQ 5.10.1 and Spring 3.2.2.
Thanks in advance for any assistance.
The broker will only return up to 400 message by default as configured by the maxBrowsePageSize option in the destination policies. You can increase that value but must use caution as the messages are paged into memory and as such can lead you into an OOM situation.
You must always remember that a message broker is not a database, using it as one will generally end in tears.

Setting Time To Live (TTL) from Java - sample requested

EDIT:
This is basically what I want to do, only in Java
Using ElasticSearch, we add documents to an index bypassing IndexRequest items to a BulkRequestBuilder.
I would like for the documents to be dropped from the index after some time has passed (time to live/ttl)
This can be done either by setting a default for the index, or on a per-document basis. Either approach is fine by me.
The code below is an attempt to do it per document. It does not work. I think it's because TTL is not enabled for the index. Either show me what Java code I need to add to enable TTL so the code below works, or show me different code that enables TTL + sets default TTL value for the index in Java I know how to do it from the REST API but I need to do it from Java code, if at all possible.
logger.debug("Indexing record ({}): {}", id, map);
final IndexRequest indexRequest = new IndexRequest(_indexName, _documentType, id);
final long debug = indexRequest.ttl();
if (_ttl > 0) {
indexRequest.ttl(_ttl);
System.out.println("Setting TTL to " + _ttl);
System.out.println("IndexRequest now has ttl of " + indexRequest.ttl());
}
indexRequest.source(map);
indexRequest.operationThreaded(false);
bulkRequestBuilder.add(indexRequest);
}
// execute and block until done.
BulkResponse response;
try {
response = bulkRequestBuilder.execute().actionGet();
Later I check in my unit test by polling this method, but the document count never goes down.
public long getDocumentCount() throws Exception {
Client client = getClient();
try {
client.admin().indices().refresh(new RefreshRequest(INDEX_NAME)).actionGet();
ActionFuture<CountResponse> response = client.count(new CountRequest(INDEX_NAME).types(DOCUMENT_TYPE));
CountResponse countResponse = response.get();
return countResponse.getCount();
} finally {
client.close();
}
}
After a LONG day of googling and writing test programs, I came up with a working example of how to use ttl and basic index/object creation from the Java API. Frankly most of the examples in the docs are trivial, and some JavaDoc and end-to-end examples would go a LONG way to help those of us who are using the non-REST interfaces.
Ah well.
Code here: Adding mapping to a type from Java - how do I do it?

Resources