Storm-jms Spout collecting Avro messages and sending down stream? - hadoop

I am new to Avro Format. I am trying to collect Avro messages from a JMS Queue using Storm-Jms spout and send them to hdfs using hdfs bolt.
Queue is sending avro but i am not able to get them in avro format using the HDFS BOLT.
How to properly collect the avro message and send them downstream without encoding errors in hdfs.

The existing HDFS Bolt does not support Writing avro Files we need to overcome this by making the following changes. In this sample Code i am using the getting JMS Messages from my spout and the converting those JMS bytes message to AVRO and emmiting them to HDFS.
This code can serve as a sample for modifying the methods in AbstractHdfsBolt.
public void execute(Tuple tuple) {
try {
long length = bytesMessage.getBodyLength();
byte[] bytes = new byte[(int)length];
///////////////////////////////////////
bytesMessage.readBytes(bytes);
String replyMessage = new String(bytes, "UTF-8");
datumReader = new SpecificDatumReader<IndexedRecord>(schema);
decoder = DecoderFactory.get().binaryDecoder(bytes, null);
result = datumReader.read(null, decoder);
synchronized (this.writeLock) {
dataFileWriter.append(result);
dataFileWriter.sync();
this.offset += bytes.length;
if (this.syncPolicy.mark(tuple, this.offset)) {
if (this.out instanceof HdfsDataOutputStream) {
((HdfsDataOutputStream) this.out).hsync(EnumSet.of(SyncFlag.UPDATE_LENGTH));
} else {
this.out.hsync();
this.out.flush();
}
this.syncPolicy.reset();
}
dataFileWriter.flush();
}
if(this.rotationPolicy.mark(tuple, this.offset)){
rotateOutputFile(); // synchronized
this.offset = 0;
this.rotationPolicy.reset();
}
} catch (IOException | JMSException e) {
LOG.warn("write/sync failed.", e);
this.collector.fail(tuple);
}
}
#Override
void closeOutputFile() throws IOException {
this.out.close();
}
#Override
Path createOutputFile() throws IOException {
Path path = new Path(this.fileNameFormat.getPath(), this.fileNameFormat.getName(this.rotation, System.currentTimeMillis()));
this.out = this.fs.create(path);
dataFileWriter.create(schema, out);
return path;
}
#Override
void doPrepare(Map conf, TopologyContext topologyContext,OutputCollector collector) throws IOException {
// TODO Auto-generated method stub
LOG.info("Preparing HDFS Bolt...");
try {
schema = new Schema.Parser().parse(new File("/home/*******/********SchemafileName.avsc"));
} catch (IOException e1) {
e1.printStackTrace();
}
this.fs = FileSystem.get(URI.create(this.fsUrl), hdfsConfig);
datumWriter = new SpecificDatumWriter<IndexedRecord>(schema);
dataFileWriter = new DataFileWriter<IndexedRecord>(datumWriter);
JMSAvroUtils JASV = new JMSAvroUtils();
}

Related

Spring Boot using Apache POI streaming workbook - How do I release the memory taken up

I am writing a simple Spring boot app that reads from the DB and writes it out to an excel file using Apache POI. The generated file can contain upto 100K rows, and is around 8-10 MB in size.
My controller class:
public ResponseEntity<Resource> getExcelData(
#RequestBody ExcelRequest request) {
InputStreamResource file = new InputStreamResource(downloadService.startExcelDownload(request));
return ResponseEntity.ok()
.header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=myFile.xlsx")
.contentType(MediaType.parseMediaType("application/vnd.ms-excel"))
.body(file);
}
Service class:
public ByteArrayInputStream startExcelDownload(ExcelRequest request) {
/** Apache POI code using SXSSFWorkbook **/
SXSSFWorkbook workbook = new SXSSFWorkbook(1000);
ByteArrayOutputStream out = new ByteArrayOutputStream();
try {
// Excel generation logic here
...
workbook.write(out);
return new ByteArrayInputStream(out.toByteArray());
}
catch (IOException | ParseException e) {
throw new RuntimeException("fail to import data to Excel file: " + e.getMessage());
}
finally {
try {
out.close();
} catch (Exception e)
{
e.printStackTrace();
}
}
}
Here is what I see in VisualVM
And in the heap dump:
byte[] 151,521,152 B (41.2%) 3,100,020 (30.7%)
Is there something I have missed? Should the byte[] continue to take up memory after the response has been returned? The memory goes down once I manually run the GC.

Pipeline With Multiple Sink transformations With same Elastic Cluster Is Publishing Events to Single Elastic Sink

We currently have a pipeline with 2 Sinks.
S3 Sink
Elastic Sink
We are consuming messages from Kafka and after modifying a field to the Events Based on Some Rules, messages are published to both S3 and Elastic.
We have come Across a Usecase where we had to send Events to two different Index Within the Same Elastic Cluster.
While trying this We Observed that the messages are being sent to only one of the Index.
ElasticSink :
ElasticsearchSink.Builder<Output> esSinkBuilder = new ElasticsearchSink.Builder<Output>(
elasticSearchHosts,
new ElasticsearchSinkFunction<Output>() {
public IndexRequest createIndexRequest(Output output) {
IndexRequest request = Requests.indexRequest()
.index(...)
.source(GsonUtils.toJsonFromObject(output), XContentType.JSON);
return request;
}
#Override
public void process(Output output, RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(output));
}
}
);
esSinkBuilder.setRestClientFactory(restClientBuilder -> {
restClientBuilder.setHttpClientConfigCallback(httpClientBuilder -> {
TrustStrategy acceptingTrustStrategy = (certificate, authType) -> true;
SSLContext sslContext = null;
try {
sslContext = SSLContexts.custom().loadTrustMaterial(null, acceptingTrustStrategy).build();
} catch (Exception e) {
// Handle error
}
httpClientBuilder.setSSLContext(sslContext);
CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("any","any"));
return httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider);
});
});
esSinkBuilder.setFailureHandler(new ActionRequestFailureHandler() {
#Override
public void onFailure(ActionRequest action, Throwable failure, int restStatusCode, RequestIndexer indexer) {
....
}
});
return esSinkBuilder.build();
}
I am unable to see a reason why this is happening.
While Debug I can see both sink transformations being added to env.
i am using flink 1.13.x .

Springboot Kafka #Listener consumer pause/resume not working

I have a springboot Kafka Consumer & Producer. The consumer is expected to read data from topic 1 by 1, process(time consuming) it & write it to another topic and then manually commit the offset.
In order to avoid rebalancing, I have tried to call pause() and resume() on KafkaContainer but the consumer is always running & never responds to pause() call, tried it even with a while loop and faced no success(unable to pause the consumer). KafkaListenerEndpointRegistry is Autowired.
Springboot version = 2.6.9, spring-kafka version = 2.8.7
#KafkaListener(id = "c1", topics = "${app.topics.topic1}", containerFactory = "listenerContainerFactory1")
public void poll(ConsumerRecord<String, String> record, Acknowledgment ack) {
log.info("Received Message by consumer of topic1: " + value);
String result = process(record.value());
producer.sendMessage(result + " topic2");
log.info("Message sent from " + topicIn + " to " + topicOut);
ack.acknowledge();
log.info("Offset committed by consumer 1");
}
private String process(String value) {
try {
pauseConsumer();
// Perform time intensive network IO operations
resumeConsumer();
} catch (InterruptedException e) {
log.error(e.getMessage());
}
return value;
}
private void pauseConsumer() throws InterruptedException {
if (registry.getListenerContainer("c1").isRunning()) {
log.info("Attempting to pause consumer");
Objects.requireNonNull(registry.getListenerContainer("c1")).pause();
Thread.sleep(5000);
log.info("kafkalistener container state - " + registry.getListenerContainer("c1").isRunning());
}
}
private void resumeConsumer() throws InterruptedException {
if (registry.getListenerContainer("c1").isContainerPaused() || registry.getListenerContainer("c1").isPauseRequested()) {
log.info("Attempting to resume consumer");
Objects.requireNonNull(registry.getListenerContainer("c1")).resume();
Thread.sleep(5000);
log.info("kafkalistener container state - " + registry.getListenerContainer("c1").isRunning());
}
}
Am I missing something? Could someone please guide me with the right way of achieving the required behaviour?
You are running the process() method on the listener thread so pause/resume will not have any effect; the pause only takes place when the listener thread exits the listener method (and after it has processed all the records received by the previous poll).
The next version (2.9), due later this month, has a new property pauseImmediate, which causes the pause to take effect after the current record is processed.
You can try like this. This work for me
public class kafkaConsumer {
public void run(String topicName) {
try {
Consumer<String, String> consumer = new KafkaConsumer<>(config);
consumer.subscribe(Collections.singleton(topicName));
while (true) {
try {
ConsumerRecords<String, String> consumerRecords = consumer.poll(Duration.ofMillis(80000));
for (TopicPartition partition : consumerRecords.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = consumerRecords.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
kafkaEvent = record.value();
consumer.pause(consumer.assignment());
/** Implement Your Business Logic Here **/
Once your processing done
consumer.resume(consumer.assignment());
try {
consumer.commitSync();
} catch (CommitFailedException e) {
}
}
}
} catch (Exception e) {
continue;
}
}
} catch (Exception e) {
}
}

JMS Message Persistence on ActiveMQ

I need to ensure redelivery of JMS messages when the consumer fails
The way the producer is set up now - DefaultJmsListenerContainerFactory and Session.AUTO_ACKNOWLEDGE
I'm trying to build a jar and try in here to save the message into the server, once the app is able to consume, the producer in the jar will produce the message to the app.
Is that a good approach to do so?! any other way/recommendation to improve this?
public void handleMessagePersistence(final Object bean) {
ObjectMapper mapper = new ObjectMapper();
final String beanJson = mapper.writeValueAsString(bean); // I might need to convert to xml instead
// parameterize location of persistence folder
writeToDriver(beanJson);
try {
Producer.produceMessage(beanJson, beanJson, null, null, null);
} catch (final Exception e) {
LOG.error("Error producing message ");
}
}
here what I have to writ out the meesage:
private void writeToDriver(String beanJson) {
File filename = new File(JMS_LOCATION +
LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS")) + ".xml");
try (final FileWriter fileOut = new FileWriter(filename)) {
try (final BufferedWriter out = new BufferedWriter(fileOut)) {
out.write(beanJson);
out.flush();
}
} catch (Exception e) {
LOG.error("Unable to write out : " + beanJson, e);
}
}

How to transfer *.pgp files using SFTP spring Integration

We are developing generic automated application which will download *.pgp file from SFTP server.
The application working fine with *.txt files. But when we are trying to pull *.pgp files we are getting the below exception.
2016-03-18 17:45:45 INFO jsch:52 - SSH_MSG_SERVICE_REQUEST sent
2016-03-18 17:45:46 INFO jsch:52 - SSH_MSG_SERVICE_ACCEPT received
2016-03-18 17:45:46 INFO jsch:52 - Next authentication method: publickey
2016-03-18 17:45:48 INFO jsch:52 - Authentication succeeded (publickey).
sftpSession org.springframework.integration.sftp.session.SftpSession#37831f
files size158
java.io.IOException: inputstream is closed
at com.jcraft.jsch.ChannelSftp.fill(ChannelSftp.java:2884)
at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2908)
at com.jcraft.jsch.ChannelSftp.access$500(ChannelSftp.java:36)
at com.jcraft.jsch.ChannelSftp$2.read(ChannelSftp.java:1390)
at com.jcraft.jsch.ChannelSftp$2.read(ChannelSftp.java:1340)
at org.springframework.util.StreamUtils.copy(StreamUtils.java:126)
at org.springframework.util.FileCopyUtils.copy(FileCopyUtils.java:109)
at org.springframework.integration.sftp.session.SftpSession.read(SftpSession.java:129)
at com.sftp.test.SFTPTest.main(SFTPTest.java:49)
java code :
public class SFTPTest {
public static void main(String[] args) {
ApplicationContext applicationContext = new ClassPathXmlApplicationContext("beans.xml");
DefaultSftpSessionFactory defaultSftpSessionFactory = applicationContext.getBean("defaultSftpSessionFactory", DefaultSftpSessionFactory.class);
System.out.println(defaultSftpSessionFactory);
SftpSession sftpSession = defaultSftpSessionFactory.getSession();
System.out.println("sftpSessikon "+sftpSession);
String remoteDirectory = "/";
String localDirectory = "C:/312421/temp/";
OutputStream outputStream = null;
List<String> fileAtSFTPList = new ArrayList<String>();
try {
String[] fileNames = sftpSession.listNames(remoteDirectory);
for (String fileName : fileNames) {
boolean isMatch = fileCheckingAtSFTPWithPattern(fileName);
if(isMatch){
fileAtSFTPList.add(fileName);
}
}
System.out.println("files size" + fileAtSFTPList.size());
for (String fileName : fileAtSFTPList) {
File file = new File(localDirectory + fileName);
/*InputStream ipstream= sftpSession.readRaw(fileName);
FileUtils.writeByteArrayToFile(file, IOUtils.toByteArray(ipstream));
ipstream.close();*/
outputStream = new FileOutputStream(file);
sftpSession.read(remoteDirectory + fileName, outputStream);
outputStream.close();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}finally {
try {
if (outputStream != null)
outputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
public static boolean fileCheckingAtSFTPWithPattern(String fileName){
Pattern pattern = Pattern.compile(".*\\.pgp$");
Matcher matcher = pattern.matcher(fileName);
if(matcher.find()){
return true;
}
return false;
}
}
Please suggest how to sort out this issue.
Thanks
The file type is irrelevant to Spring Integration - it looks like the server is closing the connection while reading the preamble - before the data is being fetched...
at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2908)
at com.jcraft.jsch.ChannelSftp.access$500(ChannelSftp.java:36)
at com.jcraft.jsch.ChannelSftp$2.read(ChannelSftp.java:1390)
at com.jcraft.jsch.ChannelSftp$2.read(ChannelSftp.java:1340)
The data itself is not read until later (line 1442 in ChannelSftp).
So it looks like a server-side problem.

Resources