What is the best way to fetch millions of rows at a time in spring boot? - spring

I have a spring boot application and for a particular feature I have to prepare a CSV everyday for another service to use. The job runs everyday at 6 AM. And dumps the csv on the server. The issue is the data list is big. It's Around 7.8 millions of rows.I am using spring JPA to fetch all the records. Is their any better way to make it more efficient? Here's my code....
#Scheduled(cron = "0 1 6 * * ?")
public void saveMasterChildList() {
log.debug("running write job");
DateFormat dateFormatter = new SimpleDateFormat("dd_MM_yy");
String currentDateTime = dateFormatter.format(new Date());
String fileName = currentDateTime + "_Master_Child.csv";
ICsvBeanWriter beanWriter = null;
List<MasterChild> masterChildren = masterChildRepository.findByMsisdnIsNotNull();
try {
beanWriter = new CsvBeanWriter(new FileWriter(new File("/u01/edw_bill/", fileName)),
CsvPreference.STANDARD_PREFERENCE);
String[] header = {"msisdn"};
String[] nameMapping = {"msisdn"};
beanWriter.writeHeader(header);
for (MasterChild masterChild : masterChildren) {
beanWriter.write(masterChild, nameMapping);
}
} catch ( IOException e) {
log.debug("Error writing the CSV file {}", e.toString());
} finally {
if (beanWriter != null) {
try {
beanWriter.close();
} catch (IOException e) {
log.debug("Error closing the writer {}", e.toString());
}
}
}
} here

You could use pagination to separate data and load chunk by chunk. See this.

Related

java stream is making weird things to generate csv file in Spring Boot

I'm processing a csv file through my springboot app, the file is to download it, in my case I use streams but there is a problem what I don't know what's wrong in my code because some rows is complete with the columns but next row only write some columns and leftover columns are write below as if were a new row. I hope you understand what I mean. I hope you give a hand, thank you in advance.
This code below is the controller
.....
#RequestMapping(value="/stream/csv/{grupo}/{iduser}", method = RequestMethod.GET)
public void generateCSVUsingStream(#PathVariable("grupo") String grupo,
#PathVariable("iduser") String userId,HttpServletResponse response) {
response.addHeader("Content-Type", "application/csv");
response.addHeader("Content-Disposition", "attachment; filename=\""+userId+"_Reporte_PayCash"+grupo.replaceAll("\\s", "")+".csv");
response.setCharacterEncoding("UTF-8");
try (Stream<ReportePayCashDTO> streamPaycashdatos = capaDatosDao.ReportePayCashStream(userId, grupo);PrintWriter out = response.getWriter();) {
//PrintWriter out = response.getWriter();
out.write(String.join(",", "Cuenta" , "Referencia", "Referencia_paycash","Distrito","Plaza","Cartera"));
out.write("\n");
streamPaycashdatos.forEach(streamdato -> {
out.write(streamdato.getAccount()+","+streamdato.getReferencia()+","+streamdato.getReferenciapaycash()
+","+streamdato.getCartera()+","+streamdato.getState()+","+streamdato.getCity());
out.append("\r\n");
});
out.flush();
out.close();
streamPaycashdatos.close();
} catch (IOException ix) {
throw new RuntimeException("There is an error while downloading file", ix);
}
}
The method on DAO is this
...
#Override
public Stream<ReportePayCashDTO> ReportePayCashStream(String userId, String grupo) {
// TODO Auto-generated method stub
Stream<ReportePayCashDTO > stream = null ;
String query ="";
//more code
try {
stream = getJdbcTemplate().queryForStream(query, (rs, rowNum) -> {
return new ReportePayCashDTO(Utils.valnull(rs.getString("account")),
Utils.valnull(rs.getString("reference")),
Utils.valnull(rs.getString("referencepaycash")),
Utils.valnull(rs.getString("state")),
Utils.valnull(rs.getString("city")),
Utils.valnull(rs.getString("cartera"))
);
});
}catch(Exception e) {
e.printStackTrace();
logger.error(e.getMessage());
}
return stream;
}
Example: This is what I hoped will write into csv file
55xxxxx02,88xxxx153,1170050202662,TAMAULIPAS,TAMPICO,AmericanExpre
58xxxxx25,88xxx899,1170050202662,TAMAULIPAS,TAMPICO,AmericanClasic
but some rows was written like this
55xxxxx02,88xxxx153,1170050202662
,TAMAULIPAS,TAMPICO,AmericanExpre
58xxxxx25,88xxx899,1170050202662
,TAMAULIPAS,TAMPICO,AmericanClasic

Spring Boot - Handle CSV as well as Excel Multipart file

I have a REST API in Spring Boot Application that takes in a param of type Multipart file.
There is possibility that user may import either CSV file or Excel(.xlsx / .xsl) file of huge size which needs to be handled.
I am using Apache POI to read Excel type file and it is working fine. To my existing code, how do I efficiently handle CSV file reading also
Below is Excel file Reading Code:
#RequestMapping(value="/read", method = RequestMethod.POST)
#Transactional
public Map<String, String> read(#RequestParam("file") MultipartFile file) {
Map<String, String> response = new ArrayList();
if (!file.isEmpty()) {
ByteArrayInputStream stream;
Workbook wb;
StringBuilder contentSb = new StringBuilder();
try {
stream = new ByteArrayInputStream(file.getBytes());
wb = WorkbookFactory.create(stream);
org.apache.poi.ss.usermodel.Sheet sheet = wb.getSheetAt(wb.getActiveSheetIndex());
Iterator<Row> rowIterator = sheet.rowIterator();
System.out.println("Processing Excel file");
for (int rowIndex = 0; rowIndex <= sheet.getLastRowNum(); rowIndex++) {
Row row = sheet.getRow(rowIndex);
if (row != null) {
Cell cell = row.getCell(0);
if (cell != null) {
contentSb.append(cell.getStringCellValue()+",");
}
}
}
System.out.println("Processed Excel file");
return response;
} catch (Exception e) {
e.printStackTrace();
}
}
else {
return response;
}
}
Thank you in advance!

Nifi Processor gets triggered twice for single Input flow file

I am currently new on Apache Nifi and still exploring it.
I made a custom processor where I will fetch data from server with pagination.
I pass the input file which will contains the attribute "url".
Finally transfer the response in output flow file, as I fetch data with pagination, so I made a new output flow file for each page and transferred it to Successful relationship.
Below is the code part:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
FlowFile incomingFlowFile = session.get();
String api = null;
if (incomingFlowFile == null) {
logger.info("empty input flow file");
session.commit();
return;
} else {
api=incomingFlowFile.getAttribute("url");
}
session.remove(incomingFlowFile);
if(api == null) {
logger.warn("API url is null");
session.commit();
return;
}
int page = Integer.parseInt(context.getProperty(PAGE).getValue());
while(page < 3) {
try {
String url = api + "&curpg=" + page;
logger.info("input url is: {}", url);
HttpResponse response = httpGetApiCall(url, 10000);
if(response == null || response.getEntity() == null) {
logger.warn("response null");
session.commit();
return;
}
String resp = EntityUtils.toString(response.getEntity());
InputStream is = new ByteArrayInputStream(StandardCharsets.UTF_16.encode(resp).array());
FlowFile outFlowFile = session.create();
outFlowFile = session.importFrom(is, outFlowFile);
session.transfer(outFlowFile, SUCCESSFUL);
} catch (IOException e) {
logger.warn("IOException :{}", e.getMessage());
return;
}
++page;
}
session.commit();
}
I am facing issue that for a single Input flow file, this processor get triggered twice and so it generates 4 flow files for a single input flow file.
I am not able to figure out this where I have done wrong.
Please help in this issue.
Thanks in advance.
======================================================================
processor group 1(Nifi_Parvin)
processor group 2 (News_Point_custom)

Confluent HDFS Connector is losing messages

Community, could you please help me to understand why ~3% of my messages don't end up in HDFS? I wrote a simple producer in JAVA to generate 10 million messages.
public static final String TEST_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"myrecord\","
+ "\"fields\":["
+ " { \"name\":\"str1\", \"type\":\"string\" },"
+ " { \"name\":\"str2\", \"type\":\"string\" },"
+ " { \"name\":\"int1\", \"type\":\"int\" }"
+ "]}";
public KafkaProducerWrapper(String topic) throws UnknownHostException {
// store topic name
this.topic = topic;
// initialize kafka producer
Properties config = new Properties();
config.put("client.id", InetAddress.getLocalHost().getHostName());
config.put("bootstrap.servers", "myserver-1:9092");
config.put("key.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
config.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
config.put("schema.registry.url", "http://myserver-1:8089");
config.put("acks", "all");
producer = new KafkaProducer(config);
// parse schema
Schema.Parser parser = new Schema.Parser();
schema = parser.parse(TEST_SCHEMA);
}
public void send() {
// generate key
int key = (int) (Math.random() * 20);
// generate record
GenericData.Record r = new GenericData.Record(schema);
r.put("str1", "text" + key);
r.put("str2", "text2" + key);
r.put("int1", key);
final ProducerRecord<String, GenericRecord> record = new ProducerRecord<>(topic, "K" + key, (GenericRecord) r);
producer.send(record, new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
if (e != null) {
logger.error("Send failed for record {}", record, e);
messageErrorCounter++;
return;
}
logger.debug("Send succeeded for record {}", record);
messageCounter++;
}
});
}
public String getStats() { return "Messages sent: " + messageCounter + "/" + messageErrorCounter; }
public long getMessageCounter() {
return messageCounter + messageErrorCounter;
}
public void close() {
producer.close();
}
public static void main(String[] args) throws InterruptedException, UnknownHostException {
// initialize kafka producer
KafkaProducerWrapper kafkaProducerWrapper = new KafkaProducerWrapper("my-test-topic");
long max = 10000000L;
for (long i = 0; i < max; i++) {
kafkaProducerWrapper.send();
}
logger.info("producer-demo sent all messages");
while (kafkaProducerWrapper.getMessageCounter() < max)
{
logger.info(kafkaProducerWrapper.getStats());
Thread.sleep(2000);
}
logger.info(kafkaProducerWrapper.getStats());
kafkaProducerWrapper.close();
}
And I use the Confluent HDFS Connector in standalone mode to write data to HDFS. The configuration is as follows:
name=hdfs-consumer-test
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=my-test-topic
hdfs.url=hdfs://my-cluster/kafka-test
hadoop.conf.dir=/etc/hadoop/conf/
flush.size=100000
rotate.interval.ms=20000
# increase timeouts to avoid CommitFailedException
consumer.session.timeout.ms=300000
consumer.request.timeout.ms=310000
heartbeat.interval.ms= 60000
session.timeout.ms= 100000
The connector writes the data into HDFS, but after waiting for 20000 ms (due to rotate.interval.ms) not all messages are received.
scala> spark.read.avro("/kafka-test/topics/my-test-topic/partition=*/my-test-topic*")
.count()
res0: Long = 9749015
Any idea what is the reason for this behavior? Where is my mistake? I'm using Confluent 3.0.1/Kafka 10.0.0.1.
Are you seeing the last few messages are not moved to HDFS? If so, it's likely you are running into the issue described here https://github.com/confluentinc/kafka-connect-hdfs/pull/100
Try sending one more message to the topic after the rotate.interval.ms has expired to validate this is what you are running into. If you need to rotate based on time, it's probably a good idea to upgrade to pickup the fix.

What is a GDATA extension profile?

I want to get the XML in atom format of a GoogleDocs spreadsheet using the [generateAtom(..,..)][1] method of the class BaseEntry which a SpreadsheetEntry inherits. But I don't understand the the second parameter in the method, ExtensionProfile. What is it and will this method call suffice if I just want to get the XML in atom format?
XmlWriter x = new XmlWriter();
spreadSheetEntry.generateAtom(x,new ExtensionProfile());
[1]: http://code.google.com/apis/gdata/javadoc/com/google/gdata/data/BaseEntry.html#generateAtom(com.google.gdata.util.common.xml.XmlWriter, com.google.gdata.data.ExtensionProfile)
From the JavaDoc for ExtensionProfile:
A profile is a set of allowed
extensions for each type together with
additional properties.
Usually if you've got a service, you can ask that for its extension profile using Service.getExtensionProfile().
Elaborating Jon Skeet's answer, you need to instanciate a service like this:
String developer_key = "mySecretDeveloperKey";
String client_id = "myApplicationsClientId";
YouTubeService service = new YouTubeService(client_id, developer_key);
Then you can write to a file using the extension profile of your service:
static void write_video_entry(VideoEntry video_entry) {
try {
String cache_file_path = Layout.get_cache_file_path(video_entry);
File cache_file = new File(cache_file_path);
Writer writer = new FileWriter(cache_file);
XmlWriter xml_writer = new XmlWriter(writer);
ExtensionProfile extension_profile = service.getExtensionProfile();
video_entry.generateAtom(xml_writer, extension_profile);
xml_writer.close();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Analogously, you can read a file using the extension profile of your service:
static VideoFeed read_video_feed(File cache_file_file) {
VideoFeed video_feed = new VideoFeed();
try {
InputStream input_stream = new FileInputStream(cache_file_file);
ExtensionProfile extension_profile = service.getExtensionProfile();
try {
video_feed.parseAtom(extension_profile, input_stream);
} catch (ParseException e) {
e.printStackTrace();
}
input_stream.close();
} catch (IOException e) {
e.printStackTrace();
}
return video_feed;
}

Resources