What I have?
Spring Integration that watch recursively a folder for new CSV's files; and send them back to Spring batch.
The job: read the CSV file; in the processor, I modify some data in the items; then I use a custom writer to save my data on the DB.
Problem?
In fact that I have dynamic number of CSV beeing send to the batch. I want that my job commit interval will be based on the number of items (lines) present in the CSV's file. In other way, I don't want to commit my data in every fixed number of item, but every end of file. Exemple: CSV 1 have 200 Lines, I want to process all the lines, writes them, commit, close the transaction then read the next CSV.
I have two idea, but I didn't know whoch is the perfect and how to implement it:
Get from the reader the number of lines in my CSV and send it to my commit interval using a job parameter argument like so #{jobParameters['commit.interval.value']}
Implement a Custom Completion Policy to replace my commit inteval, how to implement isComplete() Do you have any exemples? Github project?
But before all that, how can I get the number of items?
Could any one helps me? a code sample maybe?
Thank you in advance.
No answer, but I found a solution
I'm using a Dynamic commit interval instead of a completion policy.
With Spring batch integration, I can use a transformer to send my file to the batch, for that I have a custom class FileMessageToJobRequest in that one I added this function that helps me to get the count lines
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
and this one to send parameters
#Transformer
public JobLaunchRequest toRequest(Message<File> message) throws IOException{
JobParametersBuilder jobParametersBuilder = new JobParametersBuilder();
jobParametersBuilder.addString("commit.interval", Integer.toString(countLines(message.getPayload().getAbsolutePath())));
return new JobLaunchRequest(job, jobParametersBuilder.toJobParameters());
}
and in my job context, I just added this commit-interval="#{jobParameters['commit.interval']}"
Hope it help someone in need ;)
Related
I am trying to read an Excel file in manipulate it or add new data to it and write it back out. I am also trying to do this a complete reactive process using Flux and Mono. The Idea is to return the resulting file or bytearray via a webservice.
My question is how do I get a InputStream and OutputStream in a non blocking way?
I am using the Apache Poi library to read and generate the Excel File.
I currently have a solution based around a mix of Mono.fromCallable() and Blocking code getting the Input Stream.
For example the webservice part is as follows.
#GetMapping(value = API_BASE_PATH + "/download", produces = "application/vnd.ms-excel")
public Mono<ByteArrayResource> download() {
Flux<TimeKeepingEntry> createExcel = excelExport.createDocument(false);
return createExcel.then(Mono.fromCallable(() -> {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
excelExport.getWb().write(outputStream);
return new ByteArrayResource(outputStream.toByteArray());
}).subscribeOn(Schedulers.elastic()));
}
And the Processing of the file:
public Flux<TimeKeepingEntry> createDocument(boolean all) {
Flux<TimeKeepingEntry> entries = null;
try {
InputStream inputStream = new ClassPathResource("Timesheet Template.xlsx").getInputStream();
wb = WorkbookFactory.create(inputStream);
Sheet sheet = wb.getSheetAt(0);
log.info("Created document");
if (all) {
//all entries
} else {
entries = service.findByMonth(currentMonthName).log("Excel Export - retrievedMonths").sort(Comparator.comparing(TimeKeepingEntry::getDateOfMonth)).doOnNext(timeKeepingEntry-> {
this.populateEntry(sheet, timeKeepingEntry);
});
}
} catch (IOException e) {
log.error("Error Importing File", e);
}
return entries;
}
This works well enough but not very in line with Flux and Mono. Some guidance here would be good. I would prefer to have the whole sequence non-blocking.
Unfortunately the WorkbookFactory.create() operation is blocking, so you have to perform that operation using imperative code. However fetching each timeKeepingEntry can be done reactively. Your code would looks something like this:
public Flux<TimeKeepingEntry> createDocument() {
return Flux.generate(
this::getWorkbookSheet,
(sheet, sink) -> {
sink.next(getNextTimeKeepingEntryFrom(sheet));
},
this::closeWorkbook);
}
This will keep the workbook in memory, but will fetch each entry on demand when the elements of the Flux are requested.
I'm trying to write a custom Nifi processor which will take in the contents of the incoming flow file, perform some math operations on it, then write the results into an outgoing flow file. Is there a way to dump the contents of the incoming flow file into a string or something? I've been searching for a while now and it doesn't seem that simple. If anyone could point me toward a good tutorial that deals with doing something like that it would be greatly appreciated.
The Apache NiFi Developer Guide documents the process of creating a custom processor very well. In your specific case, I would start with the Component Lifecycle section and the Enrich/Modify Content pattern. Any other processor which does similar work (like ReplaceText or Base64EncodeContent) would be good examples to learn from; all of the source code is available on GitHub.
Essentially you need to implement the #onTrigger() method in your processor class, read the flowfile content and parse it into your expected format, perform your operations, and then re-populate the resulting flowfile content. Your source code will look something like this:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
final ComponentLog logger = getLogger();
AtomicBoolean error = new AtomicBoolean();
AtomicReference<String> result = new AtomicReference<>(null);
// This uses a lambda function in place of a callback for InputStreamCallback#process()
processSession.read(flowFile, in -> {
long start = System.nanoTime();
// Read the flowfile content into a String
// TODO: May need to buffer this if the content is large
try {
final String contents = IOUtils.toString(in, StandardCharsets.UTF_8);
result.set(new MyMathOperationService().performSomeOperation(contents));
long stop = System.nanoTime();
if (getLogger().isDebugEnabled()) {
final long durationNanos = stop - start;
DecimalFormat df = new DecimalFormat("#.###");
getLogger().debug("Performed operation in " + durationNanos + " nanoseconds (" + df.format(durationNanos / 1_000_000_000.0) + " seconds).");
}
} catch (Exception e) {
error.set(true);
getLogger().error(e.getMessage() + " Routing to failure.", e);
}
});
if (error.get()) {
processSession.transfer(flowFile, REL_FAILURE);
} else {
// Again, a lambda takes the place of the OutputStreamCallback#process()
FlowFile updatedFlowFile = session.write(flowFile, (in, out) -> {
final String resultString = result.get();
final byte[] resultBytes = resultString.getBytes(StandardCharsets.UTF_8);
// TODO: This can use a while loop for performance
out.write(resultBytes, 0, resultBytes.length);
out.flush();
});
processSession.transfer(updatedFlowFile, REL_SUCCESS);
}
}
Daggett is right that the ExecuteScript processor is a good place to start because it will shorten the development lifecycle (no building NARs, deploying, and restarting NiFi to use it) and when you have the correct behavior, you can easily copy/paste into the generated skeleton and deploy it once.
i'm doing a simple batch job with Spring Batch and Spring Boot.
I need to read a flat file, separate the header data (first line) from the body data (rest of lines) for individual business logic processing and then write everything into a single file.
As you can see, the header has 5 params that have to be mapped to one class, and the body has 12 which have to be mapped to a different one.
I first thought of using FlatFileItemReader and skip the header. Then use the skippedLinesCallback to handle that line, but i couldn't figure out how to do it.
I'm new to Spring Batch and Java Config. If someone can help me writing a solution for my problem i would really aprecciate it!
I leave here the input file:
01.01.2017|SUBDCOBR|12:21:23|01/12/2016|31/12/2016
01.01.2017|12345678231234|0002342434|BORGIA RUBEN|27-32548987-9|FA|A|2062-
00010443/444/445|142,12|30/08/2017|142,01
01.01.2017|12345673201234|2342434|ALVAREZ ESTHER|27-32533987-9|FA|A|2062-
00010443/444/445|142,12|30/08/2017|142,02
01.01.2017|12345673201234|0002342434|LOPEZ LUCRECIA|27-32553387-9|FA|A|2062-
00010443/444/445|142,12|30/08/2017|142,12
01.01.2017|12345672301234|0002342434|SILVA JESUS|27-32558657-9|NC|A|2062-
00010443|142,12|30/08/2017|142,12
Cheers!
EDIT 1:
This would be my first attepmt . My "body" POJO is called DetalleFacturacion and my "header" POJO is CabeceraFacturacion. The reader I thought to do it with DetalleFacturacion pojo, so i can skip the header and treat it later... however i'm not sure how to assign header's data into CabeceraFacturacion.
public FlatFileItemReader<DetalleFacturacion> readerDetalleFacturacion(){
FlatFileItemReader<DetalleFacturacion> reader = new FlatFileItemReader<>();
reader.setLinesToSkip(1);
reader.setResource(new ClassPathResource("/inputFiles/GLEO-MN170100-PROCESO01-SUBDFACT-000001.txt"));
DefaultLineMapper<DetalleFacturacion> detalleLineMapper = new DefaultLineMapper<>();
DelimitedLineTokenizer tokenizerDet = new DelimitedLineTokenizer("|");
tokenizerDet.setNames(new String[] {"fechaEmision", "tipoDocumento", "letra", "nroComprobante",
"nroCliente", "razonSocial", "cuit", "montoNetoGP", "montoNetoG3",
"montoExento", "impuestos", "montoTotal"});
LineCallbackHandler skippedLineCallback = new LineCallbackHandler() {
#Override
public void handleLine(String line) {
String[] headerSeparado = line.split("|");
String printDate = headerSeparado[0];
String reportIdentifier = headerSeparado[1];
String tituloReporte = headerSeparado[2];
String fechaDesde = headerSeparado[3];
String fechaHasta = headerSeparado[4];
CabeceraFacturacion cabeceraFacturacion = new CabeceraFacturacion();
cabeceraFacturacion.setPrintDate(printDate);
cabeceraFacturacion.setReportIdentifier(reportIdentifier);
cabeceraFacturacion.setTituloReporte(tituloReporte);
cabeceraFacturacion.setFechaDesde(fechaDesde);
cabeceraFacturacion.setFechaHasta(fechaHasta);
}
};
reader.setSkippedLinesCallback(skippedLineCallback);
detalleLineMapper.setLineTokenizer(tokenizerDet);
detalleLineMapper.setFieldSetMapper(new DetalleFieldSetMapper());
detalleLineMapper.afterPropertiesSet();
reader.setLineMapper(detalleLineMapper);
// Test to check if it is saving correctly data in CabeceraFacturacion
CabeceraFacturacion cabeceraFacturacion = new CabeceraFacturacion();
System.out.println("Print Date:"+cabeceraFacturacion.getPrintDate());
System.out.println("Report Identif:
"+cabeceraFacturacion.getReportIdentifier());
return reader;
}
You are correct . You need to use skippedLinesCallback to handle skip lines.
You need to implement LineCallbackHandler interface and add you processing in handleLine method.
LineCallbackHandler Interface passes the raw line content of the lines in the file to be skipped. If linesToSkip is set to 2, then this interface is called twice.
This is how you can define Reader for the same.
Java Config - Spring Batch 4
#Bean
public FlatFileItemReader<POJO> myReader() {
return FlatFileItemReader<pojo>().
.setResource(new FileSystemResource("resources/players.csv"));
.name("myReader")
.delimited()
.delimiter(",")
.names("pro1,pro2,pro3")
.targetType(POJO.class)
.skippedLinesCallback(skippedLinesCallback)
.build();
}
I have written a controller which takes as a input the domain name , crawls the whole site and gives back the result in JSON format
http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.google.com
This gives the data google
http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.yahoo.com
This gives data for yahoo
If I try to run these two URL's simultaneously, I see that I am getting the mixed data, and the results of one is affecting the another, even though I try to hit them from different machines.
Here is my controller
#RequestMapping("/getUrlCrawlData/{domain:.+}")
#ResponseBody
public String registerContact(#PathVariable("domain") String domain) throws HttpStatusException, SQLException, IOException {
List<URLdata> urldata = null;
Gson gson = new Gson();
String json;
urldata = crawlService.crawlURL("http://"+domain);
json = gson.toJson(urldata);
return json;
}
What do I need to do modify to allow many multiple independent connections.
Update
Following is my crawl Service
public List<URLdata> crawlURL(String domain) throws HttpStatusException, SQLException, IOException{
testDomain = domain;
urlList.clear();
urlMap.clear();
urldata.clear();
urlList.add(testDomain);
processPage(testDomain);
//Get all pages
for(int i = 1; i < urlList.size(); i++){
if(urlList.size()>=500){
break;
}
processPage(urlList.get(i));
//System.out.println(urlList.get(i));
}
//Calculate Time
for(int i = 0; i < urlList.size(); i++){
getTitleAndMeta(urlList.get(i));
}
return urldata;
}
public static void processPage(String URL) throws SQLException, IOException, HttpStatusException{
//get useful information
try{
Connection.Response response = Jsoup.connect(URL)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
Document doc = response.parse();
//get all links and recursively call the processPage method
Elements questions = doc.select("a[href]");
for(Element link: questions){
String linkName = link.attr("abs:href");
if(linkName.contains(testDomain.replaceAll("http://www.", ""))){
if(linkName.contains("#")){
linkName = linkName.substring(0, linkName.indexOf("#"));
}
if(linkName.contains("?")){
linkName = linkName.substring(0, linkName.indexOf("?"));
}
if(!urlList.contains(linkName) && urlList.size() <= 500){
urlList.add(linkName);
}
}
}
}
catch(HttpStatusException e){
System.out.println(e);
}
catch(SocketTimeoutException e){
System.out.println(e);
}
catch(UnsupportedMimeTypeException e){
System.out.println(e);
}
catch(UnknownHostException e){
System.out.println(e);
}
catch(MalformedURLException e){
System.out.println(e);
}
}
Each of your requests (http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.google.com and http://crawlmysite-tgugnani.rhcloud.com/getUrlCrawlData/www.yahoo.com) is processed in a separate thread. You have two instances of the crawlURL() method working simultaneously, but both methods use the same variables (testDomain, urlList, urlMap and urldata). So they mess up each other's data in these variables.
One way to fix the problem is to declare these variables locally (inside the method). This way, new instances of these variables will be created for each invocation of crawlURL(). Alternatively, you can create a new instance of your CrawlService class for each invocation of the crawlURL() method.
Synchronizing threads would be a bad idea here because one requests will wait for another to complete before it can be processed by crawlURL().
As far as SpringMVC is concerned every request running in separate thread. So I think problem is in crawlService which, I suppose, is not stateless (singleton-like). Try to create new crawl service for every request and check if your data is not mixed. If creating crawl service is expensive operation you should rewrite it to work in stateless way.
#RequestMapping("/getUrlCrawlData/{domain:.+}")
#ResponseBody
public String registerContact(#PathVariable("domain") String domain) throws HttpStatusException, SQLException, IOException {
Gson gson = new Gson();
List<URLdata> = new CrawlService().crawlURL("http://"+domain);
return gson.toJson(urldata);
}
I think
urldata = crawlService.crawlURL("http://"+domain);
This call to crawl Service is the one which is affected by Multiple requests coming simultaneously.
check whether crawlService is safe from multithreading.
ie check whether crawlURL() method is synchronized , if not make it synchronized.
or else synchronize the block of calling crawlservice inside controller.
My script fetches xml via httpConnection and saves to persistent store. No problems there.
Then I loop through the saved data to compose a list of image url's to fetch via queue.
Each of these requests calls the httpConnection thread as so
...
public synchronized void run()
{
HttpConnection connection = (HttpConnection)Connector.open("http://www.somedomain.com/image1.jpg");
connection.setRequestMethod("GET");
String contentType = connection.getHeaderField("Content-type");
InputStream responseData = connection.openInputStream();
connection.close();
outputFinal(responseData, contentType);
}
public synchronized void outputFinal(InputStream result, String contentType) throws SAXException, ParserConfigurationException, IOException
{
if(contentType.startsWith("text/"))
{
// bunch of xml save code that works fine
}
else if(contentType.equals("image/png") || contentType.equals("image/jpeg") || contentType.equals("image/gif"))
{
// how to save images here?
}
else
{
//default
}
}
What I can't find any good documentation on is how one would take the response data and save it to an image stored on the device.
Maybe I just overlooked something very obvious. Any help is very appreciated.
Thanks
I tried following this advise and found the same thing I always find when looking up BB specific issues: nothing.
The problem is that every example or post assumes you know everything about the platform.
Here's a simple question: What line of code writes the read output stream to the blackberry device? What path? How do I retrieve it later?
I have this code, which I do not know if it does anything because I don't know where it is supposedly writing to or if that's even what it is doing at all:
** filename is determined on a loop based on the url called.
FileOutputStream fos = null;
try
{
fos = new FileOutputStream( File.FILESYSTEM_PATRIOT, filename );
byte [] buffer = new byte [262144];
int byteRead;
while ((byteRead = result.read (buffer ))!=- 1)
{
fos.write (buffer, 0, byteRead);
}
fos.flush();
fos.close();
}
catch(IOException ieo)
{
}
finally
{
if(fos != null)
{
fos.close();
}
}
The idea is that I have some 600 images pulled from a server. I need to loop the xml and save each image to the device so that when an entity is called, I can pull the associated image - entity_id.png - from the internal storage.
The documentation from RIM does not specify this, nor does it make it easy to begin figuring it out.
This issue does not seem to be addressed on this forum, or others I have searched.
Thanks
You'll need to use the Java FileOutputStream to do the writing. You'll also want to close the connection after reading the data from the InputStream (move outputFinal above your call to close). You can find all kinds of examples regarding FileOutputStream easily.
See here for more. Note that in order to use the FileOutputStream your application must be signed.