Spring batch - How to write own values to header in output file - spring-boot

I am developing a Spring Boot application.
I have an input.csv file to read and write values to DB, to do so I am using spring batch framework.
I am able to read/insert values to DB. Now I have created an output.csv file on which I have created a header with some column and setting a few column values extracted from input.csv file.
But for one column in output.csv file, I want to record the response i.e 00->success, 11->fail
Ex: My header looks like {"S_NO;NAME;Dept;Salary;Rsp_code"}
In this, for the first 4 columns values, I am setting from input.csv file.
For column rsp_code, I want to set hardcoded values 00 and 11.
if one row is successfully inserted to DB set rsp_code values like 00 otherwise 11.
Is there a way to do this? MY header code as following:
ItemWriter<Valueholder> databaseCsvItemWriter() {
FlatFileItemWriter<Valueholder> csvFileWriter = new FlatFileItemWriter<>();
String exportFileHeader = "S_NO;NAME;Dept;Salary;Rsp_code";
StringHeaderWriter headerWriter = new StringHeaderWriter(exportFileHeader);
csvFileWriter.setHeaderCallback(headerWriter);
String exportFilePath = "C:\\temp\\useroutput.csv";
csvFileWriter.setResource(new FileSystemResource(exportFilePath));
LineAggregator<Valueholder> lineAggregator = createStudentLineAggregator();
csvFileWriter.setLineAggregator(lineAggregator);
return csvFileWriter;
}
private LineAggregator<Valueholder> createStudentLineAggregator() {
DelimitedLineAggregator<Valueholder> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(";");
FieldExtractor<Valueholder> fieldExtractor = createStudentFieldExtractor();
lineAggregator.setFieldExtractor(fieldExtractor);
return lineAggregator;
}
private FieldExtractor<Valueholder> createStudentFieldExtractor() {
BeanWrapperFieldExtractor<Valueholder> extractor = new BeanWrapperFieldExtractor<>();
extractor.setNames(new String[] {"id", "name", "dept", "salary"});
return extractor;
}

Data transformation is a typical use case for an item processor. So you can use an item processor to do the mapping. The advantage of this approach is that you can:
unit test the mapping in isolation
re-use the item processor in another step/job if needed

Related

How to disable/avoid linesToSkp(1) from next file onwards in spring batch while processing large csv file

We have large csv file with 100 millions records, and used spring batch to load, read and write to database by splitting file with 1 million records using "SystemCommandTasklet". Below is snippet,
#Bean
#StepScope
public SystemCommandTasklet splitFileTasklet(#Value("#{jobParameters[filePath]}") final String inputFilePath) {
SystemCommandTasklet tasklet = new SystemCommandTasklet();
final File file = BatchUtilities.prefixFile(inputFilePath, AppConstants.PROCESSING_PREFIX);
final String command = configProperties.getBatch().getDataLoadPrep().getSplitCommand() + " " + file.getAbsolutePath() + " " + configProperties.getBatch().getDataLoad().getInputLocation() + System.currentTimeMillis() / 1000;
tasklet.setCommand(command);
tasklet.setTimeout(configProperties.getBatch().getDataLoadPrep().getSplitCommandTimeout());
executionContext.put(AppConstants.FILE_PATH_PARAM, file.getPath());
return tasklet;
}
and batch-config:
batch:
data-load-prep:
input-location: /mnt/mlr/prep/
split-command: split -l 1000000 --additional-suffix=.csv
split-command-timeout: 900000 # 15 min
schedule: "*/60 * * * * *"
lock-at-most: 5m
With above config, I could able to read load and write successfully to database. However, found a bug with below snippet that, after splitting the file, only first file will have headers, but next splitted file does not have hearders in the first line. So, I have to either disable or avoid linesToSkip(1) config for FlatFileItemReader(CSVReader).
#Configuration
public class DataLoadReader {
#Bean
#StepScope
public FlatFileItemReader<DemographicData> demographicDataCSVReader(#Value("#{jobExecutionContext[filePath]}") final String filePath) {
return new FlatFileItemReaderBuilder<DemographicData>()
.name("data-load-csv-reader")
.resource(new FileSystemResource(filePath))
.linesToSkip(1) // Need to avoid this from 2nd splitted file onwards as splitted file does not have headers
.lineMapper(lineMapper())
.build();
}
public LineMapper<DemographicData> lineMapper() {
DefaultLineMapper<DemographicData> defaultLineMapper = new DefaultLineMapper<>();
DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
lineTokenizer.setNames("id", "mdl65DecileNum", "mdl66DecileNum", "hhId", "dob", "firstName", "middleName",
"lastName", "addressLine1", "addressLine2", "cityName", "stdCode", "zipCode", "zipp4Code", "fipsCntyCd",
"fipsStCd", "langName", "regionName", "fipsCntyName", "estimatedIncome");
defaultLineMapper.setLineTokenizer(lineTokenizer);
defaultLineMapper.setFieldSetMapper(new DemographicDataFieldSetMapper());
return defaultLineMapper;
}
}
Note: Loader should not skip first row from second file while loading.
Thank you in advance. Appreciate any suggestions.
I would do it in the SystemCommandTasklet with the following command:
tail -n +2 data.csv | split -l 1000000 --additional-suffix=.csv
If you really want to do it with Java in your Spring Batch job, you can use a custom reader or an item processor that filters the header. But I would not recommend this approach as it introduces an additional test for each item (given the large number of lines in your input file, this could impact the performance of your job).

Spring Batch FlatFileItemReader Handle Additional Fields Added to CSV

So I've got a csv file that's being ingested on a scheduled basis. The csv file has a set of columns with their names specified in the header row, the catch is, new columns are constantly being added to this csv. Currently, when a new field is added, the ingest flow breaks and I get a FlatFileParseException. I have to go in and update the code with the new column names in order to have it work again.
What I'm looking to accomplish, is instead, when new columns are added, have the code correctly pick out the columns it needs, and not throw an exception.
#Bean
#StepScope
FlatFileItemReader<Foo> fooReader(
...
) {
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = { "foo", "bar" };
setNames(fooColumnNames);
// setStrict(false);
}};
return new FlatFileItemReader<>() {{
setLineMapper(new DefaultLineMapper<>() {{
setLineTokenizer(fooLineTokenizer);
setFieldSetMapper(new BeanWrapperFieldSetMapper<>() {{
setTargetType(Foo.class);
}});
}});
...
}};
}
I've tried using setStrict(false) in the lineTokenizer, and this gets rid of the exception, however the problem then becomes fields being set to the wrong values from the new columns that were added, as opposed to the original columns the data was being pulled from.
Any ideas on how to add a bit more fault-tolerance to this flow, so I don't have to constantly update the fooColumnNames whenever columns are added to the csv?
I tried modifying the code using the setStrict(false) parameter and toying with custom implementations of lineTokenizer, but still struggling to get fault-tolerance when new columns are added to the csv
I don't know about fault tolerance, but it could possible to retrieve columns dynamically
Add a listener to your step to retrieve columns in beforeStep and pass it to your stepExecutionContext
public class ColumnRetrieverListener implements StepExecutionListener {
private Resource resource;
//Getter and setter
#Override
public void beforeStep(StepExecution stepExecution) {
String[] columns = getColumns();
stepExecution.getExecutionContext().put("columns", columns);
}
private String[] getColumns() {
//Parse first line of resource to get columns
}
}
Use the data passed to execution context to set lineTokenizer
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = (String[]) stepExecution.getExecutionContext().get("columns");
setNames(fooColumnNames);
}};

Spring Batch Writer to write Map<Key,Values> to file

I am using the Spring batch to develop CSV feed files. I had used a writer similar to the one given below to generate my output file.
#Bean
public FlatFileItemWriter<CustomObj> writer()
{
BeanWrapperFieldExtractor<CustomObj> extractor = new BeanWrapperFieldExtractor<>();
extractor.setNames(new String[] {"name", "emailAddress", "dob"});
DelimitedLineAggregator<CustomObj> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(";");
FieldExtractor<CustomObj> fieldExtractor = createStudentFieldExtractor();
lineAggregator.setFieldExtractor(fieldExtractor);
FlatFileItemWriter<CustomObj> writer = new FlatFileItemWriter<>();
writer.setResource(outputResource);
writer.setAppendAllowed(true);
//writer.setHeaderCallback(headerCallback);
writer.setLineAggregator(lineAggregator);
return writer;
}
output
name;emailAddress;dob
abc;abc#xyz.com;10-10-20
But now we have a requirement to make this writer generic such that we no longer pass the object, instead pass a Map<String, String> and the object values are now stored in the Map of key Value pairs
Eg: name-> abc , emailAddress->abc#xyz.com, dob -> 10-10-20
We tried to use the writer similar to the below one,
But the problem here is that as there is no FieldExtractor set and thus the header and the values may become out of sync.
The PassThroughFieldExtractor just passes all the values in the collection(Map) in any order. even if the Map contains more fields, it prints all the fields.
Header and the values are not bound together in this case.
Is there any way to implement a custom field extractor which will make sure even if we change the ordering of the header, the ordering of the values remain consistent with the header.
#Bean
public FlatFileItemWriter<Map<String,String>> writer()
{
DelimitedLineAggregator<Map<String,String>> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(";");
FieldExtractor<Map<String,String>> fieldExtractor = createStudentFieldExtractor();
lineAggregator.setFieldExtractor(new PassThroughFieldExtractor<>());
FlatFileItemWriter<Map<String,String>> writer = new FlatFileItemWriter<>();
writer.setResource(outputResource);
writer.setAppendAllowed(true);
writer.setLineAggregator(lineAggregator);
return writer;
}
output
name;emailAddress;dob
abc#xyz.com;abc;10-10-20
expected Output
case 1:
name;emailAddress;dob
abc;abc#xyz.com;10-10-20
case 2:
emailAddress;dob
abc#xyz.com;10-10-20
you need a custom fieldset extractor that extracts values from the map in the same order as the headers. Spring Batch does not provides such an extractor so you need to implement it yourself. For example, you can pass the headers at construction time to the extractor, and extract values from the map according to the headers order.

How to change dynamically the index name in “saveJsonToES”?

I am trying to insert logs that I extract from a kafka server in order to insert in ElasticSearc 5 with Spark Streaming 2.0.0 .
Here is my code. My big problem is with line "saveJsonToES", in fact, this function has a string argument for specifying the index name. However, my index name is a JavaDStream. I did like this in oder to generate dynamic index names in another class.
JavaDStream<List<String>> newLines = lines.map(arg0 -> {
String lineToInsertInES = "";
String indexName = "";
List<String> list = new ArrayList<String>();
//some code to determine strings to add in my list
list.add(lineToInsertInES);
list.add(indexName);
return list;
});
JavaDStream<String> lineToInsertInES = newLines.map(list -> list.get(0));
JavaDStream<String> indexName = newLines.map(list -> list.get(1));
lineToInsertInES.foreachRDD(line->{
if(!line.isEmpty())
JavaEsSpark.saveJsonToEs(line,indexName); //problem at this line
});
Can you telle how I can solve this ?
Thank you in advance
J

Loading Files in UDF

I have a requirement of populating a field based on the evaluation of a UDF. The input to the UDF would be some other fields in the input and as well as an csv sheet. Presently, the approach I have taken is to load the CSV file, group it ALL and then pass it as a bag to the UDF along with other required parameters. However, its taking a very long time to complete the process (roughly about 3 hours) for source data of 170k records and as well as csv records of about 150k.
I'm sure there must be much better efficient way to handle this and hence need your inputs.
source_alias = LOAD 'src.csv' USING
PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray);
csv_alias = LOAD 'csv_file.csv' USING
PigStorage(',') AS (c1:chararray,c2:chararray,c3:chararray);
grpd_csv_alias = GROUP csv_alias ALL;
final_alias = FOREACH source_alias GENERATE f1 AS f1,
myUDF(grpd_csv_alias, f2) AS derived_f2;
Here is my UDF on a high level.
public class myUDF extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
String f2Response = "N";
DataBag csvAliasBag = (DataBag)input.get(0);
String f2 = (String) input.get(1);
try {
Iterator<Tuple> bagIterator = csvAliasBag.iterator();
while (bagIterator.hasNext()) {
Tuple localTuple = (Tuple)bagIterator.next();
String col1 = ((String)localTuple.get(1)).trim().toLowerCase();
String col2 = ((String)localTuple.get(2)).trim().toLowerCase();
String col3 = ((String)localTuple.get(3)).trim().toLowerCase();
String col4 = ((String)localTuple.get(4)).trim().toLowerCase();
<Custom logic to populate f2Response based on the value in f2 and as well as col1, col2, col3 and col4>
}
}
return f2Response;
}
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
I believe the process is taking too long because of building and passing csv_alias to the UDF for each row in the source file.
Is there any better way to handle this?
Thanks
For small files, you can put them on the distributed cache. This copies the file to each task node as a local file then you load it yourself. Here's an example from the Pig docs UDF section. I would not recommend parsing the file each time, however. Store your results in a class variable and check to see if it's been initialized. If the csv is on the local file system, use getShipFiles. If the csv you're using is on HDFS, used the getCachedFiles method. Notice that for HDFS there's a file path followed by a # and some text. To the left of the # is the HDFS path and to the right is the name you want it to be called when it's copied to the local file system.
public class Udfcachetest extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
String concatResult = "";
FileReader fr = new FileReader("./smallfile1");
BufferedReader d = new BufferedReader(fr);
concatResult +=d.readLine();
fr = new FileReader("./smallfile2");
d = new BufferedReader(fr);
concatResult +=d.readLine();
return concatResult;
}
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/user/pig/tests/data/small#smallfile1"); // This is hdfs file
return list;
}
public List<String> getShipFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/home/hadoop/pig/smallfile2"); // This local file
return list;
}
}

Resources