Spring FlatFileReader Jagged CSV file - spring

I'm reading data via spring batch and I'm going to dump it into a database table.
My csv file of musical facts is formatted like this:
question; valid answer; potentially another valid answer; unlikely, but another;
Where all rows have a question and at least one valid answer, but there can be more. The simple way to hold this data is to in the data in a pojo is with one field for a String and another for a List<String>.
Below is a simple line mapper to read a CSV file, but I don't know how to make the necessary changes to accommodate a jagged CSV file in this manner.
#Bean
public LineMapper<MusicalFactoid> musicalFactoidLineMapper() {
DefaultLineMapper<MusicalFactoid> musicalFactoidDefaultLineMapper = new DefaultLineMapper<>();
musicalFactoidDefaultLineMapper.setLineTokenizer(new DelimitedLineTokenizer() {{
setDelimiter(";");
setNames(new String[]{"question", "answer"}); // <- this will not work!
}});
musicalFactoidDefaultLineMapper.setFieldSetMapper(new BeanWrapperFieldSetMapper<MusicalFactoid>() {{
setTargetType(MusicalFactoid.class);
}});
return musicalFactoidDefaultLineMapper;
}
What do I need to do?

Write your own Line Mapper. As far as I see, you don't have any complex logic.
Something like this:
public MyLineMapper implements LineMapper<MusicalFactoid> {
public MusicalFactoid mapLine(String line, int lineNumber) {
MusicalFactoid dto = new MusicalFactoid();
String[] splitted = line.split(";");
dto.setQuestion(splitted[0]);
for (int idx = 1; idx < splitted.length; idx++) {
dto.addAnswer(splitted[idx]);
}
return dto;
}
}

Related

Spring Batch FlatFileItemReader Handle Additional Fields Added to CSV

So I've got a csv file that's being ingested on a scheduled basis. The csv file has a set of columns with their names specified in the header row, the catch is, new columns are constantly being added to this csv. Currently, when a new field is added, the ingest flow breaks and I get a FlatFileParseException. I have to go in and update the code with the new column names in order to have it work again.
What I'm looking to accomplish, is instead, when new columns are added, have the code correctly pick out the columns it needs, and not throw an exception.
#Bean
#StepScope
FlatFileItemReader<Foo> fooReader(
...
) {
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = { "foo", "bar" };
setNames(fooColumnNames);
// setStrict(false);
}};
return new FlatFileItemReader<>() {{
setLineMapper(new DefaultLineMapper<>() {{
setLineTokenizer(fooLineTokenizer);
setFieldSetMapper(new BeanWrapperFieldSetMapper<>() {{
setTargetType(Foo.class);
}});
}});
...
}};
}
I've tried using setStrict(false) in the lineTokenizer, and this gets rid of the exception, however the problem then becomes fields being set to the wrong values from the new columns that were added, as opposed to the original columns the data was being pulled from.
Any ideas on how to add a bit more fault-tolerance to this flow, so I don't have to constantly update the fooColumnNames whenever columns are added to the csv?
I tried modifying the code using the setStrict(false) parameter and toying with custom implementations of lineTokenizer, but still struggling to get fault-tolerance when new columns are added to the csv
I don't know about fault tolerance, but it could possible to retrieve columns dynamically
Add a listener to your step to retrieve columns in beforeStep and pass it to your stepExecutionContext
public class ColumnRetrieverListener implements StepExecutionListener {
private Resource resource;
//Getter and setter
#Override
public void beforeStep(StepExecution stepExecution) {
String[] columns = getColumns();
stepExecution.getExecutionContext().put("columns", columns);
}
private String[] getColumns() {
//Parse first line of resource to get columns
}
}
Use the data passed to execution context to set lineTokenizer
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = (String[]) stepExecution.getExecutionContext().get("columns");
setNames(fooColumnNames);
}};

How do I test my Springboot XML Filtering function

I am fairly new to unit testing and I am trying to filter a XML file in java spring-boot the filtering Function looks like this:
public Document filterRecordsByReleaseDate(Document document, String dateString, RDSymbol symbol) throws ParseException {
Document newDocument = builder.newDocument();
Node root = newDocument.createElement("records");
newDocument.appendChild(root);
Date comparisonDate = new SimpleDateFormat("yyyy-MM-dd").parse(dateString);
NodeList nodeList = document.getElementsByTagName("record");
for (int i = 0; i < nodeList.getLength(); i++) {
Node node = newDocument.adoptNode(nodeList.item(i));
Element element = (Element) node;
String releaseDateString = element.getElementsByTagName("releasedate").item(0).getTextContent();
Date releaseDate = new SimpleDateFormat("yyyy.MM.dd").parse(releaseDateString);
if (releaseDate.after(comparisonDate) && symbol.toString().equals("GT")) {
root.appendChild(node);
} else if (releaseDate.before(comparisonDate) && symbol.toString().equals("LT")) {
root.appendChild(node);
}
}
return newDocument;
}
The function itself is working fine, but I was thinking about how I might unit test this code. Currently, I only have one File which is supplied from the src/main/resourcesfolder. The Data will at some point come from some External Service/DB and it will follow the same format.
In my head I have several Questions:
how do I mock the Input Document for the file
to what should I compare the output of the function?
Concerning question 1: would It be ok to just use the RecordRepository.getXML function as a dependency, as only thing I could do otherwise would be to replicate its code anyways?
Concerning question 2: would It be ok to create a mocks folder in the src/test/resources directory to which I save the outputs of previous succesfull filters, to compare to. I feel like this would make the Test kind of redundant, but I also dont see any other way. Is there something I am not seing?

Reading line as String

My question is the somewhat related to this question.
In my batch configuration I am using a flatfile reader like below with the intent to read the entire line in the flatfile as a string:
#Bean
#StepScope
#Qualifier("employeeItemReader")
#DependsOn("partitioner")
public FlatFileItemReader<Employee> EmployeeItemReader(#Value("#{stepExecutionContext['fileName']}") String filename)
throws MalformedURLException {
return new FlatFileItemReaderBuilder<Employee>().name("employeeItemReader")
.delimited()
.delimiter("<#|FooBar|#>")
//.names(new String[] { "id", "firstName", "lastName" })
.names(new String[] { "id" })
.fieldSetMapper(new BeanWrapperFieldSetMapper<Employee>() {
{
setTargetType(Employee.class);
}
})
.linesToSkip(0)
.resource(new UrlResource(filename)).build();
}
As you can see, I am using (literally) the delimiter as .delimiter("<#|FooBar|#>"). And it solves my purpose (in Dev environment) as I am reading a multiple files where each line contains a UUID string value. Given that my delimiter will never be present in the expected UUID.
But there are chances that there might be more than one UUID per line as I am getting those files from different sources. So, I want to tackle this situation where each line is of (similar to) this format - afcf8f03-7d83-4c24-9b7b-d03303e70c00.
Question: How do I make use of FixedLengthTokenizer to make sure I always read a line as as UUID? As I am dealing with: 8AlphNum-4AlphNum-4AlphNum-12AlphNum. How do I tackle these alpha numerics and hyphens?

Building a multi-record flat file

I'm struggling to find a proper solution for generating a flat file.
Here are some criteria I need to take care of:
The file has a header with summary of its following records
there could be multiple Collection Header Records with multiple Batch Header Records which contain multiple records of different types.
All records within a Batch have a checksum which has to be added to a batch checksum. This one has to be added to the collection Header checksum and that again to the file checksum. Also each entry in the file has a counter value.
So my plan was to create a class for each record. but what now? I have the records and the "summary records", the next step would be to bring them all in order, count the sums and then set the counters.
How should I proceed from here, should I put everything in a big SortedList? If so, how do I know where to add the latest record (It has to be added to its representing batch summary)?
My first idea was to do something like this:
SortedList<HeaderSummary, SortedList<BatchSummary, SortedList<string, object>>>();
But it is hard to navigate through the HeaderSummaries and BatchSummaries to add a object in the inner Sorted list, bearing in mind that I may have to create and add a HeaderSummary / BachtSummary.
Having several different ArrayLists like one for Header, one for Batch and one for the rest gives me problems when combining them to a flat file because of the order and the - yet to set - counters, while keeping the order etc.
Do you have any clever solution for such a flat file?
Consider using classes to represent levels of your tree structure.
interface iBatch {
public int checksum { get; set; }
}
class BatchSummary {
int batchChecksum;
List<iBatch> records;
public void WriteBatch() {
WriteBatchHeader();
foreach (var record in records)
batch.WriteRecord();
}
public void Add(iBatch rec) {
records.Add(rec); // or however you find the appropriate batch
}
}
class CollectionSummary {
int collectionChecksum;
List<BatchSummary> batches;
public void WriteCollection() {
WriteCollectionHeader();
foreach (var batch in batches)
batch.WriteBatch();
}
public void Add(int WhichBatch, iBatch rec) {
batches[whichBatch].Add(rec); // or however you find the appropriate batch
}
}
class FileSummary {
// ... file summary info
int fileChecksum;
List<CollectionSummary> collections;
public void WriteFile() {
WriteFileHeader();
foreach (var collection in collections)
collection.WriteCollection();
}
public void Add(int whichCollection, int WhichBatch, iBatch rec) {
collections[whichCollection].Add(whichBatch, rec); // or however you find the appropriate collection
}
}
Of course, you could use a common Summary class to be more DRY, if not necessarily more clear.

Read in big csv file, validate and write out using uniVocity parser

I need to parse a big csv file (2gb). The values have to be validated, the rows containing "bad" fields must be dropped and a new file containing only valid rows ought to be output.
I've selected uniVocity parser library to do that. Please help me to understand whether this library is well-suited for the task and what approach should be used.
Given the file size, what is the best way to organize read->validate->write in uniVocity ? Read in all rows at once or use iterator style ? Where parsed and validated rows should be stored before they are written to file ?
Is there a way in Univocity to access row's values by index ? Something like row.getValue(3) ?
I'm the author of this library, let me try to help you out:
First, do not try to read all rows at once as you will fill your memory with LOTS of data.
You can get the row values by index.
The faster approach to read/validate/write would be by using a RowProcessor that has a CsvWriter and decides when to write or skip a row. I think the following code will help you a bit:
Define the output:
private CsvWriter createCsvWriter(File output, String encoding){
CsvWriterSettings settings = new CsvWriterSettings();
//configure the writer ...
try {
return new CsvWriter(new OutputStreamWriter(new FileOutputStream(output), encoding), settings);
} catch (IOException e) {
throw new IllegalArgumentException("Error writing to " + output.getAbsolutePath(), e);
}
}
Redirect the input
//this creates a row processor for our parser. It validates each row and sends them to the csv writer.
private RowProcessor createRowProcessor(File output, String encoding){
final CsvWriter writer = createCsvWriter(output, encoding);
return new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
if (shouldWriteRow(row)) {
writer.writeRow(row);
} else {
//skip row
}
}
private boolean shouldWriteRow(String[] row) {
//your validation here
return true;
}
#Override
public void processEnded(ParsingContext context) {
writer.close();
}
};
}
Configure the parser:
public void readAndWrite(File input, File output, String encoding) {
CsvParserSettings settings = new CsvParserSettings();
//configure the parser here
//tells the parser to send each row to them custom processor, which will validate and redirect all rows to the CsvWriter
settings.setRowProcessor(createRowProcessor(output, encoding));
CsvParser parser = new CsvParser(settings);
try {
parser.parse(new InputStreamReader(new FileInputStream(input), encoding));
} catch (IOException e) {
throw new IllegalStateException("Unable to open input file " + input.getAbsolutePath(), e);
}
}
For better performance you can also wrap the row processor in a ConcurrentRowProcessor.
settings.setRowProcessor(new ConcurrentRowProcessor(createRowProcessor(output, encoding)));
With this, the writing of rows will be performed in a separate thread.

Resources