So I've got a csv file that's being ingested on a scheduled basis. The csv file has a set of columns with their names specified in the header row, the catch is, new columns are constantly being added to this csv. Currently, when a new field is added, the ingest flow breaks and I get a FlatFileParseException. I have to go in and update the code with the new column names in order to have it work again.
What I'm looking to accomplish, is instead, when new columns are added, have the code correctly pick out the columns it needs, and not throw an exception.
#Bean
#StepScope
FlatFileItemReader<Foo> fooReader(
...
) {
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = { "foo", "bar" };
setNames(fooColumnNames);
// setStrict(false);
}};
return new FlatFileItemReader<>() {{
setLineMapper(new DefaultLineMapper<>() {{
setLineTokenizer(fooLineTokenizer);
setFieldSetMapper(new BeanWrapperFieldSetMapper<>() {{
setTargetType(Foo.class);
}});
}});
...
}};
}
I've tried using setStrict(false) in the lineTokenizer, and this gets rid of the exception, however the problem then becomes fields being set to the wrong values from the new columns that were added, as opposed to the original columns the data was being pulled from.
Any ideas on how to add a bit more fault-tolerance to this flow, so I don't have to constantly update the fooColumnNames whenever columns are added to the csv?
I tried modifying the code using the setStrict(false) parameter and toying with custom implementations of lineTokenizer, but still struggling to get fault-tolerance when new columns are added to the csv
I don't know about fault tolerance, but it could possible to retrieve columns dynamically
Add a listener to your step to retrieve columns in beforeStep and pass it to your stepExecutionContext
public class ColumnRetrieverListener implements StepExecutionListener {
private Resource resource;
//Getter and setter
#Override
public void beforeStep(StepExecution stepExecution) {
String[] columns = getColumns();
stepExecution.getExecutionContext().put("columns", columns);
}
private String[] getColumns() {
//Parse first line of resource to get columns
}
}
Use the data passed to execution context to set lineTokenizer
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = (String[]) stepExecution.getExecutionContext().get("columns");
setNames(fooColumnNames);
}};
Related
I am using the Spring batch to develop CSV feed files. I had used a writer similar to the one given below to generate my output file.
#Bean
public FlatFileItemWriter<CustomObj> writer()
{
BeanWrapperFieldExtractor<CustomObj> extractor = new BeanWrapperFieldExtractor<>();
extractor.setNames(new String[] {"name", "emailAddress", "dob"});
DelimitedLineAggregator<CustomObj> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(";");
FieldExtractor<CustomObj> fieldExtractor = createStudentFieldExtractor();
lineAggregator.setFieldExtractor(fieldExtractor);
FlatFileItemWriter<CustomObj> writer = new FlatFileItemWriter<>();
writer.setResource(outputResource);
writer.setAppendAllowed(true);
//writer.setHeaderCallback(headerCallback);
writer.setLineAggregator(lineAggregator);
return writer;
}
output
name;emailAddress;dob
abc;abc#xyz.com;10-10-20
But now we have a requirement to make this writer generic such that we no longer pass the object, instead pass a Map<String, String> and the object values are now stored in the Map of key Value pairs
Eg: name-> abc , emailAddress->abc#xyz.com, dob -> 10-10-20
We tried to use the writer similar to the below one,
But the problem here is that as there is no FieldExtractor set and thus the header and the values may become out of sync.
The PassThroughFieldExtractor just passes all the values in the collection(Map) in any order. even if the Map contains more fields, it prints all the fields.
Header and the values are not bound together in this case.
Is there any way to implement a custom field extractor which will make sure even if we change the ordering of the header, the ordering of the values remain consistent with the header.
#Bean
public FlatFileItemWriter<Map<String,String>> writer()
{
DelimitedLineAggregator<Map<String,String>> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(";");
FieldExtractor<Map<String,String>> fieldExtractor = createStudentFieldExtractor();
lineAggregator.setFieldExtractor(new PassThroughFieldExtractor<>());
FlatFileItemWriter<Map<String,String>> writer = new FlatFileItemWriter<>();
writer.setResource(outputResource);
writer.setAppendAllowed(true);
writer.setLineAggregator(lineAggregator);
return writer;
}
output
name;emailAddress;dob
abc#xyz.com;abc;10-10-20
expected Output
case 1:
name;emailAddress;dob
abc;abc#xyz.com;10-10-20
case 2:
emailAddress;dob
abc#xyz.com;10-10-20
you need a custom fieldset extractor that extracts values from the map in the same order as the headers. Spring Batch does not provides such an extractor so you need to implement it yourself. For example, you can pass the headers at construction time to the extractor, and extract values from the map according to the headers order.
My documents have the property docType that separated them based on the purpose of each type, in the specific case template or audit. However, when I do the following:
document.getProperty("docType").equals("template");
document.getProperty("docType").equals("audit");
The results of them are always the same, it returns every time all documents stored without filtering them by the docType.
Below, you can check the query function.
public static Query getData(Database database, final String type) {
View view = database.getView("data");
if (view.getMap() == null) {
view.setMap(new Mapper() {
#Override
public void map(Map<String, Object> document, Emitter emitter) {
if(String.valueOf(document.get("docType")).equals(type)){
emitter.emit(document.get("_id"), null);
}
}
}, "4");
}
return view.createQuery();
}
Any hint?
This is not a valid way to do it. Your view function must be pure (it cannot reference external state such as "type"). Once that is created you can then query it for what you want by setting start and end keys, or just a set of keys in general to filter on.
I'm experimenting with Stanford NLP's TokensRegex and try to find dimensions (e.g. 100x120) in a text. So my plan is to first retokenize the input to further split these tokens (using the example provided in retokenize.rules.txt) and then to search for the new pattern.
After doing the retokenization, however, only null-values are left that replace the original string:
The top level annotation
[Text=100x120 Tokens=[null-1, null-2, null-3] Sentences=[100x120]]
The retokenization seems to work fine (3 tokens in result), but the values are lost. What can I do to maintain the original values in the tokens list?
My retokenize.rules.txt file is (as in the demo):
tokens = { type: "CLASS", value:"edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
options.matchedExpressionsAnnotationKey = tokens;
options.extractWithTokens = TRUE;
options.flatten = TRUE;
ENV.defaults["ruleType"] = "tokens"
ENV.defaultStringPatternFlags = 2
ENV.defaultResultAnnotationKey = tokens
{ pattern: ( /\d+(x|X)\d+/ ), result: Split($0[0], /x|X/, TRUE) }
The main method:
public static void main(String[] args) throws IOException {
//...
text = "100x120";
Properties properties = new Properties();
properties.setProperty("tokenize.language", "de");
properties.setProperty("annotators", tokenize,retokenize,ssplit,pos,lemma,ner");
properties.setProperty("customAnnotatorClass.retokenize", "edu.stanford.nlp.pipeline.TokensRegexAnnotator");
properties.setProperty("retokenize.rules", "retokenize.rules.txt");
StanfordCoreNLP stanfordPipeline = new StanfordCoreNLP(properties);
runPipeline(pipelineWithRetokenize, text);
}
And the pipeline:
public static void runPipeline(StanfordCoreNLP pipeline, String text) {
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
out.println();
out.println("The top level annotation");
out.println(annotation.toShorterString());
//...
}
Thanks for letting us know. The CoreAnnotations.ValueAnnotation is not being populated and we'll update TokenRegex to populate the field.
Regardless, you should be able to use TokenRegex to retokenize as you have planned. Most of the pipeline does not depending on the ValueAnnotation and uses the CoreAnnotations.TextAnnotation instead. You can use the CoreAnnotations.TextAnnotation to get the text for the new tokens (each token is a CoreLabel so you can access it using token.word() as well).
See TokensRegexRetokenizeDemo for example code on how to get the different annotations out.
I need to parse a big csv file (2gb). The values have to be validated, the rows containing "bad" fields must be dropped and a new file containing only valid rows ought to be output.
I've selected uniVocity parser library to do that. Please help me to understand whether this library is well-suited for the task and what approach should be used.
Given the file size, what is the best way to organize read->validate->write in uniVocity ? Read in all rows at once or use iterator style ? Where parsed and validated rows should be stored before they are written to file ?
Is there a way in Univocity to access row's values by index ? Something like row.getValue(3) ?
I'm the author of this library, let me try to help you out:
First, do not try to read all rows at once as you will fill your memory with LOTS of data.
You can get the row values by index.
The faster approach to read/validate/write would be by using a RowProcessor that has a CsvWriter and decides when to write or skip a row. I think the following code will help you a bit:
Define the output:
private CsvWriter createCsvWriter(File output, String encoding){
CsvWriterSettings settings = new CsvWriterSettings();
//configure the writer ...
try {
return new CsvWriter(new OutputStreamWriter(new FileOutputStream(output), encoding), settings);
} catch (IOException e) {
throw new IllegalArgumentException("Error writing to " + output.getAbsolutePath(), e);
}
}
Redirect the input
//this creates a row processor for our parser. It validates each row and sends them to the csv writer.
private RowProcessor createRowProcessor(File output, String encoding){
final CsvWriter writer = createCsvWriter(output, encoding);
return new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
if (shouldWriteRow(row)) {
writer.writeRow(row);
} else {
//skip row
}
}
private boolean shouldWriteRow(String[] row) {
//your validation here
return true;
}
#Override
public void processEnded(ParsingContext context) {
writer.close();
}
};
}
Configure the parser:
public void readAndWrite(File input, File output, String encoding) {
CsvParserSettings settings = new CsvParserSettings();
//configure the parser here
//tells the parser to send each row to them custom processor, which will validate and redirect all rows to the CsvWriter
settings.setRowProcessor(createRowProcessor(output, encoding));
CsvParser parser = new CsvParser(settings);
try {
parser.parse(new InputStreamReader(new FileInputStream(input), encoding));
} catch (IOException e) {
throw new IllegalStateException("Unable to open input file " + input.getAbsolutePath(), e);
}
}
For better performance you can also wrap the row processor in a ConcurrentRowProcessor.
settings.setRowProcessor(new ConcurrentRowProcessor(createRowProcessor(output, encoding)));
With this, the writing of rows will be performed in a separate thread.
I'm reading data via spring batch and I'm going to dump it into a database table.
My csv file of musical facts is formatted like this:
question; valid answer; potentially another valid answer; unlikely, but another;
Where all rows have a question and at least one valid answer, but there can be more. The simple way to hold this data is to in the data in a pojo is with one field for a String and another for a List<String>.
Below is a simple line mapper to read a CSV file, but I don't know how to make the necessary changes to accommodate a jagged CSV file in this manner.
#Bean
public LineMapper<MusicalFactoid> musicalFactoidLineMapper() {
DefaultLineMapper<MusicalFactoid> musicalFactoidDefaultLineMapper = new DefaultLineMapper<>();
musicalFactoidDefaultLineMapper.setLineTokenizer(new DelimitedLineTokenizer() {{
setDelimiter(";");
setNames(new String[]{"question", "answer"}); // <- this will not work!
}});
musicalFactoidDefaultLineMapper.setFieldSetMapper(new BeanWrapperFieldSetMapper<MusicalFactoid>() {{
setTargetType(MusicalFactoid.class);
}});
return musicalFactoidDefaultLineMapper;
}
What do I need to do?
Write your own Line Mapper. As far as I see, you don't have any complex logic.
Something like this:
public MyLineMapper implements LineMapper<MusicalFactoid> {
public MusicalFactoid mapLine(String line, int lineNumber) {
MusicalFactoid dto = new MusicalFactoid();
String[] splitted = line.split(";");
dto.setQuestion(splitted[0]);
for (int idx = 1; idx < splitted.length; idx++) {
dto.addAnswer(splitted[idx]);
}
return dto;
}
}