Reading line as String - spring

My question is the somewhat related to this question.
In my batch configuration I am using a flatfile reader like below with the intent to read the entire line in the flatfile as a string:
#Bean
#StepScope
#Qualifier("employeeItemReader")
#DependsOn("partitioner")
public FlatFileItemReader<Employee> EmployeeItemReader(#Value("#{stepExecutionContext['fileName']}") String filename)
throws MalformedURLException {
return new FlatFileItemReaderBuilder<Employee>().name("employeeItemReader")
.delimited()
.delimiter("<#|FooBar|#>")
//.names(new String[] { "id", "firstName", "lastName" })
.names(new String[] { "id" })
.fieldSetMapper(new BeanWrapperFieldSetMapper<Employee>() {
{
setTargetType(Employee.class);
}
})
.linesToSkip(0)
.resource(new UrlResource(filename)).build();
}
As you can see, I am using (literally) the delimiter as .delimiter("<#|FooBar|#>"). And it solves my purpose (in Dev environment) as I am reading a multiple files where each line contains a UUID string value. Given that my delimiter will never be present in the expected UUID.
But there are chances that there might be more than one UUID per line as I am getting those files from different sources. So, I want to tackle this situation where each line is of (similar to) this format - afcf8f03-7d83-4c24-9b7b-d03303e70c00.
Question: How do I make use of FixedLengthTokenizer to make sure I always read a line as as UUID? As I am dealing with: 8AlphNum-4AlphNum-4AlphNum-12AlphNum. How do I tackle these alpha numerics and hyphens?

Related

Spring Batch FlatFileItemReader Handle Additional Fields Added to CSV

So I've got a csv file that's being ingested on a scheduled basis. The csv file has a set of columns with their names specified in the header row, the catch is, new columns are constantly being added to this csv. Currently, when a new field is added, the ingest flow breaks and I get a FlatFileParseException. I have to go in and update the code with the new column names in order to have it work again.
What I'm looking to accomplish, is instead, when new columns are added, have the code correctly pick out the columns it needs, and not throw an exception.
#Bean
#StepScope
FlatFileItemReader<Foo> fooReader(
...
) {
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = { "foo", "bar" };
setNames(fooColumnNames);
// setStrict(false);
}};
return new FlatFileItemReader<>() {{
setLineMapper(new DefaultLineMapper<>() {{
setLineTokenizer(fooLineTokenizer);
setFieldSetMapper(new BeanWrapperFieldSetMapper<>() {{
setTargetType(Foo.class);
}});
}});
...
}};
}
I've tried using setStrict(false) in the lineTokenizer, and this gets rid of the exception, however the problem then becomes fields being set to the wrong values from the new columns that were added, as opposed to the original columns the data was being pulled from.
Any ideas on how to add a bit more fault-tolerance to this flow, so I don't have to constantly update the fooColumnNames whenever columns are added to the csv?
I tried modifying the code using the setStrict(false) parameter and toying with custom implementations of lineTokenizer, but still struggling to get fault-tolerance when new columns are added to the csv
I don't know about fault tolerance, but it could possible to retrieve columns dynamically
Add a listener to your step to retrieve columns in beforeStep and pass it to your stepExecutionContext
public class ColumnRetrieverListener implements StepExecutionListener {
private Resource resource;
//Getter and setter
#Override
public void beforeStep(StepExecution stepExecution) {
String[] columns = getColumns();
stepExecution.getExecutionContext().put("columns", columns);
}
private String[] getColumns() {
//Parse first line of resource to get columns
}
}
Use the data passed to execution context to set lineTokenizer
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = (String[]) stepExecution.getExecutionContext().get("columns");
setNames(fooColumnNames);
}};

Change the scalar style used for all multi-line strings when serialising a dynamic model using YamlDotNet

I am using the following code snippet to serialise a dynamic model of a project to a string (which is eventually exported to a YAML file).
dynamic exportModel = exportModelConvertor.ToDynamicModel(project);
var serializerBuilder = new SerializerBuilder();
var serializer = serializerBuilder.EmitDefaults().DisableAliases().Build();
using (var sw = new StringWriter())
{
serializer.Serialize(sw, exportModel);
string result = sw.ToString();
}
Any multi-line strings such as the following:
propertyName = "One line of text
followed by another line
and another line"
are exported in the following format:
propertyName: >
One line of text
followed by another line
and another line
Note the extra (unwanted) line breaks.
According to this YAML Multiline guide, the format used here is the folded block scalar style. Is there a way using YamlDotNet to change the style of this output for all multi-line string properties to literal block scalar style or one of the flow scalar styles?
The YamlDotNet documentation shows how to apply ScalarStyle.DoubleQuoted to a particular property using WithAttributeOverride but this requires a class name and the model to be serialised is dynamic. This also requires listing every property to change (of which there are many). I would like to change the style for all multi-line string properties at once.
To answer my own question, I've now worked out how to do this by deriving from the ChainedEventEmitter class and overriding void Emit(ScalarEventInfo eventInfo, IEmitter emitter). See code sample below.
public class MultilineScalarFlowStyleEmitter : ChainedEventEmitter
{
public MultilineScalarFlowStyleEmitter(IEventEmitter nextEmitter)
: base(nextEmitter) { }
public override void Emit(ScalarEventInfo eventInfo, IEmitter emitter)
{
if (typeof(string).IsAssignableFrom(eventInfo.Source.Type))
{
string value = eventInfo.Source.Value as string;
if (!string.IsNullOrEmpty(value))
{
bool isMultiLine = value.IndexOfAny(new char[] { '\r', '\n', '\x85', '\x2028', '\x2029' }) >= 0;
if (isMultiLine)
eventInfo = new ScalarEventInfo(eventInfo.Source)
{
Style = ScalarStyle.Literal
};
}
}
nextEmitter.Emit(eventInfo, emitter);
}
}

TokensRegex: Tokens are null after retokenization

I'm experimenting with Stanford NLP's TokensRegex and try to find dimensions (e.g. 100x120) in a text. So my plan is to first retokenize the input to further split these tokens (using the example provided in retokenize.rules.txt) and then to search for the new pattern.
After doing the retokenization, however, only null-values are left that replace the original string:
The top level annotation
[Text=100x120 Tokens=[null-1, null-2, null-3] Sentences=[100x120]]
The retokenization seems to work fine (3 tokens in result), but the values are lost. What can I do to maintain the original values in the tokens list?
My retokenize.rules.txt file is (as in the demo):
tokens = { type: "CLASS", value:"edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
options.matchedExpressionsAnnotationKey = tokens;
options.extractWithTokens = TRUE;
options.flatten = TRUE;
ENV.defaults["ruleType"] = "tokens"
ENV.defaultStringPatternFlags = 2
ENV.defaultResultAnnotationKey = tokens
{ pattern: ( /\d+(x|X)\d+/ ), result: Split($0[0], /x|X/, TRUE) }
The main method:
public static void main(String[] args) throws IOException {
//...
text = "100x120";
Properties properties = new Properties();
properties.setProperty("tokenize.language", "de");
properties.setProperty("annotators", tokenize,retokenize,ssplit,pos,lemma,ner");
properties.setProperty("customAnnotatorClass.retokenize", "edu.stanford.nlp.pipeline.TokensRegexAnnotator");
properties.setProperty("retokenize.rules", "retokenize.rules.txt");
StanfordCoreNLP stanfordPipeline = new StanfordCoreNLP(properties);
runPipeline(pipelineWithRetokenize, text);
}
And the pipeline:
public static void runPipeline(StanfordCoreNLP pipeline, String text) {
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
out.println();
out.println("The top level annotation");
out.println(annotation.toShorterString());
//...
}
Thanks for letting us know. The CoreAnnotations.ValueAnnotation is not being populated and we'll update TokenRegex to populate the field.
Regardless, you should be able to use TokenRegex to retokenize as you have planned. Most of the pipeline does not depending on the ValueAnnotation and uses the CoreAnnotations.TextAnnotation instead. You can use the CoreAnnotations.TextAnnotation to get the text for the new tokens (each token is a CoreLabel so you can access it using token.word() as well).
See TokensRegexRetokenizeDemo for example code on how to get the different annotations out.

Spring FlatFileReader Jagged CSV file

I'm reading data via spring batch and I'm going to dump it into a database table.
My csv file of musical facts is formatted like this:
question; valid answer; potentially another valid answer; unlikely, but another;
Where all rows have a question and at least one valid answer, but there can be more. The simple way to hold this data is to in the data in a pojo is with one field for a String and another for a List<String>.
Below is a simple line mapper to read a CSV file, but I don't know how to make the necessary changes to accommodate a jagged CSV file in this manner.
#Bean
public LineMapper<MusicalFactoid> musicalFactoidLineMapper() {
DefaultLineMapper<MusicalFactoid> musicalFactoidDefaultLineMapper = new DefaultLineMapper<>();
musicalFactoidDefaultLineMapper.setLineTokenizer(new DelimitedLineTokenizer() {{
setDelimiter(";");
setNames(new String[]{"question", "answer"}); // <- this will not work!
}});
musicalFactoidDefaultLineMapper.setFieldSetMapper(new BeanWrapperFieldSetMapper<MusicalFactoid>() {{
setTargetType(MusicalFactoid.class);
}});
return musicalFactoidDefaultLineMapper;
}
What do I need to do?
Write your own Line Mapper. As far as I see, you don't have any complex logic.
Something like this:
public MyLineMapper implements LineMapper<MusicalFactoid> {
public MusicalFactoid mapLine(String line, int lineNumber) {
MusicalFactoid dto = new MusicalFactoid();
String[] splitted = line.split(";");
dto.setQuestion(splitted[0]);
for (int idx = 1; idx < splitted.length; idx++) {
dto.addAnswer(splitted[idx]);
}
return dto;
}
}

Problem in using Solr WordDelimiterFilter

I am doing some test using WordDelimiterFilter in Solr but it doesn't preserve the protected list of words which I pass to it. Would you please inspect the code and the output example and suggest which part is missing or used badly?
with running this code:
private static Analyzer getWordDelimiterAnalyzer() {
return new Analyzer() {
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream stream = new StandardTokenizer(Version.LUCENE_32, reader);
WordDelimiterFilterFactory wordDelimiterFilterFactory = new WordDelimiterFilterFactory();
HashMap<String, String> args = new HashMap<String, String>();
args.put("generateWordParts", "1");
args.put("generateNumberParts", "1");
args.put("catenateWords", "1");
args.put("catenateNumbers", "1");
args.put("catenateAll", "0");
args.put("luceneMatchVersion", Version.LUCENE_32.name());
args.put("language", "English");
args.put("protected", "protected.txt");
wordDelimiterFilterFactory.init(args);
ResourceLoader loader = new SolrResourceLoader(null, null);
wordDelimiterFilterFactory.inform(loader);
/*List<String> protectedWords = new ArrayList<String>();
protectedWords.add("good bye");
protectedWords.add("hello world");
wordDelimiterFilterFactory.inform(new LinesMockSolrResourceLoader(protectedWords));
*/
return wordDelimiterFilterFactory.create(stream);
}
};
}
input text:
hello world
good bye
what is your plan for future?
protected strings:
good bye
hello world
output:
(hello,startOffset=0,endOffset=5,positionIncrement=1,type=)
(world,startOffset=6,endOffset=11,positionIncrement=1,type=)
(good,startOffset=12,endOffset=16,positionIncrement=1,type=)
(bye,startOffset=17,endOffset=20,positionIncrement=1,type=)
(what,startOffset=21,endOffset=25,positionIncrement=1,type=)
(is,startOffset=26,endOffset=28,positionIncrement=1,type=)
(your,startOffset=29,endOffset=33,positionIncrement=1,type=)
(plan,startOffset=34,endOffset=38,positionIncrement=1,type=)
(for,startOffset=39,endOffset=42,positionIncrement=1,type=)
(future,startOffset=43,endOffset=49,positionIncrement=1,type=)
You are using a standard tokenizer which at least tokenizes on a whitespace level so you will always have "hello world" be split to "hello" and "world".
TokenStream stream = new StandardTokenizer(Version.LUCENE_32, reader);
See Lucene Documentation:
public final class StandardTokenizer extends Tokenizer
A grammar-based tokenizer constructed with JFlex
This should be a good tokenizer for most European-language documents:
Splits words at punctuation characters, removing punctuation.
However, a dot that's not followed by whitespace is considered part of
a token.
Splits words at hyphens, unless there's a number in the token, in
which case the whole token is interpreted as a product number and is
not split.
Recognizes email addresses and internet hostnames as one token.
The word delimiter protected word list is meant for something like:
ISBN2345677 to be split in ISBN 2345677
text2html not to be split in text 2 html (because text2html was added to protected words)
If you really want to do something like you mentioned you may use the KeywordTokenizer. But you have to do the complete splitting by yourself.

Resources