Spring batch read file one by one. File content is not constant - spring

MultiResourceItemReader reads all files sequentially.
I want once one file read completely, it should call processor/writer.it should not read next file.
Since file content is not constant, i can't go with chunk size.
Any idea on chunk policy to decide end of file content?

I think you should write a step which read/process/write only one file with a "single file item reader" (like FlatFileItemReader). And repeat the step while there are files remainig.
Spring batch gives you a feature to do so : conditional flows and in particular the programmatic flow decision which gives you a smart way to decide when to stop a loop between steps (when there is not file any more)
And since you will not be able to give a constant input file name to your reader, you should also have a look at Late binding section.
Hope this will be enough to help you. Please, make comments if you need more details.

Using MultiResourceItemReader, assigning multiple file reasources.
Using custom file reader as delegate, reading a file completely
For reading file completely, come up with a logic
#Bean
public MultiResourceItemReader<SimpleFileBean> simpleReader()
{
Resource[] resourceList = getFileResources();
if(resourceList == null) {
System.out.println("No input files available");
}
MultiResourceItemReader<SimpleFileBean> resourceItemReader = new MultiResourceItemReader<SimpleFileBean>();
resourceItemReader.setResources(resourceList);
resourceItemReader.setDelegate(simpleFileReader());
return resourceItemReader;
}
#Bean
SimpleInboundReader simpleFileReader() {
return new SimpleInboundReader(customSimpleFileReader());
}
#Bean
public FlatFileItemReader customSimpleFileReader() {
return new FlatFileItemReaderBuilder()
.name("customFileItemReader")
.lineMapper(new PassThroughLineMapper())
.build();
}
public class SimpleInboundReader implements ResourceAwareItemReaderItemStream{
private Object currentItem = null;
private FileModel fileModel = null;
private String fileName = null;
private boolean fileRead = false;
private ResourceAwareItemReaderItemStream<String> delegate;
public SimpleInboundReader(ResourceAwareItemReaderItemStream<String> delegate) {
this.delegate = delegate;
}
#Override
public void open(ExecutionContext executionContext) throws ItemStreamException {
delegate.open(executionContext);
}
#Override
public void update(ExecutionContext executionContext) throws ItemStreamException {
delegate.update(executionContext);
}
#Override
public void close() throws ItemStreamException {
delegate.close();
}
#Override
public void setResource(Resource resource) {
fileName = resource.getFilename();
this.delegate.setResource(resource);
}
String getNextLine() throws UnexpectedInputException, ParseException, NonTransientResourceException, Exception {
return delegate.read();
}
#Override
public SimpleFileBean read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
SimpleFileBean simpleFileBean = null;
String currentLine = null;
currentLine=delegate.read();
if(currentLine != null) {
simpleFileBean = new SimpleFileBean();
simpleFileBean.getLines().add(currentLine);
while ((currentLine = getNextLine()) != null) {
simpleFileBean.getLines().add(currentLine);
}
return simpleFileBean;
}
return null;
}
}

Related

How to use Spring Batch to read CSV files which contains mutiple line in one cell?

Raw CSV is like this:
First line: Name, StudentID, comment
Data:
Name, StudentId, Comment
Jake, 12312, poor
Emma, 12324, good
Mary, 13214, need more work on programming
and math.
The comment cell of the last entry of the csv data contains two lines. I want to treat it as one line data.
When I read the file using flatItemReader, it throws error about "expected token 3 but actual 1" I guess it treat the second line as a new line.
Is there a way to treat them as one line?
Have your reader just return the raw string for each line without trying to split on the delimiter. Make a processor (has to be stateful) to handle the parsing. The only tricky part is you'll have to signal to the processor when you've reached the EOF somehow so it isn't waiting to see if it should aggregate the next line. Something like this:
public class AggregatingItemProcessor<T> implements ItemProcessor<T, T>, InitializingBean {
private BiPredicate<T, T> aggregatePredicate;
private BiFunction<T, T, T> aggregator;
public void setAggregatePredicate(BiPredicate<T, T> aggregatePredicate) {
this.aggregatePredicate = aggregatePredicate;
}
public void setAggregator(BiFunction<T, T, T> aggregator) {
this.aggregator = aggregator;
}
private T cur;
#Override
public T process(T item) throws Exception {
if(cur == null) {
cur = item;
return null;
}
if(aggregatePredicate.test(cur, item)) {
cur = aggregator.apply(cur, item);
return null;
} else {
T toRet = cur;
cur = item;
return toRet;
}
}
#Override
public void afterPropertiesSet() throws Exception {
Assert.notNull(aggregatePredicate, "Predicate to determine if records should be aggregated must not be null.");
Assert.notNull(aggregator, "Function for aggregating items must not be null.");
}
}
Then the config...
static final String EOF_MARKER = "\0";
#Bean
public FlatFileItemReader<String> reader() {
final FlatFileItemReader<String> reader = new FlatFileItemReader<String>() {
private boolean finished = false;
#Override
public String read() throws Exception, UnexpectedInputException, ParseException {
if(finished) return null;
String next = super.read();
if(next == null) {
finished = true;
return EOF_MARKER;
}
return next;
}
};
reader.setLineMapper((s, i) -> s);
return reader;
}
#Bean
public AggregatingItemProcessor<String> processor() {
final AggregatingItemProcessor<String> processor = new AggregatingItemProcessor<>();
processor.setAggregatePredicate((s1, s2) -> !EOF_MARKER.equals(s2) && StringUtils.countOccurrencesOf(s2, ",") < 2);
processor.setAggregator(String::concat);
return processor;
}

Reading OKIO stream twice

I am using OKHTTP for networking and currently get a charStream from response.charStream() which I then pass for GSON for parsing. Once parsed and inflated, I deflate the model again to save to disk using a stream. It seems like extra work to have to go from networkReader to Model to DiskWriter. Is it possible with OKIO to instead go from networkReader to JSONParser(reader) as well as networkReader to DiskWriter(reader). Basically I want to to be able to read from the network stream twice.
You can use a MirroredSource (taken from this gist).
public class MirroredSource {
private final Buffer buffer = new Buffer();
private final Source source;
private final AtomicBoolean sourceExhausted = new AtomicBoolean();
public MirroredSource(final Source source) {
this.source = source;
}
public Source original() {
return new okio.Source() {
#Override
public long read(final Buffer sink, final long byteCount) throws IOException {
final long bytesRead = source.read(sink, byteCount);
if (bytesRead > 0) {
synchronized (buffer) {
sink.copyTo(buffer, sink.size() - bytesRead, bytesRead);
// Notfiy the mirror to continue
buffer.notify();
}
} else {
sourceExhausted.set(true);
}
return bytesRead;
}
#Override
public Timeout timeout() {
return source.timeout();
}
#Override
public void close() throws IOException {
source.close();
sourceExhausted.set(true);
synchronized (buffer) {
buffer.notify();
}
}
};
}
public Source mirror() {
return new okio.Source() {
#Override
public long read(final Buffer sink, final long byteCount) throws IOException {
synchronized (buffer) {
while (!sourceExhausted.get()) {
// only need to synchronise on reads when the source is not exhausted.
if (buffer.request(byteCount)) {
return buffer.read(sink, byteCount);
} else {
try {
buffer.wait();
} catch (final InterruptedException e) {
//No op
}
}
}
}
return buffer.read(sink, byteCount);
}
#Override
public Timeout timeout() {
return new Timeout();
}
#Override
public void close() throws IOException { /* not used */ }
};
}
}
Usage would look like:
MirroredSource mirroredSource = new MirroredSource(response.body().source()); //Or however you're getting your original source
Source originalSource = mirroredSource.original();
Source secondSource = mirroredSource.mirror();
doParsing(originalSource);
writeToDisk(secondSource);
originalSource.close();
If you want something more robust you can repurpose Relay from OkHttp.

Choose Class in Birt is empty eventhough I have added jar in Datasource

Even though while creating dataset choose class window is empty. I am using Luna Service Release 2 (4.4.2).
From: http://yaragalla.blogspot.com/2013/10/using-pojo-datasource-in-birt-43.html
In the dataset class the three methods, “public void open(Object obj, Map map)”, “public Object next()” and “public void close()” must be implemented.
Make sure you have implemented these.
Here is a sample that I tested with:
public class UserDataSet {
public Iterator<User> itr;
public List<User> getUsers() throws ParseException {
List<User> users = new ArrayList<>();
// Add to Users
....
return users;
}
public void open(Object obj, Map<String, Object> map) {
try {
itr = getUsers().iterator();
} catch (ParseException e) {
e.printStackTrace();
}
}
public Object next() {
if (itr.hasNext())
return itr.next();
return null;
}
public void close() {
}
}

Freemarker removeIntrospectionInfo does not work with DCEVM after model hotswap

I am using Freemarker and DCEVM+HotSwapManager agent. This basically allows me to hotswap classes even when adding/removing methods.
Everything works like charm until Freemarker uses hotswapped class as model. It's throwing freemarker.ext.beans.InvalidPropertyException: No such bean property on me even though reflection shows that the method is there (checked during debug session).
I am using
final Method clearInfoMethod = beanWrapper.getClass().getDeclaredMethod("removeIntrospectionInfo", Class.class);
clearInfoMethod.setAccessible(true);
clearInfoMethod.invoke(clazz);
to clear the cache, but it does not work. I even tried to obtain classCache member field and clear it using reflection but it does not work too.
What am I doing wrong?
I just need to force freemarker to throw away any introspection on model class/classes he has already obtained.
Is there any way?
UPDATE
Example code
Application.java
// Application.java
public class Application
{
public static final String TEMPLATE_PATH = "TemplatePath";
public static final String DEFAULT_TEMPLATE_PATH = "./";
private static Application INSTANCE;
private Configuration freemarkerConfiguration;
private BeansWrapper beanWrapper;
public static void main(String[] args)
{
final Application application = new Application();
INSTANCE = application;
try
{
application.run(args);
}
catch (InterruptedException e)
{
System.out.println("Exiting");
}
catch (IOException e)
{
System.out.println("IO Error");
e.printStackTrace();
}
}
public Configuration getFreemarkerConfiguration()
{
return freemarkerConfiguration;
}
public static Application getInstance()
{
return INSTANCE;
}
private void run(String[] args) throws InterruptedException, IOException
{
final String templatePath = System.getProperty(TEMPLATE_PATH) != null
? System.getProperty(TEMPLATE_PATH)
: DEFAULT_TEMPLATE_PATH;
final Configuration configuration = new Configuration();
freemarkerConfiguration = configuration;
beanWrapper = new BeansWrapper();
beanWrapper.setUseCache(false);
configuration.setObjectWrapper(beanWrapper);
try
{
final File templateDir = new File(templatePath);
configuration.setTemplateLoader(new FileTemplateLoader(templateDir));
}
catch (IOException e)
{
throw new RuntimeException(e);
}
final RunnerImpl runner = new RunnerImpl();
try
{
runner.run(args);
}
catch (RuntimeException e)
{
e.printStackTrace();
}
}
public BeansWrapper getBeanWrapper()
{
return beanWrapper;
}
}
RunnerImpl.java
// RunnerImpl.java
public class RunnerImpl implements Runner
{
#Override
public void run(String[] args) throws InterruptedException
{
long counter = 0;
while(true)
{
++counter;
System.out.printf("Run %d\n", counter);
// Application.getInstance().getFreemarkerConfiguration().setObjectWrapper(new BeansWrapper());
Application.getInstance().getBeanWrapper().clearClassIntrospecitonCache();
final Worker worker = new Worker();
worker.doWork();
Thread.sleep(1000);
}
}
Worker.java
// Worker.java
public class Worker
{
void doWork()
{
final Application application = Application.getInstance();
final Configuration freemarkerConfiguration = application.getFreemarkerConfiguration();
try
{
final Template template = freemarkerConfiguration.getTemplate("test.ftl");
final Model model = new Model();
final PrintWriter printWriter = new PrintWriter(System.out);
printObjectInto(model);
System.out.println("-----TEMPLATE MACRO PROCESSING-----");
template.process(model, printWriter);
System.out.println();
System.out.println("-----END OF PROCESSING------");
System.out.println();
}
catch (IOException e)
{
e.printStackTrace();
}
catch (TemplateException e)
{
e.printStackTrace();
}
}
private void printObjectInto(Object o)
{
final Class<?> aClass = o.getClass();
final Method[] methods = aClass.getDeclaredMethods();
for (final Method method : methods)
{
System.out.println(String.format("Method name: %s, public: %s", method.getName(), Modifier.isPublic(method.getModifiers())));
}
}
}
Model.java
// Model.java
public class Model
{
public String getMessage()
{
return "Hello";
}
public String getAnotherMessage()
{
return "Hello World!";
}
}
This example does not work at all. Even changing BeansWrapper during runtime won't have any effect.
BeansWrapper (and DefaultObjectWrapper's, etc.) introspection cache relies on java.beans.Introspector.getBeanInfo(aClass), not on reflection. (That's because it treats objects as JavaBeans.) java.beans.Introspector has its own internal cache, so it can return stale information, and in that case BeansWrapper will just recreate its own class introspection data based on that stale information. As of java.beans.Introspector's caching, it's in fact correct, as it builds on the assumption that classes in Java are immutable. If something breaks that basic rule, it should ensure that java.beans.Introspector's cache is cleared (and many other caches...), or else it's not just FreeMarker that will break. At JRebel for example they made a lot of effort to clear all kind of caches. I guess DCEVM doesn't have the resources for that. So then, it seems you have to call Introspector.flushCaches() yourself.
Update: For a while (Java 7, maybe 6) java.beans.Introspector has one cache per thread group, so you have call flushCaches() from all thread groups. And this all is actually implementation detail that, in principle, can change any time. And sadly, the JavaDoc of Introspector.flushCaches() doesn't warn you...

Custom WritableCompare displays object reference as output

I am new to Hadoop and Java, and I feel there is something obvious I am just missing. I am using Hadoop 1.0.3 if that means anything.
My goal for using hadoop is to take a bunch of files and parse them one file at a time (as opposed to line by line). Each file will produce multiple key-values, but context to the other lines is important. The key and value are multi-value/composite, so I have implemented WritableCompare for the key and Writable for the value. Because the processing of each file take a bit of CPU, I want to save the output of the mapper, then run multiple reducers later on.
For the composite keys, I followed [http://stackoverflow.com/questions/12427090/hadoop-composite-key][1]
The problem is, the output is just Java object references as opposed to the composite key and value. Example:
LinkKeyWritable#bd2f9730 LinkValueWritable#8752408c
I am not sure if the problem is related to not reducing the data at all or
Here is my main class:
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Parser.class);
conf.setJobName("raw_parser");
conf.setOutputKeyClass(LinkKeyWritable.class);
conf.setOutputValueClass(LinkValueWritable.class);
conf.setMapperClass(RawMap.class);
conf.setNumMapTasks(0);
conf.setInputFormat(PerFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
PerFileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
And my Mapper class:
public class RawMap extends MapReduceBase implements
Mapper {
public void map(NullWritable key, Text value,
OutputCollector<LinkKeyWritable, LinkValueWritable> output,
Reporter reporter) throws IOException {
String json = value.toString();
SerpyReader reader = new SerpyReader(json);
GoogleParser parser = new GoogleParser(reader);
for (String page : reader.getPages()) {
String content = reader.readPageContent(page);
parser.addPage(content);
}
for (Link link : parser.getLinks()) {
LinkKeyWritable linkKey = new LinkKeyWritable(link);
LinkValueWritable linkValue = new LinkValueWritable(link);
output.collect(linkKey, linkValue);
}
}
}
Link is basically a struct of various information that get's split between LinkKeyWritable and LinkValueWritable
LinkKeyWritable:
public class LinkKeyWritable implements WritableComparable<LinkKeyWritable>{
protected Link link;
public LinkKeyWritable() {
super();
link = new Link();
}
public LinkKeyWritable(Link link) {
super();
this.link = link;
}
#Override
public void readFields(DataInput in) throws IOException {
link.batchDay = in.readLong();
link.source = in.readUTF();
link.domain = in.readUTF();
link.path = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeLong(link.batchDay);
out.writeUTF(link.source);
out.writeUTF(link.domain);
out.writeUTF(link.path);
}
#Override
public int compareTo(LinkKeyWritable o) {
return ComparisonChain.start().
compare(link.batchDay, o.link.batchDay).
compare(link.domain, o.link.domain).
compare(link.path, o.link.path).
result();
}
#Override
public int hashCode() {
return Objects.hashCode(link.batchDay, link.source, link.domain, link.path);
}
#Override
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.batchDay, o.link.batchDay)
&& Objects.equal(link.source, o.link.source)
&& Objects.equal(link.domain, o.link.domain)
&& Objects.equal(link.path, o.link.path);
}
return false;
}
}
LinkValueWritable:
public class LinkValueWritable implements Writable{
protected Link link;
public LinkValueWritable() {
link = new Link();
}
public LinkValueWritable(Link link) {
this.link = new Link();
this.link.type = link.type;
this.link.description = link.description;
}
#Override
public void readFields(DataInput in) throws IOException {
link.type = in.readUTF();
link.description = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(link.type);
out.writeUTF(link.description);
}
#Override
public int hashCode() {
return Objects.hashCode(link.type, link.description);
}
#Override
public boolean equals(final Object obj){
if(obj instanceof LinkKeyWritable) {
final LinkKeyWritable o = (LinkKeyWritable)obj;
return Objects.equal(link.type, o.link.type)
&& Objects.equal(link.description, o.link.description);
}
return false;
}
}
I think the answer is in the implementation of the TextOutputFormat. Specifically, the LineRecordWriter's writeObject method:
/**
* Write the object to the byte stream, handling Text as a special
* case.
* #param o the object to print
* #throws IOException if the write throws, we pass it on
*/
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());
} else {
out.write(o.toString().getBytes(utf8));
}
}
As you can see, if your key or value is not a Text object, it calls the toString method on it and writes that out. Since you've left toString unimplemented in your key and value, it's using the Object class's implementation, which is writing out the reference.
I'd say that you should try writing an appropriate toString function or using a different OutputFormat.
It looks like you have a list of objects just like you wanted. You need to implement toString() on your writable if you want a human-readable version printed out instead of an ugly java reference.

Resources