i am using spring batch to read a flat file. The file has related records. ie, there can be a parent record and any number of child records i wanted to read all the records and call web service to store it. I also wanted to capture the relationship and store it.one challenge is child record can be anywhere in the file. And child can also have many children records.I am unable to find the solution for this problem with spring batch.
please provide your suggestions
update: I dont have any option to use database as temporary storage of data.
I had solved such problem by processing the file multiple times.
On every pass I'll try to read\process every record in the file with such alg:
if record has parent - check if parent already stored. If no - I skip it in processor
if record unchanged (or already stored if updates are not possible) - skip it in processor
else - store in db
And then declare loop and decider:
<batch:step id="processParentAndChilds" next="loop">
<batch:tasklet>
<batch:chunk reader="processParentAndChildsReader"
commit-interval="1000">
<batch:processor>
<bean class="processParentAndChildsProcessor"/>
</batch:processor>
<batch:writer>
<bean class="processParentAndChildsWriter"/>
</batch:writer>
</batch:chunk>
</batch:tasklet>
</batch:step>
<batch:decision decider="processParentAndChildsRetryDecider" id="loop">
<batch:next on="NEXT_LEVEL" to="processprocessParentAndChilds"/>
<batch:next on="COMPLETED" to="goToNextSteps"/>
</batch:decision>
public class ProcessParentAndChildsRetryDecider implements JobExecutionDecider{
#Override
public FlowExecutionStatus decide(JobExecution jobExecution, StepExecution stepExecution) {
// if no on record written - no sense to try again
if (stepExecution.getWriteCount() > 0) {
return new FlowExecutionStatus("NEXT_LEVEL");
} else {
return FlowExecutionStatus.COMPLETED;
}
}
}
Related
I created a Spring Batch job which reads orders from MongoDB and makes a rest call to upload them. However, the batch job automatically gets completed even though all records are not read by MongoItemReader.
I am maintaining a field batchProcessed:boolean on Orders collection. The MongoItemReader reads records for which {batchProcessed:{$ne:true}} as I need to run the batch job multiple times but not process the same documents again and again.
In my OrderWriter I set batchProcessed to true.
#Bean
#StepScope
public MongoItemReader<Order> orderReader() {
MongoItemReader<Order> reader = new MongoItemReader<>();
reader.setTemplate(mongoTempate);
HashMap<String,Sort.Direction> sortMap = new HashMap<>();
sortMap.put("_id",Direction.ASC);
reader.setSort(sortMap);
reader.setTargetType(Order.class);
reader.setQuery("{batchProcessed:{$ne:true}}");
return reader;
}
#Bean
public Step uploadOrdersStep(OrderItemProcessor processor) {
return stepBuilderFactory.get("step1").<Order, Order>chunk(1)
.reader(orderReader()).processor(processor).writer(orderWriter).build();
}
#Bean
public Job orderUploadBatchJob(JobBuilderFactory factory, OrderItemProcessor processor) {
return factory.get("uploadOrder").flow(uploadOrdersStep(processor)).end().build();
}
The MongoItemReader is a paging item reader. When reading items in pages and changing items that might be returned by the query (ie a field that is used in the query's "where" clause), the paging logic can be lost and some items might be skipped. There's a similar problem with the JPA paging item reader that is explained in details here: Spring batch jpaPagingItemReader why some rows are not read?
Common techniques to work around this issue is to use a cursor-based reader, use a staging table/collection, use a partitioned step with a partition per page, etc.
I am using spring batch and as normally used I have reader , processor and writer .
I have 2 questions
1>
Reader queries all 200 records (total record size in table is 200 and I have given pagesize=200 )and thus it gets me all 200 records, and in processor we want list of all these record because we have to compare each record with other 199 records to group them in different tiers .
Thus I am thinking if we can get that list in processing step , I can manipulate them .how should I approach .
2>
In processing stage I need some master data from database depending on which all input records will be processed .i m thinking of injection of data source in processing bean and fetch all master table data and process all records. Is it good approach or please suggest otherwise .
<job id="sampleJob">
<step id="step1">
<tasklet>
<chunk reader="itemReader" processor="processor" writer="itemWriter" commit-interval="20"/>
</tasklet>
</step>
</job>
And the processor is
#Override
public User process(Object item) throws Exception {
// transform item to user
return user;
}
And I want something like
public List<User> process(List<Object> item) throws Exception {
// transform item to user
return user;
}
I found some post here but they say to get the list in writer .But i dont like to process anything in writer, because that kills the defination of writer and processor. Is there any configuration to get the list inside this process method.
Thank you
Since the ItemProcessor receives whatever you return from the ItemReader, you need your ItemReader to return the List. That List is really the "item" you're processing. There is an example of this in the Spring Batch Samples. The AggregateItemReader reads all the items from a delegate ItemReader and returns them as a single list. You can take a look at it on Github here: https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/java/org/springframework/batch/sample/domain/multiline/AggregateItemReader.java
I have a transformer which returns a Map as a result. This result is then put on to the output-channel. What I want to do is to go to different channel for each KEY in the map. How can I configure this in Spring Integration?
e.g.
Transformer -- produces --> Map
Map contains {(Key1, "some data"), (Key2, "some data")}
So for Key1 --> go to channel 1
So for Key2 --> go to channel 2
etc..
Code examples would be helpful.
Thanks in advance
GM
Your processing should consist of two steps:
Partitioning message into separate parts that will be processed independently,
Routing separate messages (the result of split) into appropriate channels.
For the first task you have to use splitter and for the second one - router (header value router fits best here).
Please find a sample Spring Integration configuration below. You may want to use an aggregator at the end of a chain in order to combine messages - I leave it at your discretion.
<channel id="inputChannel">
<!-- splitting message into separate parts -->
<splitter id="messageSplitter" input-channel="inputChannel" method="split"
output-channel="routingChannel">
<beans:bean class="com.stackoverflow.MapSplitter"/>
</spliter>
<channel id="routingChannel">
<!-- routing messages into appropriate channels basis on header value -->
<header-value-router input-channel="routingChannel" header-name="routingHeader">
<mapping value="someHeaderValue1" channel="someChannel1" />
<mapping value="someHeaderValue2" channel="someChannel2" />
</header-value-router>
<channel id="someChannel1" />
<channel id="someChannel2" />
And the splitter:
public final class MapSplitter {
public static final String ROUTING_HEADER_NAME = "routingHeader";
public List<Message<SomeData>> split(final Message<Map<Key, SomeData>> map) {
List<Message<SomeData>> result = new LinkedList<>();
for(Entry<Key, SomeData> entry : map.entrySet()) {
final Message<SomeData> message = new MessageBuilder()
.withPayload(entry.getValue())
.setHeader(ROUTING_HEADER_NAME, entry.getKey())
.build();
result.add(message);
}
return result;
}
}
My batched statements in mybatis are timing out. I'd like to throttle the load I'm sending to the database by flushing the statements periodically. In iBATIS, I used a callback, something like this:
sqlMapClientTemplate.execute(new SqlMapClientCallback<Integer>() {
#Override
public Integer doInSqlMapClient(SqlMapExecutor executor)
throws SQLException {
executor.startBatch();
int tally = 0;
for (Foo foo: foos) {
executor.insert("FooSql.insertFoo",foo.getData());
/* executes batch when > MAX_TALLY */
tally = BatchHelper.updateTallyOnMod(executor, tally);
}
return executor.executeBatch();
}
});
Is there a better way to do this in mybatis? Or do I need to do the same type of thing with SqlSessionCallback? This feels cumbersome. What I'd really like to do is configure the project to flush every N batched statements.
I did not get any responses, so I'll share the solution I settled on.
Mybatis provides direct access to statement flushing. I autowired the SqlSession, used Guava to partition the collection into manageable chunks, then flushed the statements after each chunk.
Iterable<List<Foo>> partitions = Iterables.partition(foos, MAX_TALLY);
for (List<Foo> partition : partitions) {
for (Foo foo : partition) {
mapper.insertFoo(foo);
}
sqlSession.flushStatements();
}
Sorry for the late response, but I just stumbled onto this question right now. However, hopefully it will help others with a similar problem.
You don't need to explicitly Autowire the SqlSession. You may use the mapper interface itself. In the mapper interface simply define a method that is annotated with the #Flush annotation and has a return type of List<BatchResult>. Here's an example of a method in the mapper interface:
#Flush
List<BatchResult> flushBatchedStatements();
Then simply call this method on your mapper object like so:
Iterable<List<Foo>> partitions = Iterables.partition(foos, MAX_TALLY);
for (List<Foo> partition : partitions)
{
for (Foo foo : partition)
{
mapper.insertFoo(foo);
}
mapper.flushBatchedStatements(); //this call will flush the all statements batched upto this point, into the table.
}
Note that you don't need to add anything special into your mapper XML file to support this type of statement flushing via the mapper interface. Your XML mapper may simply be something like
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd" >
<mapper namespace=".....">
<insert id="bulkInsertIntoTable" parameterType="myPackage.Foo">
insert into MyDatabaseTable(col1, col2, col3)
values
( #{fooObj.data1}, #{fooObj.data2}, #{fooObj.data3} )
</insert>
</mapper>
The only thing that is needed is that you use MyBatis 3.3 or higher. Here's what the MyBatis docs on the MyBatis website state:
If this annotation is used, it can be called the
SqlSession#flushStatements() via method defined at a Mapper
interface.(MyBatis 3.3 or above)
For more details please visit the MyBatis official documentation site:
http://www.mybatis.org/mybatis-3/java-api.html
I am having multiple log files 1.csv,2.csv and 3.csv generated by a log report.
I want to read those files and parse them concurrently using Scriptella.
Scriptella does not provide parallel job execution out of the box. Instead you should use a job scheduler provided by an operating system or a programming environment (e.g. run multiple ETL files by submitting jobs to an ExecutorService).
Here is a working example to import a single file specified as a system property:
ETL file:
<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
<connection id="in" driver="csv" url="$input"/>
<connection id="out" driver="text"/>
<query connection-id="in">
<script connection-id="out">
Importing: $1, $2
</script>
</query>
</etl>
Java code to run files in parallel:
//Imports 3 csv files in parallel using a fixed thread pool
public class ParallelCsvTest {
public static void main(String[] args) throws EtlExecutorException, MalformedURLException, InterruptedException {
final ExecutorService service = Executors.newFixedThreadPool(3);
for (int i=1;i<=3;i++) {
//Pass a name as a parameter to ETL file, e.g. input<i>.csv
final Map<String,?> map = Collections.singletonMap("input", "input"+i+".csv");
EtlExecutor executor = EtlExecutor.newExecutor(new File("parallel.csv.etl.xml").toURI().toURL(), map);
service.submit((Callable<ExecutionStatistics>)executor);
}
service.shutdown();
service.awaitTermination(10, TimeUnit.SECONDS);
}
}
Tu run this example create 3 csv files input1.csv, input2.csv and input3.csv and put them in the current working directory. Example of the CSV file:
Level, Message
INFO,Process 1 started
INFO,Process 1 stopped