CSV File To DB2 Database - skip columns - Spring Batch project - spring

I am working on a Spring batch project where I have to push data from a CSV file into a DB. Managed to implement the batch and the rest, currently the data is being pushed as it should but I wonder if there's anyway to skip some of the columns in the CSV file as some of them are irrelevant.
I did a bit of research but I wasn't able to find an answer, unless I missed something.
Sample of my code below.
<bean id="mysqlItemWriter"
class="org.springframework.batch.item.database.JdbcBatchItemWriter">
<property name="dataSource" ref="dataSource" />
<property name="sql">
<value>
<![CDATA[
insert into WEBREPORT.RAWREPORT(CLIENT,CLIENTUSER,GPS,EXTENSION) values (:client, :clientuser, :gps, :extension)
]]>
</value>
</property>

You can implement your FieldSetMapper which will map structure from one line to your POJO in reader.
Lets say you have:
name, surname, email
Mike, Evans, test#test.com
And you have model of Person with only name and email. You are not interested in surname. Here is reader example:
#Component
#StepScope
public class PersonReader extends FlatFileItemReader<Person> {
#Override
public void afterPropertiesSet() throws Exception {
//load file in csvResource variable
setResource(csvResource);
setLineMapper(new DefaultLineMapper<Person>() {
{
setLineTokenizer(new DelimitedLineTokenizer());
setFieldSetMapper(new PersonFieldSetMapper());
}
});
super.afterPropertiesSet();
}
}
And you can define PersonFieldSetMapper:
#Component
#JobScope
public class PersonFieldSetMapper implements FieldSetMapper<Person> {
#Override
public Person mapFieldSet(final FieldSet fieldSet) throws BindExceptio
{
final Person person = new Person();
person.setName(fieldSet.readString(0)); // columns are zero based
person.setEmail(fieldSet.readString(2));
return person;
}
}
This is for skipping columns, if I understood right this is what you want. If you want to skip rows, it can be done as well and I explained how to skip blank lines for example in this question.

if the check for the skip is simple and does not need a database roundtrip, you can use a simple itemProcessor, which returns null for skipped items
real simple pseudo code
public class SkipProcessor implements ItemProcessor<Foo,Foo>{
public Foo process(Foo foo) throws Exception {
//check for a skip
if(skip(foo)) {
return null;
} else {
return foo;
}
}
}
if the skip check is more complex and needs a database roundtrip, you can use the item processor, but the performance (if needed) will suffer
if performance is critical...well then it depends on setup, requirements and your possibilities, i would try it with 2 steps, one step loads cvs into database (without any checks), second steps reads data from database, and the skip check is done with a clever sql JOIN in the SQL for the itemReader

Related

Extending RepositoryItemWriter does not delete rows so that the next step sees the rows

I am attempting to use a RepositoryItemWriter to delete Products by their Ids using the RepositoryItemWriter as recommended. I have configured a RepositoryItemWriter as follows:
#Repository
public interface ProductRepository extends CrudRepository<Product,Long> {
}
Next, I specify a RepositoryItemWriter bean as follows:
class ProductRepositoryItemWriter extends RepositoryItemWriter<Product>{
private CrudRepository productRepository;
ProductRepositoryItemWriter(CrudRepository productRepository) {
super.setRepository(this.productRepository =
productRepository
}
#Transactional
#Override
protected void doWrite(List<? extends Product> products) {
this.productRepository.deleteAll(products);
}
}
My step looks like this:
public Step processStep(#Qualifier("jpaTransactionManager") final PlatformTransactionManager jpaTransactionManager) {
return stepBuilderFactory.get("processStep")
.transactionManager(jpaTransactionManager)
.chunk(120)
.reader(productJpaItemReader)
.writer((ItemWriter)productRepositoryWriter)
.build();
}
I see the deletes occurring but the products are not deleted so that the next step fails to insert the products. That step follows the delete step like this on("COMPLETED").to("uploadStep).end()
#Bean("repopulateFlow")
Flow repopulateFlow() {
FlowBuilder<Flow> flowBuilder = new FlowBuilder<>.
("repopulateFlow);
flowBuilder.start(deleteStep).on("COMPLETED")
.to(upLoadStep)
.end();
return flowBuilder.build();
}
The deleteStep uses my ProductRepositoryItemWriter to delete the rows and then the next step in the flow tries to re-insert the data but that step finds data in the table that the deleteStep should have deleted from the table. How can I achieve what I am trying to do?
I ran delete step alone in a job and it does not delete the rows in the table after it completes. I use a HikariCP pool using properties that populate a DataSourceProperties object to create the datasource. I wonder if Spring is not setting it to auto-commit true or I need to create a Hikari Pool and then set the auto-commit property to. true.
The uploadStep fails because rows are still there so I get a ConstraintsViolationException. I removed the #Transactional but still see the problem?
You need to remove #Transactional on your doWrite method. The writer is already executed in a transaction driven by Spring Batch.

Spring Batch CompositeItemProcessor get value from other delegates

I have a compositeItemProcessor as below
<bean id="compositeItemProcessor" class="org.springframework.batch.item.support.CompositeItemProcessor">
<property name="delegates">
<list>
<bean class="com.example.itemProcessor1"/>
<bean class="com.example.itemProcessor2"/>
<bean class="com.example.itemProcessor3"/>
<bean class="com.example.itemProcessor4"/>
</list>
</property>
</bean>
The issue i have is that within itemProcessor4 i require values from both itemProcessor1 and itemProcessor3.
I have looked at using the Step Execution Context but this does not work as this is within one step. I have also looked at using #AfterProcess within ItemProcessor1 but this does not work as it isn't called until after ItemProcessor4.
What is the correct way to share data between delegates in a compositeItemProcessor?
Is a solution of using util:map that is updated in itemProcessor1 and read in itemProcessor4 under the circumstances that the commit-interval is set to 1?
Using the step execution context won't work as it is persisted at chunk boundary, so it can't be shared between processors within the same chunk.
AfterProcess is called after the registered item processor, which is the composite processor in your case (so after ItemProcessor4). This won't work neither.
The only option left is to use some data holder object that you share between item processors.
Hope this helps.
This page seems to state that there are two types of ExecutionContexts, one at step-level, one at job-level.
https://docs.spring.io/spring-batch/trunk/reference/html/patterns.html#passingDataToFutureSteps
You should be able to get the job context and set keys on that, from the step context
I had a similar requirement in my application too. I went with creating a data transfer object ItemProcessorDto which will be shared by all the ItemProcessors. You can store data in this DTO object in first processor and all the remaining processors will get the information out of this DTO object. In addition to that any ItemProcessor could update or retrieve the data out of the DTO.
Below is a code snippet:
#Bean
public ItemProcessor1<ItemProcessorDto> itemProcessor1() {
log.info("Generating ItemProcessor1");
return new ItemProcessor1();
}
#Bean
public ItemProcessor2<ItemProcessorDto> itemProcessor2() {
log.info("Generating ItemProcessor2");
return new ItemProcessor2();
}
#Bean
public ItemProcessor3<ItemProcessorDto> itemProcessor3() {
log.info("Generating ItemProcessor3");
return new ItemProcessor3();
}
#Bean
public ItemProcessor4<ItemProcessorDto> itemProcessor4() {
log.info("Generating ItemProcessor4");
return new ItemProcessor4();
}
#Bean
#StepScope
public CompositeItemProcessor<ItemProcessorDto> compositeItemProcessor() {
log.info("Generating CompositeItemProcessor");
CompositeItemProcessor<ItemProcessorDto> compositeItemProcessor = new CompositeItemProcessor<>();
compositeItemProcessor.setDelegates(Arrays.asList(itemProcessor1(), itemProcessor2(), itemProcessor3), itemProcessor4()));
return compositeItemProcessor;
}
#Data
public class ItemProcessorDto {
private List<String> sharedData_1;
private Map<String, String> sharedData_2;
}

How spring batch share data between job

I have one query on Spring Batch Job.
I want to share data of one job with another job in same execution context. Is it possible? If so, then how?
My requirement is caching. I have file, where some data is stored. My job runs daily and need data of that file. I don't want to read file by my job daily. instead of it, I want to store data of file in cache(Hash Map). So when same job runs next day, it will use data from cache only. Is it possible in spring batch?
Your suggestion are welcome.
You can use spring initialize bean which initializes your cache at startup.
Add initialize bean to your application context;
<bean id="yourCacheBean" class="yourpackage.YourCacheBean" init-method="initialize">
</bean>
YourCacheBean looks like;
public class YourCacheBean {
private Map<Object, Object> yourCache;
public void initialize() {
//TODO: Intialize your cache
}
}
Give the initialize bean to the itemReader or itemProcessor or itemWriter in job.xml;
<bean id="exampleProcessor" class="yourpackage.ExampleProcessor" scope="step">
<property name="cacheBean" ref="yourCacheBean" />
</bean>
ExampleProcessor looks like;
public class ExampleProcessor implements ItemProcessor<String, String> {
private YourCacheBean cacheBean;
public String process(String arg0) {
return "";
}
public void setCacheBean(YourCacheBean cacheBean) {
this.cacheBean = cacheBean;
}
}
Create a job to import file into database. Other jobs will use data from database as a cache.
Another way may be to read file into a Map<> and serialize object to a file than de-serialize when need (but I still prefer database as cache)
Spring have a cache annotation that may help that kind of case and it is really easy to implement. The first call to a method will be executed, afterwards if you call the same method with exactly the same arguments, the value will be returned by the cache.
Here you have a little tutorial: http://www.baeldung.com/spring-cache-tutorial
In your case, if your call to read the file is always with the same arguments will work as you want. Just take care of TTL.

DbUnit check autogenerating id

I need to execute integration tests using DbUnit. I have created to datasets (before and after test) and compare them using #DatabaseSetup and #ExpectedDatabase annotations. During test one new database row was created (it presents in after test dataset, which I specify using #ExpectedDatabase annotation). The problem is that row id is generating automatically (I am using Hibernate), so row id is changing permanently. Therefore my test pass only once and after that I need to change id in after test dataset, but this is not that I need. Can you suggest me please any solutions for this issue, if this issue can be resolved with DbUnit.
Solution A:
Use assigned id strategy and use a seperate query to retrieve next value in business logic. So you can always assign a known id in your persistence tests with some appropriate database cleanup. Note that this only works if you're using Oracle Sequence.
Solution B:
If I'm not mistaken, there are some methods similar to assertEqualsIngoreColumns() in org.dbunit.Assertion. So you can ignore the id assertion if you don't mind. Usually I'll compensate that with a not null check on id. Maybe there some options in #ExpectedDatabase but I'm not sure.
Solution C:
I'd like to know if there is a better solution because that solution A introduces some performance overhead while solution B sacrifices a little test coverage.
What version of dbunit you're using by the way. I have never seen these annotations in 2.4.9 and below, they looks easier to use.
This workaround is saving my skin till now:
I implemented a AbstractDataSetLoader with replacement feature:
public class ReplacerDataSetLoader extends AbstractDataSetLoader {
private Map<String, Object> replacements = new ConcurrentHashMap<>();
#Override
protected IDataSet createDataSet(Resource resource) throws Exception {
FlatXmlDataSetBuilder builder = new FlatXmlDataSetBuilder();
builder.setColumnSensing(true);
try (InputStream inputStream = resource.getInputStream()) {
return createReplacementDataSet(builder.build(inputStream));
}
}
/**
* prepare some replacements
* #param dataSet
* #return
*/
private ReplacementDataSet createReplacementDataSet(FlatXmlDataSet dataSet) {
ReplacementDataSet replacementDataSet = new ReplacementDataSet(dataSet);
//Configure the replacement dataset to replace '[null]' strings with null.
replacementDataSet.addReplacementObject("[null]", null);
replacementDataSet.addReplacementObject("[NULL]", null);
replacementDataSet.addReplacementObject("[TODAY]", new Date());
replacementDataSet.addReplacementObject("[NOW]", new Timestamp(System.currentTimeMillis()));
for (java.util.Map.Entry<String, Object> entry : replacements.entrySet()) {
replacementDataSet.addReplacementObject("["+entry.getKey()+"]", entry.getValue());
}
replacements.clear();
return replacementDataSet;
}
public void replace(String replacement, Object value){
replacements.put(replacement, value);
}
}
With this you could somehow track the ids you need and replace in your testes
#DatabaseSetup(value="/test_data_user.xml")
#DbUnitConfiguration(dataSetLoaderBean = "replacerDataSetLoader")
public class ControllerITest extends WebAppConfigurationAware {
//reference my test dbconnection so I can get last Id using regular query
#Autowired
DatabaseDataSourceConnection dbUnitDatabaseConnection;
//reference my datasetloader so i can iteract with it
#Autowired
ColumnSensingFlatXMLDataSetLoader datasetLoader;
private static Number lastid = Integer.valueOf(15156);
#Before
public void setup(){
System.out.println("setting "+lastid);
datasetLoader.replace("emp1", lastid.intValue()+1);
datasetLoader.replace("emp2", lastid.intValue()+2);
}
#After
public void tearDown() throws SQLException, DataSetException{
ITable table = dbUnitDatabaseConnection.createQueryTable("ids", "select max(id) as id from company.entity_group");
lastid = (Number)table.getValue(0, "id");
}
#Test
#ExpectedDatabase(value="/expected_data.xml", assertionMode=DatabaseAssertionMode.NON_STRICT)
public void test1() throws Exception{
//run your test logic
}
#Test
#ExpectedDatabase(value="/expected_data.xml", assertionMode=DatabaseAssertionMode.NON_STRICT)
public void test2() throws Exception{
//run your test logic
}
}
And my expected dataset need some replacement emp1 and emp2
<?xml version='1.0' encoding='UTF-8'?>
<dataset>
<company.entity_group ID="15155" corporate_name="comp1"/>
<company.entity_group ID="15156" corporate_name="comp2"/>
<company.entity_group ID="[emp1]" corporate_name="comp3"/>
<company.entity_group ID="[emp2]" corporate_name="comp3"/>
<company.ref_entity ID="1" entity_group_id="[emp1]"/>
<company.ref_entity ID="2" entity_group_id="[emp2]"/>
</dataset>
Use DatabaseAssertionMode.NO_STRICT, and delete the 'id' column from your 'expect.xml'.
DBUnit will ignore this column.

ItemProcessor needs to both modify the working entity but also create new referenced ones

Using spring-batch and JPA (provided by hibernate).
I have a step that does the following:
reads all clients from DB (Client entity)
enhances them with data from 3rd party. The ItemProcessor goes to the 3rd party data source, fetches some data that it stores in the Client entity itself (its fields) but also brings more data that is stored as different entities (ClientSale) and Client has a property of List which is mapped by ManyToOne.
The modified entity (Client) and the new ones (ClientSale) need to be stored in DB.
The reader part is straight forward, and for the writer I used JPAItemWriter. At the processing stage I tried to update the fields, create the new ones and add them to the client's list and return the client, hoping that the writer will write both the referenced objects and the client itself to the DB.
Instead, I got an error saying that ClientSale with id #123213213 doesn't exist in the DB.
How to I overcome this? Should I return a list of objects (different types) from my processor (the client + all ClientSale)? Can JPAItemWriter handle a list of objects? Another problem with that is that I'll have to manually update the client_id in the ClientSale entities instead of adding them to the list and having hibernate to understand the relation between them and who points where..
What's the best practice here?
Thanks!
OK.. Here is what I did at the end based on everything:
I created a MultiEntityItemWriter that can receive as item a list (and in that case it opens it and writes all elements to the delegated ItemWriter.
Code:
public class MultiEntityItemWriter implements ItemWriter<Object>{
private ItemWriter<Object> delegate;
#Override
public void write(List<? extends Object> items) throws Exception {
List<Object> toWrite = new ArrayList<>();
for (Object item : items) {
if (item instanceof Collection<?>) {
toWrite.addAll((Collection<?>)item);
} else {
toWrite.add(item);
}
}
delegate.write(toWrite);
}
public ItemWriter<Object> getDelegate() {
return delegate;
}
public void setDelegate(ItemWriter<Object> delegate) {
this.delegate = delegate;
}
}
Now, my ItemProcessor can output a list with all entities to be written and I don't need to rely on JPA to understand that there are more entities to be committed to the DB.
hope it helps...
I think you are trying to accommodate multiple steps in single step. Try finding a way and define your job as two step process instead of one.
<batch:job id="MyJob" incrementer="incrementer" job-repository="jobRepository">
<batch:step id="step1" next="step2">
<tasklet >
<chunk reader="reader1"
writer="writer1"
processor="processor1"
commit-interval="10" />
</tasklet>
</batch:step>
<batch:step id="step2">
<tasklet >
<chunk reader="reader2"
writer="writer2"
processor="processor2"
commit-interval="10" />
</tasklet>
</batch:step>
</batch:job>
If required, use appropriate caching for optimal performance.
EDIT:
In your item writer, please make sure you are using entityManager/session of first data source. Also use merge to persist your eneities in place of persist.

Resources