Within a Spring batch processor, how to commit entity to Database immediately on calling repository.save() - spring-boot

I am creating a Spring Batch process (Spring Boot 2) that reads a file and writes it to a Database. It processes it one record at a time. Read from file, process it, and write (or update) to the Database.
If a record for the same ID exists in the DB, the process has to update the end date of the existing record in DB, and create a new record with new start date. Below is the code:
public class Processor implements ItemProcessor<CelebVO, CelebVO> {
#Autowired
EndorseTableRepository endorseTableRepository;
#Override
#Transactional
public CelebVO process(CelebVO celebVO) {
CelebEndorsement celebEndorsement = endorseTableRepository.findAllByCelebIDAndBrandID(celebVO.getCelebID(),celebVO.getBrandID());
if (celebEndorsement == null) {
CelebEndorsement newEndorsement = new CelebEndorsement(celebVO);
endorseTableRepository.save(newEndorsement);
} else {
celebEndorsement.setEndDate(celebVO.getEffDt.minusDays(1));
endorseTableRepository.save(celebEndorsement);
// create a new row with new start date
CelebEndorsement newEndorsement = new CelebEndorsement(celebVO);
newEndorsement.setStartDate(celebVO.getEffDt());
endorseTableRepository.save(newEndorsement);
}
return celebVO;
}
}
Below is the input txt file (CelebVO):
CelebID BrandID EffDt
J Lo Pepsi 2021-01-05
J Lo Pepsi 2021-05-30
Now, lets suppose we are starting with an empty EndorseTable. When the process picks up the file and reads the records, it will see there are no records for CelebID 'J Lo'. So it will insert a row to the DB.
Now, the process reads the second row and process it. It should see that there is already a record in the table for J Lo. So it should put an endDate to that records and then create a new record.
After this file is processed we should see two records in the table.
But that is not what is happening. Though I do a repository.save() for the first record, it is still not commited to the table. So when the process reads the second row, it doesn't find any rows in the table. It ends up writing only one record to the table.
I tried a repository.saveAndFlush(). That doesn't help.
My chunk size is 1
I tried removing #Transactional. But that breaks the code. So I kept it there.

The chunk-oriented processing model of Spring Batch commits a transaction per chunk, not per record. So in your case, if the insert and the update happen to be in the same chunk, the processor won't see the change of the previous record as the transaction is not committed yet at that point.
Adding #Transactional on your processor's method is incorrect, because the processor will already be executed within the scope of a transaction driven by Spring Batch. What you are trying to do would work if you set the commit interval to 1, but this would impact the performance of your step.

I had to modify the Entity class. I replaced
#ManyToOne(cascade = CascadeType.ALL)
with
#ManyToOne(cascade = {CascadeType.MERGE, CascadeType.DETACH})
and it worked.

Related

How do I update one column of all rows in a large table in my Spring Boot application?

I have a Spring Boot 2.x project with a big Table in my Cassandra Database. In my Liquibase Migration Class, I need to replace a value from one column in all rows.
For me its a big perfomance hit, when I try to solve this with
SELECT * FROM BOOKING
forEach Row
Update Row
Because of the total number of rows. Even when I select only 1 Column.
Is it possible to make something like "partwise/pagination" loop?
Pseudecode
Take first 1000 rows
do Update
Take next 1000 rows
do Update
loop.
Im also happy about all other solution approaches you have.
Must known:
Make sure there is a way to group the updates by partition. If you try a batchUpdate on 1000 rows not in same partition the coordinator of the request will suffer, you are moving the load from your client to the coordinator, and you want the parallelize the writes instead. A batchUpdate with cassandra has nothing to do with the one in relational databases.
For fined-grained operations like this you want to go back to the usage of the drivers with CassandraOperations and CqlSession for maximum control
There is a way to paginate with Spring Data cassandra using Slice but do not have control over how operations are implemented.
Spring Data Cassandra core
Slice<MyEntity> slice = MyEntityRepo.findAll(CassandraPageRequest.first(size));
while(slice.hasNext() && currpage < page) {
slice = personrepo.findAll(slice.nextPageable());
currpage++;
}
slice.getContent();
Drivers:
// Prepare Statements to speed up queries
PreparedStatement selectPS = session.prepare(QueryBuilder
.selectFrom( "myEntity").all()
.build()
.setPageSize(1000) // 1000 per pages
.setTimeout(Duration.ofSeconds(10)); // 10s timeout
PreparedStatement updatePS = session.prepare(QueryBuilder
.update("mytable")
.setColumn("myColumn", QueryBuilder.bindMarker())
.whereColumn("myPK").isEqualTo(QueryBuilder.bindMarker())
.build()
.setConsistencyLevel(ConsistencyLevel.ONE)); // Fast writes
// Paginate
ResultSet page1 = session.execute(selectPS);
Iterator<Row> page1Iter = page1.iterator();
while (0 < page1.getAvailableWithoutFetching()) {
Row row = page1Iter.next();
cqlsession.executeAsync(updatePS.bind(...));
}
ByteBuffer pagingStateAsBytes =
page1.getExecutionInfo().getPagingState();
selectPS.setPagingState(pagingStateAsBytes);
ResultSet page2 = session.execute(selectPS);
You could of course include this pagination in a loop and track progress.

Spring Boot JPA save() method trying to insert exisiting row

I have a simple kafka consumer that collects events and based on the data in them inserts or updates a record in the database - table has a unique ID constraint on ID column and also in the entity field.
Everything works fine when the table is pre-populated and inserts happen every now and then. However when i truncate the table and send a couple thousand events with limited number of ID (i was doing 50 unique ID within 3k events) then events are processed simultaneously and the save() method randomly fails with Unique constraint violation exception. I debugged it and the outcome is pretty simple.
event1={id = 1 ... //somedata} gets picked up, service method saveOrUpdateRecord() looks for the record by ID=1, finds none, inserts a new record.
event2={id = 1 ... //somedata} gets picked up almost at the same time, service method saveOrUpdateRecord() looks for the record by ID=1, finds none (previous one is mid-inserting), tries to insert and fails with constraint violation exception - should find this record and merge it with the input from the event based on my conditions.
How can i get the saveOrUpdateRecord() to run only when the previous one was fully executed to prevent such behaviour? I really dont want to slow kafka consumer down with poll size etc, i just want my service to execute one transaction at a time.
The service method:
public void saveOrUpdateRecord(Object input) {
Object output = repository.findById(input.getId));
if (output == null) {
repository.save(input);
} else {
mergeRecord(input, output);
repository.save(output);
}
}
Will #Transactional annotaion on method do the job?
Make your service thread safe.
Use this:
public synchronized void saveOrUpdateRecord(Object input) {
Object output = repository.findById(input.getId));
if (output == null) {
repository.save(input);
} else {
mergeRecord(input, output);
repository.save(output);
}
}

Native SQL select forces flushing in transactional method

I have a transactional method where objects are inserted. The debugger shows that upon eventsDAO.save(..) the actual insert doesn't take place, but there is only a sequence fetch. The first time I see insert into events_t .. in the debugger is when there's a reference to the just-inserted Event.
#Transactional(propagation = Propagation.REQUIRED, rollbackFor = Exception.class, readOnly = false)
public void insertEvent(..) {
EventsT eventsT = new EventsT();
// Fill it out...
EventsT savedEventsT = eventsDAO.save(eventsT); // No actual save happens here
// .. Some other HQL fetches or statements ...
// Actual Save(Insert) only happens after some actual reference to this EventsT (below)
// This is also HQL
SomeField someField = eventsDAO.findSomeAttrForEventId(savedEventsT.getId());
}
But I also see that this only holds true if all the statements are HQL (non-native).
As soon as I put a Native-SQL Select somewhere before any actual reference to this table, even if it does not touch the table in any way, it forces an immediate flush and I see the statement insert into events_t ... on the console at that exact point.
If I don't touch the table EventsT with my Native SQL Select in any way, why does the flushing happen at that point?
According to the hibernate documentation:
6.1. AUTO flush
By default, Hibernate uses the AUTO flush mode which triggers a flush in the following circumstances:
prior to committing a Transaction
prior to executing a JPQL/HQL query that overlaps with the queued entity actions
before executing any native SQL query that has no registered synchronization
So, this is expected behaviour. See also this section. It shows how you can use a synchronization.

Spring Data / Hibernate save entity with Postgres using Insert on Conflict Update Some fields

I have a domain object in Spring which I am saving using JpaRepository.save method and using Sequence generator from Postgres to generate id automatically.
#SequenceGenerator(initialValue = 1, name = "device_metric_gen", sequenceName = "device_metric_seq")
public class DeviceMetric extends BaseTimeModel {
#Id
#GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "device_metric_gen")
#Column(nullable = false, updatable = false)
private Long id;
///// extra fields
My use-case requires to do an upsert instead of normal save operation (which I am aware will update if the id is present). I want to update an existing row if a combination of three columns (assume a composite unique) is present or else create a new row.
This is something similar to this:
INSERT INTO customers (name, email)
VALUES
(
'Microsoft',
'hotline#microsoft.com'
)
ON CONFLICT (name)
DO
UPDATE
SET email = EXCLUDED.email || ';' || customers.email;
One way of achieving the same in Spring-data that I can think of is:
Write a custom save operation in the service layer that
Does a get for the three-column and if a row is present
Set the same id in current object and do a repository.save
If no row present, do a normal repository.save
Problem with the above approach is that every insert now does a select and then save which makes two database calls whereas the same can be achieved by postgres insert on conflict feature with just one db call.
Any pointers on how to implement this in Spring Data?
One way is to write a native query insert into values (all fields here). The object in question has around 25 fields so I am looking for an another better way to achieve the same.
As #JBNizet mentioned, you answered your own question by suggesting reading for the data and then updating if found and inserting otherwise. Here's how you could do it using spring data and Optional.
Define a findByField1AndField2AndField3 method on your DeviceMetricRepository.
public interface DeviceMetricRepository extends JpaRepository<DeviceMetric, UUID> {
Optional<DeviceMetric> findByField1AndField2AndField3(String field1, String field2, String field3);
}
Use the repository in a service method.
#RequiredArgsConstructor
public class DeviceMetricService {
private final DeviceMetricRepository repo;
DeviceMetric save(String email, String phoneNumber) {
DeviceMetric deviceMetric = repo.findByField1AndField2AndField3("field1", "field", "field3")
.orElse(new DeviceMetric()); // create new object in a way that makes sense for you
deviceMetric.setEmail(email);
deviceMetric.setPhoneNumber(phoneNumber);
return repo.save(deviceMetric);
}
}
A word of advice on observability:
You mentioned that this is a high throughput use case in your system. Regardless of the approach taken, consider instrumenting timers around this save. This way you can measure the initial performance against any tunings you make in an objective way. Look at this an experiment and be prepared to pivot to other solutions as needed. If you are always reading these three columns together, ensure they are indexed. With these things in place, you may find that reading to determine update/insert is acceptable.
I would recommend using a named query to fetch a row based on your candidate keys. If a row is present, update it, otherwise create a new row. Both of these operations can be done using the save method.
#NamedQuery(name="getCustomerByNameAndEmail", query="select a from Customers a where a.name = :name and a.email = :email");
You can also use the #UniqueColumns() annotation on the entity to make sure that these columns always maintain uniqueness when grouped together.
Optional<Customers> customer = customerRepo.getCustomersByNameAndEmail(name, email);
Implement the above method in your repository. All it will do it call the query and pass the name and email as parameters. Make sure to return an Optional.empty() if there is no row present.
Customers c;
if (customer.isPresent()) {
c = customer.get();
c.setEmail("newemail#gmail.com");
c.setPhone("9420420420");
customerRepo.save(c);
} else {
c = new Customer(0, "name", "email", "5451515478");
customerRepo.save(c);
}
Pass the ID as 0 and JPA will insert a new row with the ID generated according to the sequence generator.
Although I never recommend using a number as an ID, if possible use a randomly generated UUID for the primary key, it will qurantee uniqueness and avoid any unexpected behaviour that may come with sequence generators.
With spring JPA it's pretty simple to implement this with clean java code.
Using Spring Data JPA's method T getOne(ID id), you're not querying the DB itself but you are using a reference to the DB object (proxy). Therefore when updating/saving the entity you are performing a one time operation.
To be able to modify the object Spring provides the #Transactional annotation which is a method level annotation that declares that the method starts a transaction and closes it only when the method itself ends its runtime.
You'd have to:
Start a jpa transaction
get the Db reference through getOne
modify the DB reference
save it on the database
close the transaction
Not having much visibility of your actual code I'm gonna abstract it as much as possible:
#Transactional
public void saveOrUpdate(DeviceMetric metric) {
DeviceMetric deviceMetric = metricRepository.getOne(metric.getId());
//modify it
deviceMetric.setName("Hello World!");
metricRepository.save(metric);
}
The tricky part is to not think the getOne as a SELECT from the DB. The database never gets called until the 'save' method.

Spring data Neo4j Affected row count

Considering a Spring Boot, neo4j environment with Spring-Data-neo4j-4 I want to make a delete and get an error message when it fails to delete.
My problem is since the Repository.delete() returns void I have no ideia if the delete modified anything or not.
First question: is there any way to get the last query affected lines? for example in plsql I could do SQL%ROWCOUNT
So anyway, I tried the following code:
public void deletesomething(Long somethingId) {
somethingRepository.delete(getExistingsomething(somethingId).getId());
}
private something getExistingsomething(Long somethingId, int depth) {
return Optional.ofNullable(somethingRepository.findOne(somethingId, depth))
.orElseThrow(() -> new somethingNotFoundException(somethingId));
}
In the code above I query the database to check if the value exist before I delete it.
Second question: do you recommend any different approach?
So now, just to add some complexity, I have a cluster database and db1 can only Create, Update and Delete, and db2 and db3 can only Read (this is ensured by the cluster sockets). db2 and db3 will receive the data from db1 from the replication process.
For what I seen so far replication can take up to 90s and that means that up to 90s the database will have a different state.
Looking again to the code above:
public void deletesomething(Long somethingId) {
somethingRepository.delete(getExistingsomething(somethingId).getId());
}
in debug that means:
getExistingsomething(somethingId).getId() // will hit db2
somethingRepository.delete(...) // will hit db1
and so if replication has not inserted the value in db2 this code wil throw the exception.
the second question is: without changing those sockets is there any way for me to delete and give the correct response?
This is not currently supported in Spring Data Neo4j, if you wish please open a feature request.
In the meantime, perhaps the easiest work around is to fall down to the OGM level of abstraction.
Create a class that is injected with org.neo4j.ogm.session.Session
Use the following method on Session
Example: (example is in Kotlin, which was on hand)
fun deleteProfilesByColor(color : String)
{
var query = """
MATCH (n:Profile {color: {color}})
DETACH DELETE n;
"""
val params = mutableMapOf(
"color" to color
)
val result = session.query(query, params)
val statistics = result.queryStatistics() //Use these!
}

Resources