Readcount in spring batch reader is less than source count

Readcount in spring batch reader is less than source count - spring-boot

I am Using Spring batch withing spring boot.
reading a table from source and writing to destination.
i am populating three tables where the read-count for one of the table is less than Actual.
Say In source table there are 2656274 Rows but the read-count I am getting is readCount=2577203, writeCount=2577203
other tables having 8 lacks records and 10 k records working fine
public class DxDatabaseItemReader extends JdbcCursorItemReader<DxDataRead> {
public DxDatabaseItemReader(DxJobContext jobCntx) {
synchronized (this){
System.err.print("Into DxDatabaseItemReader");
String[] dsParam =jobCntx.srcDbName.split("#");
this.setDataSource(createOracleDataSource(dsParam[1],dsParam[2],dsParam[3]));
this.setSql(createFetchQuery(jobCntx.srcFieldNames,jobCntx.srcTableName));//"SELECT SOEID, FST_NAM, LST_NAM FROM REF_PRSNL_MSTR");
this.setFetchSize(0);
this.setRowMapper((ResultSet rs, int rowNum) -> {
Map<String, String> fieldNameMap = new HashMap<>();
for(int i=0;i<jobCntx.srcFieldNames.size();i++)
fieldNameMap.put(getSrcFieldName(jobCntx.srcFieldNames, i), rs.getString(jobCntx.srcFieldNames.get(i)));
DxDataRead dataRead = new DxDataRead();
dataRead.setFieldNameMap(fieldNameMap);
return dataRead;
});
}
}
Expected : it should read all the data and write

Related

Kafka Streams exactly-once re-balance aggregation state data loss

Running 3 Kafka Streams instances with exactly-once, but experiencing loss of data when restarting one of the streams instances (the other 2 doing re-balance).
If I restart the instance quickly (within session.timeout.ms), without the other 2 doing re-balance, everything is working as expected.
Input and output topics are created with 6 partitions.
Running 3 Kafka brokers.
Producing data with a single python producer in a loop (acks='all').
Outputting data to SQL with a single Kafka Connect configured with consumer.override.isolation.level=read_committed
I am expecting the aggregated data to have the same count as the output of my python loop. And this works just fine as long as Kafka Streams is not going into re-balance state.
In short the streams instance does:
Collect session data, and updating a session state.
Delta updates on the session state are then re-partitioned and summed using windowed
aggregation.
Grepping through my own debug output I'm inclined to believe the problem is related to transferring the aggregation state:
Record A which is an update to session X is adding 0 to the aggregation.
Output from the aggregation is now 6
Record B which is an update to session X is adding 1 to the aggregation.
Output from the aggregation is now 7
Rebalance
Update to session X (which may or may not be a replay or Record A) is adding 0 to the aggregation.
Output from the aggregation is now 6
Simplified and stripped out version of the code: (Not really a Java developer, so sorry for non-optimal syntax)
public static void main(String[] args) throws Exception {
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
final StoreBuilder<KeyValueStore<MediaKey, SessionState>> storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(SESSION_STATE_STORE),
mediaKeySerde,
sessionStateSerde
);
builder.addStateStore(storeBuilder);
KStream<String, IncomingData> incomingData = builder.stream(
SESSION_TOPIC, Consumed.with(Serdes.String(), mediaDataSerde));
KGroupedStream<MediaKey, AggregatedData> mediaData = incomingData
.transform(new SessionProcessingSupplier(SESSION_STATE_STORE), SESSION_STATE_STORE)
.selectKey(...)
.groupByKey(...);
KTable<Windowed<MediaKey>, AggregatedData> aggregatedMedia = mediaData
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.aggregate(
new Initializer<AggregatedData>() {...},
new Aggregator<MediaKey, AggregatedData, AggregatedData>() {
#Override
public AggregatedData apply(MediaKey key, AggregatedData input, AggregatedData aggregated) {
// ... Add stuff to "aggregated"
return aggregated
}
},
Materialized.<MediaKey, AggregatedData, WindowStore<Bytes, byte[]>>as("aggregated-media")
.withValueSerde(aggregatedDataSerde)
);
aggregatedMedia.toStream()
.map(new KeyValueMapper<Windowed<MediaKey>, AggregatedData, KeyValue<MediaKey, PostgresOutput>>() {
#Override
public KeyValue<MediaKey, PostgresOutput> apply(Windowed<MediaKey> mediaidKey, AggregatedData data) {
// ... Some re-formatting and then
return new KeyValue<>(mediaidKey.key(), output);
}
})
.to(POSTGRES_TOPIC, Produced.with(mediaKeySerde, postgresSerde));
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
// Shutdown hook
}
and:
public class SessionProcessingSupplier implements TransformerSupplier<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> {
#Override
public Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> get() {
return new Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>>() {
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
this.stateStore = (KeyValueStore<String, Processing.SessionState>) context.getStateStore(sessionStateStoreName);
}
Override
public KeyValue<String, Processing.AggregatedData> transform(String sessionid, Processing.IncomingData data) {
Processing.SessionState state = this.stateStore.get(sessionid);
// ... Update or create session state
return new KeyValue<String, Processing.AggregatedData>(sessionid, output);
}
};
}
}

How to repeat Job with Partitioner when data is dynamic with Spring Batch?

I am trying to develop a batch process using Spring Batch + Spring Boot (Java config), but I have a problem doing so. I have a software that has a database and a Java API, and I read records from there. The batch process should retrieve all the documents which expiration date is less than a certain date, update the date, and save them again in the same database.
My first approach was reading the records 100 by 100; so the ItemReader retrieve 100 records, I process them 1 by 1, and finally I write them again. In the reader, I put this code:
public class DocumentItemReader implements ItemReader<Document> {
public List<Document> documents = new ArrayList<>();
#Override
public Document read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
if(documents.isEmpty()) {
getDocuments(); // This method retrieve 100 documents and store them in "documents" list.
if(documents.isEmpty()) return null;
}
Document doc = documents.get(0);
documents.remove(0);
return doc;
}
}
So, with this code, the reader reads from the database until no records are found. When the "getDocuments()" method doesn't retrieve any documents, the List is empty and the reader returns null (so the Job finish). Everything worked fine here.
However, the problem appears if I want to use several threads. In this case, I started using the Partitioner approach instead of Multi-threading. The reason of doing that is because I read from the same database, so if I repeat the full step with several threads, all of them will find the same records, and I cannot use pagination (see below).
Another problem is that database records are updated dynamically, so I cannot use pagination. For example, let's suppose I have 200 records, and all of them are going to expire soon, so the process is going to retrieve them. Now imagine I retrieve 10 with one thread, and before anything else, that thread process one and update it in the same database. The next thread cannot retrieve from 11 to 20 records, as the first record is not going to appear in the search (as it has been processed, its date has been updated, and then it doesn't match the query).
It is a little difficult to understand, and some things may sound strange, but in my project:
I am forced to use the same database to read and write.
I can have millions of documents, so I cannot read all the records at the same time. I need to read them 100 by 100, or 500 by 500.
I need to use several threads.
I cannot use pagination, as the query to the databse will retrieve different documents each time it is executed.
So, after hours thinking, I think the unique possible solution is to repeat the job until the query retrives no documents. Is this possible? I want to do something like the step does: Do something until null is returned - repeat the job until the query return zero records.
If this is not a good approach, I will appreciate other possible solutions.
Thank you.

Maybe you can add a partitioner to your step that will :
Select all the ids of the datas that needs to be updated (and other columns if needed)
Split them in x (x = gridSize parameter) partitions and write them in temporary file (1 by partition).
Register the filename to read in the executionContext
Then your reader is not reading from the database anymore but from the partitioned file.
Seem complicated but it's not that much, here is an example which handle millions of record using JDBC query but it can be easily transposed for your use case :
public class JdbcToFilePartitioner implements Partitioner {
/** number of records by database fetch */
private int fetchSize = 100;
/** working directory */
private File tmpDir;
/** limit the number of item to select */
private Long nbItemMax;
#Override
public Map<String, ExecutionContext> partition(final int gridSize) {
// Create contexts for each parttion
Map<String, ExecutionContext> executionsContexte = createExecutionsContext(gridSize);
// Fill partition with ids to handle
getIdsAndFillPartitionFiles(executionsContexte);
return executionsContexte;
}
/**
* #param gridSize number of partitions
* #return map of execution context, one for each partition
*/
private Map<String, ExecutionContext> createExecutionsContext(final int gridSize) {
final Map<String, ExecutionContext> map = new HashMap<>();
for (int partitionId = 0; partitionId < gridSize; partitionId++) {
map.put(String.valueOf(partitionId), createContext(partitionId));
}
return map;
}
/**
* #param partitionId id of the partition to create context
* #return created executionContext
*/
private ExecutionContext createContext(final int partitionId) {
final ExecutionContext context = new ExecutionContext();
String fileName = tmpDir + File.separator + "partition_" + partitionId + ".txt";
context.put(PartitionerConstantes.ID_GRID.getCode(), partitionId);
context.put(PartitionerConstantes.FILE_NAME.getCode(), fileName);
if (contextParameters != null) {
for (Entry<String, Object> entry : contextParameters.entrySet()) {
context.put(entry.getKey(), entry.getValue());
}
}
return context;
}
private void getIdsAndFillPartitionFiles(final Map<String, ExecutionContext> executionsContexte) {
List<BufferedWriter> fileWriters = new ArrayList<>();
try {
// BufferedWriter for each partition
for (int i = 0; i < executionsContexte.size(); i++) {
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(executionsContexte.get(String.valueOf(i)).getString(
PartitionerConstantes.FILE_NAME.getCode())));
fileWriters.add(bufferedWriter);
}
// Fetching the datas
ScrollableResults results = runQuery();
// Get the result and fill the files
int currentPartition = 0;
int nbWriting = 0;
while (results.next()) {
fileWriters.get(currentPartition).write(results.get(0).toString());
fileWriters.get(currentPartition).newLine();
currentPartition++;
nbWriting++;
// If we already write on all partitions, we start again
if (currentPartition >= executionsContexte.size()) {
currentPartition = 0;
}
// If we reach the max item to read we stop
if (nbItemMax != null && nbItemMax != 0 && nbWriting >= nbItemMax) {
break;
}
}
// closing
results.close();
session.close();
for (BufferedWriter bufferedWriter : fileWriters) {
bufferedWriter.close();
}
} catch (IOException | SQLException e) {
throw new UnexpectedJobExecutionException("Error writing partition file", e);
}
}
private ScrollableResults runQuery() {
...
}
}

Spring Batch how to perform Skip or don't consider for writing?

I already went through many links like: Spring Batch - Skip Record On Process and simply looking to validate the records in the processor before writing it to the MongoDB.
I've 500 records in the Oracle DB and on 162th record, below code's line-1 satisfy and after than no other records are getting considered for writing, so instead of 500 records, I supposed to get 480 records, 20 records I want to skip because its EFFECTIVE_DATE is null which I don't want to consider for writting.
public class StudentRowMapper implements RowMapper<Student> {
#Override
public Student mapRow(ResultSet rs, int rowNum) throws SQLException {
if(rs.getString("EFFECTIVE_DATE") == null) { //Line-1
return null;
}
else {
Student Student = new Student();
Student.setRowIdObject(rs.getInt("PK_ID"));
.............
.............
.............
.............
return Student;
}
}
}

Aggreed with #Mahmoud, you can also :
Add this filter on the query of your mongodb reader : "{ EFFECTIVE_DATE: null }"
Return null in your processor

simply looking to validate the records in the processor before writing it to the MongoDB.
ValidatingItemProcessor is what you are looking for. It allows you to validate items and skip them or filter them (see filter parameter) before passing them to the writer.

Hibernate queries getting slower and slower

I'm working on a process that checks and updates data from Oracle database. I'm using hibernate and spring framework in my application.
The application reads a csv file, processes the content, then persiste entities :
public class Main() {
Input input = ReadCSV(path);
EntityList resultList = Process.process(input);
WriteResult.write(resultList);
...
}
// Process class that loops over input
public class Process{
public EntityList process(Input input) :
EntityList results = ...;
...
for(Line line : input.readLine()){
results.add(ProcessLine.process(line))
...
}
return results;
}
// retrieving and updating entities
Class ProcessLine {
#Autowired
DomaineRepository domaineRepository;
#Autowired
CompanyDomaineService companydomaineService
#Transactional
public MyEntity process(Line line){
// getcompanyByXX is CrudRepository method with #Query that returns an entity object
MyEntity companyToAttach = domaineRepository.getCompanyByCode(line.getCode());
MyEntity companyToDetach = domaineRepository.getCompanyBySiret(line.getSiret());
if(companyToDetach == null || companyToAttach == null){
throw new CustomException("Custom Exception");
}
// AttachCompany retrieves some entity relationEntity, then removes companyToDetach and adds CompanyToAttach. this updates relationEntity.company attribute.
companydomaineService.attachCompany(companyToAttach, companyToDetach);
return companyToAttach;
}
}
public class WriteResult{
#Autowired
DomaineRepository domaineRepository;
#Transactional
public void write(EntityList results) {
for (MyEntity result : results){
domaineRepository.save(result)
}
}
}
The application works well on files with few lines, but when i try to process large files (200 000 lines), the performance slows drastically, and i get a SQL timeout.
I suspect cache issues, but i'm wondering if saving all the entities at the end of the processing isn't a bad practice ?

The problem is your for loop which is doing individual saves on the result and thus does single inserts slowing it down. Hibernate and spring support batch inserts and should be done when ever possible.
something like domaineRepository.saveAll(results)
Since you are processing lot of data it might be better to do things in batches so instead of getting one company to attach you should get a list of companies to attach processes those then get a list of companies to detach and process those
public EntityList process(Input input) :
EntityList results;
List<Code> companiesToAdd = new ArrayList<>();
List<Siret> companiesToRemove = new ArrayList<>();
for(Line line : input.readLine()){
companiesToAdd.add(line.getCode());
companiesToRemove.add(line.getSiret());
...
}
results = process(companiesToAdd, companiesToRemove);
return results;
}
public MyEntity process(List<Code> companiesToAdd, List<Siret> companiesToRemove) {
List<MyEntity> attachList = domaineRepository.getCompanyByCodeIn(companiesToAdd);
List<MyEntity> detachList = domaineRepository.getCompanyBySiretIn(companiesToRemove);
if (attachList.isEmpty() || detachList.isEmpty()) {
throw new CustomException("Custom Exception");
}
companydomaineService.attachCompany(attachList, detachList);
return attachList;
}
The above code is just sudo code to point you in the right direction, will need to work out what works for you.

For every line you read you are doing 2 read operations here
MyEntity companyToAttach = domaineRepository.getCompanyByCode(line.getCode());
MyEntity companyToDetach = domaineRepository.getCompanyBySiret(line.getSiret());
You can read more than one line and us the in query and then process that list of companies

Deleting List<Object> using HibernateTemplate() is giving Exception

I am trying to fetch List from table FlexiBooking and then add this to another list i.e. to move to another table and delete those entries fetched from this table. I have used HibernateTemplate() object, since the project is being done using Spring Framework. But I am getting exception that trying to delete Detatched object. What is the problem?
Below is my code:
#Override
public void moveToNormalBooking(User user,
int no_of_seats) {
String queryString = "FROM FlexiBooking";
HibernateTemplate hibernateTemplate = getHibernateTemplate();
hibernateTemplate.setMaxResults(no_of_seats);
List<FlexiBooking> flexiBookingsTobeMoved = hibernateTemplate.find(queryString);
List<FlightBooking> flightBookings = new ArrayList<FlightBooking>();
int i =0;
while(i < flexiBookingsTobeMoved.size()) {
FlightBooking flightbooking = new FlightBooking();
flightbooking.setCostPerTicket(flexiBookingsTobeMoved.get(i).getTotalCost());
flightbooking.setDateOfJourney(flexiBookingsTobeMoved.get(i).getScheduledFlight().getScheduledFlightDate());
Booking booking = new Booking();
booking.setBooker(flexiBookingsTobeMoved.get(i).getUser());
booking.setBookingDate(flexiBookingsTobeMoved.get(i).getBookingDate());
booking.setBookingReferenceNo(flexiBookingsTobeMoved.get(i).getBookingReferenceNumber());
booking.setCancelled(false);
flightbooking.setBooking(booking);
flightbooking.setFlightRoute(flexiBookingsTobeMoved.get(i).getScheduledFlight());
flightbooking.setCouponCode(flexiBookingsTobeMoved.get(i).getCouponCode());
flightBookings.add(flightbooking);
i++;
}
hibernateTemplate.saveOrUpdateAll(flightBookings);
// hibernateTemplate.update(flexiBookingsTobeMoved);
hibernateTemplate.deleteAll(flexiBookingsTobeMoved);
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Readcount in spring batch reader is less than source count - spring-boot

Related

Kafka Streams exactly-once re-balance aggregation state data loss

How to repeat Job with Partitioner when data is dynamic with Spring Batch?

Spring Batch how to perform Skip or don't consider for writing?

Hibernate queries getting slower and slower

Deleting List<Object> using HibernateTemplate() is giving Exception

Categories

Resources