How do I update one column of all rows in a large table in my Spring Boot application? - spring-boot

I have a Spring Boot 2.x project with a big Table in my Cassandra Database. In my Liquibase Migration Class, I need to replace a value from one column in all rows.
For me its a big perfomance hit, when I try to solve this with
SELECT * FROM BOOKING
forEach Row
Update Row
Because of the total number of rows. Even when I select only 1 Column.
Is it possible to make something like "partwise/pagination" loop?
Pseudecode
Take first 1000 rows
do Update
Take next 1000 rows
do Update
loop.
Im also happy about all other solution approaches you have.

Must known:
Make sure there is a way to group the updates by partition. If you try a batchUpdate on 1000 rows not in same partition the coordinator of the request will suffer, you are moving the load from your client to the coordinator, and you want the parallelize the writes instead. A batchUpdate with cassandra has nothing to do with the one in relational databases.
For fined-grained operations like this you want to go back to the usage of the drivers with CassandraOperations and CqlSession for maximum control
There is a way to paginate with Spring Data cassandra using Slice but do not have control over how operations are implemented.
Spring Data Cassandra core
Slice<MyEntity> slice = MyEntityRepo.findAll(CassandraPageRequest.first(size));
while(slice.hasNext() && currpage < page) {
slice = personrepo.findAll(slice.nextPageable());
currpage++;
}
slice.getContent();
Drivers:
// Prepare Statements to speed up queries
PreparedStatement selectPS = session.prepare(QueryBuilder
.selectFrom( "myEntity").all()
.build()
.setPageSize(1000) // 1000 per pages
.setTimeout(Duration.ofSeconds(10)); // 10s timeout
PreparedStatement updatePS = session.prepare(QueryBuilder
.update("mytable")
.setColumn("myColumn", QueryBuilder.bindMarker())
.whereColumn("myPK").isEqualTo(QueryBuilder.bindMarker())
.build()
.setConsistencyLevel(ConsistencyLevel.ONE)); // Fast writes
// Paginate
ResultSet page1 = session.execute(selectPS);
Iterator<Row> page1Iter = page1.iterator();
while (0 < page1.getAvailableWithoutFetching()) {
Row row = page1Iter.next();
cqlsession.executeAsync(updatePS.bind(...));
}
ByteBuffer pagingStateAsBytes =
page1.getExecutionInfo().getPagingState();
selectPS.setPagingState(pagingStateAsBytes);
ResultSet page2 = session.execute(selectPS);
You could of course include this pagination in a loop and track progress.

Related

Pagination in duplicate rescords

I need to apply pagination in a spring boot project.
I apply pagination in 2 queries. Each of them gives me data from different tables. Now, some of these records are identical in the two tables hence need to be removed.
At the end, the number of entries that I need to send will be reduced, thereby ruining the initial pagination applied. How do I go about this? What should be my approach?
Here I take 2 lists from 2 jpa calls(highRiskCust and amlPositiveCust) that will apply pagination, then remove the duplicacy and return the final result (tempReport).
`
List<L1ComplianceResponseDTO> highRiskCust = customerKyc.findAllHighRiskL1Returned(startDateTime, endDateTime,agentIds);
List<L1ComplianceResponseDTO> amlPositiveCust = customerKyc.findAllAmlPositiveL1Returned(startDateTime, endDateTime,agentIds);
List<L1ComplianceResponseDTO> tempReport = new ArrayList<>();
tempReport.addAll(amlPositiveCust);
tempReport.addAll(highRiskCust);
tempReport = tempReport.stream().filter(distinctByKey(p -> p.getKycTicketId()))
.collect(Collectors.toList());
`
In order to have pagination working, you need to do it with a unique request.
Fondammently this request should use UNION.
Since JPA does not support UNION, either you do a native query or you change you query logic using outer joins.

Getting max value on server (Entity Framework)

I'm using EF Core but I'm not really an expert with it, especially when it comes to details like querying tables in a performant manner...
So what I try to do is simply get the max-value of one column from a table with filtered data.
What I have so far is this:
protected override void ReadExistingDBEntry()
{
using Model.ResultContext db = new();
// Filter Tabledata to the Rows relevant to us. the whole Table may contain 0 rows or millions of them
IQueryable<Measurement> dbMeasuringsExisting = db.Measurements
.Where(meas => meas.MeasuringInstanceGuid == Globals.MeasProgInstance.Guid
&& meas.MachineId == DBMatchingItem.Id);
if (dbMeasuringsExisting.Any())
{
// the max value we're interested in. Still dbMeasuringsExisting could contain millions of rows
iMaxMessID = dbMeasuringsExisting.Max(meas => meas.MessID);
}
}
The equivalent SQL to what I want would be something like this.
select max(MessID)
from Measurement
where MeasuringInstanceGuid = Globals.MeasProgInstance.Guid
and MachineId = DBMatchingItem.Id;
While the above code works (it returns the correct value), I think it has a performance issue when the database table is getting larger, because the max filtering is done at the client-side after all rows are transferred, or am I wrong here?
How to do it better? I want the database server to filter my data. Of course I don't want any SQL script ;-)
This can be addressed by typing the return as nullable so that you do not get a returned error and then applying a default value for the int. Alternatively, you can just assign it to a nullable int. Note, the assumption here of an integer return type of the ID. The same principal would apply to a Guid as well.
int MaxMessID = dbMeasuringsExisting.Max(p => (int?)p.MessID) ?? 0;
There is no need for the Any() statement as that causes an additional trip to the database which is not desirable in this case.

ADO.NET - Data Adapter Fill Method - Fill Dataset with rows modified in SQL

I am using ADO.NET with Data Adaptor to Fill a Dataset in my .NET Core 3.1 Project.
The first run for the Fill method occurs when my program initially starts so I have an in memeory cache to start using with my business/program logic. When I then make any changes to the tables using EF Core, once the changes have been saved I then run the Data Adapter Fill method to re-populate the Dataset with the updates from the tables that were modified in SQL through EF Core..
Reading various docs for a number of days now, what I'm unclear about is whether the Data Adapter Fill method overwrites all of the existing table rows in the Dataset each time the fill method is called? i.e if I'm loading a dataset with a table from SQL that has 10k rows, is it going to overwrite all 10k rows that exist in the dataset, even if 99% of the rows have not changed?
The reason I am going down the Dataset route is that I want to keep and in memory cache of the various tables from SQL so I can query the data as fast as possible without raising queries SQL all the time.
The solution I want is something along the lines of Data Adaptor Fill method, but I don't want the Dataset to be overwritten for any rows that had not been modified in SQL since the last run.
Is this how things are working already? or do I have to look for another solution?
Below just an example of the Adaptor Fill method.
public async Task<AdoNetResult> FillAlarmsDataSet()
{
string connectionString = _config.GetConnectionString("DefaultConnection");
try
{
string cmdText1 = "SELECT * FROM [dbo].[Alarm] ORDER BY Id;" +
"SELECT * FROM [dbo].[AlarmApplicationRole] ORDER BY Id;";
dataAdapter = new SqlDataAdapter(cmdText1, connectionString);
// Create table mappings
dataAdapter.TableMappings.Add("Alarm", "Alarm");
dataAdapter.TableMappings.Add("AlarmApplicationRole", "AlarmApplicationRole");
alarmDataSet = new DataSet
{
Locale = CultureInfo.InvariantCulture
};
// Create and fill the DataSet
await Task.Run(() => dataAdapter.Fill(alarmDataSet));
return AdoNetResult.Success;
}
catch (Exception ex)
{
// Return the task with details of the exception
return AdoNetResult.Failed(ex);
}
}

Large Resultset with Spring Boot and QueryDSL

I have a Spring Boot application where I use QueryDSL for dynamic queries.
Now the results should be exported as a csv file.
The model is an Order which contains products. The products should be included in the csv file.
However, as there are many thousand orders with millions of products this should not be loaded into memory at once.
However, solutions proposed by Hibernate (ScrollableResults) and streams are not supported by QueryDSL.
How can this be achieved while still using QueryDSL (to avoid duplication of filtering logic)?
One workaround to this problem is to keep iterating using offset and limit.
Something like:
long limit = 100;
long lastLimitUsed = 0;
List<MyEntity> entities = new JPAQuery<>(em)
.from(QMyEntity.entity)
.limit(limit)
.offset(lastLimitUsed)
.fetch();
lastLimitUsed += limit;
With that approach you can fetch smaller chunks of data. It is important to analyze if the limit and offset field will work well with your query. There are situations where even if you use limit and offset you will end up making a full scan on the tables involved on the query. If that happens you will face a performance problem instead of a memory one.
Use JPAQueryFactory
// com.querydsl.jpa.impl.JPAQueryFactory
JPAQueryFactory jpaFctory = new JPAQueryFactory(entityManager);
//
Expression<MyEntity> select = QMyEntity.myEntity;
EntityPath<MyEntity> path = QMyEntity.myEntity;
Stream stream = this.jpaQueryFactory
.select(select)
.from(entityPath)
.where(cond)
.createQuery() // get jpa query
.getResultStream();
// do something
stream.close();

Spark not be able to retrieve all of Hbase data in specific column

My Hbase table has 30 Million records, each record has the column raw:sample, raw is columnfamily sample is column. This column is very big, the size from a few KB to 50MB. When I run the following Spark code, it only can get 40 thousand records but I should get 30 million records:
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "10.1.1.15:2181")
conf.set(TableInputFormat.INPUT_TABLE, "sampleData")
conf.set(TableInputFormat.SCAN_COLUMNS, "raw:sample")
conf.set("hbase.client.keyvalue.maxsize","0")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
var arrRdd:RDD[Map[String,Object]] = hBaseRDD.map(tuple => tuple._2).map(...)
Right now I work around this by get the id list first then iterate the id list to get the column raw:sample by pure Hbase java client in Spark foreach.
Any ideas please why I can not get all of the column raw:sample by Spark, is it because the column too big?
A few days ago one of my zookeeper nodes and datanodes down, but I fixed it soon since the replica is 3, is this the reason? Would think if I run hbck -repair would help, thanks a lot!
Internally, TableInputFormat creates a Scan object in order to retrieve the data from HBase.
Try to create a Scan object (without using Spark), configured to retrieve the same column from HBase, see if the error repeats:
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
HTable table = new HTable(config, "emp");
// Instantiating the Scan class
Scan scan = new Scan();
// Scanning the required columns
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));
// Getting the scan result
ResultScanner scanner = table.getScanner(scan);
// Reading values from scan result
for (Result result = scanner.next(); result != null; result = scanner.next())
System.out.println("Found row : " + result);
//closing the scanner
scanner.close();
In addition, by default, TableInputFormat is configured to request a very small chunk of data from the HBase server (which is bad and causes a large overhead). Set the following to increase the chunk size:
scan.setBlockCache(false);
scan.setCaching(2000);
For a high throughput like yours, Apache Kafka is the best solution to integrate the data flow and keeping data pipeline alive. Please refer http://kafka.apache.org/08/uses.html for some use cases of kafka
One more
http://sites.computer.org/debull/A12june/pipeline.pdf

Resources