Datastax Cassandra Java Driver 4 paging problem - spring-boot

I have a requirement to fetch results from my cassandra database table in a paginated manner. I am using spring boot version 2.3.1 which in turn is using cassandra java driver 4. In previous driver versions, while paginating, there were no issues and the driver used to fetch the results equal to specified page size, like so:
Where select = QueryBuilder.select("column1", "column2", "column3")
.from("my_table")
.where(QueryBuilder.eq("column4", "some_value"))
.and(QueryBuilder.eq("column5", "some_value"));
select.setFetchSize(5);
if (!page.equals("0"))
select.setPagingState(PagingState.fromString(page));
ResultSet results = cassandraTemplate.getCqlOperations().queryForResultSet(select);
PagingState nextPage = results.getExecutionInfo().getPagingState();
int remaining = results.getAvailableWithoutFetching(); // gives 5 results as specified
In Java driver version 4, the method has been changed from setFetchSize(int) to setPageSize(int). But the same thing is not working here. It is fetching all the results even after specifying the size:
SimpleStatement stmt = QueryBuilder.selectFrom("my_keyspace", "my_table")
.columns(Arrays.asList("column1", "column2", "column3"))
.where(Relation.column("column4")
.isEqualTo(QueryBuilder.literal("some_value")),
Relation.column("column5")
.isEqualTo(QueryBuilder.literal("some_value))).build();
stmt.setPageSize(5);
if (!page.equals("0"))
stmt.setPagingState(Bytes.fromHexString(page));
ResultSet results = cassandraTemplate.getCqlOperations().queryForResultSet(stmt);
ByteBuffer nextPage = results.getExecutionInfo().getPagingState();
int remaining = results.getAvailableWithoutFetching(); // gives all the results, even though size is specified as 5
Am I doing something wrong? If I'm not then what should be the solution to this problem?

Statements in driver 4 are immutable. You need to change the following lines:
stmt = stmt.setPageSize(5);
if (!page.equals("0"))
stmt = stmt.setPagingState(Bytes.fromHexString(page));
IOW, each mutating method returns a new instance, so you need to capture that by reassigning the stmt variable each time.

Related

How do I update one column of all rows in a large table in my Spring Boot application?

I have a Spring Boot 2.x project with a big Table in my Cassandra Database. In my Liquibase Migration Class, I need to replace a value from one column in all rows.
For me its a big perfomance hit, when I try to solve this with
SELECT * FROM BOOKING
forEach Row
Update Row
Because of the total number of rows. Even when I select only 1 Column.
Is it possible to make something like "partwise/pagination" loop?
Pseudecode
Take first 1000 rows
do Update
Take next 1000 rows
do Update
loop.
Im also happy about all other solution approaches you have.
Must known:
Make sure there is a way to group the updates by partition. If you try a batchUpdate on 1000 rows not in same partition the coordinator of the request will suffer, you are moving the load from your client to the coordinator, and you want the parallelize the writes instead. A batchUpdate with cassandra has nothing to do with the one in relational databases.
For fined-grained operations like this you want to go back to the usage of the drivers with CassandraOperations and CqlSession for maximum control
There is a way to paginate with Spring Data cassandra using Slice but do not have control over how operations are implemented.
Spring Data Cassandra core
Slice<MyEntity> slice = MyEntityRepo.findAll(CassandraPageRequest.first(size));
while(slice.hasNext() && currpage < page) {
slice = personrepo.findAll(slice.nextPageable());
currpage++;
}
slice.getContent();
Drivers:
// Prepare Statements to speed up queries
PreparedStatement selectPS = session.prepare(QueryBuilder
.selectFrom( "myEntity").all()
.build()
.setPageSize(1000) // 1000 per pages
.setTimeout(Duration.ofSeconds(10)); // 10s timeout
PreparedStatement updatePS = session.prepare(QueryBuilder
.update("mytable")
.setColumn("myColumn", QueryBuilder.bindMarker())
.whereColumn("myPK").isEqualTo(QueryBuilder.bindMarker())
.build()
.setConsistencyLevel(ConsistencyLevel.ONE)); // Fast writes
// Paginate
ResultSet page1 = session.execute(selectPS);
Iterator<Row> page1Iter = page1.iterator();
while (0 < page1.getAvailableWithoutFetching()) {
Row row = page1Iter.next();
cqlsession.executeAsync(updatePS.bind(...));
}
ByteBuffer pagingStateAsBytes =
page1.getExecutionInfo().getPagingState();
selectPS.setPagingState(pagingStateAsBytes);
ResultSet page2 = session.execute(selectPS);
You could of course include this pagination in a loop and track progress.

Large Resultset with Spring Boot and QueryDSL

I have a Spring Boot application where I use QueryDSL for dynamic queries.
Now the results should be exported as a csv file.
The model is an Order which contains products. The products should be included in the csv file.
However, as there are many thousand orders with millions of products this should not be loaded into memory at once.
However, solutions proposed by Hibernate (ScrollableResults) and streams are not supported by QueryDSL.
How can this be achieved while still using QueryDSL (to avoid duplication of filtering logic)?
One workaround to this problem is to keep iterating using offset and limit.
Something like:
long limit = 100;
long lastLimitUsed = 0;
List<MyEntity> entities = new JPAQuery<>(em)
.from(QMyEntity.entity)
.limit(limit)
.offset(lastLimitUsed)
.fetch();
lastLimitUsed += limit;
With that approach you can fetch smaller chunks of data. It is important to analyze if the limit and offset field will work well with your query. There are situations where even if you use limit and offset you will end up making a full scan on the tables involved on the query. If that happens you will face a performance problem instead of a memory one.
Use JPAQueryFactory
// com.querydsl.jpa.impl.JPAQueryFactory
JPAQueryFactory jpaFctory = new JPAQueryFactory(entityManager);
//
Expression<MyEntity> select = QMyEntity.myEntity;
EntityPath<MyEntity> path = QMyEntity.myEntity;
Stream stream = this.jpaQueryFactory
.select(select)
.from(entityPath)
.where(cond)
.createQuery() // get jpa query
.getResultStream();
// do something
stream.close();

Select Count very slow using EF with Oracle

I'm using EF 5 with Oracle database.
I'm doing a select count in a table with a specific parameter. When I'm using EF, the query returns the value 31, as expected, But the result takes about 10 seconds to be returned.
using (var serv = new Aperam.SIP.PXP.Negocio.Modelos.SIP_PA())
{
var teste = (from ens in serv.PA_ENSAIOS_UM
where ens.COD_IDENT_UNMET == "FBLDY3840"
select ens).Count();
}
If I execute the simple query bellow the result is the same (31), but the result is showed in 500 milisecond.
SELECT
count(*)
FROM
PA_ENSAIOS_UM
WHERE
COD_IDENT_UNMET 'FBLDY3840'
There are a way to improve the performance when I'm using EF?
Note: There are 13.000.000 lines in this table.
Here are some things you can try:
Capture the query that is being generated and see if it is the same as the one you are using. Details can be found here, but essentially, you will instantiate your DbContext (let's call it "_context") and then set the Database.Log property to be the logging method. It's fine if this method doesn't actually do anything--you can just set a breakpoint in there and see what's going on.
So, as an example: define a logging function (I have a static class called "Logging" which uses nLog to write to files)
public static void LogQuery(string queryData)
{
if (string.IsNullOrWhiteSpace(queryData))
return;
var message = string.Format("{0}{1}",
queryData.Trim().Contains(Environment.NewLine) ?
Environment.NewLine : "", queryData);
_sqlLogger.Info(message);
_genLogger.Trace($"EntityFW query (len {message.Length} chars)");
}
Then when you create your context point to LogQuery:
_context.Database.Log = Logging.LogQuery;
When you do your tests, remember that often the first run is the slowest because the server has to actually do the work, but on the subsequent runs, it often uses cached data. Try running your tests 2-3 times back to back and see if they don't start to run in the same time.
I don't know if it generates the same query or not, but try this other form (which should be functionally equivalent, but may provide better time)
var teste = serv.PA_ENSAIOS_UM.Count(ens=>ens.COD_IDENT_UNMET == "FBLDY3840");
I'm wondering if the version you have pulls data from the DB and THEN counts it. If so, this other syntax may leave all the work to be done at the server, where it belongs. Not sure, though, esp. since I haven't ever used EF with Oracle and I don't know if it behaves the same as SQL or not.

SqlAlchemy - when I iterate on a query, do I get a list or a iterator?

I'm starting to learn how to use SQLAlchemy and I'm running into some efficiency problems.
I created an object mapping an existing big table on our Oracle database:
engine = create_engine(connectionString, echo=False)
class POI(object):
def __repr__(self):
return "{poi_id} - {title}, {city} - {uf}".format(**self.__dict__)
def loadSession():
metadata = MetaData(engine)
_poi = Table('tbl_ourpois', metadata, autoload = True)
mapper(POI, _poi)
Session = sessionmaker(bind = engine)
session = Session()
return session
This table have millions of registries. When I do a simple query and try to iterate over it:
session = loadSession()
for poi in session.query(POI):
print poi
I noticed two things: (1) it takes some minutes for it to start printing objects on the screen, (2) memory usage starts to grow like crazy. So, my conclusion was that this code was fetching all the result set in a list and then iterating over it. Is this correct?
With cx_Oracle, when I do a query like:
conn = cx_Oracle.connect(connectionString)
cursor = conn.cursor()
cursor.execute("select * from tbl_ourpois")
for poi in cursor:
print poi
the resulting cursor behaves as an iterator that gets results into a buffer and returns them as they are needed intead of loading the whole thing in a list. This loop starts printing results almost instantly and memory usage is pretty low and constant.
Can I get this kind of behavior wiht SQLAlchemy? Is there a way to get a constant memory iterator out of session.query(POI) instead of a list?

Hibernate, Query SQL with params. Bad Performance

I have an issue related to the performance of a SQL query using JPA.
Response time:
Using Toad - 200 ms
Inside my project using Glassfish 2.1, Java 1.5, Hibernate 3.4.0.ga - 27 s
Oracle 10g
Glassfish and Toad are hosting in the same machine. I have connected to other ddbb from the same Glassfish, JPA, etc, and performance is good. so I don't know what is happening.
I have two different environments. In one of this (the worst, theoretically) it runs fast. In the other, it's where I have the problem.
The query is executed with a Javax.persistence.Query object and in this object are inserted the parameters with the method setParameter(). After that, I call to getResultList() method and this method returns the registers to me. In this point is where the time is excessive.
But, if I replace the parameters in code and I call to getResultList() method directly, without setting parameters into Query object, the performance is much better.
Anyone could help me with any clue about the problem or how to trace it?
Query
SELECT A, B, ..., DATE_FIELD FROM
(SELECT A, B, C FROM Table1
WHERE REGEXP_LIKE(A, NVL(UPPER(:A),'')) AND DATE_FIELD = :DATE
UNION
SELECT A, B, C FROM Table2
WHERE REGEXP_LIKE(A, NVL(UPPER(:A),'')) AND DATE_FIELD = :DATE)
Java Code
public Query generateQuerySQL(String stringQuery, HashMap<String, Object> hParams) {
Query query = em.createNativeQuery(stringQuery);
if (hParams != null) {
for (Iterator<String> paramNameList = hParams.keySet().iterator(); paramNameList.hasNext() {
String name = paramNameList.next();
Object value = hParams.get(name);
query.setParameter(name, value);
}
}
return query;
}
Query query = em.createNativeQuery(stringQuery);
will elaborate a query plan to execute the query. Unfortunally the metadata that is used to elaborate the query plan do not fit the actual parameters values that will be used when the query will be executed.
If you substitute the parameter before elaborating the plan : the plan is fine and run very fast.
Similar question here
you should change cursor_sharing = FORCE in oracle to enable hibernate support in JPA for oracle.
please refer to following for more details

Resources