Paging SELECT query results from Cassandra in Spring Boot application - spring

During my research I have come across this JIRA for Spring-Data-Cassandra:
https://jira.spring.io/browse/DATACASS-56
Now, according to the post above, currently SDC is not supporting Pagination in the Spring App due to structure of Cassandra. However, I'm thinking, if I can pull the entire rows list into a Java List, can I Paginate that list ? I don't have much experience in Spring, but is there something I am missing when I assume this can be done ?

Cassandra does not support pagination in the sense of pointing to a specific page (limit/offset) but generates a continuation token (PagingState) that is a set of bytes. Pulling a List of records will load all records in memory and possibly exhaust your memory (depending on the amount of data).
Spring Data Cassandra 1.5.0 RC1 comes with a streaming API in CassandraTemplate:
Iterator<Person> it = template.stream("SELECT * FROM person WHERE … ;", Person.class);
while(it.hasNext()) {
// …
}
CassandraTemplate.stream(…) will return an Iterator that operates on an underlying ResultSet. The DataStax driver uses a configurable fetch-size (5000 rows by default) for bulk fetching. Streaming data access can fetch as much or as little data as you require to process data. Data is not retained by the driver nor Spring Data Cassandra, and once the fetched bulk is retrieved from the Iterator, the underlying ResultSet will fetch the next bulk itself.
The other alternative is using ResultSet directly that gives you access to PagingState and do all the continuation/paging business yourself. You would lose all the higher level benefits of Spring Data Cassandra.

Related

Uploding data to kafka producer

I am new to Kafka in Spring Boot, I have been through many tutorials and got fair knowledge about the same.
Currently I have been assigned a task and I am facing an issue. Hope to get some help here.
The scenario is as follows.
1)I have a DB which is getting updated continuously with millions of data.
2)I have to hit the DB after every 5 mins and pick the recently updated data and send it to Kafka.
Condition- The old data that I have picked in my previous iteration should not be picked in my next DB call and Kafka pushing.
I am done with the part of Spring Scheduling to pick the data by using findAll() of spring boot JPA, but how can I write the logic so that it does not pick the old DB records and just take the new record and push it to kafka.
My DB table also have a field called "Recent_timeStamp" of type "datetime"
Its hard to tell without really seeing your logic and the way you work with the database, but from what you've described you should do just "findAll" here.
Instead you should treat your DB table as a time-driven data:
Since it has a field of timestamp, make sure there is an index on it
Instead of "findAll" execute something like:
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP > ?
ORDER BY RECENT_TIMESTAMP ASC
In this case you'll get the records ordered by the increasing timestamp
Now the ? denotes the last memorized timestamp that you've handled
So you'll have to maintain the state here
Another option is to query the data whose timestamp is "less" than 5 minutes, in this case the query will look like this (pseudocode since the actual syntax varies):
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP < now() - 5 minutes
ORDER BY RECENT_TIMESTAMP ASC
The first method is more robust because if your spring boot application is "down" for some reason you'll be able to recover and query all your records from the point it has failed to send the data. On the other hand you'll have to save this kind of pointer in some type of persistent storage.
The second solution is "easier" in a sense that you don't have a state to maintain but on the other hand you will miss the data after the restart.
In both of the cases you might want to use some kind of pagination because basically you don't know how many records you'll get from the database and if the amount of records exceeds your memory limits, the application with end up with OutOfMemory error thrown.
A Completely different approach is throwing the data to kafka when you write to the database instead of when you read from it. At that point you might have a data chunk of (probably) reasonably limited size and in general you don't need the state because you can store to db and send to kafka from the same service, if the architecture of your application permits to do so.
You can look into kafka connect component if it serves your purpose.
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka® and other data systems. It makes it simple to quickly define connectors that move large data sets in and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch, or into batch systems–such as Hadoop for offline analysis.

How to fetch the data from database using spring boot without mapping

I have a database and in that database there are many tables of data. I want to fetch the data from any one of those tables by entering a query from the front-end application. I'm not doing any manipulation to the data, doing just retrieving the data from database.
Also, mapping the data requires writing so many entity or POJO classes, so I don't want to map the data to any object. How can I achieve this?
In this case, assuming the mapping of tables if not relevant, you don't need to use JPA/Hibernate at all.
You can use an old, battle tested jdbc template that can execute a query of your choice (that you'll pass from client), will serialize the response to JSONObject and return it as a response in your controller.
The client side will be responsible to rendering the result.
You might also query the database metadata to obtain the information about column names, types, etc. so that the client side will also get this information and will be able to show the results in a more convenient / "advanced" way.
Beware of security implications, though. Basically it means that the client will be able to delete all the records from the database by a simple query and you won't be able to avoid it :)

Spring Data JPA garbage collection

I have a Spring Batch application with JpaPagingItemReader (i modified it a bit) and 4 Jpa repositories to enrich Model which comes from JpaPagingItemReader.
My flow is:
Select Model (page size = 8192), then i collect this List<Model> to Map<String, List<Model>> (group by id, because models not unique and i need to enrich by id) then enrich it with 4 custom JpaRepositories with native queries with IN clause, and merge them with Java 8 Streams.
Convert data to XML object and with Stax writing with MultiFileItemWriter to files, which are splitted no more than 20000 per file.
All works great, but today i tried to run flow with big amount of data from database. I generated 20 files (2.2 GB). But sometimes i got OutOfMemory Java Heap (I had 1Gb XMS, XSS), then i up it to 2 GB and all works good, but in Instana i see, that Old gen Java memory is always 900 in use after GC. It is about 1.3-1.7Gb in use. So i start to think, how can i optimize GC of Spring Data Jpa objects. I think they are much time in memory.
When i select Model with JpaPagingItemReader i detach every Model (with entityManager.detach), but when i enrich Model with custom Spring Data Jpa requests i am not detaching results. Maybe the problem in this and i should detach them?
I do not need to insert data to database, i need just to read it. Or do i need to make page size less and select about 4000 per request?
I need to process 370 000 records from database and enrich them.
Solved. Added flags to my run configuration, and increase XMS and XMX twice.

spring jpa, findAll(Iterable<Integer> userIds)

findAll(ListU ..) in spring jpa is called by passing UUID's list of size n, when sql logs are enabled i found n number of sql getting logged ,which i assume DB is being called n times(size of list) to fetch records.
can it be possible to call DB only once to fetch all the records at once so that performance can be improved
Spring Data uses an IN close for this method except when your entity has a composite key. That is already just one query.
So the multiple queries you see is most likely your JPA implementation deciding to return a proxy with just the id and then lazy loading the attributes by demand. See the documentation of the implementation you are using for how to prevent/control that.

jdbi jdbc vertica streaming result set for handling large data

I am trying to connect to vertica through Jdbi jdbc to get huge result set.
Followed JDBI documentation and added this to dao,
#SqlQuery("<query>")
#Mapper(ResultRow.StreamMapper.class)
#FetchSize(chunkSizeInRows)
public Iterable<List<Object>> getStreamingResultSet(#Define("query") String query);
But it seems like its loading the entire data into memory instead of streaming it
I've been looking at streaming result sets from JDBI, and came across this question. The answer is on the SQL Object Queries documentation page:
because the method returns a java.util.Iterator it loads results
lazily
So in this case, the Iterable<List<Object>> should be a Iterator<List<Object>> (I assume JDBI can convert a database row to a List<Object>).

Resources