Fetch large data set from Database and Send it to Kafka - spring

I have 2 tables lets say Emp and Courses table. Emp has 30k rows and Courses have 100k rows.
1 employee can have many courses .i.e one to many relation.I need to fetch the records from the table and send to a Kafka.
Data From Table-----> Convert To JSON --->Send To KAFKA
I don't want to load all the rows in the memory at once as it can cause memory out of exception error.
How to achieve it? I will probably be using JDBCTEMPLATE or SPRING DATA JPA?
I am using SPRING BOOT 2+ version and JAVA 8
FYI
For eg. in EMP table i have emp_id =1 which has 5 corresponding rows in Courses table.
So i will convert this 5 rows to 1 Java object and then 1 Json object.

Importing data from a database to Apache Kafka is a really common use case. Kafka Connect allows you to stream data from and to Kafka in a reliable, scalable and fault tolerant way. Specifically, the JDBC source connector does what you are trying to do, if you build your custom solution you'll probably end up having a partial implementation of what the connector is already doing.

Spring Data provides a feature to load a lot of data continuously in a non-blocking way. This allows you to do all the treatments you want on your results while some data are still fetching. To do that, you just have to return a Stream of your #Entity class :
public interface EmployeeRepository extends JpaRepository<EmployeeEntity, String> {
Stream<EmployeeEntity> findAllBy();
}

Related

What is the best approach while pooling data from DB and query DB again to fetch additional information?

The spring boot application that I am working on
pools 1000 messages from table X [ This table X is populated by another service s1]
From each message get the account number and query table Y to get additional information about account.
I am using spring integrating to pool messages from table X and reading additional information for account, I am planning to use Spring JDBC.
We are expecting about 10k messages very day.
Is above approach, to query table Y for each message, a good approach ?
No, that indeed not. If all of that data is in the same database, consider to write a proper SELECT to join those tables in a single query performed by that source polling channel adapter.
Another approach is to implement a stored procedure which will do that job for you and will return the whole needed data: https://docs.spring.io/spring-integration/reference/html/jdbc.html#stored-procedures.
Although if the memory for that number of records to handle at once is a limit in your environment or you don't care how fast all of them are processed, then indeed an integration flow with parallel processing of splitted polling result is OK. For that goal you can use a JdbcOutboundGateway as a service in your flow instead of playing with plain JdbcTemplate: https://docs.spring.io/spring-integration/reference/html/jdbc.html#jdbc-outbound-gateway

Uploding data to kafka producer

I am new to Kafka in Spring Boot, I have been through many tutorials and got fair knowledge about the same.
Currently I have been assigned a task and I am facing an issue. Hope to get some help here.
The scenario is as follows.
1)I have a DB which is getting updated continuously with millions of data.
2)I have to hit the DB after every 5 mins and pick the recently updated data and send it to Kafka.
Condition- The old data that I have picked in my previous iteration should not be picked in my next DB call and Kafka pushing.
I am done with the part of Spring Scheduling to pick the data by using findAll() of spring boot JPA, but how can I write the logic so that it does not pick the old DB records and just take the new record and push it to kafka.
My DB table also have a field called "Recent_timeStamp" of type "datetime"
Its hard to tell without really seeing your logic and the way you work with the database, but from what you've described you should do just "findAll" here.
Instead you should treat your DB table as a time-driven data:
Since it has a field of timestamp, make sure there is an index on it
Instead of "findAll" execute something like:
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP > ?
ORDER BY RECENT_TIMESTAMP ASC
In this case you'll get the records ordered by the increasing timestamp
Now the ? denotes the last memorized timestamp that you've handled
So you'll have to maintain the state here
Another option is to query the data whose timestamp is "less" than 5 minutes, in this case the query will look like this (pseudocode since the actual syntax varies):
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP < now() - 5 minutes
ORDER BY RECENT_TIMESTAMP ASC
The first method is more robust because if your spring boot application is "down" for some reason you'll be able to recover and query all your records from the point it has failed to send the data. On the other hand you'll have to save this kind of pointer in some type of persistent storage.
The second solution is "easier" in a sense that you don't have a state to maintain but on the other hand you will miss the data after the restart.
In both of the cases you might want to use some kind of pagination because basically you don't know how many records you'll get from the database and if the amount of records exceeds your memory limits, the application with end up with OutOfMemory error thrown.
A Completely different approach is throwing the data to kafka when you write to the database instead of when you read from it. At that point you might have a data chunk of (probably) reasonably limited size and in general you don't need the state because you can store to db and send to kafka from the same service, if the architecture of your application permits to do so.
You can look into kafka connect component if it serves your purpose.
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka® and other data systems. It makes it simple to quickly define connectors that move large data sets in and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch, or into batch systems–such as Hadoop for offline analysis.

How to read multiple tables using Spring Batch

I am looking to read data from multiple tables (different database tables) and aggregate and create final result set. In my case, each query will return the List of object. I went through web many times, I found no link other than - Spring Batch How to read multiple table (queries) as Reader and write it as flat file write, but it returns only single object.
Is there any way if we can do this ? Any working sample example would help a lot.
Example -
One query gives List of Departments - from Oracle DB
One query gives List of Employee - from Postgres
Now I want to build Employee and Department relationship and send final object to processor to further lookup against MongoDB and send the final object to reader.
The question should rather be "how to join three tables from three different databases and write the result in a file". There is no built-in reader in Spring Batch that reads from multiple tables. You either need to create a custom reader, or decompose the problem at hand into tasks that can be implemented using Spring Batch tasklet/chunk-oriented steps.
I believe you can use the driving query pattern in a single chunk-oriented step. The reader reads employee items, then a processor enrich items with 1) department from postgres and 2) other info from mongo. This should work for small/medium datasets. If you have a lot of data, you can use partitioning to parallelize things and improve performance.
Another option if you want to avoid a query per item is to load all departments in a cache for example (I guess there should be less departments than employees) and enrich items from the cache rather than with individual queries to the db.

Embeded H2 Database for dynamic files

In our application, we need to load large CSV files and fetch some data out of it. For example, getting the distinct values from the CSV file. For this, we decided to go with in-memory DB's like H2, as there is no need to store the data in persistent storage.
However, the file is so dynamic that the columns may not be the same. I need to load the file to the H2 database to a table that is temporary for that session.
Tech Stack is Spring boot and H2.
The examples I see on forums is using a standard entity that knows what fields the table has. However my case the table columns will be dynamic
I tried the below in spring boot
public interface ImportCSVRepository extends JpaRepository<Object, String>
with
#Query(value = "CREATE TABLE TEST AS SELECT * FROM CSVREAD('test.csv');", nativeQuery = true)
But this gives unmanaged entity error. I understand why the error is thrown. However I am not sure how to achieve this. Also please clarify if I should use Spring-batch ?
You can use JdbcTemplate to manually create tables and query/update the data in them.
An example of how to create a table with JdbcTemplate
Dynamically creating tables and defining new entities (or modifying existing ones) is hardly possible with spring-data repositories and #Entity-ies. You probably should also check some NoSQL dbs like MongoDb - it's easier to define documents (or key-value objects - Redis) with dynamic structures in them.

Paging SELECT query results from Cassandra in Spring Boot application

During my research I have come across this JIRA for Spring-Data-Cassandra:
https://jira.spring.io/browse/DATACASS-56
Now, according to the post above, currently SDC is not supporting Pagination in the Spring App due to structure of Cassandra. However, I'm thinking, if I can pull the entire rows list into a Java List, can I Paginate that list ? I don't have much experience in Spring, but is there something I am missing when I assume this can be done ?
Cassandra does not support pagination in the sense of pointing to a specific page (limit/offset) but generates a continuation token (PagingState) that is a set of bytes. Pulling a List of records will load all records in memory and possibly exhaust your memory (depending on the amount of data).
Spring Data Cassandra 1.5.0 RC1 comes with a streaming API in CassandraTemplate:
Iterator<Person> it = template.stream("SELECT * FROM person WHERE … ;", Person.class);
while(it.hasNext()) {
// …
}
CassandraTemplate.stream(…) will return an Iterator that operates on an underlying ResultSet. The DataStax driver uses a configurable fetch-size (5000 rows by default) for bulk fetching. Streaming data access can fetch as much or as little data as you require to process data. Data is not retained by the driver nor Spring Data Cassandra, and once the fetched bulk is retrieved from the Iterator, the underlying ResultSet will fetch the next bulk itself.
The other alternative is using ResultSet directly that gives you access to PagingState and do all the continuation/paging business yourself. You would lose all the higher level benefits of Spring Data Cassandra.

Resources