jdbi jdbc vertica streaming result set for handling large data - jdbc

I am trying to connect to vertica through Jdbi jdbc to get huge result set.
Followed JDBI documentation and added this to dao,
#SqlQuery("<query>")
#Mapper(ResultRow.StreamMapper.class)
#FetchSize(chunkSizeInRows)
public Iterable<List<Object>> getStreamingResultSet(#Define("query") String query);
But it seems like its loading the entire data into memory instead of streaming it

I've been looking at streaming result sets from JDBI, and came across this question. The answer is on the SQL Object Queries documentation page:
because the method returns a java.util.Iterator it loads results
lazily
So in this case, the Iterable<List<Object>> should be a Iterator<List<Object>> (I assume JDBI can convert a database row to a List<Object>).

Related

Uploding data to kafka producer

I am new to Kafka in Spring Boot, I have been through many tutorials and got fair knowledge about the same.
Currently I have been assigned a task and I am facing an issue. Hope to get some help here.
The scenario is as follows.
1)I have a DB which is getting updated continuously with millions of data.
2)I have to hit the DB after every 5 mins and pick the recently updated data and send it to Kafka.
Condition- The old data that I have picked in my previous iteration should not be picked in my next DB call and Kafka pushing.
I am done with the part of Spring Scheduling to pick the data by using findAll() of spring boot JPA, but how can I write the logic so that it does not pick the old DB records and just take the new record and push it to kafka.
My DB table also have a field called "Recent_timeStamp" of type "datetime"
Its hard to tell without really seeing your logic and the way you work with the database, but from what you've described you should do just "findAll" here.
Instead you should treat your DB table as a time-driven data:
Since it has a field of timestamp, make sure there is an index on it
Instead of "findAll" execute something like:
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP > ?
ORDER BY RECENT_TIMESTAMP ASC
In this case you'll get the records ordered by the increasing timestamp
Now the ? denotes the last memorized timestamp that you've handled
So you'll have to maintain the state here
Another option is to query the data whose timestamp is "less" than 5 minutes, in this case the query will look like this (pseudocode since the actual syntax varies):
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP < now() - 5 minutes
ORDER BY RECENT_TIMESTAMP ASC
The first method is more robust because if your spring boot application is "down" for some reason you'll be able to recover and query all your records from the point it has failed to send the data. On the other hand you'll have to save this kind of pointer in some type of persistent storage.
The second solution is "easier" in a sense that you don't have a state to maintain but on the other hand you will miss the data after the restart.
In both of the cases you might want to use some kind of pagination because basically you don't know how many records you'll get from the database and if the amount of records exceeds your memory limits, the application with end up with OutOfMemory error thrown.
A Completely different approach is throwing the data to kafka when you write to the database instead of when you read from it. At that point you might have a data chunk of (probably) reasonably limited size and in general you don't need the state because you can store to db and send to kafka from the same service, if the architecture of your application permits to do so.
You can look into kafka connect component if it serves your purpose.
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka® and other data systems. It makes it simple to quickly define connectors that move large data sets in and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch, or into batch systems–such as Hadoop for offline analysis.

How to fetch the data from database using spring boot without mapping

I have a database and in that database there are many tables of data. I want to fetch the data from any one of those tables by entering a query from the front-end application. I'm not doing any manipulation to the data, doing just retrieving the data from database.
Also, mapping the data requires writing so many entity or POJO classes, so I don't want to map the data to any object. How can I achieve this?
In this case, assuming the mapping of tables if not relevant, you don't need to use JPA/Hibernate at all.
You can use an old, battle tested jdbc template that can execute a query of your choice (that you'll pass from client), will serialize the response to JSONObject and return it as a response in your controller.
The client side will be responsible to rendering the result.
You might also query the database metadata to obtain the information about column names, types, etc. so that the client side will also get this information and will be able to show the results in a more convenient / "advanced" way.
Beware of security implications, though. Basically it means that the client will be able to delete all the records from the database by a simple query and you won't be able to avoid it :)

Role of H2 database in Apache Ignite

I have an Apache Spark Job and one of its components fires queries at Apache Ignite Data Grid using Ignite SQL and the query is a SQLFieldsQuery. I was going through the thread dump and in one of the Executor logs I saw the following :
org.h2.mvstore.db.TransactionStore.begin(TransactionStore.java:229)
org.h2.engine.Session.getTransaction(Session.java:1580)
org.h2.engine.Session.getStatementSavepoint(Session.java:1588)
org.h2.engine.Session.setSavepoint(Session.java:793)
org.h2.command.Command.executeUpdate(Command.java:252)
org.h2.jdbc.JdbcStatement.executeUpdateInternal(JdbcStatement.java:130)
org.h2.jdbc.JdbcStatement.executeUpdate(JdbcStatement.java:115)
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.connectionForThread(IgniteH2Indexing.java:428)
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.connectionForSpace(IgniteH2Indexing.java:360)
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.queryLocalSqlFields(IgniteH2Indexing.java:770)
org.apache.ignite.internal.processors.query.GridQueryProcessor$5.applyx(GridQueryProcessor.java:892)
org.apache.ignite.internal.processors.query.GridQueryProcessor$5.applyx(GridQueryProcessor.java:886)
org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36)
org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:1666)
org.apache.ignite.internal.processors.query.GridQueryProcessor.queryLocalFields(GridQueryProcessor.java:886)
org.apache.ignite.internal.processors.cache.IgniteCacheProxy.query(IgniteCacheProxy.java:698)
com.test.ignite.cache.CacheWrapper.queryFields(CacheWrapper.java:1019)
The last line in my code executes a sql fields query as follows :
SqlFieldsQuery sql = new SqlFieldsQuery(queryString).setArgs(args);
cache.query(sql);
According to my understanding, Ignite has its own data grid which it uses to store the cache data and indices. It only makes use of H2 database to parse the SQL query and get a query execution plan.
But, the Thread dump shows that updates are being executed and transactions are involved. I don't understand the need for transactions or updates in a SQL Select Query.
I want to know the following about the role of H2 database in Ignite :
I went into the open source code of Apache Ignite(version 1.7.0) and saw that it was trying to open a connection to a specific schema in H2 database by executing the query SET SCHEMA schema_name ( connectionForThread() method of IgniteH2Indexing class ). Is one schema or one table created for every cache ? If yes, what information does it contain since all the data is stored in ignite's data grid.
I also came across another interesting thing in the open source code which is that Ignite tries to derive the schema name in H2 from space name ( reference can be found in queryLocalSqlFields() method of IgniteH2Indexing class ). I want to know what does this space name indicate and is it something internal to Ignite or configurable ?
Would the setting of schema and connection to H2 db happen for each of my SQL query, if yes then is there any way to avoid this ?
Yes, we call executeUpdate to set schema. In Ignite 2.x we will be able to switch to Connection.setSchema for that. Right now we create SQL schema for each cache and you can create multiple tables in it, but this is going to be changed in the future. It does not actually contain anything, we just utilize some H2 APIs.
Space name is basically the same thing as a cache name. You can configure SQL schema name for a cache using CacheConfiguration.setSqlSchema.
If you run queries using the same cache instance, schema will not change.

Paging SELECT query results from Cassandra in Spring Boot application

During my research I have come across this JIRA for Spring-Data-Cassandra:
https://jira.spring.io/browse/DATACASS-56
Now, according to the post above, currently SDC is not supporting Pagination in the Spring App due to structure of Cassandra. However, I'm thinking, if I can pull the entire rows list into a Java List, can I Paginate that list ? I don't have much experience in Spring, but is there something I am missing when I assume this can be done ?
Cassandra does not support pagination in the sense of pointing to a specific page (limit/offset) but generates a continuation token (PagingState) that is a set of bytes. Pulling a List of records will load all records in memory and possibly exhaust your memory (depending on the amount of data).
Spring Data Cassandra 1.5.0 RC1 comes with a streaming API in CassandraTemplate:
Iterator<Person> it = template.stream("SELECT * FROM person WHERE … ;", Person.class);
while(it.hasNext()) {
// …
}
CassandraTemplate.stream(…) will return an Iterator that operates on an underlying ResultSet. The DataStax driver uses a configurable fetch-size (5000 rows by default) for bulk fetching. Streaming data access can fetch as much or as little data as you require to process data. Data is not retained by the driver nor Spring Data Cassandra, and once the fetched bulk is retrieved from the Iterator, the underlying ResultSet will fetch the next bulk itself.
The other alternative is using ResultSet directly that gives you access to PagingState and do all the continuation/paging business yourself. You would lose all the higher level benefits of Spring Data Cassandra.

BIRT Scripted Data Source using existing JDBC DataSource

I know that my overall problem is generally approached using two of the more common solutions such as a join data set or a sub-table, sub-report. I have looked at those and I am not sure this will work effectively.
Background:
JDBC data source has local data which includes a series of id's that reference a record in a master data repository interfaced via a web service. This is where the need for a scripted data source arises. The data can be filtered on either attributes within the local JDBC data and/or the extended data from the web service. The complication is that my only interface is the id argument to the webservice.
Ideal Solution:
Aside from creating a reporting table or other truly desirable scenarios I am looking to creating a unified data source through a single scripting data source that will handle all the complexities. This leaves the report generation and parameter creation a bit cleaner, hopefully. The idea is to leverage the JDBC query as well as the web service queries in the scripted data source do the filtering and joins and create that singular unified view.
I tried using the following code as a reference to use the existing JDBC connection in the BIRT report definition to execute the query. However if I think my breakdown on what should be in open vs fetch given this came from beforeFactory for a completely different purpose may be giving me errors...truth is I see no errors it just returns 0 records.
a link
I have also found a code snippet to dynamically load a JDBC connection but that seems a bit obtuse and a ton of overhead for what I am needing to do. a link
In short: How in all-that-is-holy do you simply run a query against a database within a scripted data source if you wanted to do. The merit of doing that is another issue, but technically how?
Thanks in Advance!

Resources