Java based ETL Application - spring

I want to build a spring framework based ETL application. I should be able to create an exact copy of any table in a database. Hence, the structure of the table is not known to me beforehand. So, creation of entities is not possible within the application.
The idea is to provide some external configuration to the application for each table. The application should then be able to create an exact copy of the table.
I cannot use Spring JPA as it requires creation of entities. Thus, planning to use Spring JDBCTemplate. Will Spring JDBCTemplate be the right framework for my application?
I am not ready to use Pentaho,rather I want to build something like it with Java.

You can use Spark.
Here is an example of how you can do it
public class DemoApp {
SparkSession spark = SparkSession.builder()
.master("local[1]")
.appName(DemoApp.class.getName())
.getOrCreate();
Dataset<Row> table1 = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "demo.table", getConnectionProperties(dbProperties));
private Properties getConnectionProperties(Properties dbProperties) {
Properties connectionProperties = new Properties();
connectionProperties.put("user", "postgres");
connectionProperties.put("password", "password");
connectionProperties.put("driver", "org.postgresql.Driver");
connectionProperties.put("stringtype", "unspecified");
return connectionProperties;
}
}
You can read several tables and after that join them or do other things you like.

Related

Embeded H2 Database for dynamic files

In our application, we need to load large CSV files and fetch some data out of it. For example, getting the distinct values from the CSV file. For this, we decided to go with in-memory DB's like H2, as there is no need to store the data in persistent storage.
However, the file is so dynamic that the columns may not be the same. I need to load the file to the H2 database to a table that is temporary for that session.
Tech Stack is Spring boot and H2.
The examples I see on forums is using a standard entity that knows what fields the table has. However my case the table columns will be dynamic
I tried the below in spring boot
public interface ImportCSVRepository extends JpaRepository<Object, String>
with
#Query(value = "CREATE TABLE TEST AS SELECT * FROM CSVREAD('test.csv');", nativeQuery = true)
But this gives unmanaged entity error. I understand why the error is thrown. However I am not sure how to achieve this. Also please clarify if I should use Spring-batch ?
You can use JdbcTemplate to manually create tables and query/update the data in them.
An example of how to create a table with JdbcTemplate
Dynamically creating tables and defining new entities (or modifying existing ones) is hardly possible with spring-data repositories and #Entity-ies. You probably should also check some NoSQL dbs like MongoDb - it's easier to define documents (or key-value objects - Redis) with dynamic structures in them.

how create datasource programmatically in spring batch?

i want to copy many data from serveral db which on diff machine to a centre db.
i think the spring batch may be a choice to fit my requirement.
so. should be make a lot of job to accomplish the whole task, the jobs will like this:
job A: copy from db1 to db111;
job B: copy from db2 to db111;
job C: copy form db3 to db111;
etc...
and the tables in db1, db2, db3...is quite different.
so far, i know how to create datasources at spring boot startup time, but i don't know how to create datasource in job instance at runtime. is any idea about this? (if can support spring data jpa will be better)
or is any other way better then spring batch?
thanks.
A datasource is a set of connections to a DB so in your scenario , there are just multiple kinds of DBs or multiple DBs of same kind & for both scenarios - you will have to create one datasource for each db & then use that in piece of code wherever you need it.
Step 1 - So you write one configuration class for each database to set up one datasource. At propertly file level, you won't be able to use default properties but your custom ones where you prefix properties with db names to distinguish.
You need to define transaction managers etc for each datasource & you uniquely name each datasource.
Step 2 : Next step is to use appropriate datasource with appropriate dao classes. In above configuration class, if you use JPA , those configs would already be there including entity packages, repository packages etc etc. JdbcTemplate takes datasource in constructor etc.
All in All - scenario is similar to single datasource scenario & you will have to set up all datasources in advance at app start up but with appropriate qualified bean names & then use those data-sources wherever you need.
This Answer is what works for me

Data migration among multiple databases with spring-boot

I am trying to make an application where data will be migrated from one database to another database (Multiple dbs will be used). User can select the table at runtime & push it to target db. I am using spring-boot, spring data JPA & trying with Flyway.
My issue is how to read the complete schema from source db as user can select the source db at runtime?
Sumit
You can obtain a MetaData object from a JDBC connection and use it to obtain all kinds of information about the database, e.g. the list of tables.
See the following example which I took from a tutorial.
databaseMetaData = connection.getMetaData();
ResultSet resultSet = databaseMetaData.getTables(null, null, null, new String[]{"TABLE"});
System.out.println("Printing TABLE_TYPE \"TABLE\" ");
System.out.println("----------------------------------");
while(resultSet.next())
{
System.out.println(resultSet.getString("TABLE_NAME"));
}
Note: JPA is most likely not the right tool for the job. Consider using Springs JdbcTemplate instead.

Specifing a Sharded Collection with Spring Data MongoDB

I am using Spring Boot and Spring Data MongoDB to interface with an underlying sharded MongoDB cluster. My Spring Boot Application access the cluster via a mongos router.
Using Spring Data MongoDB, you can specify the collection an object is persisted to via #Document(collection = "nameOfCollection"), or it defaults to the class name (first letter lowercase). These collections do not need to exist before-hand; they can be created at runtime.
To shard a collection in MongoDB, you need to
1 - Enable sharding on the Database: sh.enableSharding("myDb")
2 - Shard the collection on a sharded database: sh.shardCollection("myDb.myCollection", {id:"hashed"})
Assuming there is an existing sharded database, does Spring Data MongoDB offer a way to shard a collection with a shard key? As far as I can tell, I cannot shard a collection with Spring, and therefore must configure the sharded collection before my Boot application runs. I find it odd that Spring would allow me to use undefined collections, but does not provide a way to configure the collection.
Edit:
I have seen both Sharding with spring mongo and How configuring access to a sharded collection in spring-data for mongo? which refer more to the deployment of a sharded MongoDB cluster. This question assumes all the plumbing is there and that the collection itself simply must be sharded.
Despite this question being old, I've got the same question, and it looks like there is away to provide custom sharding key since recently.
Annotation-based Shard Key configuration is available on spring-data-mongodb:3.x,
https://docs.spring.io/spring-data/mongodb/docs/3.0.x/reference/html/#sharding
#Document("users")
#Sharded(shardKey = { "country", "userId" })
public class User {
#Id
Long id;
#Field("userid")
String userId;
String country;
}
As of today spring-boot-starter-mongodb comes with 2.x version though.
Even though this is not a Spring Data solution, a potential workaround is posed in how to execute mongo admin command from java, where DB can be acquired from a Spring MongoTemplate.
DB db = mongo.getDB("admin");
DBObject cmd = new BasicDBObject();
cmd.put("shardcollection", "testDB.x");
cmd.put("key", new BasicDBObject("userId", 1));
CommandResult result = db.command(cmd);
Was running into the same problem with our update queries that internally used a save().
How it was solved?
So I now have overridden the spring-data-mongo core dependency from spring-boot-starter which is 2.1.x by 3.x release in our model which is now supporting #Sharded() annotation .
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-mongodb</artifactId>
<version>3.1.5</version>
</dependency>
allows you to say
#Document(collection = "hotelsdevice")
#Sharded(shardKey = { "device" })
public class Hotel extends BaseModel {
which internally is now able to tell the underlying mongo which is our shardkey. I am assuming this will further fix our count() queries too which were failing due to the same error "query need to target a shard "

Spring Data : relationships between 2 different data sources

In a Spring Boot Application project, I have 2 data sources:
a MySQL database (aka "db1")
a MongoDB database (aka "db2")
I'm using Spring Data JPA and Spring Data MongoDB, and it's working great... one at a time.
Saying db1 handles "Players", and db2 handles "Teams" (with a list of players' ID). Is it possible to make the relationship between those 2 heterogeneous entities working? (i.e. #ManyToOne, #Transactional, Lazy/Eager, etc.)
For example, I want to be able to write:
List<Player> fooPlayers = teamDao.findOneById(foo).getPlayers();
EDIT: If possible, I'd like to find a solution working with any spring data project
Unfortunately your conundrum has no solution in spring data.
what may be a possibility is that you create an interface (DAO) class of your own. That DAO class would have implementations to to query both of your DBs. A very crude and short example would be
your DAO
{
yourFind (id)
{
this would find in db2 and return a relevant list of objects
findOneByID(id)
get the player from the above retrieved list and query db1
getPlayer(player)
}
}
i hope this points you in the right direction

Resources