Apache Ignite Logs for joint two tables - spring-boot

I am getting some warn logs from Apache Ignite when i run execute join query. I don't know what to do.
When I create them, I set CacheConfiguration cfg.setAffinity(AffinityFunction instance); instance has 24 parts both of them.and also same nodeAffinityKey.
For join two partitioned tables join condition should contain the equality operation of affinity keys. Left side: EMPLOYEE; right side: DEPARTMENT"
I need to solve this.

Related

How to read multiple tables using Spring Batch

I am looking to read data from multiple tables (different database tables) and aggregate and create final result set. In my case, each query will return the List of object. I went through web many times, I found no link other than - Spring Batch How to read multiple table (queries) as Reader and write it as flat file write, but it returns only single object.
Is there any way if we can do this ? Any working sample example would help a lot.
Example -
One query gives List of Departments - from Oracle DB
One query gives List of Employee - from Postgres
Now I want to build Employee and Department relationship and send final object to processor to further lookup against MongoDB and send the final object to reader.
The question should rather be "how to join three tables from three different databases and write the result in a file". There is no built-in reader in Spring Batch that reads from multiple tables. You either need to create a custom reader, or decompose the problem at hand into tasks that can be implemented using Spring Batch tasklet/chunk-oriented steps.
I believe you can use the driving query pattern in a single chunk-oriented step. The reader reads employee items, then a processor enrich items with 1) department from postgres and 2) other info from mongo. This should work for small/medium datasets. If you have a lot of data, you can use partitioning to parallelize things and improve performance.
Another option if you want to avoid a query per item is to load all departments in a cache for example (I guess there should be less departments than employees) and enrich items from the cache rather than with individual queries to the db.

Syncing data between services using Kafka JDBC Connector

I have a system with a microservice architecture. It has two services: Service A and Service B each with it's own database like in the following diagram.
As far as I understand having a separate database for each service is a better approach. In this design each service is the owner of its data, it's responsible for creating, updating, deleting and enforcing constraints.
In order to have Service A data in Database B I was thinking of using JDBC Kafka Connector, but I am not sure if Table1 and Table2 in Database B should enforce constraints from Database A.
If the constraint, like the foreign key from Table2 to Table1 should exist in Database B then, is there a way to have the connector know about this?
What are other common or better ways to sync data or solve this problem?
The easiest solution seems to sync per table without any constraints in Database B. That would make things easier but it could also lead to a situation where Service's A data in Service B is inconsistent. For example having entries in Table2 that point to a non-existing entry in Table1
If the constraint, like the foreign key from Table2 to Table1 should
exist in Database B then, is there a way to have the connector know
about this?
No unfortunately the "Kafka JDBC Connector" does not know about constraints.
Based on your question I assume that Table1 and Table2 are duplicated tables in Database B which exist in Database A. In Database A you have constraints which you are not sure you should add in Database B?
If that is the case then I am not sure if using "Kafka JDBC Connector" to sync data is the best choice.
You have a couple options:
Enforce the usage of Constraints like Foreign Keys in Database B but you would need to update it from your application level and not through "Kafka JDBC Connector". So for this option you can not use "Kafka JDBC Connector". You would need to write some small service/worker to read the data from that Kafka topic and populate your database tables. This way you control what is saved to the db and you can validate the constraints even before trying to save to your database. But the question here is do you really need to have the Constraints? They are important in micro-service-A but do you really need them in micro-service-B as it is just a copy of the data?
Not use constraints and allow temporary inconsistency. This is common in micro-services world. When working with Distributed systems you always have to think about the CAP Theorem. So you take into account that some data might at some point be inconsistent but you have to make sure that you will eventually bring it back to consistent state. This means you would need to develop on your application level some cleanup/healing mechanism which will recognize this data and correct it. So Db constraints do not necessary have to be enforced on data which the micro-service does not own and is considered as External data to that micro-service Domain.
Rethink your design. Usually we duplicate data in micro-service-B from micro-service-A in order to avoid coupling between the services so that the service micro-service-B can live and operate even when the micro-service-A is down or not running for some reason. We also do it to reduce load from micro-service-B to micro-service-A for every operation which needs data from Table1 and Table2. Table1 and Table2 are owned by micro-service-A and micro-service-A is the only source of truth for this data. Micro-service-B is using a duplicate of that data for its operations.
Looking at your databases design following questions might help you figuring out what would be the best option for you system:
Is it necessary to duplicate the data in micro-service-B?
If I duplicate the data do I need both tables and do I need all their columns/data in micro-service-B? Usually you just store/duplicate only a subset of the Entity/Table that you need.
Do I need the same table structure in micro-service-A as in micro-service-A? You have to decide this based on your Domain but very often you Denormalize your tables and change them in order to fit the needs of micro-service-B operations. As usually all these design decisions depend on your application Domain and use case.

how to read/write in kafka using jdbc with conditions

I have a Kafka server that works fine for sync a table between server. My DB is PostgreSQL and I 'm using JDBC sink/source connector.
Now my question is How can I read data from two table in Source and Insert data to Four different table in Sink side.
example:
Source table: Users, Roles
Sink tables: Workers, Managers, Employers, ...
In parent server all users are available in Users table and have relation with Role table. in other side I want to insert data to specific table according to it's role
For the JDBC Sink you need one topic per target table. Thus you need four topics, one per target table, populated with the joined data. This join needs to happen at some point in the pipeline. Options would be :
As part of the JDBC Source, using the query option of the connector. Build four connectors, each with the necessary query to populate each target topic with the join that is done on the postgres side in SQL.
As a streaming application e.g. in Kafka Streams or KSQL. The JDBC source would pull in the source users and roles tables and you'd perform the join as each record flowed through.

NIFI: join two tables from different databases

I have two transactional tables originating from different databases in different servers. I would like to join them based on common attribute and store the result altogether in different database.
I have been looking for various options in NIFI to execute this as a job which runs monthly.
So far, I have been trying out various options but doesn't seem to work out. For example, I used ExecuteSQL1 & ExecuteSQL2 -> MergeContent-> PutSQL
Could anyone provide pointers on the same?
NiFi is not really meant to do a streaming join like this. The best option would be to implement the join in the SQL query using a single ExecuteSQL processor.
As Bryan said, NiFi doesn't (currently) do this. Perhaps look at Presto, you can set up multiple connections "under the hood" and use its JDBC driver to do what Bryan described, a join across tables in different DBs.
I'm thinking about adding a JoinTables processor that would let you join two tables using two different DBCPConnectionPool controller services, but there are lots of things to consider, such as being able to do the join in memory for example. For joining dimensions to fact tables, we could try to load the smaller table into memory and then we could do more of a streaming join for larger fact tables, for example. Feel free to file a New Feature Jira if you like, and we can discuss there.

Generating star schema in hive

I am from SQL Datawarehouse world where from a flat feed I generate dimension and fact tables. In general data warehouse projects we divide feed into fact and dimension. Ex:
I am completely new to Hadoop and I came to know that I can build data warehouse in hive. Now, I am familiar with using guid which I think is applicable as a primary key in hive. So, the below strategy is the right way to load fact and dimension in hive?
Load source data into a hive table; let say Sales_Data_Warehouse
Generate Dimension from sales_data_warehouse; ex:
SELECT New_Guid(), Customer_Name, Customer_Address From Sales_Data_Warehouse
When all dimensions are done then load the fact table like
SELECT New_Guid() AS 'Fact_Key', Customer.Customer_Key, Store.Store_Key...
FROM Sales_Data_Warehouse AS 'source'
JOIN Customer_Dimension Customer on source.Customer_Name =
Customer.Customer_Name AND source.Customer_Address = Customer.Customer_Address
JOIN Store_Dimension AS 'Store' ON
Store.Store_Name = Source.Store_Name
JOIN Product_Dimension AS 'Product' ON .....
Is this the way I should load my fact and dimension table in hive?
Also, in general warehouse projects we need to update dimensions attributes (ex: Customer_Address is changed to something else) or have to update fact table foreign key (rarely, but it does happen). So, how can I have a INSERT-UPDATE load in hive. (Like we do Lookup in SSIS or MERGE Statement in TSQL)?
We still get the benefits of dimensional models on Hadoop and Hive. However, some features of Hadoop require us to slightly adopt the standard approach to dimensional modelling.
The Hadoop File System is immutable. We can only add but not update data. As a result we can only append records to dimension tables (While Hive has added an update feature and transactions this seems to be rather buggy). Slowly Changing Dimensions on Hadoop become the default behaviour. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a View that retrieves the latest record using windowing functions. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.
The way how data is distributed across HDFS makes it expensive to join data. In a distributed relational database (MPP) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalise the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time. We can use SQL extensions such as array_agg in BigQuery/Postgres etc. to handle multiple grains in a fact table
I would also question the usefulness of surrogate keys. Why not use the natural key? Maybe performance for complex compound keys may be an issue but otherwise surrogate keys are not really useful and I never use them.

Resources