I have two transactional tables originating from different databases in different servers. I would like to join them based on common attribute and store the result altogether in different database.
I have been looking for various options in NIFI to execute this as a job which runs monthly.
So far, I have been trying out various options but doesn't seem to work out. For example, I used ExecuteSQL1 & ExecuteSQL2 -> MergeContent-> PutSQL
Could anyone provide pointers on the same?
NiFi is not really meant to do a streaming join like this. The best option would be to implement the join in the SQL query using a single ExecuteSQL processor.
As Bryan said, NiFi doesn't (currently) do this. Perhaps look at Presto, you can set up multiple connections "under the hood" and use its JDBC driver to do what Bryan described, a join across tables in different DBs.
I'm thinking about adding a JoinTables processor that would let you join two tables using two different DBCPConnectionPool controller services, but there are lots of things to consider, such as being able to do the join in memory for example. For joining dimensions to fact tables, we could try to load the smaller table into memory and then we could do more of a streaming join for larger fact tables, for example. Feel free to file a New Feature Jira if you like, and we can discuss there.
Related
Currently we have data in the transaction database (Oracle) and are fetching data through queries to form reports. e.g. fetch all people under company A along with their details and lookup values from some more tables. It looks something like:
Select p.name,
p.address,
(select country_name from country where country_id = p.country_id),
...
...
from
person p, company c, person_file pf...
where c.company_id = p.company_id and c.company_id = 1
.. <all joins and conditions for tables>
The query takes a lot of time to fetch the records when there are a number of people against a company. My question is, what would be a better reporting solution by design and technology to get results faster if I don't want to stick to oracle as in future data will grow. Logically, it would be to implement something that does work in parallel. Another option like Spark seems to be an overkill.
First of all if you want to use oracle as the existing solution for the parallel processing you can use spark as your data reconciliation framework. Though it needs some learning curve but by using spark sql you can use your own query to read data from oracle. You can read data in parallel though it depends on how many parallel sessions is been configured with your oracle profile. Please check with the DBA.
Another option is migrating to any nosql dB like Cassandra so you can horizontally scale your machines rather than vertically. But the migration won’t be an easy task and straight forward. As nosql database does not support joins by design so the data modelling should be changed accordingly. Once done you can use spark on top of it. You can also consider using Talend which has predefined spark component ready.
I'm trying to think out loud here to understand if graphql is a likely candidate for my need.
We have a home-grown self servicing report creation tool. This is web-based. It starts with user selecting a particular report type.
The report type in itself is a base SQL query. In subsequent screens, one can select the required columns, filters, etc. As we The output of all these steps is a SQL query, which is then run on an Oracle database.
As you can see, there are lot of cons with this tool. It is tightly coupled with the Oracle OLTP tables. There are hundreds of tables.
Given the current data model, and the presence of many tables, I'm wondering if GraphQL would be the right approach to design a UI that could act like a "data explorer". If I could combine some of the closely related tables and abstract them via GraphQL into logical groups, I'm wondering if I could create a report out of them.
**Logical Group 1**
Table1
Table2
Table3
Table4
Table5
**Logical Group 2**
Table6
Table7
Table8
Table9
Table10
and so on..
Let's say, I want 2 columns from tables in Logical group 1 and 4 Columns from Logical Group 2, is this something that could be defined as a GraphQL object and retrieved to be either rendered on a screen or written to a file?
I think I'm trying to write a data modelling UI via GraphQL. Is this even a good candidate for such a need?
We have also been evaluating Looker as a possible data modelling layer. However, it seems like there could be some
Thanks.
Without understanding your data better, it is hard to say for certain, but at first glance, this does not seem like a problem that is well suited to GraphQL.
GraphQL's strength is its ability to model + traverse a graph of data. It sounds to me like you are not so much traversing a continuous graph of data as cherry picking tables from a DB. It certainly is possible, but there may be a good deal of friction since this was not its intended design.
The litmus test I would use is the following two questions:
Can you imagine your problem mapping well to a REST API?
Does your API get consumed by performance sensitive clients?
If so, then GraphQL may serve your needs well, if not you may want to look at something like https://grpc.io/
We seem to have bit of a debate on a discussion point in our team.
We are working on a Data Warehouse in the Microsoft SQL Server 2012 platform. We have followed the Kimball Architecture to build this Data Warehouse.
Issue:
A reporting solution (built on SSRS), which sources data from this Warehouse, has significant performance issues when sourcing data from fact and dim tables. Some of our team members suggest that we extract data from facts and dims into a new set of tables using SSIS packages. This would mean we denormalise these tables into ‘Snapshot’ tables. In this way the we would not need to join these tables to create data sets within the reports. Data could be read out of these tables directly.
I do have my own worries about this; inconsistencies, maintenance of different data structures, duplication of data etc to name a few.
Question:
Would you consider creating snapshot tables (by denormalising facts and dim tables) for reporting tables a right approach?
Would like to hear your thoughts on this.
Cheers
Nithin
I don't think there is anything wrong with snapshot tables. The two most important aspects of a data warehouse are:
The data is correct.
The data is useful.
If your users are unable to extract the totals they require, in a reasonable timescale, they won't use the warehouse.
My own solution includes 3 snapshot tables. Like you, I was worried about inconsistencies. To address this we built an automated checking process. This sub-system executes a series of queries, stored on a network drive, once an hour. Any records returned by the queries are considered a fail. Fails are reported and immediately investigated by my ETL team. This sub-system ensures the snapshots and underlying facts are always aligned and consistent with each other. Drift is prevented.
That said, additional tables equals additional complexity. And that requires more time/effort to manage. Before introducing another layer to your warehouse, you should investigate why these queries are underperforming. If joins are to blame:
Are you using an inappropriate data type, for your P/F keys?
Are the FKeys indexed (some RDBMS do this by default, others do not)?
Have you looked at the execution plans, for the offending queries?
Is the join really to blame, or is it a filter applied to the dim table?
for raw cube performance my advice would be to always try to denormalize your tables and have one fact table and one table for each dimension (star schema).
If you are unsure if it will actually help you could start creating materialized views. These are kind of the best of both worlds, on the long run you should alter your etl.
In my previous job we only had flattened tables which worked quite well. Currenly we have a normalized schema but flatten it in the last step.
We have different set of data into different systems like Hadoop, Cassandra, MongoDB. But our analytic team want to get the stitched data from different systems. For example customer information with demographic will be in one system, their transactions will be in another system. Analytic should able to query to get data like from US users what was the volume of transaction. We need to develop an application to provide ease way to interact with different system. What is the best way to do?
Another requirement:
If we want to provide their custom workspace in a system like MongoDB, they can easily place with it. What is the best strategy to pull data from one system to another system on demand?
Any pointer or common architecture used to solve this kind of problem will be really helpful.
I see two questions here:
How can I consolidate data from different systems into one system?
How can I create some data in Mongo for people to experiment with?
Here we go ... =)
I would pick one system and target that for consolidation. In other words, between Hadoop, Cassandra and MongoDB, which one does your team have the most experience with? Which one do you find easiest to query with? Which one do you have set up to scale well?
Each one has pros and cons to scale, storage and queryability.
I would pick one and then pump all data to that system. At a recent job, that ended up being MongoDB. It was easy to move data to Mongo and it had by far the best query language. It also had a great community and setting up nodes was easier than Hadoop, etc.
Once you have solved (1), you can trim your data set and create a scaled down sandbox for people to run ad-hoc queries against. That would be my approach. You don't want to support the entire data set, because it would likely be too expensive and complicated.
If you were doing this in a relational database, I would say just run a
select top 1000 * from [table]
query on each table and use that data for people to play with.
I'm wondering if it is possible to pull from more that one database in the same source qualifier. You can only specify a single database connection per Source Qualifier so I'm not sure if this is possible.
Ben,
If they are both from the same database vendor and db links are set up, you can use DBLINKS in the source qualifier.
select a.col1,
b.col2
from schema1.table1 a,
schema1.table2#db2 b
where a.col3 = b.col4;
But if they are heterogeneous databases, I think the best way to implement would be create two different source qualifiers (or different look ups based on your requirement and the number of columns) and use the parameter file / session to mention different connections.
Assuming that the account used to connect has equivalent rights in both databases it's DATABASE_NAME.TABLE_NAME
SELECT
a.id
,a.name
,a.company
,b.company_id
,b.company_name
,b.address
FROM
database1.users as a
JOIN
database2.companies as b ON a.company=b.company_id
I would implement this using database links, which allows interactions across two databases.
Although not a preferred solution for many reasons, this would help you achieve what you said, for your reasons.
However, from an ideal solution perspective, you should not be doing this in first place :)
If the data is flowing through from two different databases, get them from two different source qualifiers and then depending upon your need, go for a joiner or lookup (depends upon the functional requirements) etc..
If the login have at least read access and the schemas are at the same service is possible.
For a reason our DBA don't allow dblinks...
One reason to use informatica is that you can create a specific Source Qualifier (SQ) for different sources and then use a Join/Union trans... Believe me if you have issues in some of the data sources; fix it and troubleshooting would be easier.
Also imagine that you leave the company and other team takes that jobs; graphically and logically would be easier to maintain...