Can Datagrip do cross DB queries/joins? - datagrip

I'm looking to use DG to join data from Snowflake and Impala or Hadoop. I can't see how I can do this as the job is sent to the DB to run the query. Is there any work arounds or hold the data/table in memory to use across sessions?
Thanks
testing DB connections

The workaround is to create tables from the result/tables of the other databases. And then JOIN with these tables.
Please see: https://www.jetbrains.com/help/datagrip/tables-copy.html

Related

Data replication between Oracle and Postgres

Is there a way to replicate data(like triggers or jobs) from oracle tables to postgres tables and vice versa(for different set of tables) without using external tools? Just one way replication for both the scenarios.
Just a hint:
You can think of create a DB link from Oracle to Postgres which is called heterogeneous connectivity which makes it possible to select data from Postgres with a select statement in Oracle.
Then use materialized views to schedule and store the results of those selects.
As you don't want to use any external tool otherwise the solution should have been much simpler
for 20 tables I need to replicate data from oracle to postgres. For 40 different tables, I need to replicate from postgres to oracle.
I could imagine the following setup:
For the Oracles tables that need to be accessible from Postgres, simply create foreign tables inside the Postgres server. They appear to be "local" tables in the Postgres server, but the FDW ("foreign data wrapper") will forward any request to the Oracle server. So no replication required. Whether or not this will be fast enough depends on how you access the tables. Some operations (WHERE clause, ORDER BY etc) can be pushed down to the Oracle server, some will be done by the Postgres server after all the rows have been fechted.
For the Postgres tables that need to be replicated to Oracle you could have a similar setup: create a foreign table that points to the target table in Oracle. Then create triggers on the Postgres table that will update the foreign table, thus sending the changes to Oracle.
This could all be managed on the Postgres side.

Find a table across databases in hadoop

I would like to find a specific table across multiple databases in Hadoop. I'm looking for an automatic solution since there dozens of databases involved.
Is there a hive command that could help me to do this?
Or I have to write something in bash instead?
Thanks
You can simply query your metastore. In my case i have mysql to be the metastore. so i did it this way
connect to your metastore. e.g mysql -uUser -hHost -pPassword
Use your metsatore db e.g use metastoredb;
select * from TBLS where TBL_NAME='table_name';
I have queried three columns and here is the output that i got.
select TBL_ID,DB_ID,TBL_NAME from TBLS where TBL_NAME='ri_reg_datamodels_tmp';
ask me if you get any issue with it

How to convert CONNECT BY in greenplum

Can anyone suggest how to convert CONNECT BY Oracle query into Greenplum. Greenplum doesn't support recursive queries. So, we can not use WITH RECURSIVE. Is there any alternate solution to re-write the below query.
SELECT child_id, Parnet_id, LEVEL , SYS_CONNECT_BY_PATH (child_id,'/') as HIERARCHY
FROM pathnode
START WITH Parnet_id = child_id
CONNECT BY NOCYCLE PRIOR child_id = Parnet_id;
There are ways to do this but it will be a one-off per query. You will need to create a function that loops through your pathnode table and "return next" to return each row. You can search on this site to find examples of doing this with PostgreSQL 8.2.
Work is happening to rebase Greenplum to PostgreSQL 8.3, 8.4, and so on. Those later PostgreSQL versions support "with recursive" which is the ANSI SQL way to write your SQL but Greenplum doesn't support it yet. When it does get supported by Greenplum, I don't think it will perform all that well. The query will force looping and individual row lookups. This works great in an OLTP database but not so well for an MPP database.
I suggest you transform your data in Oracle with a VIEW and then just dump the view to a file to load into Greenplum. The DDL of having a self-referencing, N-level table will never be a good idea in an MPP database.

Presto and hive partition discovery

I'm using presto mainly with hive connector to connect to hive metastore.
All of my tables are external tables pointing to data stored in S3.
My main issue with this is that there is no way (at least on I'm aware of ) to do partition discovery in Presto ,so before I start query a table in presto I need to switch to hive and run msck repair table mytable
is there more reasonable way to do it in Presto?
I'm on version 0.227 and the following helps me:
select * from hive.yourschema."yourtable$partitions"
This select returns all the partitions mapped in your catalog. You can filter, order, etc. as a normal query would.
No.
If the HIVE metastore doesn't see the partitions, PrestoDB will not see it.
Maybe a cron can help you.
There is now a way to do this:
CALL system.sync_partition_metadata(schema_name=>'<your-schema>', table_name=>'<your-table>', mode=>'FULL')
Credit to this post and this video

Microstrategy - HBase connection

We are trying to connect MS 9.4 to HBase via Impala connector.
First we created the hive tables liking them to HBase tables with following create table (as we saw in the docs):
CREATE TABLE hiveTableName1
(key int, columnName1 codClient, columnName2 clientName)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,columnfamily1:columnName1,columnfamily1:columnName2")
TBLPROPERTIES ("hbase.table.name" = "hbaseTableName1");
We did this twice, since we want to crete two hive tables and their correspondent hbase tables, in order to perform a join between them later with MS.
For the connection between MS with HBase, we follow the steps by selecting the MicroStrategy ODBC Driver for Impala Wire Protocol, and filling in the Data Source Name (Impala Data Source previously created with the Impala Driver), host and port (both for Impala installation in our AWS infraestructure) and impala/impala for credentials.
The thing is that when we finish complete the wizard and select the default namespace (which is the only one available. No any other ns has been created), we can see the hive tables that we created before, instead of the hbase tables.
I mean:
hiveTableName1
hiveTableName2
instead of
hbaseTableName1
hbaseTableName2
And, since these are the only tables availables, we only can perform our report with these two tables: a very easy join between these two tables by one field.
Both tables have 200.000 records and the join takes more than 1 minute to complete.
I'm sure that we are missing something here, and the process of linking hive tables to hbase ones are not completely right.
Is there a way to be able to connect to these two hbase tables instead of hive ones?
Any help will be really appreciated.
1. HBase does not support SQL and does not support the concept of "join" anyway.
2. Mapping Hive tables on HBase tables means that every Hive query triggers a full scan on HBase side, then the result is fed to a MapReduce batch job that does the filters and the joins.
Bottom line: 1 min is quite fast for what you are doing.
If you expect sub-second results, try some "small data" technologies (e.g. MySQL, Oracle, even MS Access) or forget about joins.
For sub-minutes results, you might give a try to Apache Phoenix: it's a HBase wrapper with indexes and some kind of SQL. Not sure about ODBC/JDBC drivers though.

Resources