Where to do a join to flatten table..? Hive or Oracle - oracle

I have 7 normalized tables in oracle that I need to flatten out(some columns, not all) to work with map-reduce jobs. Now I have 2 choices- one is to do a join in oracle and use sqoop to import the joined table in hdfs. Or to import the tables one by one and then do join using hive itself.
Is there any difference between the two approaches, pro's or cons?
Thank you.

I am well comfortable with both oracle and hive. In this case seems reasonable to get the join done in oracle. You can ensure that all the moving parts are in sync and available.
You may also consider to create an oracle view that embodies the join. You can then more repeatably verify and extract the contents of the various tables into your single denormalized one.

Related

How to join three heterogeneous sources using single joiner

How to join three heterogeneous sources using single joiner sources? Maybe three flat files, references three different relational databases (Oracle, Teradata, SQL server) tables or one flat file, one oracle table, and one SQL server table.
We need to use only one joiner only, how can we implement this?
If you have 3 flat files then it's not possible to join 3 flat files sources with one joiner. As mentioned in Informatica Joiner Transformation documentation. The joiner transformation joins 2 heterogeneous sources.
If you have 2 tables and 1 flat file, then you could use SQL override to join the 2 tables, then use single joiner to join.

Is it possible to make a join between tables that are in different databases?

I have two databases, one is oracle and the other one is postgres, and I need to perform a join select between tables in those databases. Is there any way to make this possible?
That is simple.
Install oracle_fdw in the PostgreSQL database and define a foreign table for the Oracle table.
Then you can perform the join as if it were two PostgreSQL tables.
Be careful with big or complicated queries though: of course, the performance will be worse than for a join of two local tables.

Searching from all tables in oracle

I am developing a java app which can connect to oracle database and selecting column names from any tables, after selecting columns i have to query the data from those tables which the user select in my java app, now my question is how can i join all tables in the database so that query returns data successfully, i want to connect to any oracle schema to a specific, i will make the logic in java, but i am unable to find the query which can extract the data from all tables, i tried natural join among all tables but it has dependency of having same name of connecting columns. so i want to know any generic way which can work in all conditions.
As others have mentioned.. it seems that there are other tools out there that you probably should leverage prior to trying to roll your own complex solution.
With that said if you wish to roll your own solution you could look into using some of oracle's dictionary tables. Such as:
Select * from all_tables;
Select * from all_tab_cols;

Microstrategy - HBase connection

We are trying to connect MS 9.4 to HBase via Impala connector.
First we created the hive tables liking them to HBase tables with following create table (as we saw in the docs):
CREATE TABLE hiveTableName1
(key int, columnName1 codClient, columnName2 clientName)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,columnfamily1:columnName1,columnfamily1:columnName2")
TBLPROPERTIES ("hbase.table.name" = "hbaseTableName1");
We did this twice, since we want to crete two hive tables and their correspondent hbase tables, in order to perform a join between them later with MS.
For the connection between MS with HBase, we follow the steps by selecting the MicroStrategy ODBC Driver for Impala Wire Protocol, and filling in the Data Source Name (Impala Data Source previously created with the Impala Driver), host and port (both for Impala installation in our AWS infraestructure) and impala/impala for credentials.
The thing is that when we finish complete the wizard and select the default namespace (which is the only one available. No any other ns has been created), we can see the hive tables that we created before, instead of the hbase tables.
I mean:
hiveTableName1
hiveTableName2
instead of
hbaseTableName1
hbaseTableName2
And, since these are the only tables availables, we only can perform our report with these two tables: a very easy join between these two tables by one field.
Both tables have 200.000 records and the join takes more than 1 minute to complete.
I'm sure that we are missing something here, and the process of linking hive tables to hbase ones are not completely right.
Is there a way to be able to connect to these two hbase tables instead of hive ones?
Any help will be really appreciated.
1. HBase does not support SQL and does not support the concept of "join" anyway.
2. Mapping Hive tables on HBase tables means that every Hive query triggers a full scan on HBase side, then the result is fed to a MapReduce batch job that does the filters and the joins.
Bottom line: 1 min is quite fast for what you are doing.
If you expect sub-second results, try some "small data" technologies (e.g. MySQL, Oracle, even MS Access) or forget about joins.
For sub-minutes results, you might give a try to Apache Phoenix: it's a HBase wrapper with indexes and some kind of SQL. Not sure about ODBC/JDBC drivers though.

Why we need to move external table to managed hive table?

I am new to Hadoop and learning Hive.
In Hadoop definative guide 3rd edition page no. 428 last paragraph
I don't understand below paragraph regarding external table in HIVE.
"A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table."
Can anybody explain briefly what above phrase says?
Usually the data in the initial dataset is not constructed in the optimal way for queries.
You may want to modify the data (like modifying some columns adding columns, making aggregation etc) and to store it in a specific way (partitions / buckets / sorted etc) so that the queries would benefit from these optimizations.
The key difference between external and managed table in Hive is that data in the external table is not managed by Hive.
When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.
Managed table basically is a directory in HDFS and it's created and managed by Hive. Even more - all operations for removing/changing partitions/raw data/table in that table MUST be done by Hive otherwise metadata in Hive metastore may become incorrect (e.g. you manually delete partition from HDFS but Hive metastore contains info that partition exists).
In Hadoop definative guide I think author meant that it is a common practice to write MR-job that produces some raw data and keeps it in some folder. Than you create Hive external table which will look into that folder. And than safelly run queries without the risk to drop table etc.
In other words - you can do MR job that produces some generic data and than use Hive external table as a source of data for insert into managed tables. It helps you to avoid creating boring similar MR jobs and delegate this task to Hive queries - you create query that takes data from external table, aggregates/processes it how you want and puts the result into managed tables.
Another goal of external table is to use as a source data from remote servers, e.g. in csv format.
There is no reason to move table to managed unless you are going to enable ACID or other features supported only for managed tables.
The list of differences in features supported by managed/external tables may change in future, better use current documentation. Currently these features are:
ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed
tables
DROP deletes data for managed tables while it only deletes
metadata for external ones
ACID/Transactional only works for
managed tables
Query Results Caching only works for managed
tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables
You can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
Data structure has nothing in common with external/managed table type. If you want to change structure you do not necessarily need to change table managed/external type
It is also mentioned in the book.
when your table is external table.
you can use other technologies like PIG,Cascading or Mapreduce to process it .
You can also use multiple schemas for that dataset.
and You can also create data lazily if it is external table.
when you decide that dataset should be used by only Hive,make it hive managed table.

Resources