Database versions for Oozie - oracle

I would like to change my Oozie installation from a MySQL db to an Oracle db.
My cluster is running CDH 5.4.7 with Oozie 4.1. The Oracle db that I have access to is version 12c.
In the Cloudera documentation it states that Oracle db 12c is only supported by Cloudera Manager and CDH 5.6 and newer.
My question is therefore: is there any reason why my Oozie installation should not be able to use this database, even through Cloudera components do not support it? In the Oozie documentation it does not state anything version related, as far as I have found.
I am lacking a non-production system to test this on, but looking into setting one up currently.
Any answers, including speculation, are appreciated.
If any information is missing, I will gladly append.
Thanks

Oozie inside CDH5.4.7 is using a quite old OpenJPA version, 2.2.2.
OpenJPA 2.2.2 does not support Oracle 12c.
However CDH5.8.0 still using OpenJPA 2.2.2, so my guess it that is will probably work but was never tested. Make sure to create a backup of your DB before the migration. Also, you might try the DB migration tool developed in OOZIE-2632

Related

Apache Sqoop moved into the Attic in 2021-06

I have installed hadoop version 3.3.1 and sqoop 1.4.7 which doesn't seem compatible , I am getting depreciated API implemented error while importing rdbms table.
As I tried to google for compatible versions I found apache sqoop is moved to appache attiq .and version 1.4.7 which is last stable version states in its documentation says that " Sqoop is currently supporting 4 major Hadoop releases - 0.20, 0.23, 1.0 and 2.0. "
Would you please explain what does it mean and what should I do.
could you please suggest What are the alternatives of SQOOP .
It means just what the board minutes say: Sqoop has become inactive and is now moved to the Apache Attic. This doesn't mean Sqoop is deprecated in favor of some other project, but for practical purposes you should probably not build new implementations using it.
Much of the same functionality is available in other tools, including other Apache projects. Possible options are Spark, Kafka, Flume. Which one to use is very dependent on the specifics of your use case, since none of these quite fill the same niche as Sqoop. The database connectivity capabilities of Spark make it the most flexible solution, but it also could be the most labor-intensive to set up. Kafka might work, although it's not quite as ad-hoc friendly as Sqoop (take a look at Kafka Connect). I probably wouldn't use Flume, but it might be worth a look (it is mainly meant for shipping logs).

HBase client - server’s version compatibility

I wonder how can I know if my HBase client’s jar fit to my HBase server’s version. Is there any place where it is specified which HBase versions are supported with an HBase client jar?
In my case I want to use the newest HBase client jar (2.4.5) with a pretty old HBase server (version 1.2). Is there any place where I can check the compatibility to know if it’s possible and supported?
I’d like to know if there’s a table that shows the wide compatibility like other databases has. Something like:
https://docs.mongodb.com/drivers/java/sync/current/compatibility/
Perhaps you can use checkcompatibility.py script provided in HBase distro itself to generate client API compatibility report between 1.2 and 2.4. Haven't used 2.4 myself, but based on prior history I wouldn't hope there is no breaking changes across two different major versions.

Set up IBM Open Platform with an external Oracle Database

I'm a little confused when I try to install a single node IBM Open Platform cluster using an Oracle database as RDBMS.
Firstly, I understand that the Hadoop part of the IBM Big Insights is not a modified version of the corresponding Apache version (as HortonWorks do) so, when Ambari (from the IBM repo) offers me to use an external Oracle database, I suppose it should work. I may be wrong, and I can't find any oracle reference in the crappy IBM installation guide to set it up correctly (only that it should work with Oracle 11g R2)
So, as I do with an equivalent HortonWorks distribution (but using the binaries from IBM), I set up my ambari-server with all the oracle parameters (--jdbc-db=oracle --jdbc-driver=path/to/ojdbc6.jar, I'm using a Oracle 11g XE on Centos 6.5, supposed to be supported by IOP) and I specified all the stuff I had to specify to use Ambari with Oracle (Service Name, Host, Port, ...)
I created the ambari user, loaded the corresponding Oracle DDL (packaged with Ambari) and created my Hive & Oozie users, as specified in the... Hortonworks installation guide.
Well, Ambari seems to work well with Oracle, I can set up my cluster until the last step :
If I configure Hive and/or oozie to work with oracle (validating the oracle connection is OK from the service configuration tab), the "review" step (step 8) doesn't show anything (or sometimes the IOP repos, it seems to be arbitrary). Trying to deploy starts the tasks preparation and implies a blocking states of the installation: I can't do anything else than dropping the database and reload the entire DDL to try again (or I'll obtain lots of unexpected NullPointerException)
If I configure Hive AND Oozie to work with an embedded MySQL (the default choice), keeping Ambari against Oracle, everything works fine.
Am I doing something wrong?? Or is there any limitation to configure (IBM Open Platform) Hive and Oozie to use Oracle 11 ? (when it works with the HortonWorks - same apache version - and Cloudera Distribution)
Of course, log files don't tell me anything...
UPDATE:
I tried to install IOP 4.1, firstly using MySQL as my Ambari, Hive and Oozie database, everything was fine.
Next I tried to install IOP 4.1 with Oracle 11 XE as external database (I configured oracle, created ambari, hive and oozie oracle users and loaded the Ambari Oracle schema given with IOP 4.1, and I configure the same cluster as the first time, specifying the Oracle particularities for Hive, Oozie (and Sqoop (Oracle driver)). Before deploying the services to all the nodes, Ambari is supposed to resume what it is going to install, but it doesn't: sometimes it doesn't show anything, sometimes it shows only the IOP repos urls. Next, trying to deploy, it starts the preparation tasks but never ends. and that's it. No message, no log, nothing, it just get stucked.
As the desired components of IOP 4.1 are in the same version in HDP 2.3 (Ambari 2.1, Hive 1.2.1, oozie 4.2.0, hadoop 2.7.1, pig 0.15.0, sqoop 1.4.6 and zookeeper 3.4.6), I tried to configure exactly the same cluster with HDP 2.3, Oracle 11 XE, ... and everything worked. I noticed that HDP 2.3 forces me to use SSL, while IOP does not. HDP works with an Oracle JDK 1.8 by default while IOP actually offer to use an OpenJDK 1.8 instead. I don't know if it matters, I'll try to be sure... I'll take pictures of the Ambari screen when it blocks and copy the log traces, even if there's no error message...
If anyone got an idea, please share it!
Thanks!
Trying to operate the same installation using the Oracle JDK 1.8 everything works fine.
I don't know if there is any restriccion using the Oracle JDBC driver with OpenJDK 1.8 but using Oracle 11 XE with IOP 4.1 + Oracle JDK 1.8 works.

Can ETL informatica Big Data edition (not the cloud version) connect to Cloudera Impala?

We are trying do a proof of concept on Informatica Big Data edition (not the cloud version) and I have seen that we might be able to use HDFS, Hive as source and target. But my question is does Informatica connect to Cloudera Impala? If so, do we need to have any additional connector for that? I have done comprehensive research to check if this is supported but could not find anything. Did anyone already try this? If so, can you specify the steps and link to any documentation?
Informatica version: 9.6.1 (Hotfix 2)
You can use the odbc driver provided by cloudera.
http://www.cloudera.com/downloads/connectors/impala/odbc/2-5-22.html
For Irene, the you can use the same driver the above one is based the simba driver.
http://www.simba.com/drivers/hbase-odbc-jdbc/

How to configure Hue-2.5.0 and HIve-0.11.0

From past 2 days I have been working on setting up Hue but no luck.
The versions I tried with hive 0.11.0 :- 3.5, 3.0, 2.4, 2.1, 2.3, 2.5
After much googling i came to know 3.5 and 3.0 (documentation says 0.11) are compatible with hive 0.12 or 0.13 but as mine is 0.11 I faced issues like : Required client protocal , no database found, list index error.
Finally I was able to set up Hue 2.5.0 and it indeed connects with hiveserver2.
My Properties in hue.ini :
beeswax_server_host=localhost
server_interface=hiveserver2
beeswax_server_port=10000
hive_home_dir=/usr/lib/hive/hive-0.11.0
hive_conf_dir=/usr/lib/hive/hive-0.11.0/conf
All my tables are in hive which hiveserver2 does not show if I access it using "beeline"
but if I start hive thrift server then I can access all my tables and schemas in R-studio.
I'm not getting why hiverserver2 cannot access hive tables, is it something different?
Hue.ini file give only two options : beeswex and hiveserver2 for connectivity.
I have done a lot of online google but this point nothing is helping.
please let me know if :
hiverserver2 can import hive data
OR
hiverserver can be used with hue 2.5.0
OR
if I'm missing anything
If there is any more information required please let me know.
Apache Hive is missing some patches from CDH that have not been accepted by the community. The Thrift protocol version is also different depending depending on the release.
The current workarounds are to cherry-pick the missing patches from CDH or to use Hive from CDH.
You can read more here for example.
You should have a hive client installed on the Hue machine, with a configured hive-site.xml.
Then you can comment out all the [[beeswax]] section and Hue should run correctly.

Resources