JDBC connection to Oracle server from Spark - oracle

When a connection to Oracle server is made from Spark cluster, would the JDBC connection to Oracle server would be established from the node/box where the code is being executed or would it be executed from data nodes? In later case, whether the drivers need to be installed on all data nodes for it to be connecting to Oracle server.

When a connection to Oracle server is made from Spark cluster, would the JDBC connection to Oracle server would be established from the node/box where the code is being executed or would it be executed from data nodes?
Data is always loaded from the executor nodes. However driver node needs an access to the database as well, to be able fetch metadata.
In later case, whether the drivers need to be installed on all data nodes for it to be connecting to Oracle server.
Yes. Driver has to be present on each node used by the Spark application. This can done by:
Having required jars on the classpath of each node.
Using spark.jars to distribute jars on the runtime
Using spark.jars.packages to fetch jars using Maven coordinates.

Related

Can I use a SnappyData JDBC connection with only a Locator and Server nodes?

SnappyData documentation and architecture diagrams seem to indicate that a JDBC thin client connection goes from a client to a Locator and then it is routed to a direct connection to a Server.
If this is true, then I can run JDBC queries without a Lead node, correct?
Yes, that is correct. The locator provides load and connectivity information back to the client that is now able to connect to one or more servers either for direct access to a bucket for low latency queries but more importantly, is HA - can failover and failback.
So, yes, your connected clients will continue to function even when the locator goes away. Note that the "lead" plays a different role than the locator. Its primary function is to host Spark driver, orchestrate Spark Jobs and provide HA to Spark. With no lead, you won't be able to run such Jobs.
In addition to what #jagsr has mentioned, if you do not intend to run the lead nodes (and thus no Spark jobs or column store), then you can run the cluster as pure row store using snappy-start-all.sh rowstore (see rowstore docs)

Apache Hive Installation on pseudo distributed or multi node cluster environment

I have installed hadoop on multi node environment in my PC as below
1: 4 virtual box instances loaded with ubuntu(14.04)
2: 1-master node , 2-slave node and remaining vm instance works as client
Note: All 4 VM'S are running in my PC itself
I was able to complete apace-2.6 hadoop setup successfully on the above mentioned setup .Now I want to install hive in order to do some data summarization, query, and analysis .
But I am not sure how I have to proceed further. I have few queries mentioned below :
Q1: Do I need to install/setup Apache Hive(0.14) on all nodes(master/name-node and slave/data-node)? or is it only on master node?
Q2: what is the mode should be used to deal with the meta-store is it local mode or remote mode ?
Q3: In case if I want to use mysql for hive meta-store,should I install it on master/name node itself or do I need to use separate client machine for this?
please can some one also share me if there are any steps to be followed to configure metastore? in multi node/pseudo distributed environment.
BR,
San
You need to install the required Hive services (HiveServer2, Metastore, WebHCat) only once. In your lab scenario, you would probably put them on the master. The client can then run Beeline (the HiveServer2 client.)
If you configure the Metastore as Local, Hive will use a local Derby database. Again, for your lab setup, this is probably just what you need/want.
In a production scenario, you would
set up a dedicated server for supporting services that should not fight for resources with the namenode process(es)
and use a dedicated database server for your Metastore database, which will be remote.

Stored proc behaving differently on two nodes

I have two oracle nodes running on RAC. Using TOAD I compiled a stored proc. My Java Application runs on Jboss and use connection pool to the oracle server. In one node I still see the old query running while the other node behave fine. How is this possible? Any idea?
Thanks

Cloudera beeswax server and hive server

I have a fundamental question regarding the two servers mentioned in the context of cloudera cdh4 distribution
Are those two interchangeable/replaceable as in could you run beeswax in place of hive server?
I'm trying to use a thrift client to connect and in my set up only the beeswax is running and not the hive server. In such a case can I connect to the beeswax server?
Hive Server is the default process and Beeswax is a newer process designed to better support concurrency and provide authentication using Kerberos. You should run one or the other.
And yes, you should definitely be able to connect to beeswax using Thrift. You can find clients for Beeswax and Hive server here.
what is the difference between hive-server2 and beeswax? They are both designed to better support concurrency and security.

How can I establish a connection between an AS/400 server and Hadoop, and move data?

How do you get data/tables from db2 on an AS/400 server to a Hadoop filesystem? How do you establish a connection between the AS/400 server and a Hadoop file system?
I know that we can get data/tables from MySQL server data to a Hadoop filesystem using SQOOP.
I believe that DB2 has native connector in Sqoop, so I would recommend similar way as you're moving data from MySQL (e.g. using JDBC interface).
SQOOP uses a JDBC connector to interact with databases. You can use the JT400 Java library ( http://jt400.sourceforge.net/ ) for the JDBC driver.

Resources