Is it possible to use an existing Trino cluster as a data source for a different Trino cluster? - trino

Say that I have access to a Trino cluster (call it Trino_external) that has connections to various data stores. I also have access to another data store (call it RDB_isolated) that I cant connect to through the original Trino cluster. Is it possible to create a local Trino cluster (call it Trino_local) that has connections to both Trino_external and RDB_isolated so that I can run a single query that joins data from these two sources? Something like this:
Trino_local
+- Trino_external
| +- DB_external_1
|
|__RDB_isolated
Sample query
SELECT *
FROM Trino_local.Trino_external.DB_external_1 as l
JOIN Trino_local.RDB_isolated as r
ON l.column = r.column

Related

Transfer hive query result from one hadoop cluster to another hadoop cluster

I have two clusters A and B. Cluster A has 5 tables. Now I need to run a hive query on these 5 tables, result of the query should update the cluster B Table data(covers all the columns of result query)
Note: We should not create any files on cluster A during this process but temp file is allowed.
Can this doable? What are permissions/Configurations required between two clusters two achieve this?
How Can I get this task/Any other efficient alternative?
After achieving this task, I should automate using Oozie..
Do you use a database for each cluster's metadata or hive tables? If yes then - if you use the same database for storing hive tables in both clusters then you can share them. I know it sounds intuitive, but just mentioned it incase you haven't thought about it.

How to connect to multiple Cassandra in different dc

I'm setting up an application in which i am using spark session to read data from Cassandra. I am able to read the data from Cassandra if i am passing one Cassandra node from a dc.
But how can i connect to 3 different Cassandra nodes which belong to 3 different dc in spark session.
Here the code which i am using:
spark session
spark = SparkSession.builder().appName("SparkCassandraApp")
.config("spark.cassandra.connection.host", cassandraContactPoints)
.config("spark.cassandra.connection.port", cassandraPort)
.config("spark.cassandra.auth.username", userName).config("spark.cassandra.auth.password", password)
.config("spark.dynamicAllocation.enabled", "false").config("spark.shuffle.service.enabled", "false")
.master("local[4]").getOrCreate();
property file :
spring.data.cassandra.contact-points=cassandra1ofdc1, cassandra2ofdc2, cassandra3ofdc3
spring.data.cassandra.port=9042
when i try the above scenario i am getting the following exception:
Caused by:
java.lang.IllegalArgumentException: requirement failed: Contact points contain multiple data centers: dc1, dc2, dc3
Any help would be appreciated
Thanks in advance.
The Spark Cassandra Connector (SCC) allows to use only nodes from local data center, either defined by the spark.cassandra.connection.local_dc configuration parameter, or determined from the DC of the contact point(s) (that is performed by function LocalNodeFirstLoadBalancingPolicy.determineDataCenter). SCC newer will use nodes from other DCs...

Joining Oracle Table Data with MongoDB Collection

I have a reporting framework to build and generate reports (tabular format reports). As of now I used to write SQL query and it used to fetch data from Oracle. Now I have got an interesting challenge where half of data will come from Oracle and remaining data come from MongoDB based on output from Oracle data. Fetched tabular format data from Oracle will have one additional column which will contain key to fetch data from MongoDB. With this I will have two data set in tabular format one from Oracle data and one from MongoDB. Based on one common column I need to merge both table data and produce one data set to produce report.
I can write logic in java code to merge two tables (say data in 2D array format). But instead of doing this from my own, I am thinking to utilize some RDBMS in-memory data concept. For example, H2 database, where I can create two tables in memory on the fly and execute H2 queries to merge two tables. Or, I believe, there could be something in Oracle too like global temp table etc. Could someone please suggest the better approach to join oracle table data with MongoDB collection.
I think you can try and use Kafka and Spark Streaming to solve this problem. Assuming your data is transactional, you can create a Kafka broker and create a topic. Then make change to the existing services where you are saving to Oracle and MongoDB. Create 2 Kafka producers (one for Oracle and another for Mongo) to write the data as streams to the Kafka topic. Then create a consumer group to receive streams from Kafka. You may then aggregate the real time streams using a Spark cluster(You can look at Spark Streaming API for Kafka 1) and save the results back to MongoDB (using Spark Connector from MongoDB 2) or any other distributed database. Then you can do data visualizations/reporting on those results stored in MongoDB.
Another suggestion would be to use apache drill. https://drill.apache.org
You can use a mongo and JDBC drill bits and then you can join oracle tables and mongo collections together.

Implementing change in ThreadPoolSize on client side - JDBC driver Apache Phoenix

I have recently set up a JDBC driver to connect to Hadoop db using Apache Phoenix. Basic queries on Squirrel have worked well (for example, "select * from datafile"), but as soon as I ask a slightly more complicated query (ie, "select column1 from datafile where column2 = 'filter1'", I encounter this error:
org.apache.phoenix.exception.PhoenixIOException: Task
org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask rejected from
org.apache.phoenix.job.JobManager[Running, pool size = 128, active threads =
128, queued tasks = 5000, completed tasks = 5132]
From some searching, it seems that I should increase the ThreadPoolSize in the Apache Phoenix hbase.xml configuration file in order to avoid this error, which I have done, increasing it from 128 to 512. However, it does not seem to have noticed this change. The error persists and the "pool size" is still given as 128 within the error.
On the Phoenix Driver settings in Squirrel, I have indicated the location of hbase and hdfs directories containing the .xml config files under "Extra Class Path" in setup.
Is there any way to make the driver "notice" that the ThreadPoolSize has changed?
Thank you!
I spent a lot of time on this issue...
The first step is to run an explain on the query and look for the chunks number (ex: CLIENT 4819-CHUNK):
explain select row sum(row2) where the_date=to_date("2018-01-01");
+------------------------------------------------------------------------------+
| PLAN |
+------------------------------------------------------------------------------+
| CLIENT 4819-CHUNK 2339029958 ROWS 1707237752908 BYTES PARALLEL 4819-WAY FULL |
| SERVER FILTER BY "THE_DATE" = DATE '2018-01-01 01:00:00.000' |
| SERVER AGGREGATE INTO DISTINCT ROWS BY ["THE_DATE"] |
| CLIENT MERGE SORT |
+------------------------------------------------------------------------------+
4 rows selected (0.247 seconds)
Check the number of regions and/or guideposts in the table
Set the phoenix.stats.guidepost.width property to a value larger than its default size of 100MB and restart HBase Region Servers to apply the change
Update the table statistics by running the following command:
jdbc:phoenix...> UPDATE STATISTICS my_table
Set these values in Ambari/hbase config:
phoenix.query.threadPoolSize: Number of concurrent threads to run for each query and should be set to the number of vcores at the client side/Region Servers in the cluster.
phoenix.query.queueSize: The Max queue depth for the tasks to be run for any queue, beyond which an attempt to queue additional work is rejected. Set this property value to be equal to the number of 'chunks' for the table, as it can be seen in the 'explain' command output.
REFERENCE
https://phoenix.apache.org/update_statistics.html
Couple of things to check
Ensure your phoenix client jar is the compatible version with that of your phoenix server.
Get the hbase-site.xml (ensure phoenix threadpool size is set appropriately in sync with the master) from your Hbase master node and add to the phoenix jar file (using 7zip) and try running the squirrel client again.

Configured the HA Cluster with Hive-2.0.1(Derby Support) shows redundant database names?

I have configured the HA Cluster with one Namenode and one Standby and one data node.
I have started derby database with hiveserver2(Hive-2.0.1).
After started the hive server, open the beeline.cmd to check the number of database exists.
It shows default for 2 times:
0: jdbc:hive2://hostname:port/default> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
| default |
+----------------+--+
3 rows selected (0.027 seconds)
At that time i can't able to create table in that hive2.
Can anyone tell me the reason for that issue?
Any help appreciated.
It is not possible to have the same database twice.
Try to create same database from two different clients at a same time.
If duplicate database can be created then ask your query in Hive mailing list or report in jira.

Resources