Greenplum PXF JDBC Parallel insert - greenplum

When inserting data to jdbc pxf table from gp table:
insert into <pxf jdbc target> 
select * from <gp table>
does it work in parallel on all PXF instances with many connections to external RDBMS, or on single PXF instance like when selecting from JDBC source without partitioning?
GP Version 6.17

It will be in parallel on all PXF instances.

As per gpdb documentation, for all load and unload operation in HDFS/S3, queries would be executed in parallel.
Reference: https://docs.greenplum.org/6-10/admin_guide/external/g-external-tables.html
Below is the quote from their documentation.
External tables access external files from within the database as if
they are regular database tables. External tables defined with the
gpfdist/gpfdists, pxf, and s3 protocols utilize Greenplum parallelism
by using the resources of all Greenplum Database segments to load or
unload data. The pxf protocol leverages the parallel architecture of
the Hadoop Distributed File System to access files on that system. The
s3 protocol utilizes the Amazon Web Services (AWS) capabilities.

Related

Can i use PutDatabaseRecord processor to directly upsert into apache Kudu?

I'm trying to sync Mysql with apache Kudu, I used CaptureChangeMySql processor to Fetch New Update/Insert Records (in JSON Format), How can I use PutDatabaseRecord to put/update the data into Kudu?
note that I'm doing syncing in database level not only a specific table with a fixed schema
According to this Apache Kudu doc, you should be able to insert records into a Kudu table using Impala. Depending on the version you may have automatic access to the table (meaning Impala already "knows about" the Kudu table) or you might need an external table created in Impala that sits atop the Kudu table (see the aforementioned doc). Either way you should be able to use the Impala JDBC driver in PutDatabaseRecord or any of the SQL-based processors (like PutSQL if you need to create the table in your flow, for example).
Alternatively you can try the PutKudu processor, which has been in Apache NiFi since version 1.4.0 (via NIFI-3973).

Multi-tenancy implementation with Apache Kudu

I am implementing big data system using apache Kudu. Preliminary requirement are as follows:
Support Multi-tenancy
Front end will use Apache Impala JDBC drivers to access data.
Customers will write Spark Jobs on Kudu for analytical use cases.
Since Kudu does not support Multi tenancy OOB, I can think of a following way to support Multi tenancy.
Way:
Each table will have tenantID column and all data from all tenants will be stored in the same table with corresponding tenantID.
Map Kudu tables as an external tables in Impala. Create views for these tables with a where clause for each tenant like
CREATE VIEW IF NOT EXISTS cust1.table AS SELECT * FROM table WHERE tenantid = 'cust1';
Customer1 will access table cust1.table for accessing cust1's data using impala JDBC drivers or from Spark. Customer2 will access table cust2.table for accessing cust2's data and so on.
Questions:
Is this an acceptable way to implement multi-tenancy or is there a better way to do it (may be with other external services)
If implemented this way, how do I restrict customer2 from accessing cust1.table in Kudu especially when customer would write their own spark jobs for analytical purposes.
We had a meeting with Cloudera folks and following is the response we received for the questions I posted above
Questions:
Is this an acceptable way to implement multi-tenancy or is there a better way to do it (may be with other external services)
If implemented this way, how do I restrict customer2 from accessing cust1.table in Kudu especially when customer would write their own spark jobs for analytical purposes.
Answers:
As pointed out by Samson in the comments, Kudu has either no access or full access policy as of now. Therefore the option suggested is to use Impala to access Kudu.
Therefore instead of having each table with TenantID column, each tenants tables are created separately. These Kudu tables are mapped in Impala as external tables (preferably in a separate Impala databases).
Access to these tables are then controlled using Sentry Authorization in Impala.
For Spark SQL access as well, suggested approach was to only make Imapala tables visible and not directly access Kudu tables. The authentication and authorization requirements are then handled again at Impala level before Spark Jobs are given access to the underneath Kudu tables.

When to use Hcatalog and what are its benefits

I'm new to Hcatlog (HCAT), we would like to know in what usecases/scenario's we use HCAT, Benefits of making use of HCAT, Is there any Performance Improvement can be gain from HCatlog. Can any one just provide information on when to use Hcatlog
Apache HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache Map/Reduce, and Apache Hive – to more easily read and write data on the grid.
HCatalog creates a table abstraction layer over data stored on an HDFS cluster. This table abstraction layer presents the data in a familiar relational format and makes it easier to read and write data using familiar query language concepts.
HCatalog data structures are defined using Hive's data definition language (DDL) and the Hive metastore stores the HCatalog data structures. Using the command-line interface (CLI), users can create, alter, and drop tables. Tables are organized into databases or are placed in the default database if none are defined for the table. Once tables are created, you can explore the metadata of the tables using commands such as Show Table and Describe Table.
HCatalog commands are the same as Hive's DDL commands.
HCatalog’s ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.
HCatalog opens up the hive metadata to other Map/Reduce tools. Every Map/Reduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables) HCatalog supported Map/Reduce tools do not need to care about where the data is stored, in which format and storage location.
It assist integration with other tools and supplies read and write interfaces for Pig, Hive and Map/Reduce.
It provide shared schema and data types for Hadoop tools.You do not have to explicitly type the data structures in each program.
It Expose the information as Rest Interface for external data access.
It also integrates with Sqoop, which is a tool designed to transfer data back and forth between Hadoop and relational databases such as SQL Server and Oracle
It provide APIs and webservice wrapper for accessing metadata in hive metastore.
HCatalog also exposes a REST interface so that you can create custom tools and applications to interact with Hadoop data structures.
This allows us to use the right tool for the right job. For example, we can load data into Hadoop using HCatalog, perform some ETL on the data using Pig, and then aggregate the data using Hive. After the processing, you could then send the data to your data warehouse housed in SQL Server using Sqoop. You can even automate the process using Oozie.
How it works:
Pig- HCatLoader and HCatStore interface
Map/Reduce- HCatInputFormat and HCatOutputFormat interface
Hive- No Interface Necessary. Direct access to metadata
References:
Microsoft Big Data Solution
http://hortonworks.com/hadoop/hcatalog/
Answer to your question:
As I described earlier HCatalog provides shared schema and data types for hadoop tools It simplifies your work during data processing. If you have created a table using HCatalog, you can directly access that hive table through pig or Map/Reduce (you cannot simply access a hive table through pig or Map Reduce).You don't need to create schema for every tool.
If you are working with the shared data that can be used from multiple
users(some team using Hive, some team using pig, some team using Map/Reduce) then HCatalog will be useful as they just need to table only to access the data for processing.
It is not replacement of any tool It a facility to provide single access to many tools.
Performance depends on your hadoop cluster. You should do some performance benchmarking in your Hadoop cluster to major performance.

Oracle to Hadoop data ingestion in real-time

I have a requirement to ingest the data from an Oracle database to Hadoop in real-time.
What's the best way to achieve this on Hadoop?
The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.
Other things that matter for this answer are:
What is the target for the data and what are you going to do with it?
just store plain HDFS files and access for adhoc queries with something like Impala?
store in HBase for use in other apps?
use in a CEP solution like Storm?
...
What tools is your team familiar with
Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?
or do you prefer a Data integration tool like Informatica?
Coming back to CDC, there are three different approaches to it:
Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)
Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.
Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)
Expanding a bit on what #Nickolay mentioned, there are a few options, but the best would be too opinion based to state.
Tungsten (open source)
Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.
Oracle GoldenGate
Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.
Dell Shareplex
SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables
Apache Sqoop is a data transfer tool to transfer bulk data from any RDBMS with JDBC connectivity(supports Oracle also) to hadoop HDFS.

How to convert cassandra to HDFS file system for shark/hive query

Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.
You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.
The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."
I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.
You can use Tuplejump Calliope Project.
https://github.com/tuplejump/calliope
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!
Three cassandra hive storage
https://github.com/2013Commons/hive-cassandra for 2.0 and hadoop 2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
https://github.com/richardalow/cassowary directly from sstable

Resources