Can i use PutDatabaseRecord processor to directly upsert into apache Kudu? - apache-nifi

I'm trying to sync Mysql with apache Kudu, I used CaptureChangeMySql processor to Fetch New Update/Insert Records (in JSON Format), How can I use PutDatabaseRecord to put/update the data into Kudu?
note that I'm doing syncing in database level not only a specific table with a fixed schema

According to this Apache Kudu doc, you should be able to insert records into a Kudu table using Impala. Depending on the version you may have automatic access to the table (meaning Impala already "knows about" the Kudu table) or you might need an external table created in Impala that sits atop the Kudu table (see the aforementioned doc). Either way you should be able to use the Impala JDBC driver in PutDatabaseRecord or any of the SQL-based processors (like PutSQL if you need to create the table in your flow, for example).
Alternatively you can try the PutKudu processor, which has been in Apache NiFi since version 1.4.0 (via NIFI-3973).

Related

Is Hive and Impala integration possible?

Is Hive and Impala integration possible?
After data processing in hive i want to store result data in impala for better read, is it possible?
If yes can you please share one example.
Both hive and impala, do not store any data. The data is stored in the HDFS location and hive an impala both are used just to visualize/transform the data present in the HDFS.
So yes, you can process the data using hive and then read it using impala, considering both of them have been setup properly. But since impala needs to be refreshed, you need to run the invalidate metadata and refresh commands
Impala uses the HIVE metastore to read the data. Once you have a table created in hive, it is possible to read the same and query the same using Impala. All you need is to refresh the table or trigger INVALIDATE METADATA in impala to read the data.
Hope this helps :)
Hive and impala are two different query engines. Each query engine is unique in terms of its architecture as well as performance. We can use hive metastore to get metadata and running query using impala. The common usecase is to connect impala/hive from tableau. If we are visualizing hive from tableau, we can get the latest data without any work around. If we keep on loading the data continuously, metadata will be updated as well. Impala does not aware of those changes. So we should run metadata invalidate query by connecting impalad to refresh its state and sync with the latest info available in metastore. So that user will get the same results as hive when the run the same query from tableau using impala engine.
There is no configuration parameter available now to run this invalidation query periodically. This blog reads well to execute meta data invalidation query through oozie scheduler periodically to handle such problems, Or simply we can set up a cronjob from the server itself.

Can NiFi - SelectHiveQL reads the data from a table on CDH cluster in parquet format?

I have a use case where i have to move data from inhouse CDH cluster to AWS EMR cluster.
I am thinking to setup NiFi on AWS EC2 instance to moves the data from inhouse cluster to AWS s3 storage.
My all tables on CDH cluster are stored in parquet format.
Question#1:
Do we have support in NiFi that allows to read tables in parquet format??
OR
The only option that i have is to read data directly from hdfs directory and place it on s3 and then create hive table in EMR?
Question#2: How Nifi determines new data inserted into the table and reads new data. In my case all tables are partitioned by yyyymm.
If you use SelectHiveQL, it can read anything Hive can (including Parquet), all the conversion work is done in Hive and is returned through the JDBC driver as a ResultSet, so you'll get the data out as Avro or CSV depending on what you set as the Output Format property in SelectHiveQL.
Having said that, your CDH would need a Hive version of at least 1.2.1, I've seen quite a few questions about compatibility where CDH has Hive 1.1.x, which NiFi does not support with the Hive processors. For that you'd need something like the Simba JDBC driver (not the Apache Hive JDBC driver, it doesn't implement all the necessary JDBC methods) and you can use ExecuteSQL and other SQL processors with the JDBC driver.

Multi-tenancy implementation with Apache Kudu

I am implementing big data system using apache Kudu. Preliminary requirement are as follows:
Support Multi-tenancy
Front end will use Apache Impala JDBC drivers to access data.
Customers will write Spark Jobs on Kudu for analytical use cases.
Since Kudu does not support Multi tenancy OOB, I can think of a following way to support Multi tenancy.
Way:
Each table will have tenantID column and all data from all tenants will be stored in the same table with corresponding tenantID.
Map Kudu tables as an external tables in Impala. Create views for these tables with a where clause for each tenant like
CREATE VIEW IF NOT EXISTS cust1.table AS SELECT * FROM table WHERE tenantid = 'cust1';
Customer1 will access table cust1.table for accessing cust1's data using impala JDBC drivers or from Spark. Customer2 will access table cust2.table for accessing cust2's data and so on.
Questions:
Is this an acceptable way to implement multi-tenancy or is there a better way to do it (may be with other external services)
If implemented this way, how do I restrict customer2 from accessing cust1.table in Kudu especially when customer would write their own spark jobs for analytical purposes.
We had a meeting with Cloudera folks and following is the response we received for the questions I posted above
Questions:
Is this an acceptable way to implement multi-tenancy or is there a better way to do it (may be with other external services)
If implemented this way, how do I restrict customer2 from accessing cust1.table in Kudu especially when customer would write their own spark jobs for analytical purposes.
Answers:
As pointed out by Samson in the comments, Kudu has either no access or full access policy as of now. Therefore the option suggested is to use Impala to access Kudu.
Therefore instead of having each table with TenantID column, each tenants tables are created separately. These Kudu tables are mapped in Impala as external tables (preferably in a separate Impala databases).
Access to these tables are then controlled using Sentry Authorization in Impala.
For Spark SQL access as well, suggested approach was to only make Imapala tables visible and not directly access Kudu tables. The authentication and authorization requirements are then handled again at Impala level before Spark Jobs are given access to the underneath Kudu tables.

import mysql data to hdfs using apache nifi

I am learner of Apache nifi and currently expolering on "import mysql data to hdfs using apache nifi"
Please guide me on creating flow by providing an doc, end to end flow.
i have serached my sites, its not available.
To import MySQL data, you would create a DBCPConnectionPool controller service, pointing at your MySQL instance, driver, etc. Then you can use any of the following processors to get data from your database (please see the documentation for usage of each):
ExecuteSQL
QueryDatabaseTable
GenerateTableFetch
Once the data is fetched from the database, it is usually in Avro format. If you want it in another format, you will need to use some conversion processor(s) such as ConvertAvroToJSON. When the content of the flow file(s) is the way you want it, you can use PutHDFS to place the files into HDFS.

cache spark table in thrift server

when using jupyter to cache some data into spark (using sqlcontext.cacheTable) i can see the table cached for the sparkcontext running within Jupyter. But now i want to access those cached tables from BI tools via odbc using the thrift server. when checking the thriftserver cache I dont see any table, the question is how do i get those tables cached to be consumed from BI tools?
do i have to send the same spark commands via jdbc? in that case, is the context related to the current session?
regards,
miguel
I found the solution. In order to have the tables cached to be used with jdbc/odbc clients via thriftserver i have to use CACHE TABLE from one of the clients, for exmaple from beeline. Once this is done the table is in-memory for all the different sessions.
It is also important to be sure you are using the right spark thriftserver. In order to know that just do a show table; in beeline, if you get just one column back you are not using the spark one and the CACHE TABLE wont work.

Resources