How can I get the Nifi ExecuteSQL processor COUNT(*) Result and use it in PutSQL processor? - apache-nifi

I have a table (T1) that is inserted frequently. I want to check whether the insertions successful. I have another table (T2) to store the row counts of this table.
I used ExecuteSQL processor to send SELECT COUNT(*) query against the T1. I want to put the result into T2 with PutSQL processor.
Can I get the result of the query using flowfile attributes?

If you are allowed to use custom processor then use below which does the same, creating FlowFile attributes from SQL query resultset instead of replacing FlowFile content, have a look at source code also. There is an alternative approach using DistributedMapCache implementation.
UpdateAttributesFromSQL
https://github.com/brettryan/nifi-drunken-bundle
https://mvnrepository.com/artifact/com.drunkendev/nifi-drunken-nar
https://lists.apache.org/thread/dlcfmvs3djpm5hn4rsj11vht0fo4pp11
Alternate approach mentioned here
https://community.cloudera.com/t5/Support-Questions/Apache-Nifi-Add-attribute-to-Flow-File-based-on-Sql-query/m-p/349885

Related

Nifi || Can we execute multiple sql queries in single database session

I have a requirement in which i have to execute multiple sql queries in Nifi using executesql processor.
Those queries are dependent on each other as I am storing data in session based temporary table.
So I need to know if I do so , whether all the queries will get executed in single database session.
Example:
Query a data is inserted into Temp_A, Now I need that data in next query so will it be possible.
Note: I am talking about session based Temporary table only.
you can use ExecuteSQL processor where in SQL Pre-Query parameter that could contain multiple commands separated with semicolon ;
all those commands with main sql query will be executed using the same connection for one flow file.
note that the same sql connection could be used for the next flow file.

How to import data from MySql into Hive using Apache Nifi?

I am trying to import data from MySql to Hive using QueryDatabaseTable and PutHiveQl processors, but an error occurs.
I have some questions:
What is the output format of puthiveql?
Should the output table be created beforehand or will the processor do that?
Where can I find a template for the MySql to Hive process?
Here is some information about your questions:
The flow files input to PutHiveQL are output after they have been sent to Hive (or if the send fails), so the output format (and contents) are identical to the input format/contents.
The output table should be created beforehand, but you could send PutHiveQL a "CREATE TABLE IF NOT EXISTS" statement first and it will create the table for you.
I'm not aware of an existing template, but a basic approach could be the following:
QueryDatabaseTable -> ConvertAvroToJSON -> SplitJson -> EvaluateJsonPath -> UpdateAttribute (optional) -> ReplaceText -> PutHiveQL
QueryDatabaseTable will do incremental fetches of your MySQL table.
ConvertAvroToJSON will get the records into a format you can
manipulate (there currently aren't many processors that handle Avro)
SplitJson will create a flow file for each of the records/rows
EvaluateJsonPath can extract values from the records and put them in
flow file attributes
UpdateAttribute could add attributes containing type information.
This is optional, used if you are using prepared statements for
PutHiveQL
ReplaceText builds a HiveQL statement (INSERT, e.g.) either with
parameters (if you want prepared statements) or hard-coded values
from the attributes
PutHiveQL executes the statement(s) to get the records into Hive
In NiFi 1.0, there will be a ConvertAvroToORC processor, this is a more efficient way to get data into Hive (as well as to query it from Hive). That approach is to convert the results of QueryDatabaseTable to ORC files, which are then placed in HDFS (using PutHDFS), and it generates a partial Hive DDL statement to create the table for you (using the type information from the Avro records). You pass that statement (after filling in the target location) to PutHiveQL, and you can immediately start querying your table.
There are also plans for a PutHiveStreaming processor which takes Avro records as input, so that flow would just be QueryDatabaseTable -> PutHiveStreaming, which would insert the records directly into Hive (and is much more efficient than multiple INSERT statements).

Hive count(*) query is not invoking mapreduce

I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce

Finding unmatched records using pig script

In My POC, am trying to implement an ETL data flow (star schema) using pig script, As you all know before loading in to fact table i would like to load dimension. Here in dimension i need to load only the new records from source(csv file), I mean records which is not there in dimension(sql server). All joins(skewed,replicate & merge join) in pig are trying to match the existing records and produce only matched records. Can you please tell me how to bring the unmatched record as an output in order to load in to my dimension?
Thanks
Selvam
Do a left outer join of source (csv file) with that of dimension(sql server) table. Resultant records that have the join column as null are the new records. Then filter out records whose value of the join column is null.

Hive : inserting into multiple tables based on query result

I am trying to run a hive query to filter invalid records. Here is what I am doing
1. Load the csv file into a single column table.
2. define a UDF my_validation to validate each record
3. execute the query
from pgstg INSERT OVERWRITE LOCAL DIRECTORY '/tmp/validrecords.out'
select * where my_validation(record) IS NOT NULL
INSERT OVERWRITE TABLE PGERR
select record where my_validation(record) IS NULL;
Here are my questions:
a. Is there a better way to filter invalid records;
b. Does the my_validation UDF run twice on the whole table ?
c. what is the best way to split a single column to multiple column.
Thanks much for your help.
To answer your questions:
1) If you have custom validation criteria UDF is probably the way to go. If I were doing it, I would create an is_valid UDF that returns a boolean (instead of returning NULL vs. not NULL).
2) Yes, the UDF does get run twice.
3) Glad you asked. Look at the explode function available in Hive

Resources