ExecuteSQL does nothing - hadoop

I am trying to fetch data from oracle database through Nifi. In the canvas, I put "GenerateFlowFile" processor with a file size of 0 KB scheduled to run every 5 min. This is just to trigger the "ExecuteSQL" processor on success. For the "ExecuteSQL", I set the DB Connection Pooling Service to be DBCPConnectionPool. I input the SQL query "SELECT * FROM SOMETABLE". My DBCPConnectionPool configuration is as follows:
URL = jdbc:oracle:thin:#hostname:port:sid
Driver = oracle.jdbc.driver.OracleDriver
Jar URL = file:///somelocation/ojdbc6.jar
User = someuser
Password = somepassword
When I tried to run, nothing happens. The red box becomes green and there's a number 1 on the top right corner of "ExecuteSQL" processor. But nothing happens. Then when I stop it, still the Active Threads is 1.
Can please advise me cause I am new to this? Thank you.

Since the original post is answered, I'll respond to the question within its comments:
You can set the GenerateFlowFile processor to run every 30 seconds or so, then start and immediately stop it. This will cause ExecuteSQL to run exactly once, fetching all rows.
Alternatively (in NiFi 0.6.0+) you can use the QueryDbTable processor, which will fetch all the rows the first time but then (based on a maximum-value column like an increasing primary key) only return rows as they are added.

Related

Nifi processor runs recursively

I am exploring nifi, as of I have created processor group with number of processor which basically select data from Oracle DB and insert in to mongoDB. the the flow works as expected.
The flow is QueryDatabaseTable -> SplitAvro -> ConvertAvorToJson -> PutMongoRecord
In QueryDatabaseTable I have custom query select * from employee, which gives me 100 records and these 100 records inserted into mongoDB. But here issue is QueryDatabaseTable is called again and again, so in result same 100 records are get added in mongoDB again and again. Is there any way to stop this repeated execution? Thanks in advance.
Update: I am using Nifi 1.9.2
PFB QueryDatabaseTable setting tab below
Scheduling
Properties
Update 2: Configuration
Use maximum-value columns if you want to prevent duplicates selection.

How to run a processor only when ahother proccessor has finished its execution?

I'm migrating a table (2 millions of rows) from DB2 to SQL Server. I'm using the next flow:
ExecuteSQL (to select records from the Db2 table).
SplitAvro (to split the records. I configured it with Output Size = 1 to control the case that if one fails the rest is inserted without problems.
PutDataBaseRecord (to insert the records in the SQL Server table).
ExecuteSQL (I need to call a stored procedure that executes update sentences against the same table that PutDataBaseRecord is working to).
The problem is the second ExecuteSQL is running before PutDataBaseRecord complete the insertion of all records.
How can I tell nifi to run that processor only when the other one finishes?
Thanks in advance!
After PutDatabaseRecord you can use MergeContent in Defragment mode to undo the split operation performed by SplitAvro. This way a single flow file will come out of MergeContent only when all splits have been seen, and at that point you know its time to for the second ExecuteSQL to run.
The answer provided by #bryan-bende is great, as it is simple and elegant. If that doesn't work for some reason, you could also look at Wait/Notify. Having said that, Bryan's answer is simpler and probably more robust.

Pull Data from Hive to SQL Server without duplicates using Apache Nifi

Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
SelectHiveQL
SplitAvro
ConvertAvroToJson
ConvertJsonTOSQL
PutSQL
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)

Hive Update,Insert,delete

I have been trying to implement the UPDATE,INSERT,DELETE operations in hive table as per instructions. But whenever I try to include the properties which will do our work i.e. configuration values set for INSERT, UPDATE, DELETE hive.support.concurrency true (default is false) hive.enforce.bucketing true (default is false) hive.exec.dynamic.partition.mode nonstrict (default is strict) After that, if I run show tables on hive shell it's taking 65.15 seconds which normally runs at 0.18 seconds without the above properties. Apart from show tables, rest of the commands not giving any output i.e. they keep on running until and unless I kill the process. Could you tell me reason for this?
Hive is not an RDBMS. A query that ran for 2 mins may run for 5 mins under the same configuration. Neither Hive nor Hadoop guarantee us about the time taken for a query to execute. Also, please include information about whether you are running on a single node cluster or multi node cluster. And also provide information about the size of data on which you are querying. The information you have provided is insufficient. But, don't come to any conclusion based on time to execute query. Because, lots of factors such as disk, CPU slots, N/W etc etc., are involved in deciding run time of query.

AWS Hive + Kinesis on EMR = Understanding check-pointing

I have an AWS Kinesis stream and I created an external table in Hive pointing at it. I then create a DynamoDB table for the checkpoints and in my Hive query I set the following properties as described here:
set kinesis.checkpoint.enabled=true;
set kinesis.checkpoint.metastore.table.name=my_dynamodb_table;
set kinesis.checkpoint.metastore.hash.key.name=HashKey;
set kinesis.checkpoint.metastore.range.key.name=RangeKey;
set kinesis.checkpoint.logical.name=my_logical_name;
set kinesis.checkpoint.iteration.no=0;
I have the following questions:
Do I always have to start with iteration.no set to 0?
Does this always start from the beginning of the script (oldest Kinesis record about to be evicted)?
Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?
To re-execute the script on the same data, is it enough to re run the query with the same execution number?
If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?
Given the DynamoDB checkpoint entry:
{"startSeqNo":"1234",
"endSeqNo":"5678",
"closed":false}
What's the meaning of the closed field?
Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?
I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?
I know that it's a lot of questions but I could not find these answers on the documentation.
Check out the Kinesis documentation and the Kinesis Storage Handler Readme which contains answers to many of your questions.
Do I always have to start with iteration.no set to 0?
Yes, unless you are doing some advanced logic which requires you to skip a known or already processed part of the stream
Does this always start from the beginning of the script (oldest
Kinesis record about to be evicted)?
Yes
Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?
This is handled by the hive script, since it is querying all data in the kinesis stream at each run
To re-execute the script on the same data, is it enough to re run the query with the same execution number?
As Kinesis data is a 24-hour time window, the data has (possibly) changed since your last query, so you probably would want to query all records again in the Hive job
If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?
Yes, you would expect the results to change as the stream changes
Given the DynamoDB checkpoint entry:
What's the meaning of the closed field?
Although this is an internal detail of the Kinesis Storage Handler, I believe this indicates whether the shard is a parent shard, which indicates whether is it open and accepting new data or closed and not accepting new data into the shard. If you have scaled your stream up or down, parent shards exist for 24 hours, and contain all data since you scaled, however no new data will be inserted into these shards.
Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?
New sequence numbers generally increase over time is the only guidance that Amazon provide on this.
I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?
This means the shard is open and still accepting new data (not a parent shard)

Resources