I've got an issue with flowfiles passing through a mergerecord.
Here is the flow (click on link for image):
Flow Queue
I've tried most of the permutations of the configuration settings but can't seem to get flowfiles out of the queue no matter what I do:
MergeRecord Configuration
Does anyone know what could be blocking this mergerecord from passing flowfiles? It seems the flowfiles are currently "text" files, would they need to be JSON for the mergerecord to group correctly?
The Merge is correlating on TableName - meanining it is only going to merge flowfiles where the TableName attribute is the same value.
However, you only have 10 total bins - meaning if 10 flowfiles come in with table1,2,3,4,5,6,7,8,9,10 you have maxed out your bins, so any FlowFiles with table11,12,13,14, etc. aren't going to get merged until a bin frees up. They will just sit in the queue and wait.
Further, your Merge config is also only set with Min 1 and Max 1000 - meaning you need 1000 records with TableName = table1, before those files are merged and the bin is released.
With 5000 FlowFiles making up 3MB, I'm going to assume there aren't many Records per FlowFile, so you aren't filling up 1000 Records and releasing any bins.
So, double check that your TableName attribute is being set as you expected, and consider modifying the setting for controlling the merge. You could lower the Max Records from 1000 to trigger sooner, you could add a Max Size, or you could add a Max Age to time-bound it.
Related
I was inserting batches of data into CockroachDB, multi-batches (up to 10) in a single transaction. After a couple of batches the insert failed with “message size 50 MiB bigger than maximum allowed message size 16 MiB”, which is correct, this batch contained a record with outsized ‘string’.
I added a line to the transaction to update the max_read_buffer_size cluster setting to 100 MiB. But I'm still getting the error.
max_read_buffer_size is a cluster setting. Cluster settings cannot be set inside a transaction. Make sure you update the setting outside of the transaction.
I use ExecuteSQLRecord to run a query and write to CSV format. The table has 10M rows. Although I can split the output into multiple flow files, the query is executed by only a single thread and is very slow.
Is there a way to partition the query into multiple queries so that the next processor can run multiple concurrent tasks, each one process one partition? It would be like:
GenerateTableFetch -> ExecuteSQLRecord (with concurrent tasks)
The problem is that GenerateTableFetch only accepts table name as input. It does not accept customized queries.
Please advise if you have solutions. Thank you in advance.
You can increase the concurrency on Nifi processors (by increase the number in Councurrent Task), you can also increase the throughput, some time it works :
Also if you are working on the cluster, before the processor, you can apply load balancing on the queue, so it will distribute the workload among the nodes of your cluster (load balance strategy, put to round robin):
Check this, youtube channel, for Nifi antipatterns (there is a video on concurrency): Nifi Notes
Please clarify your question, if I didn't answer it.
Figured out an alternative way. I developed a Oracle PL/SQL function which takes table name as an argument, and produces a series of queries like "SELECT * FROM T1 OFFSET x ROWS FETCH NEXT 10000 ROWS ONLY". The number of queries is based on the number of rows of the table, which is a statistics number in the catalog table. If the table has 1M rows, and I want to have 100k rows in each batch, it will produces 10 queries. I use ExecuteSQLRecord to call this function, which effectively does the job of NiFi processor GenerateTableFetch. My next processor (e.g. ExecuteSQLRecord again) can now have 10 concurrent tasks working in parallel.
My current flow in Nifi is like
ListHDFS->FetchHDFS->SplitText->JoltransformJSon->PUTHBaseJSON.
Hourly input JSON files would be max of 10GB.
Single file size would be 80 -100MB.
Splittext & JoltTransform -> transform the text and sent it as 4KB files . Hence the hourly job is taking 50 to 1.20 minutest to complete the flow . How can I make this faster. What would be the best flow to handle the use case.
Have tried to use Mergecontent , didnt worked out well .
Thanks All
You can use MergeRecord processor After JoltTransfromJson Processor and
keep your maximum number of records to make flowfile eligible to merge into single flowfile.
Use Max Bin Age property as wildcard to force eligible the bin to be Merged.
Then use record oriented processor for HBase i.e PutHBaseRecord processor and configure your Record Reader controller service(JsonTree Reader) to read the incoming flowfile and tune the Batch size property value to get maximum performance.
By using this process we are processing chunks of records which eventually increase the performance of storing data into HBase.
Flow:
ListHDFS->FetchHDFS->SplitText->JoltransformJSon->MergeRecord ->PUTHBaseRecord
Refer to these links for Merge Record configs and Record Reader configs
Is there a way to get fragment index from SplitRecord processor Nifi? I am splitting a very big xls (4 mill records) into "Records Per Split" = 100000.
Now I want to just process first 2 splits, to see quality of the file and reject rest of the file.
I can see fragment index is in other split function (e.g. JsonSplit), but not in record split. Any other hack?
Method1:
By using Control Rate processor we can achieve this case
Control Rate Processor:
By this configs we are releasing 2 flowfiles for every minute and
Flow:
Configure the queue expiration to like 10 sec(or lower number if you need), then the flowfiles are going to expired in the queue but first 2 flowfiles are going to be released.
Method2:
By using SplitText processor then use RouteOnAttribute Processor and add new property as
${fragment.index:le(2)}
By using above expression language we are only allowing only the first 2 fragment indexes.
Refer to this link for splitting Big File in NiFi.
I'm evaluating Nifi for our ETL process.
I want to build the following flow:
Fetch a lot of data from SQL database -> Split into chunks 1000 records
each -> Count error records in each chunk -> Count total number of error
records -> If it exceeds a threshold Fail process -> else save each chunk to the database.
The problem I can't resolve is how to wait until all chunks are validated.
If for example I have 5 validation tasks working concurrently, I need some
kind of barrier to wait until all chunks are processed and only after that
run error count processor because I don't want to save invalid data and
delete it if the threshold is reached.
The other question I have is if there is any possibility to run this
validation processor on multiple nodes in parallel and still have the
possibility to wait until they all are completed.
One solution to this is to use the ExecuteScript processor as a "relief valve" to hold a simple count in memory triggered off of the first receipt of a flowfile with a specific attribute value (store in the local/cluster state with basically a Map of key attribute-value to value count). Once that value reaches a threshold, you can generate a new flowfile to route to the success relationship containing the attribute value that has finished. In this case, send the other results (the flowfiles that need to be batched) to a MergeContent processor and set the minimum batching size to whatever you like. The follow-on processor to the valve should have its Scheduling Strategy set to Event Driven so it only runs when it receives a flowfile from the valve.
Updating count in distributed MapCache is not the correct way as fetch and update are separate and cannot be made in atomic processor which just increments counts.
http://apache-nifi-users-list.2361937.n4.nabble.com/How-do-I-atomically-increment-a-variable-in-NiFi-td1084.html