How to update the configuration of an apache nifi processor without stopping it? - apache-nifi

Good morning, I'm using Apache Nifi, I wonder if anyone knows any way to change the setting of a processor without having to stop it. Or some viable alternative to prevent the loss of information.
Thanks

The configuration of a processor cannot be changed while the processor is running and this is done intentionally. This provides guarantees to the developer of a processor so that in the onTrigger method they can be guaranteed all the properties have the same values that passed validation when the processor was started.
If you can describe your use-case more we might be able to come up with alternative approaches.

there is an alternative solution. Duplicate the processor will update its configuration to the desired one. the output of the duplicate is connected to the next processor. the original processor is stopped and its queued connected to the duplicate and then turned on.
In one way or another the data flow has to be interrupted, but in this way the changes that take more time to make in the processor, can be made in the duplicate first, in order to reduce the impact of the interruption as much as possible.
regards

Related

How can I track an event accross multiple resources in gem5?

I would like to know if there is a proper method to track memory accesses
across multiple resources at once. For example I set up a simple dual core CPU
by advancing the simple.py from learning gem5 (I just added another
TimingSimpleCPU and made the port connections).
I took a look at the different debug options and found for example the
MemoryAccess flag (and others), but this seemed to only show the accesses at
the DRAM or one other resource component.
Nevertheless I imagine a way to track events across CPU, bus and finally memory.
Does this feature already exist?
What can I try next? Is it and idea to add my own --debug-flag or can I work
with the TraceCPU for my specified use?
I haven't worked much with gem5 yet so I'm not sure how to achieve this. Since until now I only ran in SE mode is the FS mode a solution?
Finally I also found the TraceCPUData flag in the --debug-flags, but running
this with my config script created no output (like many other flags btw. ...).
It seems that this is a --debug-flag for the TraceCPU, what kind of output does this flag create and can it help me?

An issue of partial insertion of data into the target when job fails

We have 17 records data set in one of the source tables in which we have erroneous data in the 14th record, which causes the job failure. Then, in the target only 10 records would be inserted as the commit size given as “10” in the mysqloutput component and the job failed. In the next execution after correcting the error record, job will fetch all the 17 records with successful execution. Due to which there will be duplicates in the target.
we tried :
To overcome this, we have tried with tmysqlrollback component in which we have included the tmysqlconnection and tmysqlcommit components.
Q1 : Is there any other option to use tmysqlrollback without using the tmysqlconnection and tmysqlcommit components?
Explored the tmysqlrollback and commit component from the documentation
https://help.talend.com/reader/QgrwjIQJDI2TJ1pa2caRQA/7cjWwNfCqPnCvCSyETEpIQ
But still looking for clue how to design the above process efficient manner.
Q2 : Also, We'd like to know about the RAM usage and disk space consumption from the performance perspective.
Any help on it would be much appreciated ?
No, the only way to do transactions in Talend is to open a connection using tMysqlConnection, then either commit using a tMysqlCommit or rollback using tMysqlRollback.
Without knowing what you're doing in your job (lookups, transformations..etc), it's hard to advise you on the ram consumption and performance. But if you only have a source to target, then ram consumption should be minimal (make sure you enable stream on the tMysqlInput component). If you have another database as your source, then ram consumption depends on how that database driver is configured (jdbc drivers usually accept a parameter to tell it to only fetch a certain number of records at a time).
Lookups and components that process data in memory (tSortRow, tUniqRow, tAggregateRow..etc) are what causes memory issues, but it's possible to tweak their usage (using disk among other methods).

HBase - hotspotting check

I am using HBase. And I am suspecting that rowkey has caused hotspoting. Before trying with salting of rowkey, I would like to check if hotspoting has already occurred. Is there any way in HBase to analyze data distribution in region servers to check if hotspoting has occurred?
Thanks,
Partha
You can use the HMaster Info Web UI to detect this.
It should be http://master-address:16010 by default.
If it's not available, you can check if the UI is not disabled in the conf (hbase-site.xml) and be sure that hbase.master.info.port is not set to -1.
When you are on it, you have to click on the table that you want to check.
You will be on this page
https://docs.prediction.io/images/cloudformation/hbase-32538c47.png
Then if you see that one region server has a lot more regions than the others, this is a good hint that one of your region server is probably hotspotted.
It means that the regions in this part of the rowkey scope are splitted more often ! The request per second can also be an indicator but to my experience, it's not always really accurate.
But this is just good hints and the only simple good way that I know to be sure that a hotspot is occuring is to bench it. Because when it happens, the write performance are really, REALLY different. So, you should check the througput that you have with an hashed rowkey with the same data then compare. You'll see very quickly if there is an hotspot.

Nifi processor batch insert - handle failure

I am currently in the process of writing an ElasticSearch Nifi processor. Individual inserts / writes to ES are not optimal, instead batching documents is preferred. What would be considered the optimal approach within a Nifi processor to track (batch) documents (FlowFiles) and when at a certain amount batch them in? The part I am most concerned about is if ES is unavailable, down, network partition, etc. prevents the batch from being successful. The primary point of the question, is given that Nifi has content storage for queuing / back-pressure, etc. is there a preferred method for using that to ensure no FlowFiles get lost if a destination is down? Maybe there is another processor I should look at for an example?
I have looked at the Mongo processor, Merge, etc. to try and get an idea of the preferred approach for batching inside of a processor, but can't seem to find anything specific. Any suggestions would be appreciated.
Good chance I am overlooking some basic functionality baked into Nifi. I am still fairly new to the platform.
Thanks!
Great question and a pretty common pattern. This is why we have the concept of a ProcessSession. It allows you to send zero or more things to an external endpoint and only commit once you know it has been ack'd by the recipient. In this sense it offers at least-once semantics. If the protocol you're using supports two-phase commit style semantics you can get pretty close to the ever elusive exactly-once semantic. Much of the details of what you're asking about here will depend on the destination systems API and behavior.
There are some examples in the apache codebase which reveal ways to do this. One way is if you can produce a merged collection of events prior to pushing to the destination system. Depends on its API. I think PutMongo and PutSolr operate this way (though the experts on that would need to weigh in). An example that might be more like what you're looking for can be found in PutSQL which operates on batches of flowfiles to send in a single transaction (on the destination DB).
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/PutSQL.java
Will keep an eye here but can get the eye of a larger NiFi group at users#nifi.apache.org
Thanks
Joe

The best way to store restart information in spring-batch readers, processors, writers and tasklets

Currently I'm designing my first batch application with spring batch using several tasklets and own readers, writers and processors primarily doing input data checks and tif-file handling (split, merge etc) depending on the input data i.e. document metadata with the appertaining image files. I want to store and use restart information persistet in the batch_step_execution_context in the spring-batch job-repository. Unfortunately I did not find many examples where and how to do this best. I want to make the application restartable so that it can continue after error correction at the point it left off.
What I have done so far and checked if in case of an exception the step information has been persistet:
Implemented ItemStream in a CustomItemWriter using update() and open() to store and regain information to/from the step_execution_context e.g. executionContext.putLong("count", count). Works good.
Used StepListeners and found that the context information written in beforeStep() has been persistet. Also works.
I appreciate help which will give or point to some examples, "restart tutorial" or sources where to read how to do it in Readers, Processors, Writers and tasklets. Does it make sense in Readers and Processors? I'm aware that handling restart information might also depend on commit-interval, restartable flags etc..
Remark: Maybe I require some deeper understanding of spring-batch concepts beyond what I read and tried so far. Also hints regarding this are welcome. I consider myself as intermediate level lacking details to make my application using some comforts of spring-batch.

Resources