Apache Niffi getMongo Processor - apache-nifi

I am new in niffi i am using getMongo to extract document from mongodb but same result is coming again and again but the result of query is only 2 document the query is {"qty":{$gt:10}}

There is a similar question regarding this. Let me quote what I had said there:
"GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach."
The question can be found here.

Related

Understanding Apache Spark Web UI performance metrics

I'm new to Spark and I'm trying to understand the metrics in the Web UI that are related to in my Spark Application (developed through Dataset API). I've watched few videos by Spark Summit and Databricks and most of the videos I watched were about a general overview of the Web UI like: definition of stage/job/task, how to understand when something is not working properly (e.g. not balanced work between executors), suggestions about things to avoid while programming, etc.
However, I couldn't find a detailed explaination of each performance metrics. In particular I'm interested understanding the things in the following images that are related to a Query that contains a groupBy(Col1, Col2), a orderBy(Col1, Col2) and a show().
Job 0
If I understood well, the default max partition size is set to 128 MB. Since my dataset size is 1378MB I get 11 tasks that work with 128MB, right? and since in the first stage I did some filtering (before applying groupBy) tasks write in memory so Shuffle Write is 108.3KB but why do I get 200 tasks for second stage?
After the groupBy I used an orderBy, is the number of tasks related to how my dataset is or it is related to the size of it?
UPDATE: I found this spark.sql.shuffle.partitions of 200 default partitions conundrum and some other questions, but now I'm wondering if there is a specific reason for it to be 200?
Stage 0
Why some tasks have result serialization here? If I understood well the serialization is related to the output so any show(), count(), collect(), etc. But in this stage those actions are not present (before the groupBy).
Stage 1
Is it normal that there is a huge part for result serialization time? I called show() (that takes 20 rows by default and there is an orderBy) so all tasks run in parallel and that one serialized all its records?
Why only one task have a considerable Shuffle Read Time? I expected all to have at least a small amount of Shuffle Read Time, again it is something related to my dataset?
The deserialization time is related to reading my dataset file? I'm asking because I wouldnt have expected it there since it is stage 1 and it was already present in stage 0.
Job 1- caching
Since I'm dealing with 3 queries that starts from the same dataset, I used cache() at the beginning of the first Query. I was wondering why it shows 739.9MB / 765 [input size/records] ... In the first query it shows 1378.8 MB / 7647431 [input size/records].
I guess that it has 11 tasks since the size of the dataset cached is still 1378MB but 765 is a really low number compared to the initial that was 7647431 so I dont think it is really related to records/rows, right?
Thanks for reading.

Nifi - Process the files based on count or time elapsed?

I have a following flow,
ListFile ---> FetchFile ---> ? ExecuteScript (maybe) ---> Notify
Basically, I want to go to Notify, if
Total flowfiles (from fetch files) is say 200; OR
Time elapsed (from last signal) is say 3 hours.
I think the 1st condition is easy to achieve. I can have a groovy script which can read number of flowfiles, if 200 go to SUCCESS or else ROLLBACK the session.
But I want to know how to also check the time elapsed for n (number can be less than 200) flowfiles in queue is more than 3 hours or so?
Update
Here is the problem: We have a batch processing (~200 files and can increase based on business in future) currently. We have a NiFi pipeline, i.e. List, Fetch, Basic validation on checksum, etc and process (call the SQL) which is working fine.
As per the business, throughout the day we can have the correction to data so that we can get all or some of the files to "re-process". That is also fine and working.
Now, as per new requirements, we need to build the process after this "batch" is completed. So in the best case, I can have the MergeContent processor with max bin of n and give the signal or notify to my new processor.
However, as explained above, throughout that day we can get few or all files processed again. So now my "n" may not match the new "number" of files re-processed. Hence, even in this case if we have elapsed say 3 hours, then irrespective of "n" not equal to new number of files reprocessed, I should notify the new process to run again.
Hence, I am looking for n files OR m hours elapsed check.
I think this may be an example of an XY problem -- you're trying to solve a problem and believe that counting the number of files fetched or time elapsed will help, but this pattern is usually discouraged in Apache NiFi and there are other solutions to the original problem. I would encourage you to describe more fully the higher level problem you are trying to solve to see if there is a better solution.
I will answer the question though (none of these are ideal solutions).
You can use a MergeContent processor with a minimum bin count of 200
You can use an ExecuteScript processor as you noted
You can write a value (the current timestamp) to a DistributedCacheMapServer when the Notify processor executes, and check that value with a FetchDistributedCacheMap processor against the current timestamp and use a simple Expression Language statement to compare the timestamp values
I think you may also want to read some examples of Wait/Notify logic, because creating thresholds like "200 incoming flowfiles || 3 hours elapsed time" is what the Wait processor does.
"How to wait for all fragments to be processed, then do something?" by Koji Kawamura
"NiFi workflow monitoring – Wait/Notify pattern with split and merge" by Pierre Villard
"Simple NiFi Wait/Notify Example" answer by Abdelkrim Hadjidj

Update operation concurrency on multiple nodes

I have a single application , maintained on two different nodes on cloud. I have a scheduler in the application which triggers every 5 minutes, which perform some update operation in database. How can I avoid the two operations to cause anomaly in database. Is there a way one application may know, that other instance is already been triggered or any sort of inter node communication that may happen in cloud foundry.
Many Thanks
A couple options come to mind for Cloud Foundry:
Create a distributed "lock" with your database. This could be as simple as a table or record in the DB that the scheduler checks out first before it does anything else. Once it has the lock, the scheduler can work. If it fails to obtain the lock, it goes back to sleep. Then when it's done, it returns the lock.
If you have lots of work to do, you could divide it into sections and have locks for each section, that way you could spread the work out across your different instances. This gets more complicated though, so you'd have to weigh the advantages against the extra complication to see if it's worth it for your use case.
Only run the scheduler on the first node. You can determine the first node by looking at your application instance number. Either the env variable CF_INSTANCE_INDEX or VCAP_APPLICATION, which contains JSON and has an instance_index property. For either option, the value will be 0 for the first instance. If it's 0, the scheduler runs. If it's greater than zero, the scheduler doesn't run.
Hope that helps!

running data science clustering job with a parse.com background job

So I have a mobile app that wants to use parse for login, user data, content, etc... but I also need to run an hourly k-means clustering job on my entire user data set. I was looking at Parse jobs as a possible solution. My question is since the clustering algorithms will probably take up a lot of memory - since they will need to load all the users into memory - will it be possible or useful to use parse for this, or to run map reduce jobs with the the background jobs.... or is this really beyond the means of parse and I should look at setting up my own backend instead of using a backend-as-a-service.
Parse offers background jobs https://parse.com/docs/cloud_code_guide#jobs
As long as the job completes within 15 minutes, it is fine.
However, there is an open bug on Parse that hasn't been fixed over the last 1.5 months. We believe that this is actually a memory issue (and if that's the case you might run into it, as well). Here's the bug ID: https://developers.facebook.com/bugs/1586656868273252/

Why is Cacti showing an empty graph, even though the rrd file is created?

I have developed my own SNMP service, and i want to plot a graph of an OID provided.
So, i have created a graph in Cacti.
-) It is showing device up.
-) It is creating rrd file. (RRDTool says OK).
-) Showing the graph, but it's empty.
But when I check it, say
rrdtool fetch <rrd file> AVERAGE
it shows me nan for all the values. The monitored OID has value 47 and i have set min=0 and max=100.
I am using Cacti appliance by rpath:
http://www.rpath.org/ui/#/appliances?id=http://www.rpath.org/api/products/cacti-appliance
Still, I can't show value on graph..
Where is the problem? Can anyone please tell me?
First of all, use Cacti's "Rebuild Poller Cache" function under the Utilities menu.
If that didn't work ,check if the RRD file is actually updating with new data.
To do this use the command:
rrdtool last [filename.rrd]
This will output the last time (in unix timestamp) that a new value has been inserted into the RRA file which you can compare to the current time that date +%s will output.
If it's not updating with data then you should change the cacti log level to DEBUG via the settings page on Cacti's web UI and look for appropriate messages.
If the poller couldn't get the data then it's usually an issue relating to connectiviy/SNMP.
You can further check issues as such by manually polling the specific OID on that host:
snmpwalk -c[SNMP COMMUNITY] -v2c [HOSTNAME OR IP ADDRESS] 1.3.6.1.2.1
You can use the above command and OID (1.3.6.1.2.1) just to see if you're getting a reply.
If that worked then you should change the command from snmpwalk to snmpget and the OID to the actual OID you're trying to poll and retry.
If the RRD is updating with new data but you're still getting NaN in your graphs then I suggest looking into the heartbeat and step values of the data source (via the data template) in relation to your polling interval and poller cronjob interval.
These values determine how many times the RRD file will miss data before inserting a NaN.
The cronjob calls the cacti poller to start performing it's polling cycle.
The poller interval is the actual time that the poller will wait between two polling cycles if it was indeed invoked in time by the cronjob.
So for 1 minute polling (on the poller and the cronjob) you will have to use a step of 60 (seconds) and a heartbeat of 120.
For 5 minutes polling, the step will be 300 and the heartbeat will be 600.
This is mainly caused by someone changing the poller interval on the settings page.
Gandalf from the Cacti forums wrote a nice Guide that you can use and further help can be found on Cacti forums.
Good luck! :)
Maybe cacti doesn't have the needed permissions to access the rrd file and your test was done with a user who has the required permissions, for example root?
Are you sure you have collected enough data?
If your RRD has a step of 1 minute, and your first RRA has a consolidated count of 1 (1cdp=1pdp), then you should collect data for at least (step x ( count + 1 )) seconds before you expect to see any data in the graph. Make sure you are collecting data at least as often as the step size.
If you collect data for 10 min and nothing shows up, then make sure you are actually collecting the data, make sure the values you get are within range, and that they are being used. Check the last modification time on the RRD file. Print out the values before you update to verify they are what you think they are.
You should double check the range Cacti is plotting in. I moved the values in the graph filter and spotted a little chunk of data in the graphs, then you just have to adjust it.

Resources