We work with a lot of data and have a high throughput of files going through our Nifi instances. We have recently been losing providence records and don't seem to understand what the cause is.
Below are some details, if relevant:
We have our Providence database on its own drive in the cloud, and are not seeing any high IO usage or resource contention.
We have added additional threads to this, as well as 999k file handles.
If it means anything, providence data is kept for 2 weeks in our configuration.
We are on Nifi version 1.15.3, but are planning on an upgrade in the near future.
Any ideas on what the cause may be and how to remediate this? Thanks!
Related
what are some good panels to have in kibana visualisation for developers to troubleshoot issues in applications? I am trying to create a dashboard that developers could use to pinpoint where the app is having issues. So that they could resolve it. These are a few factors that I have considered :
Cpu usage of pod, memory usage of pod, network in and out, application logs are the ones I have got in mind. Any other panels I could add to so that developers could get an idea where to check if something goes wrong in the app.
For example, application slowness could be because of high cpu consumption, app goes down could because OOM kill, request takes longer could be due to latency or cache issues etc Is there any other thing that I could take into consideration if yes please suggest?
So here a few things that we could add are:
Number of pods, deployments,daemonsets,statefulsets present in the cluster
cpu utilised by the pod(pod wise breakdown)
memory utilised by the pod(pod wise breakdown)
Network in/out
5.top memory/cpu consuming pods and nodes
Latency
persistence disk details
error logs as annotations in tsvb
Logstreams to check logs within dashboard.
My flow works correctly but after one hour data of flow becomes disappeared. I've reduced and increased Heap size memory from 100mb until 8g it did not work, my cpu usage increased until 500% and then data of my flow is disappeared.I mean In/out of all processors became zero,I attached my flow. does anybody have a solution?
my system config:
macOs high sierra
processor 2.3 GHz Intel Core i7
memory 16 GB 1600 MHz DDR3
this is log of my flow
enter image description here
this is my flow after losing data and deleting content
enter image description here
I hope this explanation of these basic concepts clears up the confusion.
About NiFi
NiFi is a flow management tool, you can have processors to ingest, process and egest data.
Typically a message comes in, and goes out once NiFi is done with it.
About statistics
Each processor will keep track of incoming and outgoing messages. These messages are tracked for a while on the processor, and then 'forgotten'. I believe the time period is 5 minutes.
About queues
You can inspect a queue to see the messages in it, if there are no messages you cannot inspect them of course. You might be interested in the provenance.
About provenance
You can check the provenance of a message in the queue to see how it developed (content, timestamps) as it passed the processors. I have personally worked mostly with NiFi in HDF, so I'm not sure if this option is available when you run NiFi without a platform around it.
Detecting problems in NiFi
Of course there may be exceptions, but if NiFi is unable to pick up messages, I would expect them to get stuck in a queue. And if NiFi is processing them but failing, you would expect red squares to start appearing in the UI.
So usually it is quite easy to tell if something is going wrong in NiFi.
When our Cascading jobs encounter an error in data, they throw various exceptions… These end up in the logs, and if the logs fill up, the cluster stops working. do we have any config file to be edited/configured to avoid such scenarios?
we are using MapR 3.1.0, and we are looking for a way to limit the log use (syslogs/userlogs), without using centralized logging, without adjusting the logging level, and we are less bothered about whether it keeps the first N bytes, or the last N bytes of logs and discords remain part.
We don't really care about the logs, and we only need the first (or last) few Megs to figure out what went wrong. We don't wan't to use centralized logging, because we don't really want to keep the logs/ don't care to spend the perf overhead of replicating them. Also, correct me if I'm wrong: user_log.retain-size, has issues when JVM re-use is used.
Any clue/answer will be greatly appreciated !!
Thanks,
Srinivas
This should probably be in a different StackExchange site as it's a more of a DevOps question than a programming question.
Anyway, what you need is your DevOps to setup logrotate and configure it according to your needs. Or edit the log4j xml files on the cluster to change the way logging is done.
I am working on a clustered marklogic environment where we have 10 Nodes. All nodes are shared E&D Nodes.
Problem that we are facing:
When a page is written in marklogic its takes some time (upto 3 secs) for all the nodes in the cluster to get updated & its during this time if I then do a read operation to fetch the previously written page, its not found.
Has anyone experienced this latency issue? and looked at eliminating it then please let me know.
Thanks
It's normal for a new document to only appear after the database transaction commits. But it is not normal for a commit to take 3-sec.
Which version of MarkLogic Server?
Which OS and version?
Can you describe the hardware configuration?
How large are these documents? All other things equal, update time should be proportional to document size.
Can you reproduce this with a standalone host? That should eliminate cluster-related network latency from the transaction, which might tell you something. Possibly your cluster network has problems, or possibly one or more of the hosts has problems.
If you can reproduce the problem with a standalone host, use system monitoring to see what that host is doing at the time. On linux I favor something like iostat -Mxz 5 and top, but other tools can also help. The problem could be disk I/O - though it would have to be really slow to result in 3-sec commits. Or it might be that your servers are low on RAM, so they are paging during the commit phase.
If you can't reproduce it with a standalone host, then I think you'll have to run similar system monitoring on all the hosts in the cluster. That's harder, but for 10 hosts it is just barely manageable.
We have a JPA -> Hibernate -> Oracle setup, where we are only able to crank up to 22 transactions per seconds (two reads and one write per transaction). The CPU and disk and network are not bottlenecking.
Is there something I am missing? I wonder if there could be some sort of oracle imposed limit that the DBA's have applied?
Network is not the problem, as when I do raw reads on the table, i can do 2000 reads per second. The problem is clearly writes.
CPU is not the problem on the app server, the CPU is basically idling.
Disk is not the problem on the app server, the data is completely loaded into memory before the processing starts
Might be worth comparing performance with a different client technology (or even just a simple test using SQL*Plus) to see if you can beat this performance anyway - it may simply be an under-resourced or misconfigured database.
I'd also compare the results for SQLPlus running directly on the d/b server, to it running locally on whatever machine your Java code is running on (where it is communicating over SQLNet). This would confirm if the problem is below your Java tier.
To be honest there are so many layers between your JPA code and the database itself, diagnosing the cause is going to be fun . . . I recall one mysterious d/b performance problem resolved itself as a misconfigured network card - the DBAs were rightly insistent that the database wasn't showing any bottlenecks.
It sounds like the application is doing a transaction in a bit less than 0.05 seconds. If the SELECT and UPDATE statements are extracted from the app and run them by themselves, using SQL*Plus or some other tool, how long do they take, and if you add up the times for the statements do they come pretty near to 0.05? Where does the data come from that is used in the queries, and which eventually gets used in the UPDATE? It's entirely possible that the slowdown is not the database but somewhere else in the app, such a the data acquisition phase. Perhaps something like a profiler could be used to find out where the app is spending its time.
Share and enjoy.