collectd + graphite Disk Writes Cumulative Instead Of Per Time Period - metrics

Most of the collectd stats (such as memory free/used, etc) for my hosts display in units per time period on graphite, producing jagged graphs as expected. However, when I examine metrics from the collectd disk plugin, I see an ever-rising slope of values over time, as if the values were being summed together, as in the image below:
When I examine the .conf files for collectd plugins on the host, the relevant one for the disk plugin merely contains 'LoadPlugin 'disk'". What might prevent my setup from displaying disk statistic graphs like the ones on the collectd disk plugin wiki?

Related

Size of metricbeat metricsets/fields?

I've been tasked with reducing the footprint on disk of our Metricbeats. We have a number of modules active, each with a set of metricsets, which each in turn have all the defaults uploading.
We're using an enormous amount of storage. I would like to write up what fields we can get away with dropping entirely, and also include an estimate for what % this will reduce our storage usage.
Is there anywhere where storage requirements/ranges are given for metricbeat metricsets, or even whole modules? Maybe on a per-document basis? I'm hoping there's some approximations available, but I've come up empty handed so far.

nifi splitRecord hung

I'm testing nifi SplitRecord with a small file of only 11 records
However, SplitRecord hangs for a long time. I don't get a clue what it is doing.
Processor Hung
SPlitRecord Properties:
more properties
Is Records Per Split controlling
the maximum, or the minimum, or exact number of records per split?
if the total number of records is less than records per split, what's the behavior of SplitRecords? does it wait until a time-out and then put all on-hold records in to a single split?
After about 10 minutes or random number of start/stop/terminate/restart
it may trigger the processor to split the data sooner.
Records Per Split controls the maximum, see "SplitRecord.java" for the code. If there are fewer records than the RECORDS_PER_SPLIT value, it will immediately push them all out.
However, it does look like it is creating a new FlowFile, even if the total record count is less than the RECORDS_PER_SPLIT value, meaning it's doing disk writing regardless of whether a split really occured.
I would probably investigate two things:
Host memory - how much memory does the host have? How much is configured as NiFi max heap? How much total system memory is in use/free? NiFi performs best when plenty of system memory is left for file cache.
Host's disks, specifically the disk that has the Content Repository on it. Capacity? IO? Is it shared with other services? FlowFile content is written to the Content Repository, if the disk is shared with the OS, or other busy services (or other NiFi repos) it can really slow content modification down.
Note: your NiFi version over 3 years old, please consider upgrading.

Is there a way to find out if load on Elastic stack is growing?

I have just started learning Elastic stack and I already have to diagnose production issue. Our setup from time to time has problems with pulling messages from ActiveMq to Elastic Search using Logstash. There is a lag which can be 1-3 hours.
One suspicion is that maybe load went up after latest release of our application.
Is there a way to find out total size of messages stored grouped by month? Not only their number but total size of them. Maybe documents' size went up not number of documents.
Start with setting up a production monitoring instance to provide detailed statistics on your cluster: https://www.elastic.co/guide/en/elastic-stack-overview/7.1/monitoring-production.html
This will allow you to get at those metrics like messages/month, average document size, index performance, buffer load, etc. A bit more detail on internal performance is available with https://visualvm.github.io/
While putting that piece together, you can also tweak Logstash performance e.g.
Tune Logstash worker settings:
Begin by scaling up the number of pipeline workers by using the -w flag. This will increase the number of threads available for filters and outputs. It is safe to scale this up to a multiple of CPU cores, if need be, as the threads can become idle on I/O.
You may also tune the output batch size. For many outputs, such as the Elasticsearch output, this setting will correspond to the size of I/O operations. In the case of the Elasticsearch output, this setting corresponds to the batch size.
From https://www.elastic.co/guide/en/logstash/current/performance-troubleshooting.html

Frequent Updates on Apache Ignite

I hope someone experienced with Apache Ignite can help guide my team towards the answer regarding a new setup with Apache Ignite.
Overall Setup
Data is continuously generated from many distributed sensors and streamed into our database. Each sensor may deliver many updates every second, but generally generates <10 updates/sec.
Daily the magnitude of the data is approx. 50 million records, per site.
Data Description
Each record consists of the following values
Sensor ID
Point ID
Timestamp
Proximity
where 1, is our ID of the sensor, 2 is an ID of some point on the site, and 3 is a proximity measurement from the sensor to the point.
Each second there is approx. 1000 such new records. A record is never updated.
Query Workload
Queries are fairly complex with significant (and dynamic) look-back in time. A query may require data from several sensors in one site, but the required sensors are determined dynamically. Most continuous queries only require data from the last few hours, but frequently it is necessary to query over many days.
Generally, we therefore have a write-once query-many scenario.
Initial Strategy
If we load data into primitive integer arrays in, e.g., java, the space consumption for a week approaches 5 GB. Because that is "peanuts" in the platforms of today, we intend to load all data onto all nodes in the Ignite cluster/distributed cache. In other words, use a replicated cache.
However, the continuous updates keep puzzling me. If I update the entire cache, I image quite substantial amounts of data needs to be transferred across the network every second.
Creating chunks for, say, each minute/hour is not necessarily going to work (well) either as each sensor can be temporarily offline, which will make it deliver stale data at some later point in time.
My question is therefore how to efficiently handle this stream of updates, while maintaining a consistent view of the data for the last 7-10 days.
My current, local, implementation is chunking the data into 1-hour chunks. When a new record for a given chunk arrives, the chunk is replaced with an updated chunk. This works well on a single machine but is likely too expensive in terms of network overhead in a cluster. I do not have an Ignite implementation, yet, so I have not been able to test this.
Ideally, each node in the ignite cluster would maintain its own copy of all data within the last X days, and apply the small update workload continuously.
So my question is, how would fellow Igniters approach this problem?
It sounds like you want to scale the load across multiple servers, but it's not possible with replicated caches, because each update will always update all nodes, and more nodes you have the more network traffic you will get. I think you should use partitioned caches instead and try adding nodes until the system is capable of handling the load.

Apache NiFi tuning issues

I've developed a NiFi flow prototype for data ingestion in HDFS. Now I would like to improve the overall performances but it seems I cannot really move forward.
​
​The flow takes in input csv files (each row has 80 fields), split them at row level, applies some transformations to the fields (using 4 custom processors executed sequentially), buffers the new rows into csv files, outputs them into HDFS. I've developed the processors in such a way the content of the flow file is accessed only once when each individual record is read and its fields are moved to flowfile attributes. Tests have been performed on a amazon EC2 m4.4xlarge instance (16 cores CPU, 64 GB RAM).
​​This is what I tried so far:
​​Moved the flowfile repository and the content repository on different SSD drives
Moved the provenance repository in memory (NiFi could not keep up with the events rate)
Configuring the system according to the ​configuration best practices
I've tried assigning multiple threads to each of the processors in order to reach different numbers of total threads
I've tried increasing the nifi.queue.swap.threshold and setting backpressure to never reach the swap limit
Tried different JVM memory settings from 8 up to 32 GB (in combination with the G1GC)
I've tried increasing the instance specifications, nothing changes
From the monitoring I've performed it looks like disks are not the bottleneck (they are basically idle a great part of the time, showing the computation is actually being performed in-memory) and the average CPU load is below 60%.
​The most I can get is 215k rows/minute, which is 3,5k rows/second. In terms of volume, it's just 4,7 MB/s. I am aiming to something definitely greater than this.
​
​Just as a comparison, I created a flow that reads a file, splits it in rows, merges them together in blocks and outputs on disk. Here I get 12k rows/second, or 17 MB/s. Doesn't look surprisingly fast too and let me think that probably I am doing something wrong.
​
​Does anyone has suggestions about how to improve the performances? How much will I benefit from running NiFi on cluster instead of growing with the instance specs? Thank you all
It turned out the poor performances were a combination of both the custom processors developed, and the merge content built-in processor. The same question mirrored on the hortonworks community forum got interesting feedback.
Regarding the first issue, a suggestion is to add the SupportsBatching annotation to the processors. This allows the processors to batch together several commits, and allows the NiFi user to favor latency or throughput with the processor execution from the configuration menu. Additional info can be found on the documentation here.
The other finding was that the MergeContent built-in processor doesn't seem to have optimal performances itself, therefore if possible one should consider modifying the flow and avoid the merging phase.

Resources