How can I simply compare two timeseries of two differents perfomance tests?

How can I simply compare two timeseries of two differents perfomance tests? - performance

I'm looking for a simple tool to be able to compare two timeseries from two different tests.
For exemple, I'm recording the access time to a database when I delete 1000000 rows, one after another. Now I get two csv files:
the first one with every tags and information about the test (database version, exe name, run params, etc)
startTime,clientComputerName,pid,testId
2022-09-29T09:20:16.453Z,COMPUTER-22,4608,A-1664443216
the second one with every value of my timeserie in micro seconds
startTime,duration
5,140
146,145
291,146
438,21
460,21
482,21
504,24
529,25
555,22
578,21
600,24
624,21
646,21
668,23
692,21
I get several test info line and timeseries files.
With this I would like to load one or several selected tests (based on tags in test info file) and plot them on top of one another to compare them.
Here's an exemple of the desired result:
Desired result
Now here's my problem,
I tried using plotly, but for this amount of point it gets two slow and I lose interactibility. I cannot down-sample to be able to investigate anomalies.
Kibana and Grafana are not options either since its datetime based (unless I'm missing something here).
I'd like to find something as simple as possible (tools like PowerBI might too complicated for this usage)
Do you know what I could use ?
Thanks.

Related

How to implement the equivalent of the Aggregator EIP in Nifi

I'm very experienced with Apache Camel and EIPs and am struggling to understand how to implement equivalents in Nifi. I understand that Nifi uses a different paradigm (flow based programming) but I don't think what I'm trying to do is unreasonable.
In a nutshell I want the contents of each file to be sent to many rest services and I want to aggregate the responses into a single document which will stored in elasticsearch. I might also do some further processing and cleanup to improve what is stored (but this isn't my immediate issue)
The screenshot is a quick mock-up of what I'm trying to achieve but I don't understand enough about Nifi to know how to implement this pattern correctly.

If you are going to take a single piece of data and then fork to multiple parts of the flow and then converge back, there needs to be a way for MergeContent to know which pieces go together.
There are generally two ways this can be done...
The first is using MergeContent in "defragment mode". Think of this as reversing a split operation that was performed by one of the split processors like SplitText. For example, you split a file of 100 lines into 100 flow files of 1 line each, then do some stuff to each one, then want to converge back. The split processors produce a standard set of split attributes (described in the docs of the processors) and the defragment mode knows how to bin the splits accordingly and merge them back together. This probably doesn't apply to your example since you didn't start with a split processor.
The second approach is the "Correlation Attribute" in MergeConent. This tells merge content to only merge flow files together that have the same value for the attribute specified. In your example, when a file gets picked up by GetFile and sent to 3 InvokeHttp processors, there are 3 flow files created, and they all should have their "filename" attribute set to the name of the file picked up from disk. So telling MergeContent to correlate on filename should do the trick, and probably setting the min and max number of entries to the number you expect like 3, and a maximum time in case one of them fails or hangs.

Is snakemake the right tool to use for handling output mediated workflows

I'm new to trying out snakemake (last week or so) in order to handle less of the small details for workflows, previously I have coded up my own specific workflow through python.
I generated a small workflow which among the steps would use Illumina PE reads and ran Kraken against them. I'd then parse the output of the Kraken output to detect the most common species (within a set of allowable) if a species value wasn't provided (running with snakemake -s test.snake --config R1_reads= R2_reads= species=''.
I have 2 questions.
What is the recommended approach given the dynamic output/input?
Currently my strategy for this is to create a temp file which
contains the detected species and then cat {input.species} it into
other shell commands. This doesn't seem elegant but looking through
the docs I couldn't quite find an adequate alternative. I noticed
PersistentDicts would let me pass variables between run: commands
but I'm unsure if I can use that to load variables into a shell:
section. I also noticed that wrappers could allow me to handle it
however from the point I need that variable on I'd be wrapping the
remainder of my workflow.
Is snakemake the right tool if I want to use the species afterwards to run a set of scripts specific to the species (with multiple species specific workflows)?
Right now my impression on how to solve this problem is to have multiple workflow files for the species and have a run with switch which calls the associated species workflow dependant on the species.
Appreciate any insight on these questions.
-Kim

You can mark output as dynamic (e.g. expecting one file per species). Then, Snakemake will determine the downstream DAG of jobs after those files have been generated. See http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-files

How to use RRDTool/Cacti to count "user activities" in apache access logs?

Goal
I wish to use RRDTool to count logical "user activity" from our web application's apache/tomcat access logs.
Specifically we want to count, for a period, occurrences of several url patterns.
Example
We have two applications (call them 'foo' and 'bar')
These url's interest us. They indicate when users 'did interesting stuff'.
/foo/hop
/foo/skip
/foo/jump
/bar/crawl
/bar/walk
/bar/run
Basically we want to know for a given interval (10 minutes, hour, day, etc.) how many users: hopped,skipped,jumped,crawled, walked, etc.
Reference/Starting point
This article on importing access logs into RRDTool seemed like a helpful starting point.
http://neidetcher.com/programming/2014/05/13/just-enough-rrdtool.html
However to clarify, this example uses the access log directly , whereas we want to a handful of url's 'in buckets' and count the 'number in each bucket'
Some Scripting Required..
I could do this with bash & grep & wc --iterating through the patterns, sending output to an 'intermediate results' text file....but believe RRDTool could do this with minimal 'outside coding'
That said, I believe RRDTool could do this with minimal 'outside coding'--but am unclear on the details.
Some points
I mention 'two applications' because we actually serve them up from separate servers with different log file formats. I'd like go get them into the same RRA file
Eventually I'd like to report this in cacti; initially however, I wanted to understand RRDTool details
Open to doing any coding, but would like to keep it as efficient as possible--both administratively and computer-resources. (By administratively, I mean: easy to monitor new instances)
I am very new to RRDTool and am RTM'ing . (and Walking through the Tutorial). I'm used to relational databases and spreadsheets, etc and don't have my mind around all the nuances of the RRA format.
Thanks in advance!

You could setup a separate RRD file with ABSOLUTE type datasources for each address you want to track.
Then you tail the log file and whenever you see one of the interesting urls rush by you call:
rrdtool update url-xyz.rrd N:1
The ABSOLUTE data source type is like a counter, but it gets reset every time it is read. Your counter will just count to one, but that should not be a problem.
In the example above I am using N: and not the timestamp from the access log. You could also use that if you are not doing this in real time ... but beware that you can not update the same rrd file twice at the same time. N: will use milli timestamps internally and thus probably avoid this problem.
On the other hand it may make more sense to accumulate matching log entries with the same timestamp and only update rrdtool with that number once the timestamp on the logfile changes.

Using Hadoop to "bucket" data out with a single run

Is it possible to use one Hadoop job run to output data to different directories based on keys?
My use case is server access logs. Say I have them all together, but I want to split them out based on some common URL patterns.
For example,
Anything that starts with /foo/ should go to /year/month/day/hour/foo/file
Anything that starts with /bar/ should go to /year/month/day/hour/bar/file
Anything that doesn't match should go to /year/month/day/hour/other/file
There are two problems here (from my understanding of Map Reduce): first, I'd prefer to just iterate over my data one time, instead of running one "grep" job per URL type I'd like to match. How would I split up the output, though? If I key the first with "foo", second with "bar", and rest with "other" then don't they all still go to the same reducers? How do I tell Hadoop to output them into different files?
The second problem is related (maybe the same?), I need to break output up by the timestamp in the access log line.
I should note that I'm not looking for code to solve this, but rather the proper terminology and high level solution to look into. If I have to do it with multiple runs, that's alright, but I can't run one "grep" for each possible hour (to make a file for that hour), there must be another way?

You need to partition the data just as you describe. Then you need to have multiple output files. See here (Generating Multiple Output files with Hadoop 0.20+).

JMeter - saving results + configuring "graph results" time span

I am using JMeter and have 2 questions (I have read the FAQ + Wiki etc):
I use the Graph Results listener. It seems to have a fixed span, e.g. 2 hours (just guessing - this is not indicated anywhere AFAIK), after which it wraps around and starts drawing on same canvas from the left again. Hence after a long weekend run it only shows the results of last 2 hours. Can I configure that span or other properties (beyond the check boxes I see on the Graph Results listener itself)?
Can I save the results of a run and later open them? I know I can save the test plan or parts of it. I am unclear if I can save separately just the test results data, and later open them and perform comparisons etc. And furthermore can I open them with different listeners even if they weren't part of original test (i.e. I think of the test as accumulating data, and later on I want to view and interpret the data using different "viewers").
Thanks,
-- Shaul

Don't know about 1. Regarding 2: listeners typically have a configuration field for "Write All Data to a File", which lets you specify the file name. You can use the Simple Data Writer to store results efficiently for later analysis.
You can load results from a previous test into a visualizer by choosing "Write All Data to a File" and browsing for the file you wish to load. Somewhat counterintuitively, selecting a file for writing also loads that file into the visualizer and displays the results. Just make sure you don't run the test again while that file is selected, otherwise you will lose your saved test data. :-)

Well, I later found a JMeter group that was discussing the issue raised in my first question, and B.Ramann gave me an excellent suggestion to use instead a better graph found here.
-- Shaul

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio