Perfmon File Analysis Tools - performance

I have a bunch of perfmon files that have captured information over a period of time. Whats the best tool to crunch this information? Idealy I'd like to be able to see avg stats per hour for the object counters that have been monitored.

From my experience, even just Excel makes a pretty good tool for quickly whipping up graphs of perfmon if you relog the data to CSV or TSV. You can just plot a rolling average & see the progression. Excel isn't fancy, but if you don't have more than 30-40 megs of data it can do a pretty quick job. I've found that Excel 2007 tends to get unstable when using tables & over 50 megs of data: at one point an 'undo' caused it to consume 100% cpu & 1.3 GB of RAM.
Addendum - relog isn't the best known tool but it is very useful. I don't know of any GUI front ends, so you just have to run it from the command line. The two most common cases I've used it for are
Removing unnecessary counters from logs that different sysadmin gave me, e.g. the entire process & memory objects.
Converting the binary perfmon logs to .csv or .tsv files.

Perhaps look into using LogParser.
It depends on how the info was logged (Perfmon doesn't lack flexibility)
If they're CSV you can even use the ODBC Text drivers and run queries against them!
(performance would be 'intriguing')
And here's the obligatory link to a CodingHorror article on the topic ;-)

This is a free tool provided on Codeplex, provides charting capabilities, and inbuilt thresholds for differnt server roles, which can also be modified. Generates HTML reports.
http://www.codeplex.com/PAL/Release/ProjectReleases.aspx?ReleaseId=21261

Take a look at SmartMon (www.perfmonanalysis.com). It analyzes Perfmon data in CSV and SQL Server databases.

Related

Monitor Hadoop Cluster using Collectl

I am evaluating various system monitoring tools to use one to monitor my hadoop cluster.
One of the tools I am impressed by is collectl. I have been playing around with it since a couple of days.
I am struggling to find how can we aggregate the metrics captured by collectl when using colmux?
Say, I have 10 nodes in my hadoop cluster each running collectl as a service. Using colmux I can see the
performance metrics of each node in a single view (in single and multi-line formats). Great!
But what if I am considering aggregate of CPU, IO etc on all the nodes in the cluster. That is I want to find
how my cluster as a whole is performing by aggregating the performance metrics from each node into corresponding
numbers, thereby giving me cluster-level metrics instead of node-level.
Any help is greatly appreciated. Thanks!
I had already answered this on the mailing list but for the benefit of those not on it I'll repeat myself here..
That's a cool idea. So if I understand you correctly you might see some sort of total line at the bottom? I can always add to my wish list but no promises. But I think I may also have a solution if you don't mind doing a little extra work on your own ;) btw - can I assume you've installed readkey so you can change sort columns with the arrow keys?
If you run colmux with --noesc, it will take it out of full screen more and simply print everything as scrolling output. If you then also include "--lines 99999" (or some big number) it will print all the output from all the remote systems so you don't miss anything. Finally you can pipe the output through perl, python, bash, or whatever your favorite scripting tool might be and do the totals yourself. Then whenever you see a new header fly by, print the totals and reset the counters to 0. You could even add timestamps and maybe even ultimately make it your own opensource project. I bet others would find it useful too.
-mark

Creating a printable report showing statistics from Windows Performance Analyzer Xperf

This is a very simple question.
I run Xperf and get all statistics about execution of programs, applications, and so on...
Well, I would like to find a tool that enables me to create a printable report of all data collected thank to Xperf.
Xperf, in fact, let me show all data and information regarding disk usage, CPU usage, times, overheads and so on... but does not let me print them... how to do something like this????
Thanks.
Take a look at the xperf command line "actions" described at http://msdn.microsoft.com/en-us/library/ff190853(v=VS.85).aspx.
I've also read that Microsoft's LogParser can handle ETW input files, but I haven't tried it myself.
Gary Kratkin
The xperf actions are a decent solution, but with WPT 8.1 there is a better option -- wpaexporter.exe. This lets you configure the data to be exported in a more-or-less arbitrary way. See this blog post for details:
http://randomascii.wordpress.com/2013/11/04/exporting-arbitrary-data-from-xperf-etl-files/

Do I need a ETL?

We currently use Datastage ETL to - Export a CSV/text file with data from 15 tables(3 different schemas) on a daily basis.
I am wondering If there is a simpler way to accomplish this with out using an ETL. I tried Scriptella. It looks simple/fast, but it again it is an ETL. Please suggest..
We use Python. Every programming language -- every single one ever invented -- is an alternative to an ETL.
You never need an ETL.
The questions is these:
Which is cheaper to build? Custom software or a configuration of an ETL?
Which is cheaper to maintain an operate?
Which is easier to adapt to changing requirements?
Why not use a free and easy to use ETL tool such as expressor Studio. You can download it at http://www.expressorstudio.com.
My 2 cents.
Datastage is an awful tool, and expensive to license.
SSIS is much simpler, or cloverETL is good.
ETL tool vs code is a good question.
ETL tools often have better performance as can queue data up ready to be used
where programming is is going to do this one at a time, and datastage can do this in parallel (but again i think it blows). PLus ETL tools can get data from multiple heterogeneous sources, where as you cant do this (easily) with code.
However if any data transformations etc are all to be done with data on the same server, I generally end up doing as much in SQL/TSQL(or PL/SQL) as possible, as it is just tonnes easier to debug/maintain. Primary Keys/Foreign Keys are your friend, and any missed lookups can be checked through checking counts later on to ensure data integrity is in order.
You do not need a ETL tool for that purpose. You can perform all the tasks using python, right from extracting data from CSVs/XMLs/text files, transforming data (identifying data types, null value transformation) and loading into tables.
https://towardsdatascience.com/python-etl-tools-best-8-options-5ef731e70b49
ETL can definitely be performed without the help of ETL Tools.
for eg: we can develop python scripts or there is open sources like Drift to work with it.
I think it's better to use a cheap ETL tool for your task. Because ETL tools work better than code always and make your task easy.
ETL Tool Vs Manual Script
“According to the IT research firm Forrester, the low-code development
platform market will reach a value of $21.2 billion by 2022, growing
at an annual rate of 40 percent. What’s more, 45 percent of developers
have already used a low-code platform or expect to do so in the near
future.”

5GB file to read

I have a design question. I have a 3-4 GB data file, ordered by time stamp. I am trying to figure out what the best way is to deal with this file.
I was thinking of reading this whole file into memory, then transmitting this data to different machines and then running my analysis on those machines.
Would it be wise to upload this into a database before running my analysis?
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
Any ideas?
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).
You can even access the file in the hard disk itself and reading a small chunk at a time. Java has something called Random Access file for the same but the same concept is available in other languages also.
Whether you want to load into the the database and do analysis should be purely governed by the requirement. If you can read the file and keep processing it as you go no need to store in database. But for analysis if you require the data from all the different area of file than database would be a good idea.
You do not need the whole file into memory, just the data you need for analysis. You can read every line and store only the needed parts of the line and additionally the index where the line starts in file, so you can find it later if you need more data from this line.
Would it be wise to upload this into a database before running my analysis ?
yes
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
don't worry about it, it will be fine. Just introduce a marker so the rows processed by each computer are identified.
I'm not sure I fully understand all of your requirements, but if you need to persist the data (refer to it more than once,) then a db is the way to go. If you just need to process portions of these output files and trust the results, you can do it on the fly without storing any contents.
Only store the data you need, not everything in the files.
Depending on the analysis needed, this sounds like a textbook case for using MapReduce with Hadoop. It will support your requirement of adding more machines in the future. Have a look at the Hadoop wiki: http://wiki.apache.org/hadoop/
Start with the overview, get the standalone setup working on a single machine, and try doing a simple analysis on your file (e.g. start with a "grep" or something). There is some assembly required but once you have things configured I think it could be the right path for you.
I had a similar problem recently, and just as #lalit mentioned, I used the RandomAccess file reader against my file located in the hard disk.
In my case I only needed read access to the file, so I launched a bunch of threads, each thread starting in a different point of the file, and that got me the job done and that really improved my throughput since each thread could spend a good amount of time blocked while doing some processing and meanwhile other threads could be reading the file.
A program like the one I mentioned should be very easy to write, just try it and see if the performance is what you need.
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).

Perform a batch of queries on a set of Shark performance logs?

I've been using Shark to benchmark a (very large) application and have a set of features I drill down into each time (e.g., focus on one function and remove stacks with particular others to determine the milliseconds for a particular feature on that run). So far, so good.
I'd like to write a script that takes in a bunch of shark session files and outputs the results of these queries for each file: is there a way to programmatically interact with Shark, or perhaps a way to understand the session log format?
Thanks!
I think this will be tricky unless you can reverse-engineer the Shark data files. The only other possibility I can think of is to export the profiles as text and manipulate these (obviously only works if there's enough info in the exported text to do what you need to do.)
I would also suggest asking the question again on Apple's PerfOptimization-dev mailing list (PerfOptimization-dev#lists.apple.com) - there are a number of Apple engineers on that list who can usually come up with good advice when it comes to performance and the Apple CHUD tools etc.

Resources