Post Processing LTTNG (CTF) Trace Data - lttng

I've created tracepoints that capture some raw data. I want to be able to post-process this data and possibly create a new viewer for the tracing perspective in Eclipse but I really have no idea where to start. I was hoping to find a document that described how to create a new viewer for the trace eclipse perspective, how to read the ctf files, and how to graph the results in the view.
Alternatively, I'd just like to read the trace data and add some new trace events with postprocessed data.
As background to the question, I want to perform analysis on the trace timestamps and generate statistics about the average throughput and latency. Although I can do this while inserting the tracepoint, I'd like to offload the math to the analysis portion.
Rich

In general, such analysis is better done in post-processing. Doing it at runtime in your traced program may affect performance, to a point where the data you collect is not representative of the real behaviour of the application anymore!
The Trace Compass documentation, particularly this section, explains how to create new graphical views in Eclipse.
If you want to output a time-graph or XY-chart view, you can also look at the data-driven XML interface. It is more limited in features, but can work straight off the RCP (no need to recompile, no need to setup the dev environmnent).

Related

Allow Open Telemetry sampling decision at the end of the span

It seems that the current otel specification only allows to make a sampling decision based on the initial attributes.
This is a shame because I'd like to always include some high signal spans. E.g the ones with errors or long durations. These fields are typically only populated before ending a span. I.e. too late for a sampling decision under the current specs.
Is there some other approach to get what I want? Or is it reasonable to open an issue in the repo to discuss allowing this use case?
Some context for my situation:
I'm working on a fairly small project with no dedicated resources for telemetry infrastructure. Instead we're exporting spans directly from our node.js app server to honeycomb and would like to get a more complete picture of errors and long-duration requests while sampling low-signal spans to keep our cost under control.
There are certain ways you could achieve this.
Implementing your own SpanProcessor which filters out these spans. This can get hairy quickly since it breaks the trace and some span might have parentId set to span which is dropped.
Another way to achieve this is to do tail sampling i.e drop the entire trace if it matches certain criteria and there is processor for that in opentelemetry collector contrib https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor. Please note the agent/gateway deployment of collector which is doing tail sampling has to have the access to full trace and there is also some buffering done.
I think honeycomb also has some component which can be used for sampling the telemetry but I have never used it https://github.com/honeycombio/refinery.

Hadoop and Stata

Does anyone have any experience using Stata and Hadoop? Stata 13 now has a Java Plugin API, so I think it should be straightforward to get them to play nice.
I am particularly interested in being able to parse weblog data to get it into a form suitable for statistical analysis.
This question came up on Statalist recently, but there was no response, so I thought I would try it here where the audience is more likely to have experience with this technology.
Dimitry,
I think it would be easier to do something like this using the ELK Stack (http://www.elastic.co). Logstash (the middle layer) has several parsers/tokenizers/analyzes built on the Apache Lucene engine for cleaning and formatting log data and can push the resulting data into elasticsearch, which exposes an HTTP API that you can curl fairly easily to get results (e.g., use insheetjson and pass the HTTP GET request as the URL and it should be imported into Stata without much problem).
I've been trying to cobble together a program to use the Jackson JSON library to build out more robust JSON I/O capabilities from within Stata and would definitely not mind trying to work with others to get it done.
Hope this helps,
Billy
I'll take an (un?)educated stab at this. From the looks of the java API, the caller seems to treat Stata as essentially a datastore. If that's the case, then I would imagine Stata would fit in to the hadoop world as a database and would be accessed by its own InputFormat and OutputFormat. In your specific case I'd imagine you'd write a StataOutputFormat which your reducer would use to write the parsed data. The only drawback seems to be your referenced comments that Stata apps tend to be I/O bound so I don't know that using hadoop is really going to help you since
you'll have to write all that data anyway, and
that write will be I/O bound, whether you use hadoop or not.

Store debug data in format having good viewer support

I'd like to include various debug outputs in my Python program. Summary information and some variables should be saved in a file for later examination. This will be at different levels down to very verbose outputs in loops (that's why simple logging isn't enough). And maybe larger data dumps of large tables.
For debugging a bug, I'd like to examine the different data states of the program after the program has run. I want to filter entries (e.g. my module) and chose the desired verbosity level per module.
Do you know a good data format for that (i.e. storing different verbosity levels and being able to select one; dumping data tables)? Maybe one supporting a tree view and having a ready-made nice viewer. The tree view could relate to module and method names, but other ideas are welcome.

Perform a batch of queries on a set of Shark performance logs?

I've been using Shark to benchmark a (very large) application and have a set of features I drill down into each time (e.g., focus on one function and remove stacks with particular others to determine the milliseconds for a particular feature on that run). So far, so good.
I'd like to write a script that takes in a bunch of shark session files and outputs the results of these queries for each file: is there a way to programmatically interact with Shark, or perhaps a way to understand the session log format?
Thanks!
I think this will be tricky unless you can reverse-engineer the Shark data files. The only other possibility I can think of is to export the profiles as text and manipulate these (obviously only works if there's enough info in the exported text to do what you need to do.)
I would also suggest asking the question again on Apple's PerfOptimization-dev mailing list (PerfOptimization-dev#lists.apple.com) - there are a number of Apple engineers on that list who can usually come up with good advice when it comes to performance and the Apple CHUD tools etc.

Where is Pentaho Kettle's architecture?

Where can I find Pentaho Kettle architecture? I'm looking for a short wiki, design document, blog post, anything to give a good overview on how things work. This question is not meant for specific "how to" starting guides but rather a good view at the technology and architecture.
Specific questions I have are:
How does data flow between steps? It would seem everything is in memory - am I right about this?
Is the above true about different transformations as well?
How are the Collect steps implemented?
Any specific performence guidelines to using it?
Is the ftp task reliable and performant?
Any other "Dos and Don'ts" ?
See this PDF.
How does data flow between steps? It would seem everything is in
memory - am I right about this?
Data flow is row-based. For transformation every step produce a 'tuple' or a row with fields. Every field is pair of data and a metadata. Every step has input and output. Step takes rows from input, modify rows and send rows to outputs. For most cases every all information is in memory. But. Steps reads data in streaming fashion (like jdbc or other) - so typically in memory only a part of data from a stream.
Is the above true about different transformations as well?
There is a 'job' concept and 'transformation' concept. All written above is mostly true for transformation. Mostly - means transformation can contain very different steps, some of them - like collect steps - can try to collect all data from a stream. Jobs - is a way to perform some actions that do not follow 'streaming' concept - like send email on success, load some files from net, execute different transformations one by one.
How are the Collect steps implemented?
It only depend on particular step. Typically as said above - collect steps may try to collect all data from stream - having so - can be a reason of OutOfMemory exceptions. If data is too big - consider replace 'collect' steps with different approach to process data (for example use steps that do not collect all data).
Any specific performence guidelines to using it?
A lot of. Depends on steps transformation is consists, sources of data used. I would try to speak on exact scenario rather then general guidelines.
Is the ftp task reliable and performant?
As far as I remember ftp is backed by EdtFTP implementation, and there may be some issues with that steps like - some parameters not saved, or http-ftp proxy not working or other. I would say Kettle in general is reliable and perfomant - but for some not commonly used scenarios - it can be not so.
Any other "Dos and Don'ts" ?
I would say the Do - is to understand a tool before starting use it intensively. As mentioned in this discussion - there is a couple of literature on Kettle/Pentaho Data Integration you can try search for it on specific sites.
One of advantages of Pentaho Data Integration/Kettle is relatively big community you can ask for specific aspects.
http://forums.pentaho.com/
https://help.pentaho.com/Documentation

Resources