Allow Open Telemetry sampling decision at the end of the span - open-telemetry

It seems that the current otel specification only allows to make a sampling decision based on the initial attributes.
This is a shame because I'd like to always include some high signal spans. E.g the ones with errors or long durations. These fields are typically only populated before ending a span. I.e. too late for a sampling decision under the current specs.
Is there some other approach to get what I want? Or is it reasonable to open an issue in the repo to discuss allowing this use case?
Some context for my situation:
I'm working on a fairly small project with no dedicated resources for telemetry infrastructure. Instead we're exporting spans directly from our node.js app server to honeycomb and would like to get a more complete picture of errors and long-duration requests while sampling low-signal spans to keep our cost under control.

There are certain ways you could achieve this.
Implementing your own SpanProcessor which filters out these spans. This can get hairy quickly since it breaks the trace and some span might have parentId set to span which is dropped.
Another way to achieve this is to do tail sampling i.e drop the entire trace if it matches certain criteria and there is processor for that in opentelemetry collector contrib https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor. Please note the agent/gateway deployment of collector which is doing tail sampling has to have the access to full trace and there is also some buffering done.
I think honeycomb also has some component which can be used for sampling the telemetry but I have never used it https://github.com/honeycombio/refinery.

Related

Using Graphite/Grafana for non time based data

I have been reading through documentation and have been able to get Graphite to receive data I have been sending it. I can definitely see the benefit in tracking things like concurrent users and network load.
But now I have been tasked with implementing it on the client to show things like RAM, and CPU usage. In addition to this, I must also track actions (users buying things). Maybe I am missing a large chunk of the picture here, but I am not sure how I would do these things. Do I need a timestamp? I've also seen plugins for pie charts and such and this would indicate I could perhaps create graphs from devices with different statistics.
What am I missing?
I don't think you're missing anything.
Any data you send on to, say, InfluxDB (as that's what I've used the most) will be timestamped automatically when it arrives - unless you specify an explicit one yourself.
If you're showing, for example, CPU load you can write your query to pick up the latest value, or perhaps an average (mean value) over time, or the max or min over a period of time as appropriate.
Pie charts can also be used successfully to graph relationships over time.
The key is to create a specific query (I use SQL directly) to craft the data used for the panel type. There are excellent examples in the documentation.

Nifi processor batch insert - handle failure

I am currently in the process of writing an ElasticSearch Nifi processor. Individual inserts / writes to ES are not optimal, instead batching documents is preferred. What would be considered the optimal approach within a Nifi processor to track (batch) documents (FlowFiles) and when at a certain amount batch them in? The part I am most concerned about is if ES is unavailable, down, network partition, etc. prevents the batch from being successful. The primary point of the question, is given that Nifi has content storage for queuing / back-pressure, etc. is there a preferred method for using that to ensure no FlowFiles get lost if a destination is down? Maybe there is another processor I should look at for an example?
I have looked at the Mongo processor, Merge, etc. to try and get an idea of the preferred approach for batching inside of a processor, but can't seem to find anything specific. Any suggestions would be appreciated.
Good chance I am overlooking some basic functionality baked into Nifi. I am still fairly new to the platform.
Thanks!
Great question and a pretty common pattern. This is why we have the concept of a ProcessSession. It allows you to send zero or more things to an external endpoint and only commit once you know it has been ack'd by the recipient. In this sense it offers at least-once semantics. If the protocol you're using supports two-phase commit style semantics you can get pretty close to the ever elusive exactly-once semantic. Much of the details of what you're asking about here will depend on the destination systems API and behavior.
There are some examples in the apache codebase which reveal ways to do this. One way is if you can produce a merged collection of events prior to pushing to the destination system. Depends on its API. I think PutMongo and PutSolr operate this way (though the experts on that would need to weigh in). An example that might be more like what you're looking for can be found in PutSQL which operates on batches of flowfiles to send in a single transaction (on the destination DB).
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/PutSQL.java
Will keep an eye here but can get the eye of a larger NiFi group at users#nifi.apache.org
Thanks
Joe

Post Processing LTTNG (CTF) Trace Data

I've created tracepoints that capture some raw data. I want to be able to post-process this data and possibly create a new viewer for the tracing perspective in Eclipse but I really have no idea where to start. I was hoping to find a document that described how to create a new viewer for the trace eclipse perspective, how to read the ctf files, and how to graph the results in the view.
Alternatively, I'd just like to read the trace data and add some new trace events with postprocessed data.
As background to the question, I want to perform analysis on the trace timestamps and generate statistics about the average throughput and latency. Although I can do this while inserting the tracepoint, I'd like to offload the math to the analysis portion.
Rich
In general, such analysis is better done in post-processing. Doing it at runtime in your traced program may affect performance, to a point where the data you collect is not representative of the real behaviour of the application anymore!
The Trace Compass documentation, particularly this section, explains how to create new graphical views in Eclipse.
If you want to output a time-graph or XY-chart view, you can also look at the data-driven XML interface. It is more limited in features, but can work straight off the RCP (no need to recompile, no need to setup the dev environmnent).

Are Sitecore's sublayout rendering stats incorrect?

The built-in Sitecore rendering stats http://<sitename>/sitecore/admin/stats.aspx is really helpful for identifying inefficient and slow-loading XSLT renders. Recently I've started switching to .ascx sub layouts to take advantage of the Sitecore C# API which can help improve performance when used correctly.
However, I've noticed that sub layouts (as opposed to XSLT renders) are not reported correctly on the stats page. See the screenshot below....
I know for a fact that this sub layout takes about 1.8 seconds to generate (I calculated this in the code-behind). Caching is turned off. I've refreshed the page 20 times to ensure I get an average. You will see that the "Avg. items" is always 0 - I can live with this - but the "Avg. time (ms)" is less than 1ms which is just clearly wrong.
Does anyone have any insights into this? Has anyone found a way to get it to work correctly?
Judging whether a statistic is right/wrong is going to rely on understanding exactly what it is measuring.
Digging around in Sitecore.Diagnostics.Statistics using Reflector I note the following:
Sitecore.Web.UI.Webcontrol contains a field m_timer
This is 'started' in the BeforeRender() method and 'stopped' in the AfterRender() method
Data from that timer is sent to Statistics.AddRenderingData() and is logged against the control
This means it is measuring the time taken to render the control, which for an XSLT includes the processing time for preparing all the data in it, but as much of the work of a normal ASCX is done prior to the Render-stage the statistic is much less useful. Incorporating the Load stage in the time would inadvertently include the processing time for all child components, since the Load sequence is chained and called recursively, so that probably doesn't help much either.
I suspect there is no good way of measuring the processing time for a specific ASCX control (excluding children) without first acquiring cumulative data then post-processing the call chain and splitting the time apart. This is the sort of thing RedGate ANTS does really well, but might not be so good if it was being executed on a live production system, given the overheads.

How are Process Explorer's memory metrics: WS Private, WS Shareable, WS Shared columns calculated?

I am continuing my saga to understand memory consumption by VB6 application.
The option that seems to work best so far is to monitor various memory metrics at key points at run-time and understand where big memory hogs are.
The measure driver to study this, is to understand how the application scalability in multi-user environment in Terminal Server (Citrix) is impacted due to changes in memory consumption (in simple terms more memory you use, less users you can fit on the server).
I can get most memory metrics for the process using GetProcessMemoryInfo.
Process Explorer reports additional metrics WS Private, WS Shareable, WS Shared - which seem very interesting for my investigation.
So question is, is there standard/hidden API to get these metric for a process? I would like to query these metrics programatically, so that I can capture them at key spots during application run and understand memory usage better.
See the QueryWorkingSet API. This looks rather nasty to use though, as it returns info on a per-page basis and would therefore leave it up to you to aggregate the totals. If there's a better method, please leave a comment and I'll delete this answer.
Also, if you have specific places in mind where you want to monitor changes in the working set, you might want to check out the InitializeProcessForWsWatch and GetWsChanges APIs -- these might make it easier to see how many pages have been faulted in rather than having to walk the entire page set before and after.

Resources