Store debug data in format having good viewer support - debugging

I'd like to include various debug outputs in my Python program. Summary information and some variables should be saved in a file for later examination. This will be at different levels down to very verbose outputs in loops (that's why simple logging isn't enough). And maybe larger data dumps of large tables.
For debugging a bug, I'd like to examine the different data states of the program after the program has run. I want to filter entries (e.g. my module) and chose the desired verbosity level per module.
Do you know a good data format for that (i.e. storing different verbosity levels and being able to select one; dumping data tables)? Maybe one supporting a tree view and having a ready-made nice viewer. The tree view could relate to module and method names, but other ideas are welcome.

Related

Python structure for working with variable log messages in a large file

I am trying to get data out of debug log messages created by a certain piece of open source software. It has many lines describing what it is doing during stages. It does not have a specific structure, i.e. some data covers multiple lines with different indents and no separator so does not import nicely into a pandas data frame, which would be my go-to usually.
Is there a good way to structure a python script that parses this data and one that can be used in the future for the same function, and also be extendable to extract different data? I have to do a bunch of different steps to extract the data. The other complication is that the file is much too big to store in memory (10^6 lines) so i need to iterate through the lines.
Please could anyone give me some tips on how to do this, is it best to move to do each step and save to a new file? Or my idea is to create a data object and store relevant line numbers as attributes in lists, that are generated in different method. Then each subsequent method only loads the lines from that list.
Or alternatively, maybe I am totally using the wrong tool and I need to learn awk or regex commands to do it? I just know python already so have a preference for it. Not looking for a specific answer necessarily, some tips and pointers would also be very useful!
(--details--) I am trying to trace on a freeradius server the difference between log messages of requests, accepts and rejects of a mac address to see if I can find out why it is sometimes accepted and other times rejected, seemingly randomly.
There are a lot of plugins running on the server setup before I got to dealing with it so the debug is a massive wall of text, labelling each request with a number. I can split it into requests by that number, find the request that mentions the mac, split those requests into different files, then run want to filter out all the boilerplate info that comes with each message and get to the things that are different between them. (--details--)

Post Processing LTTNG (CTF) Trace Data

I've created tracepoints that capture some raw data. I want to be able to post-process this data and possibly create a new viewer for the tracing perspective in Eclipse but I really have no idea where to start. I was hoping to find a document that described how to create a new viewer for the trace eclipse perspective, how to read the ctf files, and how to graph the results in the view.
Alternatively, I'd just like to read the trace data and add some new trace events with postprocessed data.
As background to the question, I want to perform analysis on the trace timestamps and generate statistics about the average throughput and latency. Although I can do this while inserting the tracepoint, I'd like to offload the math to the analysis portion.
Rich
In general, such analysis is better done in post-processing. Doing it at runtime in your traced program may affect performance, to a point where the data you collect is not representative of the real behaviour of the application anymore!
The Trace Compass documentation, particularly this section, explains how to create new graphical views in Eclipse.
If you want to output a time-graph or XY-chart view, you can also look at the data-driven XML interface. It is more limited in features, but can work straight off the RCP (no need to recompile, no need to setup the dev environmnent).

Debugging classic ASP - simple way to dump locals to a file in multiple places?

I'm dealing with a bunch of spaghetti in a classic ASP site. I'm trying to figure out what I need for a particular object construct, but am having a lot of trouble because of the number of variables that are used which have (for all practical purposes) global scope. Walking through with the VS debugger isn't getting me very far because there are a lot of state changes and database lookups happening all over the place.
What I'd like to do is create some kind of debug utility to dump all variables in local scope into a file, and be able to call that from several different places in the code so that I can simply compare values to understand the necessary state changes.
It's a simple enough problem to write data out to a file, but there are so many local vars in play, and I'm not sure what are just UI, what are just business data and what are actually controlling flow that I don't yet know what I want to grab.
So that's the problem--the question is:
Is there a built-in way (or a tool of some sort) that will allow me to make a simple call to dump all local variables to some sort of output as name/value pairs or something similar?

Processing gcov data files for tracing purposes

I'm trying to create a tool similar to TraceGL, but for C-type languages:
As you can see, the tool above highlights code flows that were not executed in red.
In terms of building this tool for Objective-C, for example, I know that gcov (and libprofile_rt in clang) output data files that can help determine how many times a given line of code has been executed. However, would the gcov data files be able to tell me when a given line of code occurred during a program's execution?
For example, if line X is called during code paths A and B, would I be able to ascertain from the gcov that code paths A and B called line X given line X alone?
As far as I know, GCOV instrumentation data only tells that some point in the code was executed (and maybe how many times). But there is no relationship between the code points that are instrumented.
It sounds like what you want is to determine paths through the code. To do that, you either need to do static analysis of the code (requiring a full up C parser, name resolver, flow analyzer), or you need to couple the dynamic instrumentation points together in execution order.
The first requires you find machinery capable of processing C in all of its glory; you don't want to repeat that yourself. GCC, Clang, our DMS Toolkit are choices. I know the GCC and Clang do pretty serious analysis; I'm pretty sure you could find at least intraprocedural control flow analysis; I know that DMS can do this. You'd have to customize GCC and Clang to extract this data. You'd have to configure DMS to extract this data; configuration is easier than customization because it is a design property rather than a "custom" action. YMMV.
Then, using the GCOV data, you could determine the flows between the GCOV data points. It isn't clear to me that this buys you anything beyond what you already get with just the static control flow analysis, unless your goal is to exhibit execution traces.
To do this dynamically, what you could do is force each data collection point in the instrumented code to note that it is the most recent point encountered; before doing that, it would record the most recent point encountered before it was. This would produce in effect a chain of references between points which would match the control flow. This has two problems from your point of view, I think: a) you'd have to modify GCOV or some other tool to insert this different kind of instrumentation, b) you have to worry about what and how you record "predecessors" when a data collection point gets hit more than once.
gcov (or lcov) is one option. It does produce most of the information you are looking for, though how often those files are updated depends on how often __gcov_flush() is called. It's not really intended to be real time, and does not include all of the information you are looking for (notably, the 'when'). There is a short summary of the gcov data format here and in the header file here. lcov data is described here.
For what you are looking for DTrace should be able to provide all of the information you need, and in real time. For Objective-C on Apple platforms there are dtrace probes for the runtime which allow you to trace pretty much anything. There are a number of useful guides and examples out there for learning about dtrace and how to write scripts. Brendan Gregg provides some really great examples. Big Nerd Ranch has done a series of articles on it.

Is avoiding the T in ETL possible?

ETL is pretty common-place. Data is out there somewhere so you go get it. After you get it, it's probably in a weird format so you transform it into something and then load it somewhere. The only problem I see with this method is you have to write the transform rules. Of course, I can't think of anything better. I supposed you could load whatever you get into a blob (sql) or into a object/document (non-sql) but then I think you're just delaying the parsing. Eventually you'll have to parse it into something structured (assuming you want to). So is there anything better? Does it have a name? Does this problem have a name?
Example
Ok, let me give you an example. I've got a printer, an ATM and a voicemail system. They're all network enabled or I can give you connectivity. How would you collect the state from all these devices? For example, the printer dumps a text file when you type status over port 9000:
> status
===============
has_paper:true
jobs:0
ink:low
The ATM has a CLI after you connect on port whatever and you can type individual commands to get different values:
maint-mode> GET BILLS_1
[$1 bills]: 7
maint-mode> GET BILLS_5
[$5 bills]: 2
etc ...
The voicemail system requires certain key sequences to get any kind of information over a network port:
telnet> 7,9*
0 new messages
telnet> 7,0*
2 total messages
My thoughts
Printer - So this is pretty straight-forward. You can just capture everything after sending "status", split on lines and then split on colons or something. Pretty easy. It's almost like getting a crap-formatted result from a web service or something. I could avoid parsing and just dump the whole conversation from port 9000. But eventually I'll want to get rid of that equal signs line. It doesn't really mean anything.
ATM - So this is a bit more of a pain because it's interactive. Now I'm approaching expect or a protocol territory. It'd be better if they had a service that I could query these values but that's out of scope for this post. So I write a client that gets all the values. But now if I want to collect all the data, I have to define what all the questions are. For example, I know that the ATM has more bills than $1 and $5 so I'd have a complete list like "BILLS_1 BILLS_5 BILLS_10 BILLS_20". If I ask all the questions then I have an inventory of the ATM machine. Of course, I still have to parse out the results and clean up the text if I wanted to figure out how much money is left in the ATM machine. So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Voicemail - This is similar to the ATM machine where it's interactive. It's just a bit weirder because the key sequences/commands aren't "get key". But essentially it's the same problem and solution.
Future Proof
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster. Or anything? You'd have to write "connectors" ahead of time or write a parser afterwards against some raw field you stored earlier. Maybe in the case of these very limited examples there's no alternative. There's no way to future-proof. You just have to understand the new device and parse it at collection or parse it after the fact (your stored blob/object/document).
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer that simply requires the device to split out lines. Then you could have a text processing piece that parses based on rules. For the ATM device, you'd have to write something that "speaks ATM" and turns it into lines which the iterator would then take care of. At this point, hopefully you'd be able to say "I can handle anything that has lines of text".
But then what will you call these rules for parsing the text? "Printer rules" might as well be called "printer parser" which is the same to me as "printer transform". Is there a better term for all of this?
I apologize for this question being so open ended. :)
When your sources of information are as disparate as what you illustrate then you have no choice but to implement the Transform in order to bring the items into a common data repository. Usually your data sources won't be this extreme, the data will all be related in some way but you may be retrieving it from different sources (some might come from a nicely structured database, some more might come from an Excel or XML or text file, some more might come from a web service call, etc).
When coding up a custom ETL application, a common pattern that is used is the Provider model, this enables you to write a whole bunch of custom providers to load/query and then transform the data. All the providers will implement a common interface with some relatively common function definitions (for example QueryData(), TransformData()), but the implementation of those methods will be wildly different depending on the data source being dealt with - the interface just gives a common way to deal with all the different providers. You can then use an XML configuration file to dictate which providers to run and any other initial settings they may require. Tools like SSIS abstract this stuff away for you by giving you a nice visual designer, but you can still get down and dirty and write your own code which it calls.
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster.
No problem, i would just write a new provider, which can sit in its very own assembly (dll), so it can be shipped (or modified, upgraded, etc) in isolation to any other providers i already have. Or if i was using SSIS then i would write a new DTS package.
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer ... Then you could have a text processing piece that parses based on rules.
Absolutely - you can have a base class containing common functionality which several different providers can implement, and each provider can use its own set of rules which could be coded into it or they can be contained in an external configuration file.
So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Use whichever approach makes sense for the data you are grabbing. It is also quite common for an ETL process to dump its data into a staging area (like some staging tables in a database) while the data is all being aggregated and accumulated, and then further process it to link related data and perform calculations. In the case of your ATM it may not be necessary to calculate a cash balance at ETL time because you can easily calculate it at any time in the future.

Resources