Assisted manual inspection of log files - debugging

Rookie question here. I've been inspecting a lot of log files to try to pinpoint errors in an application. Specifically I'm trying to compare success scenarios with failure scenarios... but due to the volume of logs it's difficult to identify which log messages are "good" and which ones are "bad". (I'm not the developer, so I can't just change the logs.)
Ideally I'd love a tool where I could...
Load a log file and manually specify which entries are associated with a success scenario
The rest of the logfile would then be formatted according to whether a given line appears in the "success" scenario or not. (Ideally it could do fuzzy matching, so an exact match (sans timestamp, of course) could be one color and a close match (different value) could be another.
This would make it easy to skim through the failure scenarios and identify messages associated with the failure condition. Think of it like a smart diff.
Most of the tools I've looked at (e.g., Splunk, OtrosLogViewer) seem to be focused on automated server-side deployments. While that could work, I'd love something lighter weight for quick analysis.
Does anything like this exist? Any pointers welcome/appreciated.

Related

Is there a process for munging data from many different formats in RapidMiner?

I'm trying to help my team streamline a data ingestion process that is taking up a substantial amount of time. We receive data in multiple formats and with attributes arranged differently. Is there a way using RapidMiner to create a process that:
Processes files on a schedule that are dropped into a folder (this
one I think I know but I'd love tips on this as scheduled processes
are new to me)
Automatically identifies input filetype and routes to the correct operator ("Read CSV" for example)
Recognizes a relatively small number of attributes and arranges them accordingly. In some cases, attributes are named the same way as our ingestion format and in others they are not (phone vs phone # vs Phone for example)
The attributes we process mostly consist of name, id, phone, email, address. Also, in some cases names are split first/last and in some they are full name.
I recognize that munging files for such simple attributes shouldn't be that hard but the number of files we receive and lack of order makes it very difficult to streamline a process without a bit of automation. I'm also going to move to a standardized receiving format but for a number of reasons that's on the horizon and not an immediate solution.
I appreciate any tips or guidance you can share.
Your question is relative broad, so unfortunately I can't give you complete answer. But here are some ideas on how I would tackle the points you mentioned:
For a full process scheduling RapidMiner Server is what you are
looking for. In that case you can either define a schedule (e.g.,
check regularly for new files) or even define a web service to
trigger the process.
For selecting the correct operator depending on file type, you could
use a combination of "Loop Files" and macro extraction to get the
correct type and the use either "Branch" or "Select Subprocess" for
switching to different input routes.
The "Select Attributes" operator has some very powerful options to
select specific subsets only. In your example I would go for a
regular expression akin to [pP]hone.* to get the different spelling
variants. Also very helpful in that case would be the "Reorder
Attributes" operator and "Rename by Replacing" to create a common
naming schema.
A general tip when building more complex process pipelines is to organize your different tasks in sub-processes and use the "Execute Process" operator. This makes everything much more readable and maintainable. Also a good error handling strategy is important to handle unforeseen data formats.
For more elaborate answers and tips from many adavanced RapidMiner users, I also highly recommend the RapidMiner community.
I hope this gives a good starting point for your project.

Practices for allowing systems to accommodate human error?

Systems have to sometimes accommodate the possibility of real world bad data. Consider that some data originates with paper forms. And forms inherently have a limited means of validating data.
Example 1: On one form users are expected to enter an integer distance (in miles) into a blank. We capture the information as written as a string since we don't always end up getting integer values.
Example 2: On another form we capture a code. That code should map to one of the codes in our system. However, sometimes the code written on the form is incorrect. We capture the code and allow it to exist with an invalid value until some future time of resolution. That is, we temporarily allow bad data since it's important to record the record even if some of it is invalid.
I'm interested in learning more about how systems accommodate bad data, that is, human error. Databases are supposed to be bastions of data integrity, but the real world is messy and people make mistakes. Systems must allow us to reflect those mistakes.
What are some ways systems you've developed accommodate human error? What practices have you used? What lessons have you learned?
Any further reading on the topic? (I had trouble Googling it.)
I agree with you, whatever we do there's no guarantee that we can get rid of bad or incorrect data. Especially, but not only, if it comes to user input. In my experience the same problems exist in complex integration projects, in which you have to integrate and merge (often inconsistent) data retrieved from different systems.
A good strategy is to decouple the input from the operational system itself. First, place user (or external system) provided data in a separate datastore (e.g. different schema). In a second step load this data into your operational datastore, but only if it confirms to strict rules (e.g. use address verification software to verify a given address). This Extract, Transform, Load (ETL) approach is fairly common in Data Warehousing (DWH) solutions, but can be applied programmatically in transactional systems as well (in my experience).
The above approach often leads to asynchronous processes in which the input is subitted first and (maybe) at a later time the external entity (user or system) retrives feedback whether its data was correct or not.
EDIT: For further readings I recommend to have a look at DWH concepts. Alhtough, you may not want to build such a thing, you could partially apply those concepts:
http://en.wikipedia.org/wiki/Extract,_transform,_load
http://en.wikipedia.org/wiki/Data_warehouse
http://en.wikipedia.org/wiki/Data_cleansing
A government department I worked in does a lot of surveys, most of which are (were) still paper based.
All the results were OCR'd into the system.
As part of the OCR process a digital scan of the forms is kept.
Data is then validated, data that is undecipherable or which fails validation is flagged.
When a human operator reviews the digital data they can modify the data if they are confident that they can correctly interpret what the code could not; they (here's the cool bit) can also bring up the scan of the paper based original, and use that to determine what the user was trying to say.
On a different thread; at some point you want to validate the data coming in against any expected data ranges that you want it to conform to; buy rejecting it at the point of entry you give the user a chance to correct it - the trade off is that every time you reject it you increase the chance of them abandoning the whole process.
At some point in your system you need to specify the rules which will be used for validation. At the end of the day a system is only going to be as smart as those rules. You can develop these yourself into the code (probably the business logic) or you might use a 3rd party component.
having flexible control over the validation is pretty important as they are likely to change overtime.
To be honest with you, one point of migrating from paper-based systems to IT is to remove these errors and make sure all data is always correct. I doubt any correctly planned and developed IT system (especially business financial systems) would allow such errors. Not in the company I am working for anyway...
There are lots of software tools that address the kinds of problems you mention. There are platforms and tools that let you define rules for scrubbing and transforming data and handling validation errors. Those techniques are widely used for Data Integration and Business Intelligence applications. Google for "Data Quality" or "Data Integration".
The easiest thing to do is to (this is not always possible) design the interface where users enter the data to limit as much as possible the amount of text that they need to enter. In my experience this seems to be where a lot of problems come from. One simple example of this is to provide a select, or auto-complete select field
One thing that you could do is do everything possible to determine if the data is correct before going into the db. I try to give the user entering the data as much feedback as possible so they can (ideally) fix some of the issues before the data gets persisted. For example, it is a very quick check to determine if the data being entered is of the correct type.
I got started in legal systems before the PC era. Litigation support databases routinely have to accommodate factually incorrect, incomplete, and contradictory information. It takes a different way of thinking.
The short version . . .
Instead of recording a single fact, you record multiple assertions about a fact. It boils down to designing a database to store data from assertions like these.
In an interview at 2011-01-03 08:13, Neil Rimes told Officer Cane
that he was at home from 2011-01-02 20:00 until 2011-01-03 08:13.
In an interview at 2011-01-03 08:25, Liza Nevers told Officer Cane
that Neil Rimes came home at 2011-01-02 23:45.
In a deposition at 2011-05-13 10:22, Cody Maxon told attorney Kurt
Schlagel that he saw Neil Rimes at Kroger at 2011-01-03 03:00

Is avoiding the T in ETL possible?

ETL is pretty common-place. Data is out there somewhere so you go get it. After you get it, it's probably in a weird format so you transform it into something and then load it somewhere. The only problem I see with this method is you have to write the transform rules. Of course, I can't think of anything better. I supposed you could load whatever you get into a blob (sql) or into a object/document (non-sql) but then I think you're just delaying the parsing. Eventually you'll have to parse it into something structured (assuming you want to). So is there anything better? Does it have a name? Does this problem have a name?
Example
Ok, let me give you an example. I've got a printer, an ATM and a voicemail system. They're all network enabled or I can give you connectivity. How would you collect the state from all these devices? For example, the printer dumps a text file when you type status over port 9000:
> status
===============
has_paper:true
jobs:0
ink:low
The ATM has a CLI after you connect on port whatever and you can type individual commands to get different values:
maint-mode> GET BILLS_1
[$1 bills]: 7
maint-mode> GET BILLS_5
[$5 bills]: 2
etc ...
The voicemail system requires certain key sequences to get any kind of information over a network port:
telnet> 7,9*
0 new messages
telnet> 7,0*
2 total messages
My thoughts
Printer - So this is pretty straight-forward. You can just capture everything after sending "status", split on lines and then split on colons or something. Pretty easy. It's almost like getting a crap-formatted result from a web service or something. I could avoid parsing and just dump the whole conversation from port 9000. But eventually I'll want to get rid of that equal signs line. It doesn't really mean anything.
ATM - So this is a bit more of a pain because it's interactive. Now I'm approaching expect or a protocol territory. It'd be better if they had a service that I could query these values but that's out of scope for this post. So I write a client that gets all the values. But now if I want to collect all the data, I have to define what all the questions are. For example, I know that the ATM has more bills than $1 and $5 so I'd have a complete list like "BILLS_1 BILLS_5 BILLS_10 BILLS_20". If I ask all the questions then I have an inventory of the ATM machine. Of course, I still have to parse out the results and clean up the text if I wanted to figure out how much money is left in the ATM machine. So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Voicemail - This is similar to the ATM machine where it's interactive. It's just a bit weirder because the key sequences/commands aren't "get key". But essentially it's the same problem and solution.
Future Proof
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster. Or anything? You'd have to write "connectors" ahead of time or write a parser afterwards against some raw field you stored earlier. Maybe in the case of these very limited examples there's no alternative. There's no way to future-proof. You just have to understand the new device and parse it at collection or parse it after the fact (your stored blob/object/document).
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer that simply requires the device to split out lines. Then you could have a text processing piece that parses based on rules. For the ATM device, you'd have to write something that "speaks ATM" and turns it into lines which the iterator would then take care of. At this point, hopefully you'd be able to say "I can handle anything that has lines of text".
But then what will you call these rules for parsing the text? "Printer rules" might as well be called "printer parser" which is the same to me as "printer transform". Is there a better term for all of this?
I apologize for this question being so open ended. :)
When your sources of information are as disparate as what you illustrate then you have no choice but to implement the Transform in order to bring the items into a common data repository. Usually your data sources won't be this extreme, the data will all be related in some way but you may be retrieving it from different sources (some might come from a nicely structured database, some more might come from an Excel or XML or text file, some more might come from a web service call, etc).
When coding up a custom ETL application, a common pattern that is used is the Provider model, this enables you to write a whole bunch of custom providers to load/query and then transform the data. All the providers will implement a common interface with some relatively common function definitions (for example QueryData(), TransformData()), but the implementation of those methods will be wildly different depending on the data source being dealt with - the interface just gives a common way to deal with all the different providers. You can then use an XML configuration file to dictate which providers to run and any other initial settings they may require. Tools like SSIS abstract this stuff away for you by giving you a nice visual designer, but you can still get down and dirty and write your own code which it calls.
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster.
No problem, i would just write a new provider, which can sit in its very own assembly (dll), so it can be shipped (or modified, upgraded, etc) in isolation to any other providers i already have. Or if i was using SSIS then i would write a new DTS package.
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer ... Then you could have a text processing piece that parses based on rules.
Absolutely - you can have a base class containing common functionality which several different providers can implement, and each provider can use its own set of rules which could be coded into it or they can be contained in an external configuration file.
So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Use whichever approach makes sense for the data you are grabbing. It is also quite common for an ETL process to dump its data into a staging area (like some staging tables in a database) while the data is all being aggregated and accumulated, and then further process it to link related data and perform calculations. In the case of your ATM it may not be necessary to calculate a cash balance at ETL time because you can easily calculate it at any time in the future.

What is the quickest way to isolate the source of an error amongst a list of potential sources?

What is the quickest way to isolate the source of an error amongst an ordered list of potential sources? For example, given a list of column mappings, and one of those column mappings is incorrect, what debugging technique would lead you to most quickly identify which mapping is invalid? (By most quickly, I mean, which approach would require the fewest compilation, load, and run cycles?)
Assume that whatever error message the database or database driver generates does not identify the name of the errant column. Sound familiar?
Hint:
The technique is similar to that which you might use to answer the question, "What number am I thinking of between 1 and 1000?", but with the fewest guesses.
You can use interpolation in some cases. I've used this successfully to isolate a bad record.
Sounds familiar, but I hate to be the one to tell you that there is no "quick" way of isolating the sources of errors. I know from my own experience that you want to be absolutely sure you've found the correct source of error before you go about resolving it, and this requires plenty of testing and tracing.
Keep adding more and more diagnostic information until I either isolate the issue, or can't add anymore. If it's my code vs. external code, I will go crazy with trace statements until I isolate the critical bit of code if I otherwise don't know where the issue is. On Windows, the SysInternals suite is my friend... especially the debug viewer. That will show any trace statements from anything running on the system that is emitting trace.
If I truly cannot get more specific information from the error source, then I will go into experimental mode... testing one small change at a time. This works best if you know you have a case that succeeds and a case that does not.
Trivial example: If I have row X that won't be inserted into the database, but I know row Y will, I will then take row Y and change one field at a time and keep inserting until row Y's values = row X's value.
If you really are stumped at where the issue is coming from, time to dust off your Google-fu skills. Someone has probably run into the same problem and posted a question to a forum somewhere. Of course, that's what SO is for too.
You're the human... be more stubborn than the computer!

Fast Text Search Over Logs

Here's the problem I'm having, I've got a set of logs that can grow fairly quickly. They're split into individual files every day, and the files can easily grow up to a gig in size. To help keep the size down, entries older than 30 days or so are cleared out.
The problem is when I want to search these files for a certain string. Right now, a Boyer-Moore search is unfeasibly slow. I know that applications like dtSearch can provide a really fast search using indexing, but I'm not really sure how to implement that without taking up twice the space a log already takes up.
Are there any resources I can check out that can help? I'm really looking for a standard algorithm that'll explain what I should do to build an index and use it to search.
Edit:
Grep won't work as this search needs to be integrated into a cross-platform application. There's no way I'll be able to swing including any external program into it.
The way it works is that there's a web front end that has a log browser. This talks to a custom C++ web server backend. This server needs to search the logs in a reasonable amount of time. Currently searching through several gigs of logs takes ages.
Edit 2:
Some of these suggestions are great, but I have to reiterate that I can't integrate another application, it's part of the contract. But to answer some questions, the data in the logs varies from either received messages in a health-care specific format or messages relating to these. I'm looking to rely on an index because while it may take up to a minute to rebuild the index, searching currently takes a very long time (I've seen it take up to 2.5 minutes). Also, a lot of the data IS discarded before even recording it. Unless some debug logging options are turned on, more than half of the log messages are ignored.
The search basically goes like this: A user on the web form is presented with a list of the most recent messages (streamed from disk as they scroll, yay for ajax), usually, they'll want to search for messages with some information in it, maybe a patient id, or some string they've sent, and so they can enter the string into the search. The search gets sent asychronously and the custom web server linearly searches through the logs 1MB at a time for some results. This process can take a very long time when the logs get big. And it's what I'm trying to optimize.
grep usually works pretty well for me with big logs (sometimes 12G+). You can find a version for windows here as well.
Check out the algorithms that Lucene uses to do its thing. They aren't likely to be very simple, though. I had to study some of these algorithms once upon a time, and some of them are very sophisticated.
If you can identify the "words" in the text you want to index, just build a large hash table of the words which maps a hash of the word to its occurrences in each file. If users repeat the same search frequently, cache the search results. When a search is done, you can then check each location to confirm the search term falls there, rather than just a word with a matching hash.
Also, who really cares if the index is larger than the files themselves? If your system is really this big, with so much activity, is a few dozen gigs for an index the end of the world?
You'll most likely want to integrate some type of indexing search engine into your application. There are dozens out there, Lucene seems to be very popular. Check these two questions for some more suggestions:
Best text search engine for integrating with custom web app?
How do I implement Search Functionality in a website?
More details on the kind of search you're performing could definitely help. Why, in particular do you want to rely on an index, since you'll have to rebuild it every day when the logs roll over? What kind of information is in these logs? Can some of it be discarded before it is ever even recorded?
How long are these searches taking now?
You may want to check out the source for BSD grep. You may not be able to rely on grep being there for you, but nothing says you can't recreate similar functionality, right?
Splunk is great for searching through lots of logs. May be overkill for your purpose. You pay according to the amount of data (size of the logs) you want to process. I'm pretty sure they have an API so you don't have to use their front-end if you don't want to.

Resources