"Replay" the steps needed to recreate an error - debugging

I am going to create a typical business application that will be used by a few hundred consultants. Normally, the consultants would be presented with an error message with a standard text. As the application will be a complicated one with lots of changes being made to it constantly I would like the following:
When an error message is presented, the user has the option to "send" the error message to the developers. The developers should be able to open the incoming file in i.e. Eclipse and debug the steps of the last 10 minutes of work step by step (one line at a time if they want to). Everything should be transparent, meaning that they for example should be able to see the return values of calls to the database.
Are there any solutions that offer such functionality today, my preferred language is Python or also Java. I know that there will be a huge performance hit because of such functionality, but that is acceptable as this kind of software is not performance sensitive.
It would be VERY nice if the database also had a cronology so that one could query the database for values that existed at the exact time that a specific line of code was run in the application, leading up to the bug.

You should try to use logging, e.g. commit logs from the DB and logging the user interactions with the application, if it is a web application you can start with the log files from the webserver. Make sure that the logfiles include all submitted data such as the complete GET url with parameters and POST with entity body. You can configure the web server to generate such logs when necesary.
Then you build a test client that can parse the log files and re-create all the user interaction that caused the problem to appear. If you suspect race conditions you should log with high precision (ms resolution) and make sure that the test client can run through the same sequences over and over again to stress those critical parts.
Replay (as your title suggests) is the best way to reproduce an error, just collect all the data needed to recreate the input that generated a specific state/situation. Do not focus on internal structures and return values, when it comes to hunting down an error or a bug you should not work in forensic mode, e.g. trying to analyse the cause of the crash by analyzing the wreck, you should crash the plane over and over again and add more and more logging/or use a debugger until you know that goes wrong.

Related

Why/when GetComputerName() returns 'TENTATIVE'?

It appears that in some very rare cases GetComputerName() Windows API returns 'TENTATIVE' as a computer name. Based on information from our customers we suspect that this sometimes happens on laptops when they 'awaken' from a sleep. The odd thing is that this happens after our app is already running and able to connect to a back-end server and retrieve some information from it - i.e. the computer name is written into a log file at the time.
We are considering putting a slight delay during 'awakaning' but would like confirm that the cause is what we suspect.
We could not find anything on-line that would confirm the source of the issue.

Performance Testing in Mirth Connect Using JMeter

Mirth Connect is a software that is designed to handle a message flow and it has built-in support to handle HL7 messages in particular and therefore this software is widely used for interfacing in Healthcare applications. Over the years I have seen the Mirth software experiencing performance issues primarily due to the message build up over time and in scenarios where it receives a heavy message load in quick succession.
Mirth has a channel-based architecture and it's ideal if there is some way we can performance test the Mirth channel and get JMeter statistics for its performance. Whereby we can gather the necessary information to optimize the channel transformers and also to set the purge routines accordingly.
However in the Internet there was little to no information on this area, that is how one can use JMeter to test a Mirth channel. A team in Sri Lanka did some research on this area back in 2013 and I found their findings and achievements below
http://pragmatictestlabs.com/2016/10/09/performance-testing-healthcare-application-hl7-jmeter/
However this is very specific the output here was a JSon object which they extracted, in Mirth however we can have outputs in various forms and there need to be a better way to do this. An important takeaway from this is the input that is the input is general we can use JMeter to generate HL7 messages and pass them to Mirth that's great but how to capture the response generally, it would be ideal if there is a way to read the Mirth Dashboard through JMeter, all the output statistics are there it's just a matter of reading them.
I have an application where Mirth reads HL7 messages both ADT and RDE and creates a text file accordingly with appropriate content and drops it to a shared location. Then the application reads the files and shows the information to the user.
I wish to do two performance tests here
Measure how much time the complete system takes and how it varies with load from the arrival of a message to its information being available to the user
Measure how much time the channel takes and how it does it as the load increases
I can do the first one because I can generate HL7 messages using JMeter and I can get JMeter to read the output in the application or the database. The problem is with the second, can I do this in a general way.
You asked for suggestions, so I'm going to share my general strategy for performance testing Mirth channels. I suspect that this won't be a complete answer to your question, and I might not be telling you anything you don't already know, but I'm hoping this will help you find an answer that you are comfortable with.
For several reasons, try not to spend too much time "testing the complete system":
Firstly, testing the entire system necessarily includes testing low-level configuration like the number of CPU cores, the NICs being used in the box, and kernel level software like the TCP/IP stack. You don't usually have any control over these things, so you can't optimize them in any way.
Secondly, the performance of the entire system is going to be heavily dependant on whatever ancillary code is running on the box. If a sysadmin decides to 'nice' my Mirth process down, or to use that box to also host a SQL server, that will have an impact on the system that I (again) have no control over.
Thirdly and most frankly, I find that the "performance of an entire system" is something that management asks about during system setup so they can get a cost estimate; but they know that they're only getting an estimate. You do your best to use test metrics to give a good guess for the initial hardware provisioning, but everyone knows that it's really the production performance metrics that will drive later provisioning costs.
Make sure that you build your channels for testability. I find that it's much easier to test a channel when the source and destination can be changed to "Channel Reader" and "Channel Writer" without changing message handling. One way to look at this is that you're not going to overhaul Mirth's MLLP stack or Java's TCP stack, so just eliminate these things from your testing.
I keep a source of useful test messages. I have a couple of files on a network drive that have around a hundred messages that test for nasty edge cases that I've run into over the years on my HL7 interfaces. I wrote a small Mirth channel that reads these in from a file and spews out copies as fast as it can. By turning on "Queueing" on the destination side of that channel, I can queue up a bajillion test messages that are ready to send to the channel I want to test. In the past I took the time to build a test interface that acted like a fake EMR to spew out randomly constructed messages, but there didn't seem to be any advantage over just spewing copies of the same messages from my test files.
Finally, and most importantly, it's critical that you measure the performance of your test instance using the same metrics that you'll use to measure the performance of your production instance. If the sole production metric you care about is 'messages per second', then that's what you need to measure on your test box. If memory footprint is a concern in production, then you need to measure memory usage in your test environment as well. When you make a change to to your test instance that decreases an important metric by 10%, you'll need to make sure your management is aware before you push that change to production.
Note that getting some of these metrics can be tricky, since Mirth doesn't include good tools to monitor its own performance. The Mirth dashboard is a good place to keep an eye on errors or crashes, but it's not a great place to find performance data. During my testing I make sure that I use whatever resource monitoring tool that the sysadmins will be using to monitor the performance of the production instance. Beyond that, I use a manual process to test performance: If I want to count message per second, I send through a batch of messages and look at the timestamps of the first and last messages. If I want to get an idea of the CPU load of a Mirth channel, I use the Windows Performance Monitor or the posix 'top' command.

Why are my ADODB queries not being persisted to the SQL server?

My VB6 program uses ADODB to do a lot of SQL (2000) CRUD.
Sometimes the network connection between the remote clients and the data center somehow "drops" resulting in the impossibility to establish new connections (so users launching the program can't use it).
The issue is the following:
Anyone who is using the program at the moment of the "drop" can continue using it with no issues whatoever, perform every operation, update data, read data, and everything seems like is working normally.
User then proceeds to fire up a "sum-up" report which lists everything that was done (before or after the "drop").
If we then check the database, all data regarding whatever was done after the network drop is not there. User goes back into the program and everything is as it was before the network drop.
It seems like all queries where somehow performed in-memory ? I'm at a loss about how to even approach the issue (I'm familiar enough with VB6 to work with the source code but I don't know a lot about ADODB).
I haven't yet tried to replicate the behavior due to limited customer's availability (development environment is housed in their offices), I'll try starting up the program from the IDE then rip out the network cable.
Provided I can replicate the issue, how do I fix this ? Is there some setting I'm not aware of ?
On a side note: the issue is sporadic (it happened a handful of times during the last year, and the software is being used heavily and on a daily basis by mutiple concurrent users).
After reading up on Disconnected Recordsets, it seems that's what's behind this odd behavior I'm experiencing.
This is not something that can be simply "turned off".

Using ODBC Trace or Oracle Trace to find cause of error?

I have a third party Windows service which controls/monitors equipment and updates an Oracle database. Their services occasionally report an error about a row/column in the database being "bad" but do not give the underlying database error, and their services need to be restarted and everything is fine. The current suspicion is that something from our applications/services which read/write to those same tables/rows are interfering - i.e. some kind of blocking/locking. I suspect that there is some sort of leak in their system since it happens about once a week, but our systems never need any re-starting like this.
I attempted to have the DBA run a trace run in Oracle (10g), but this managed to make our apps unable to access the Oracle database. Our systems access Oracle in .NET, either using the Oracle ODP client or Microsoft client (older programs) and on this same server (either web apps or services) or from other control workstations. The third-party services connects to Oracle via ODBC on this server. I also attempted to run a ODBC trace (since that would only be activity from the third-party service), but didn't get anything in the trace file at all.
So I'm trying to find a way to either get ODBC tracing working or what I need to look out for so that the Oracle trace doesn't kill my server.
I'm looking for the undserlying error which Oracle is returning to the thrid-party service so I can tell if we are interfering with their access to the data in some way.
If a block in the database is corrupted "Bad" this should show up in the alert logs as an ORA-01578 error. I would search the archive log for the ORA- error and then compare that with the time stamp on the client error being reported. This is making the assumption of the definition of "Bad". It would be better to have the exact error messages posted.
Blanket tracing in the database is a tricky thing as it will tend to affect the performance of your entire application. And leaving it on for an entire week may not be feasible. I have also found in one case (cant remember the exact circumstance) where turning on tracing fixed the error.
One method I have used in the past is to add the sql statement to alter the session and turn on sqltrace. This is predicated on the ability to modify the code in some way. Depending on the application this may or may not be possible.
Another method would be to work with the DBA to identify the session and turn on sql trace for that session. Also if you can identify the offending sql statements and parameter values you may be able to replicate the problem outside the service.
I have found most ORM's avoid passing the ORA- error back. However it is typically logged in the application server layer with the associated ORM error.
I have used these method and variations of these method to trouble shoot errors in the application. I hope this is useful.

performance of accessing a mono server application via remoting

This is my setting: I have written a .NET application for local client machines, which implements a feature that could also be used on a webpage. To keep this example simple, assume that the client installs a software into which he can enter some data and gets some data back.
The idea is to create a webpage that holds a form into which the user enters the same data and gets the same results back as above. Due to the company's available web servers, the first idea was to create a mono webservice, but this was dismissed for reasons unknown. The "service" is not to be run as a webservice, but should be called by a PHP script. This is currently realized by calling the mono application via shell_exec from PHP.
So now I am stuck with a mono port of my application, which works fine, but takes way too long to execute. I have already stripped out all unnecessary dlls, methods etc, but calling the application via the command line - submitting the desired data via commandline parameters - takes approximately 700ms. We expect about 10 hits per second, so this could only work when setting up a lot of servers for this task.
I assume the 700m are related to the cost of starting the application every time, because it does not make much difference in terms of time if I handle the request only once or five hundred times (I take the original input, vary it slighty and do 500 iterations with "new" data every time. Starting from the second iteration, the processing time drops down to approximately 1ms per iteration)
My next idea was to setup the mono application as a remoting server, so that it only has to be started once and can then handle incoming requests. I therefore wrote another mono application that serves as the client. Calling the client, letting the client pass the data to the server and retrieving the result now takes 344ms. This is better, but still way slower than I would expect and want it to be.
I have then implemented a new project from scratch based on this blog post and get stuck with the same performance issues.
The question is: am I missing something related to the mono-projects that could improve the speed of the client/server? Although the idea of creating a webservice for this task was dismissed, would a webservice perform better under these circumstances (as I would not need the client application to call the service), although it is said that remoting is faster than webservices?
I could have made that clearer, but implementing a webservice is currently not an option (and please don't ask why, I didn't write the requirements ;))
Meanwhile I have checked that it's indeed the startup of the client, which takes most of the time in the remoting scenario.
I could imagine accessing the server via pipes from the command line, which would be perfectly suitable in my scenario. I guess this would be done using sockets?
You can try to use AOT to reduce the startup time. On .NET you would use ngen for that purpoise, on mono just do a mono --aot on all assemblies used by your application.
AOT'ed code is slower than JIT'ed code, but has the advantage of reducing startup time.
You can even try to AOT framework assemblies such as mscorlib and System.
I believe that remoting is not an ideal thing to use in this scenario. However your idea of having mono on server instead of starting it every time is indeed solid.
Did you consider using SOAP webservices over HTTP? This would also help you with your 'web page' scenario.
Even if it is a little to slow for you in my experience a custom RESTful services implementation would be easier to work with than remoting.

Resources